federation/out: tweak publish retry backoff #884
No reviewers
Labels
No labels
approved, awaiting change
bug
configuration
documentation
duplicate
enhancement
extremely low priority
feature request
Fix it yourself
help wanted
invalid
mastodon_api
needs docs
needs tests
not a bug
planned
pleroma_api
privacy
question
static_fe
triage
wontfix
No milestone
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: AkkomaGang/akkoma#884
Loading…
Add table
Reference in a new issue
No description provided.
Delete branch "Oneric/akkoma:publish_backoff"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
The ideal backoff strategy probably involves personal preference and may be up for debate; so feel free to suggest alternatives and further tweaks.
But giving up after merely ~23-28min just seems too short to me and getting the current strategy to be tolerant of one-day downtimes just takes too many retries for my liking.
With the current strategy the individual
and cumulative backoff looks like this
(the + part denotes max extra random delay):
We default to 5 retries meaning the least backoff runs with attempt=4.
Therefore outgoing activiities might already be permanently dropped by a
downtime of only 23 minutes which doesn't seem too implausible to occur.
Furthermore it seems excessive to retry this quickly this often at the
beginning.
At the same time, we’d like to have at least one quick'ish retry to deal
with transient issues and maintain reasonable federation responsiveness.
If an admin wants to tolerate one -day downtime of remotes,
retries need to be almost doubled.
The new backoff strategy implemented in this commit instead
switches to an exponetial after a few initial attempts:
Initial retries are still fast, but the same amount of retries
now allows a remote downtime of at least 40 minutes. Customising
the retry count to 5 allows for whole-day downtimes.
For some reason i thought we kept trying for a week or so 🤔 What happens after the ~23-28min? Do we completely stop trying to deliver, or is the job put on an oban queue and retried later from there? (I expect the latter, but never actually checked that, I just assumed.) Or is "put on queue and retry later" already what happens until the 4 retries as explained here have happened?
Publisher jobs are already handled in a Oban queue; once it exhausted it’s five retries (i.e. after the fourth backoff) the job is discarded and no further delivery attempt for this activity to this inbox are made.
Perhaps you’re thinking of when we mark an instance as unreachable (
:pleroma, :instance, :federation_reachability_timeout_days
)? This is indeed a week by default and once exceeded no deliveries to the instance ill be attempted at all anymore until we ourselves receive an activity from the affected domain.Incidentally, if you’d like each delivery of each individual activity to be retried after about a week, you can raise may retries to 7 with the backoff strategy implemented here.
@ -11,2 +11,3 @@
def backoff(%Job{attempt: attempt}) when is_integer(attempt) do
Pleroma.Workers.WorkerHelper.sidekiq_backoff(attempt, 5)
if attempt > 3 do
Pleroma.Workers.WorkerHelper.exponential_backoff(attempt, 9.5)
mmm magic numbers i love them i do
why start this at only the third attempt?
i think with some other values for the base i tested before the scaling on the lower parts was very poor. With
9.5
it’s kinda ok? Though the second and third attempts are a bit quicker with the third attempt then already happening after ~16min instead of ~40min before jumping to 3h40min feels a bit smoother but it probably doesn’t matter too much in practice. Can change it to just always use the
9.5^n
backoff (or feel free to edit it yourself before merge; maintainer edits are enabled on this PR)View command line instructions
Checkout
From your project repository, check out a new branch and test the changes.