federation/out: tweak publish retry backoff #884

Open
Oneric wants to merge 1 commit from Oneric/akkoma:publish_backoff into develop
Member

The ideal backoff strategy probably involves personal preference and may be up for debate; so feel free to suggest alternatives and further tweaks.
But giving up after merely ~23-28min just seems too short to me and getting the current strategy to be tolerant of one-day downtimes just takes too many retries for my liking.


With the current strategy the individual
and cumulative backoff looks like this
(the + part denotes max extra random delay):

attempt backoff_single cumulative
1 16+30 16+30
2 47+60 63+90
3 243+90 ≈ 4min 321+180
4 1024+120 ≈17min 1360+300 ≈23+5min
5 3125+150 ≈20min 4500+450 ≈75+8min
6 7776+180 ≈ 2.1h 12291+630 ≈3.4h
7 16807+210 ≈ 4.6h 29113+840 ≈8h
8 32768+240 ≈ 9.1h 61896+1080 ≈17h
9 59049+270 ≈16.4h 120960+1350 ≈33h
10 100000+300 ≈27.7h 220975+1650 ≈61h

We default to 5 retries meaning the least backoff runs with attempt=4.
Therefore outgoing activiities might already be permanently dropped by a
downtime of only 23 minutes which doesn't seem too implausible to occur.
Furthermore it seems excessive to retry this quickly this often at the
beginning.
At the same time, we’d like to have at least one quick'ish retry to deal
with transient issues and maintain reasonable federation responsiveness.

If an admin wants to tolerate one -day downtime of remotes,
retries need to be almost doubled.

The new backoff strategy implemented in this commit instead
switches to an exponetial after a few initial attempts:

attempt backoff_single cumulative
1 16+30 16+30
2 143+60 159+90
3 2202+90 ≈37min 2361+180 ≈40min
4 8160+120 ≈ 2.3h 10521+300 ≈ 3h
5 77393+150 ≈21.5h 87914+450 ≈24h
6 735106+180 ≈8.5d 823020+630 ≈9.5d

Initial retries are still fast, but the same amount of retries
now allows a remote downtime of at least 40 minutes. Customising
the retry count to 5 allows for whole-day downtimes.

The ideal backoff strategy probably involves personal preference and may be up for debate; so feel free to suggest alternatives and further tweaks. But giving up after merely ~23-28min just seems too short to me and getting the current strategy to be tolerant of one-day downtimes just takes too many retries for my liking. --- With the current strategy the individual and cumulative backoff looks like this (the + part denotes max extra random delay): | attempt | backoff_single | | cumulative | | | ------------ | -------------------- | - | --------------- | - | | 1 | 16+30 | | 16+30 | | | 2 | 47+60 | | 63+90 | | | | 3 | 243+90 | ≈ 4min | 321+180 | | | 4 | 1024+120 | ≈17min | 1360+300 | ≈23+5min | | 5 | 3125+150 | ≈20min | 4500+450 | ≈75+8min | | 6 | 7776+180 | ≈ 2.1h | 12291+630 | ≈3.4h | | 7 | 16807+210 | ≈ 4.6h | 29113+840 | ≈8h | | 8 | 32768+240 | ≈ 9.1h | 61896+1080 | ≈17h | | 9 | 59049+270 | ≈16.4h | 120960+1350 | ≈33h | | 10 | 100000+300 | ≈27.7h | 220975+1650 | ≈61h | We default to 5 retries meaning the least backoff runs with attempt=4. Therefore outgoing activiities might already be permanently dropped by a downtime of only 23 minutes which doesn't seem too implausible to occur. Furthermore it seems excessive to retry this quickly this often at the beginning. At the same time, we’d like to have at least one quick'ish retry to deal with transient issues and maintain reasonable federation responsiveness. If an admin wants to tolerate one -day downtime of remotes, retries need to be almost doubled. The new backoff strategy implemented in this commit instead switches to an exponetial after a few initial attempts: | attempt | backoff_single | | cumulative | | | ------------ | -------------------- | - | --------------- | - | | 1 | 16+30 | | 16+30 | | | 2 | 143+60 | | 159+90 | | | 3 | 2202+90 | ≈37min | 2361+180 | ≈40min | | 4 | 8160+120 | ≈ 2.3h | 10521+300 | ≈ 3h | | 5 | 77393+150 | ≈21.5h | 87914+450 | ≈24h | | 6 | 735106+180 | ≈8.5d | 823020+630 | ≈9.5d | Initial retries are still fast, but the same amount of retries now allows a remote downtime of at least 40 minutes. Customising the retry count to 5 allows for whole-day downtimes.
Oneric added 1 commit 2025-03-18 00:15:15 +00:00
federation/out: tweak publish retry backoff
Some checks are pending
ci/woodpecker/pr/build-amd64 Pipeline is pending approval
ci/woodpecker/pr/build-arm64 Pipeline is pending approval
ci/woodpecker/pr/docs Pipeline is pending approval
ci/woodpecker/pr/lint Pipeline is pending approval
ci/woodpecker/pr/test/1 Pipeline is pending approval
ci/woodpecker/pr/test/2 Pipeline is pending approval
4011d20dbe
With the current strategy the individual
and cumulative backoff looks like this
(the + part denotes max extra random delay):

attempt  backoff_single   cumulative
   1      16+30                16+30
   2      47+60                63+90
   3     243+90  ≈ 4min       321+180
   4    1024+120 ≈17min      1360+300  ≈23+5min
   5    3125+150 ≈20min      4500+450  ≈75+8min
   6    7776+180 ≈ 2.1h    12291+630   ≈3.4h
   7   16807+210 ≈ 4.6h    29113+840   ≈8h
   8   32768+240 ≈ 9.1h    61896+1080  ≈17h
   9   59049+270 ≈16.4h   120960+1350  ≈33h
  10  100000+300 ≈27.7h   220975+1650  ≈61h

We default to 5 retries meaning the least backoff runs with attempt=4.
Therefore outgoing activiities might already be permanently dropped by a
downtime of only 23 minutes which doesn't seem too implausible to occur.
Furthermore it seems excessive to retry this quickly this often at the
beginning.
At the same time, we’d like to have at least one quick'ish retry to deal
with transient issues and maintain reasonable federation responsiveness.

If an admin wants to tolerate one -day downtime of remotes,
retries need to be almost doubled.

The new backoff strategy implemented in this commit instead
switches to an exponetial after a few initial attempts:

attempt  backoff_single   cumulative
   1      16+30              16+30
   2     143+60             159+90
   3    2202+90  ≈37min    2361+180 ≈40min
   4    8160+120 ≈ 2.3h   10521+300 ≈ 3h
   5   77393+150 ≈21.5h   87914+450 ≈24h

Initial retries are still fast, but the same amount of retries
now allows a remote downtime of at least 40 minutes. Customising
the retry count to 5 allows for whole-day downtimes.
Contributor

For some reason i thought we kept trying for a week or so 🤔 What happens after the ~23-28min? Do we completely stop trying to deliver, or is the job put on an oban queue and retried later from there? (I expect the latter, but never actually checked that, I just assumed.) Or is "put on queue and retry later" already what happens until the 4 retries as explained here have happened?

For some reason i thought we kept trying for a week or so 🤔 What happens after the ~23-28min? Do we completely stop trying to deliver, or is the job put on an oban queue and retried later from there? (I expect the latter, but never actually checked that, I just assumed.) Or is "put on queue and retry later" already what happens until the 4 retries as explained here have happened?
Author
Member

What happens after the ~23-28min? Do we completely stop trying to deliver, or is the job put on an oban queue and retried later from there? (I expect the latter, but never actually checked that, I just assumed.) Or is "put on queue and retry later" already what happens until the 4 retries as explained here have happened?

Publisher jobs are already handled in a Oban queue; once it exhausted it’s five retries (i.e. after the fourth backoff) the job is discarded and no further delivery attempt for this activity to this inbox are made.

For some reason i thought we kept trying for a week or so

Perhaps you’re thinking of when we mark an instance as unreachable (:pleroma, :instance, :federation_reachability_timeout_days)? This is indeed a week by default and once exceeded no deliveries to the instance ill be attempted at all anymore until we ourselves receive an activity from the affected domain.

Incidentally, if you’d like each delivery of each individual activity to be retried after about a week, you can raise may retries to 7 with the backoff strategy implemented here.

> What happens after the ~23-28min? Do we completely stop trying to deliver, or is the job put on an oban queue and retried later from there? (I expect the latter, but never actually checked that, I just assumed.) Or is "put on queue and retry later" already what happens until the 4 retries as explained here have happened? Publisher jobs are already handled in a Oban queue; once it exhausted it’s five retries (i.e. after the fourth backoff) the job is discarded and no further delivery attempt for this activity to this inbox are made. > For some reason i thought we kept trying for a week or so Perhaps you’re thinking of when we mark an instance as unreachable (`:pleroma, :instance, :federation_reachability_timeout_days`)? This is indeed a week by default and once exceeded no deliveries to the instance ill be attempted at all anymore until we ourselves receive an activity from the affected domain. Incidentally, if you’d like each delivery of each individual activity to be retried after about a week, you can raise may retries to 7 with the backoff strategy implemented here.
floatingghost reviewed 2025-03-31 10:35:22 +00:00
@ -11,2 +11,3 @@
def backoff(%Job{attempt: attempt}) when is_integer(attempt) do
Pleroma.Workers.WorkerHelper.sidekiq_backoff(attempt, 5)
if attempt > 3 do
Pleroma.Workers.WorkerHelper.exponential_backoff(attempt, 9.5)

mmm magic numbers i love them i do

why start this at only the third attempt?

mmm magic numbers i love them i do why start this at only the third attempt?
Author
Member

i think with some other values for the base i tested before the scaling on the lower parts was very poor. With 9.5 it’s kinda ok? Though the second and third attempts are a bit quicker with the third attempt then already happening after ~16min instead of ~40min before jumping to 3h

40min feels a bit smoother but it probably doesn’t matter too much in practice. Can change it to just always use the 9.5^n backoff (or feel free to edit it yourself before merge; maintainer edits are enabled on this PR)

i think with some other values for the base i tested before the scaling on the lower parts was very poor. With `9.5` it’s kinda ok? Though the second and third attempts are a bit quicker with the third attempt then already happening after ~16min instead of ~40min before jumping to 3h 40min feels a bit smoother but it probably doesn’t matter too much in practice. Can change it to just always use the `9.5^n` backoff *(or feel free to edit it yourself before merge; maintainer edits are enabled on this PR)*
Some checks are pending
ci/woodpecker/pr/build-amd64 Pipeline is pending approval
ci/woodpecker/pr/build-arm64 Pipeline is pending approval
ci/woodpecker/pr/docs Pipeline is pending approval
ci/woodpecker/pr/lint Pipeline is pending approval
ci/woodpecker/pr/test/1 Pipeline is pending approval
ci/woodpecker/pr/test/2 Pipeline is pending approval
This pull request can be merged automatically.
This branch is out-of-date with the base branch
You are not authorized to merge this pull request.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u publish_backoff:Oneric-publish_backoff
git checkout Oneric-publish_backoff
Sign in to join this conversation.
No description provided.