Handle upstream rate limits #269

Open
opened 2022-11-12 14:52:35 +00:00 by tim · 10 comments

From what I gathered from the code it seems like upstream ratelimits aren't honored. This leads to errors updating local data in case they are hit, as shown by this log from my instance:

Nov 12 14:51:40 tools pleroma[51289]: [error] Could not fetch page https://mastodon.art/users/wartman/statuses/109289637819341132/replies?min_id=109298976493782710&page=true - {:ok, %Tesla.Env{__client__: %Tesla.Client{adapter: nil, fun: nil, post: [], pre: [{Tesla.Middleware.FollowRedirects, :call, [[]]}]}, __module__: Tesla, body: "{\"error\":\"Too many requests\"}", headers: [{"date", "Sat, 12 Nov 2022 13:51:40 GMT"}, {"content-type", "application/json"}, {"transfer-encoding", "chunked"}, {"connection", "keep-alive"}, {"x-ratelimit-limit", "300"}, {"x-ratelimit-remaining", "0"}, {"x-ratelimit-reset", "2022-11-12T14:00:00.312308Z"}, {"cache-control", "no-cache"}, {"content-security-policy", "base-uri 'none'; default-src 'none'; frame-ancestors 'none'; font-src 'self' https://mastodon.art; img-src 'self' https: data: blob: https://mastodon.art; style-src 'self' https://mastodon.art 'nonce-lNZq5+BSJRXGfPVmeNnOxg=='; media-src 'self' https: data: https://mastodon.art; frame-src 'self' https:; manifest-src 'self' https://mastodon.art; connect-src 'self' data: blob: https://mastodon.art https://cdn.masto.host wss://mastodon.art; script-src 'self' https://mastodon.art; child-src 'self' blob: https://mastodon.art; worker-src 'self' blob: https://mastodon.art"}, {"x-request-id", "a65d1eff-7396-42c2-a372-d40bf7525499"}, {"x-runtime", "0.002416"}, {"strict-transport-security", "max-age=63072000; includeSubDomains"}], method: :get, opts: [adapter: [name: MyFinch, pool_timeout: 5000, receive_timeout: 5000]], query: [], status: 429, url: "https://mastodon.art/users/wartman/statuses/109289637819341132/replies?min_id=109298976493782710&page=true"}}
Nov 12 14:59:49 tools pleroma[51289]: [error] Could not fetch page https://mastodon.art/users/GrumpyMark/statuses/108198746918729482/replies?min_id=108198761137309864&page=true - {:ok, %Tesla.Env{__client__: %Tesla.Client{adapter: nil, fun: nil, post: [], pre: [{Tesla.Middleware.FollowRedirects, :call, [[]]}]}, __module__: Tesla, body: "{\"error\":\"Too many requests\"}", headers: [{"date", "Sat, 12 Nov 2022 13:59:49 GMT"}, {"content-type", "application/json"}, {"transfer-encoding", "chunked"}, {"connection", "keep-alive"}, {"x-ratelimit-limit", "300"}, {"x-ratelimit-remaining", "0"}, {"x-ratelimit-reset", "2022-11-12T14:00:00.417795Z"}, {"cache-control", "no-cache"}, {"content-security-policy", "base-uri 'none'; default-src 'none'; frame-ancestors 'none'; font-src 'self' https://mastodon.art; img-src 'self' https: data: blob: https://mastodon.art; style-src 'self' https://mastodon.art 'nonce-QV5KAmTYm56UYckq0fu2sw=='; media-src 'self' https: data: https://mastodon.art; frame-src 'self' https:; manifest-src 'self' https://mastodon.art; connect-src 'self' data: blob: https://mastodon.art https://cdn.masto.host wss://mastodon.art; script-src 'self' https://mastodon.art; child-src 'self' blob: https://mastodon.art; worker-src 'self' blob: https://mastodon.art"}, {"x-request-id", "262d8bd2-2c2b-4041-ab35-5f29a4ab05de"}, {"x-runtime", "0.002189"}, {"strict-transport-security", "max-age=63072000; includeSubDomains"}], method: :get, opts: [adapter: [name: MyFinch, pool_timeout: 5000, receive_timeout: 5000]], query: [], status: 429, url: "https://mastodon.art/users/GrumpyMark/statuses/108198746918729482/replies?min_id=108198761137309864&page=true"}}
Nov 12 14:59:49 tools pleroma[51289]: [error] Could not fetch page https://mastodon.art/users/GrumpyMark/statuses/109296694109885528/replies?only_other_accounts=true&page=true - {:ok, %Tesla.Env{__client__: %Tesla.Client{adapter: nil, fun: nil, post: [], pre: [{Tesla.Middleware.FollowRedirects, :call, [[]]}]}, __module__: Tesla, body: "{\"error\":\"Too many requests\"}", headers: [{"date", "Sat, 12 Nov 2022 13:59:49 GMT"}, {"content-type", "application/json"}, {"transfer-encoding", "chunked"}, {"connection", "keep-alive"}, {"x-ratelimit-limit", "300"}, {"x-ratelimit-remaining", "0"}, {"x-ratelimit-reset", "2022-11-12T14:00:00.458681Z"}, {"cache-control", "no-cache"}, {"content-security-policy", "base-uri 'none'; default-src 'none'; frame-ancestors 'none'; font-src 'self' https://mastodon.art; img-src 'self' https: data: blob: https://mastodon.art; style-src 'self' https://mastodon.art 'nonce-xWWaojX5BzRxRt9vcFcgbQ=='; media-src 'self' https: data: https://mastodon.art; frame-src 'self' https:; manifest-src 'self' https://mastodon.art; connect-src 'self' data: blob: https://mastodon.art https://cdn.masto.host wss://mastodon.art; script-src 'self' https://mastodon.art; child-src 'self' blob: https://mastodon.art; worker-src 'self' blob: https://mastodon.art"}, {"x-request-id", "3366fddb-42c8-4128-89ed-557c485a9127"}, {"x-runtime", "0.008219"}, {"strict-transport-security", "max-age=63072000; includeSubDomains"}], method: :get, opts: [adapter: [name: MyFinch, pool_timeout: 5000, receive_timeout: 5000]], query: [], status: 429, url: "https://mastodon.art/users/GrumpyMark/statuses/109296694109885528/replies?only_other_accounts=true&page=true"}}

In an optimal world, Akkoma would be able to parse the headers and know exactly when it's fine to retry again (given that those are not really standardized this might be annoying, though) - alternatively it would seem one could store the information about a hit ratelimit indicated by HTTP 429 responses and just blindly retry after x seconds/minutes, failing all further requests locally.

From what I gathered from the code it seems like upstream ratelimits aren't honored. This leads to errors updating local data in case they are hit, as shown by this log from my instance: ``` Nov 12 14:51:40 tools pleroma[51289]: [error] Could not fetch page https://mastodon.art/users/wartman/statuses/109289637819341132/replies?min_id=109298976493782710&page=true - {:ok, %Tesla.Env{__client__: %Tesla.Client{adapter: nil, fun: nil, post: [], pre: [{Tesla.Middleware.FollowRedirects, :call, [[]]}]}, __module__: Tesla, body: "{\"error\":\"Too many requests\"}", headers: [{"date", "Sat, 12 Nov 2022 13:51:40 GMT"}, {"content-type", "application/json"}, {"transfer-encoding", "chunked"}, {"connection", "keep-alive"}, {"x-ratelimit-limit", "300"}, {"x-ratelimit-remaining", "0"}, {"x-ratelimit-reset", "2022-11-12T14:00:00.312308Z"}, {"cache-control", "no-cache"}, {"content-security-policy", "base-uri 'none'; default-src 'none'; frame-ancestors 'none'; font-src 'self' https://mastodon.art; img-src 'self' https: data: blob: https://mastodon.art; style-src 'self' https://mastodon.art 'nonce-lNZq5+BSJRXGfPVmeNnOxg=='; media-src 'self' https: data: https://mastodon.art; frame-src 'self' https:; manifest-src 'self' https://mastodon.art; connect-src 'self' data: blob: https://mastodon.art https://cdn.masto.host wss://mastodon.art; script-src 'self' https://mastodon.art; child-src 'self' blob: https://mastodon.art; worker-src 'self' blob: https://mastodon.art"}, {"x-request-id", "a65d1eff-7396-42c2-a372-d40bf7525499"}, {"x-runtime", "0.002416"}, {"strict-transport-security", "max-age=63072000; includeSubDomains"}], method: :get, opts: [adapter: [name: MyFinch, pool_timeout: 5000, receive_timeout: 5000]], query: [], status: 429, url: "https://mastodon.art/users/wartman/statuses/109289637819341132/replies?min_id=109298976493782710&page=true"}} Nov 12 14:59:49 tools pleroma[51289]: [error] Could not fetch page https://mastodon.art/users/GrumpyMark/statuses/108198746918729482/replies?min_id=108198761137309864&page=true - {:ok, %Tesla.Env{__client__: %Tesla.Client{adapter: nil, fun: nil, post: [], pre: [{Tesla.Middleware.FollowRedirects, :call, [[]]}]}, __module__: Tesla, body: "{\"error\":\"Too many requests\"}", headers: [{"date", "Sat, 12 Nov 2022 13:59:49 GMT"}, {"content-type", "application/json"}, {"transfer-encoding", "chunked"}, {"connection", "keep-alive"}, {"x-ratelimit-limit", "300"}, {"x-ratelimit-remaining", "0"}, {"x-ratelimit-reset", "2022-11-12T14:00:00.417795Z"}, {"cache-control", "no-cache"}, {"content-security-policy", "base-uri 'none'; default-src 'none'; frame-ancestors 'none'; font-src 'self' https://mastodon.art; img-src 'self' https: data: blob: https://mastodon.art; style-src 'self' https://mastodon.art 'nonce-QV5KAmTYm56UYckq0fu2sw=='; media-src 'self' https: data: https://mastodon.art; frame-src 'self' https:; manifest-src 'self' https://mastodon.art; connect-src 'self' data: blob: https://mastodon.art https://cdn.masto.host wss://mastodon.art; script-src 'self' https://mastodon.art; child-src 'self' blob: https://mastodon.art; worker-src 'self' blob: https://mastodon.art"}, {"x-request-id", "262d8bd2-2c2b-4041-ab35-5f29a4ab05de"}, {"x-runtime", "0.002189"}, {"strict-transport-security", "max-age=63072000; includeSubDomains"}], method: :get, opts: [adapter: [name: MyFinch, pool_timeout: 5000, receive_timeout: 5000]], query: [], status: 429, url: "https://mastodon.art/users/GrumpyMark/statuses/108198746918729482/replies?min_id=108198761137309864&page=true"}} Nov 12 14:59:49 tools pleroma[51289]: [error] Could not fetch page https://mastodon.art/users/GrumpyMark/statuses/109296694109885528/replies?only_other_accounts=true&page=true - {:ok, %Tesla.Env{__client__: %Tesla.Client{adapter: nil, fun: nil, post: [], pre: [{Tesla.Middleware.FollowRedirects, :call, [[]]}]}, __module__: Tesla, body: "{\"error\":\"Too many requests\"}", headers: [{"date", "Sat, 12 Nov 2022 13:59:49 GMT"}, {"content-type", "application/json"}, {"transfer-encoding", "chunked"}, {"connection", "keep-alive"}, {"x-ratelimit-limit", "300"}, {"x-ratelimit-remaining", "0"}, {"x-ratelimit-reset", "2022-11-12T14:00:00.458681Z"}, {"cache-control", "no-cache"}, {"content-security-policy", "base-uri 'none'; default-src 'none'; frame-ancestors 'none'; font-src 'self' https://mastodon.art; img-src 'self' https: data: blob: https://mastodon.art; style-src 'self' https://mastodon.art 'nonce-xWWaojX5BzRxRt9vcFcgbQ=='; media-src 'self' https: data: https://mastodon.art; frame-src 'self' https:; manifest-src 'self' https://mastodon.art; connect-src 'self' data: blob: https://mastodon.art https://cdn.masto.host wss://mastodon.art; script-src 'self' https://mastodon.art; child-src 'self' blob: https://mastodon.art; worker-src 'self' blob: https://mastodon.art"}, {"x-request-id", "3366fddb-42c8-4128-89ed-557c485a9127"}, {"x-runtime", "0.008219"}, {"strict-transport-security", "max-age=63072000; includeSubDomains"}], method: :get, opts: [adapter: [name: MyFinch, pool_timeout: 5000, receive_timeout: 5000]], query: [], status: 429, url: "https://mastodon.art/users/GrumpyMark/statuses/109296694109885528/replies?only_other_accounts=true&page=true"}} ``` In an optimal world, Akkoma would be able to parse the headers and know exactly when it's fine to retry again (given that those are not really standardized this might be annoying, though) - alternatively it would seem one could store the information about a hit ratelimit indicated by HTTP 429 responses and just blindly retry after x seconds/minutes, failing all further requests locally.
Contributor

Currently working with @Oneric with finding out what the root cause of the problem is since I've had this happen to my instance as well.

There's some stuff in upstream pleroma that does seem to fix the queue getting stuck due to errors like this:

These probably don't fix the root cause, but may be helpful with alleviating the symptoms at least.

Currently working with @Oneric with finding out what the root cause of the problem is since I've had this happen to my instance as well. There's some stuff in upstream pleroma that does seem to fix the queue getting stuck due to errors like this: - https://git.pleroma.social/pleroma/pleroma/-/merge_requests/3777 - https://git.pleroma.social/pleroma/pleroma/-/merge_requests/4015 - https://git.pleroma.social/pleroma/pleroma/-/merge_requests/4077 These probably don't fix the root cause, but may be helpful with alleviating the symptoms at least.

I'm pretty sure we have the first and third of those, the second I don't think we do

I'm pretty sure we have the first and third of those, the second I don't think we do

incidentally I did implement rate limit handling a while back so this should be better than it was

incidentally I did implement rate limit handling a while back so this should be better than it was
Contributor

for whatever reason though there does seem to be an edge case where this fails.

for whatever reason though there does seem to be an edge case where this fails.
Member

incidentally I did implement rate limit handling

oh, i see it now on the backoff-http branch, but those commits appear to have never made it into develop ^^`

btw, are there any callers which should bypass the backoff or another reason why the backoff isn’t enforced in plain Http.get?

(still, i wonder what causes Akkoma to spam multiple request per second here in the first place)

I'm pretty sure we have the first and third of those, the second I don't think we do

yeah, checking now we already have timeout in purge_expired_filter, attachments_cleanup_worker, backup_worker and worker_helper. But not in remote_fetcher which is (supposed to be (albeit it’s not clear this is actually thee offending code path)) related to the problematic replies collection fetch from here, #606 and #419

The third fixes a mistake in the second, we don’t have that but without the second it’s also not (yet) needed

> incidentally I did implement rate limit handling oh, i see it now on the [backoff-http](https://akkoma.dev/AkkomaGang/akkoma/commits/branch/backoff-http) branch, but those commits appear to have never made it into develop ^^` btw, are there any callers which should bypass the backoff or another reason why the backoff isn’t enforced in plain `Http.get`? (still, i wonder what causes Akkoma to spam multiple request per second here in the first place) > I'm pretty sure we have the first and third of those, the second I don't think we do ~~yeah, checking now we already have `timeout` in `purge_expired_filter`, `attachments_cleanup_worker`, `backup_worker` and `worker_helper`. But not in `remote_fetcher`~~ which is (supposed to be (albeit it’s not clear this is actually thee offending code path)) related to the problematic replies collection fetch from here, #606 and #419 The third fixes a mistake in the second, we don’t have that but without the second it’s also not (yet) needed
Member

A suggestion for the backoff-http branch, it currently handles only 429 and Mastodon’s custom X-Ratelimit-Reset header.

GoToSocial uses the same for too many requests from a single server, but also a 503 response and the more standard Retry-After header if the server is just generally overloaded. We should backoff based on this too

A suggestion for the backoff-http branch, it currently handles only 429 and Mastodon’s custom `X-Ratelimit-Reset` header. [GoToSocial uses](https://docs.gotosocial.org/en/latest/federation/federating_with_gotosocial/#request-throttling-rate-limiting) the same for too many requests from a single server, but also a [`503` response](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/503) and the more standard `Retry-After` header if the server is just generally overloaded. We should backoff based on this too
Member

yeah, checking now we already have timeout in purge_expired_filter, attachments_cleanup_worker, backup_worker and worker_helper. But not in remote_fetcher which is (supposed to be (albeit it’s not clear this is actually thee offending code path)) related to the problematic replies collection fetch from here

duh, i was missing the obvious: worker_helper.ex is used by other workers, so it defining a timeout ofc propagates to the others. worker_helper defines timeout based on a config value and the default config for remote_fetcher is 10 seconds
The others use their own explicit timeout function because their config keys don’t follow the usual naming scheme

Meaning we indeed already have a more flexible timeout system than the hardcoded values from https://git.pleroma.social/pleroma/pleroma/-/merge_requests/3777

> yeah, checking now we already have timeout in `purge_expired_filter`, `attachments_cleanup_worker`, `backup_worker` and `worker_helper`. But not in remote_fetcher which is (supposed to be (albeit it’s not clear this is actually thee offending code path)) related to the problematic replies collection fetch from here duh, i was missing the obvious: `worker_helper.ex` is used by other workers, so it defining a timeout ofc propagates to the others. `worker_helper` defines `timeout` based on a config value and the default config for `remote_fetcher` is 10 seconds The others use their own explicit `timeout` function because their config keys don’t follow the usual naming scheme Meaning we indeed already have a more flexible timeout system than the hardcoded values from https://git.pleroma.social/pleroma/pleroma/-/merge_requests/3777
Member

I tried replicating this by spoofing responses for the affected post and its replies collections, but the post overloading akko.wtf was deleted (and i didn’t make a backup before). Instead i tried mocking some other post from the same server, but it’s possible there’s something specific to the original post this is missing.

The problematic collection from #419 from is gone together with its instance. Collections from this and #606 are still up but their instances require signed fetches. Could someone with an actual live instance go and grab the AP representation of the post and the first two pages of its reply collection? URLS should be:

#269
https://mastodon.art/users/wartman/statuses/109289637819341132
https://mastodon.art/users/wartman/statuses/109289637819341132/replies
https://mastodon.art/users/wartman/statuses/109289637819341132/replies?page=true
https://mastodon.art/users/wartman/statuses/109289637819341132/replies?min_id=109298976493782710&page=true

#606
https://compostintraining.club/users/prplecake/statuses/108493687465375532
https://compostintraining.club/users/prplecake/statuses/108493687465375532/replies
https://compostintraining.club/users/prplecake/statuses/108493687465375532/replies?page=true
https://compostintraining.club/users/prplecake/statuses/108493687465375532/replies?only_other_accounts=true&page=true
How to do manual signed fetches

Connect to the running instance with a remote shell and run

Pleroma.Object.Fetcher.fetch_and_contain_remote_object_from_id("url here")

This will just fetch the object without putting it through any further processing on the server. Watch both the log output in the remote shell and normal logs of the running instance for any errors.

To not create a giant wall of text you can use the below to wrap it in an expandable section like this one

<detail>
<summary>Status from 269</summary>
Just the status
<!-- put JSON here inside a code block -->
...
</details>

As for connecting a remote shell to the instance you can use e.g.:

# Start Akkoma with (or find node name of already running instance)
env MIX_ENV=prod elixir --sname akkoma -S mix phx.server

# and in a different terminal session then attach a shell with
iex --remsh akkoma --sname devtests

.

Nonetheless i got some noteworthy findings out of it; i’ll summarise everything old and new about this below:

  • All known problematic cases occurred while fetching a "replies" collection page beyond "first" from a Mastodon instance. All seem to be top-level posts and at least one doesn’t even have replies now (maybe they were deleted, if not presumably there’s still an empty collection with an empty "next" page)
  • Notably, no issues were reported (so far) about fetching pinned posts ("featured" collection), which appears to be the only other collection we try to fully resolve
  • "replies" is processed in two places in the pipeline:
    • in ArticleNotePageValidator.fix_replies the collection is synchronously fetched (up to the configured maximum element count) and the collection object is replaced by a list of AP ids of the post’s replies
      • if fetching a page fails here, the whole post will be rejected due to failing the changeset validation of this validator’s embedded Ecto scheme.
        That’s because Collections.Fetcher apparently returns {:ok, {:error, %Tesla.Env{status: 429, ...}}}, so instead of the current intended “nuke replies on failure” strategy we insert a {:error, %Tesla.Env{}} tuple as replies
    • in ActivityPub.SideEffects..handle(%{data: %{"type" => "Create"}}) all elements of this preprocessed list get enqueued for later fetching via RemoteFetcherWorker
  • Pleroma does not fetch the full collection; at most it fetches "first" and only if items for "first" are not already inlined like Mastodon seems to do
  • Akkoma started to attempt full reply collection fetches with 772c209914 merged 2022-08-27
  • The first release afterwards was 3.2.0 on 2022-09-10
  • The first issue was reported on 2022-11-12, the same day but iinm around an hour before 3.4.0 was released

I still have no idea what leads to this getting retried several times a second. When trying to put the stubbed out post or a stub reply to this post through the fetch pipeline, the offending post just failed once and was never retried even when viewing the (orphaned) reply in the frontend.
Things might be different if the offending post gets delivered to our inbox — this would also explain why we can get the post just fine and only fail on a subsequent collection page. Still though — if inbox processing fails, there should be a backoff period and a limit to retries before discarding the job. But we observed several reattempts per second over a prolonged time.
Even if the originating instance redelivered the status a couple times, they too should have backoff + max attempts, so this shouldn’t lead to such a flood either.
Afairc, prior to its deletion the post did not have any or not much replies and/or interaction so it’s also not many instances somehow delivering activities referencing this post.

The post tripping up akko.wtf does not exists on akko.wtf. I don’t know for sure if that’s because it failed ArticleNotePageValidator from the start or if @norm deleted the post while trying to resolve this.


Regardless of what actually causes the request flood, it might be a good idea to just return all already fetched items when a single page fails and/or let processing of the initial post itself succeed. (However, doing so will probably hinder finding out the flood’s root cause)

I tried replicating this by spoofing responses for the affected post and its replies collections, but the post overloading akko.wtf was deleted (and i didn’t make a backup before). Instead i tried mocking some other post from the same server, but it’s possible there’s something specific to the original post this is missing. The problematic collection from #419 from is gone together with its instance. Collections from this and #606 are still up but their instances require signed fetches. Could someone with an actual live instance go and grab the AP representation of the post and the first two pages of its reply collection? URLS should be: ``` #269 https://mastodon.art/users/wartman/statuses/109289637819341132 https://mastodon.art/users/wartman/statuses/109289637819341132/replies https://mastodon.art/users/wartman/statuses/109289637819341132/replies?page=true https://mastodon.art/users/wartman/statuses/109289637819341132/replies?min_id=109298976493782710&page=true #606 https://compostintraining.club/users/prplecake/statuses/108493687465375532 https://compostintraining.club/users/prplecake/statuses/108493687465375532/replies https://compostintraining.club/users/prplecake/statuses/108493687465375532/replies?page=true https://compostintraining.club/users/prplecake/statuses/108493687465375532/replies?only_other_accounts=true&page=true ``` <details> <summary>How to do manual signed fetches</summary> Connect to the running instance with a remote shell and run ```elixir Pleroma.Object.Fetcher.fetch_and_contain_remote_object_from_id("url here") ``` This will just fetch the object without putting it through any further processing on the server. Watch both the log output in the remote shell _and_ normal logs of the running instance for any errors. To not create a giant wall of text you can use the below to wrap it in an expandable section like this one ```html <detail> <summary>Status from 269</summary> Just the status <!-- put JSON here inside a code block --> ... </details> ``` As for connecting a remote shell to the instance you can use e.g.: ```sh # Start Akkoma with (or find node name of already running instance) env MIX_ENV=prod elixir --sname akkoma -S mix phx.server # and in a different terminal session then attach a shell with iex --remsh akkoma --sname devtests ``` </details> . Nonetheless i got some noteworthy findings out of it; i’ll summarise everything old and new about this below: - All known problematic cases occurred while fetching a `"replies"` collection page beyond `"first"` from a Mastodon instance. All seem to be top-level posts and at least one doesn’t even have replies _now_ *(maybe they were deleted, if not presumably there’s still an empty collection with an empty `"next"` page)* - Notably, no issues were reported (so far) about fetching pinned posts (`"featured"` collection), which appears to be the only other collection we try to fully resolve - `"replies"` is processed in two places in the pipeline: - in `ArticleNotePageValidator.fix_replies` the collection is _synchronously_ fetched *(up to the configured maximum element count)* and the collection object is replaced by a list of AP ids of the post’s replies - if fetching a page fails here, the whole post will be rejected due to failing the changeset validation of this validator’s embedded Ecto scheme. That’s because `Collections.Fetcher` apparently returns `{:ok, {:error, %Tesla.Env{status: 429, ...}}}`, so instead of the current intended “nuke replies on failure” strategy we insert a `{:error, %Tesla.Env{}}` tuple as `replies` - in `ActivityPub.SideEffects..handle(%{data: %{"type" => "Create"}})` all elements of this preprocessed _list_ get enqueued for later fetching via `RemoteFetcherWorker` - Pleroma does not fetch the full collection; at most it fetches `"first"` and **only** if `items` for `"first"` are not already inlined like Mastodon seems to do - Akkoma started to attempt full reply collection fetches with 772c209914d5cbfd4f763edc266d0f1541f656f8 merged 2022-08-27 - The first release afterwards was 3.2.0 on 2022-09-10 - The first issue was reported on 2022-11-12, the same day but iinm around an hour _before_ 3.4.0 was released I still have no idea what leads to this getting retried several times a second. When trying to put the stubbed out post or a stub reply to this post through the fetch pipeline, the offending post just failed once and was never retried even when viewing the (orphaned) reply in the frontend. Things might be different if the offending post gets delivered to our inbox — this would also explain why we can get the post just fine and only fail on a subsequent collection page. Still though — if inbox processing fails, there should be a backoff period and a limit to retries before discarding the job. But we observed several reattempts per second over a prolonged time. Even if the originating instance redelivered the status a couple times, they too should have backoff + max attempts, so this shouldn’t lead to such a flood either. Afairc, prior to its deletion the post did not have any or not much replies and/or interaction so it’s also not many instances somehow delivering activities referencing this post. The post tripping up akko.wtf does not exists on akko.wtf. I don’t know for sure if that’s because it failed `ArticleNotePageValidator` from the start or if @norm deleted the post while trying to resolve this. --- Regardless of what actually causes the request flood, it might be a good idea to just return all already fetched items when a single page fails and/or let processing of the initial post itself succeed. *(However, doing so will probably hinder finding out the flood’s root cause)*

I still have no idea what leads to this getting retried several times a second

reading this, I am struck with a theory

if you have a particularly long string of replies, could we fetch the post, try to fetch replies, fetch the first reply, try to fetch -its- replies, and chain on line this?

I'll have to look at the pipeline to see if that could haplen

>I still have no idea what leads to this getting retried several times a second reading this, I am struck with a theory if you have a particularly long string of replies, could we fetch the post, try to fetch replies, fetch the first reply, try to fetch -its- replies, and chain on line this? I'll have to look at the pipeline to see if that could haplen

oh, i see it now on the backoff-http branch, but those commits appear to have never made it into develop ^^`

me bigbig dum

PR is up, gonna run on ihba (i did before and it was fine but i'll run it rebased on current develop)

>oh, i see it now on the backoff-http branch, but those commits appear to have never made it into develop ^^` me bigbig dum PR is up, gonna run on ihba (i did before and it was fine but i'll run it rebased on current develop)
Sign in to join this conversation.
No Milestone
No project
No Assignees
4 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: AkkomaGang/akkoma#269
No description provided.