Handle upstream rate limits #269
Labels
No labels
approved, awaiting change
bug
configuration
documentation
duplicate
enhancement
extremely low priority
feature request
Fix it yourself
help wanted
invalid
mastodon_api
needs docs
needs tests
not a bug
planned
pleroma_api
privacy
question
static_fe
triage
wontfix
No milestone
No project
No assignees
4 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: AkkomaGang/akkoma#269
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
From what I gathered from the code it seems like upstream ratelimits aren't honored. This leads to errors updating local data in case they are hit, as shown by this log from my instance:
In an optimal world, Akkoma would be able to parse the headers and know exactly when it's fine to retry again (given that those are not really standardized this might be annoying, though) - alternatively it would seem one could store the information about a hit ratelimit indicated by HTTP 429 responses and just blindly retry after x seconds/minutes, failing all further requests locally.
Currently working with @Oneric with finding out what the root cause of the problem is since I've had this happen to my instance as well.
There's some stuff in upstream pleroma that does seem to fix the queue getting stuck due to errors like this:
These probably don't fix the root cause, but may be helpful with alleviating the symptoms at least.
I'm pretty sure we have the first and third of those, the second I don't think we do
incidentally I did implement rate limit handling a while back so this should be better than it was
for whatever reason though there does seem to be an edge case where this fails.
oh, i see it now on the backoff-http branch, but those commits appear to have never made it into develop ^^`
btw, are there any callers which should bypass the backoff or another reason why the backoff isn’t enforced in plain
Http.get
?(still, i wonder what causes Akkoma to spam multiple request per second here in the first place)
yeah, checking now we already havewhich is (supposed to be (albeit it’s not clear this is actually thee offending code path)) related to the problematic replies collection fetch from here, #606 and #419timeout
inpurge_expired_filter
,attachments_cleanup_worker
,backup_worker
andworker_helper
. But not inremote_fetcher
The third fixes a mistake in the second, we don’t have that but without the second it’s also not (yet) needed
A suggestion for the backoff-http branch, it currently handles only 429 and Mastodon’s custom
X-Ratelimit-Reset
header.GoToSocial uses the same for too many requests from a single server, but also a
503
response and the more standardRetry-After
header if the server is just generally overloaded. We should backoff based on this tooduh, i was missing the obvious:
worker_helper.ex
is used by other workers, so it defining a timeout ofc propagates to the others.worker_helper
definestimeout
based on a config value and the default config forremote_fetcher
is 10 secondsThe others use their own explicit
timeout
function because their config keys don’t follow the usual naming schemeMeaning we indeed already have a more flexible timeout system than the hardcoded values from https://git.pleroma.social/pleroma/pleroma/-/merge_requests/3777
I tried replicating this by spoofing responses for the affected post and its replies collections, but the post overloading akko.wtf was deleted (and i didn’t make a backup before). Instead i tried mocking some other post from the same server, but it’s possible there’s something specific to the original post this is missing.
The problematic collection from #419 from is gone together with its instance. Collections from this and #606 are still up but their instances require signed fetches. Could someone with an actual live instance go and grab the AP representation of the post and the first two pages of its reply collection? URLS should be:
How to do manual signed fetches
Connect to the running instance with a remote shell and run
This will just fetch the object without putting it through any further processing on the server. Watch both the log output in the remote shell and normal logs of the running instance for any errors.
To not create a giant wall of text you can use the below to wrap it in an expandable section like this one
As for connecting a remote shell to the instance you can use e.g.:
.
Nonetheless i got some noteworthy findings out of it; i’ll summarise everything old and new about this below:
"replies"
collection page beyond"first"
from a Mastodon instance. All seem to be top-level posts and at least one doesn’t even have replies now (maybe they were deleted, if not presumably there’s still an empty collection with an empty"next"
page)"featured"
collection), which appears to be the only other collection we try to fully resolve"replies"
is processed in two places in the pipeline:ArticleNotePageValidator.fix_replies
the collection is synchronously fetched (up to the configured maximum element count) and the collection object is replaced by a list of AP ids of the post’s repliesThat’s because
Collections.Fetcher
apparently returns{:ok, {:error, %Tesla.Env{status: 429, ...}}}
, so instead of the current intended “nuke replies on failure” strategy we insert a{:error, %Tesla.Env{}}
tuple asreplies
ActivityPub.SideEffects..handle(%{data: %{"type" => "Create"}})
all elements of this preprocessed list get enqueued for later fetching viaRemoteFetcherWorker
"first"
and only ifitems
for"first"
are not already inlined like Mastodon seems to do772c209914
merged 2022-08-27I still have no idea what leads to this getting retried several times a second. When trying to put the stubbed out post or a stub reply to this post through the fetch pipeline, the offending post just failed once and was never retried even when viewing the (orphaned) reply in the frontend.
Things might be different if the offending post gets delivered to our inbox — this would also explain why we can get the post just fine and only fail on a subsequent collection page. Still though — if inbox processing fails, there should be a backoff period and a limit to retries before discarding the job. But we observed several reattempts per second over a prolonged time.
Even if the originating instance redelivered the status a couple times, they too should have backoff + max attempts, so this shouldn’t lead to such a flood either.
Afairc, prior to its deletion the post did not have any or not much replies and/or interaction so it’s also not many instances somehow delivering activities referencing this post.
The post tripping up akko.wtf does not exists on akko.wtf. I don’t know for sure if that’s because it failed
ArticleNotePageValidator
from the start or if @norm deleted the post while trying to resolve this.Regardless of what actually causes the request flood, it might be a good idea to just return all already fetched items when a single page fails and/or let processing of the initial post itself succeed. (However, doing so will probably hinder finding out the flood’s root cause)
reading this, I am struck with a theory
if you have a particularly long string of replies, could we fetch the post, try to fetch replies, fetch the first reply, try to fetch -its- replies, and chain on line this?
I'll have to look at the pipeline to see if that could haplen
me bigbig dum
PR is up, gonna run on ihba (i did before and it was fine but i'll run it rebased on current develop)