To facilitate this ObjectValidator.fetch_actor_and_object is adapted to
return an informative error. Otherwise we’d be unable to make an
informed decision on retrying or not later. There’s no point in
retrying to fetch MRF-blocked stuff or private posts for example.
This is the only user of fetch_actor_and_object which previously just
always preteneded to be successful. For all the activity types handled
here, we absolutely need the referenced object to be able to process it
(other than Announce whether or not processing those activity types for
unknown remote objects is desirable in the first place is up for debate)
All other users of the similar fetch_actor already properly check success.
Note, this currently lumps all reolv failure reasons together,
so even e.g. boosts of MRF rejected posts will still exhaust all
retries. The following commit improves on this.
It makes decisions based on error sources harder since all possible
nesting levels need to be checked for. As shown by the return values
handled in the receiver worker something else still nests those,
but this is a first start.
This value is currently only used by Prometheus metrics
but (after optimisng the peer query inthe preceeding commit)
the most costly part of instance stats.
This query is one of the top cost offenders during an instances
lifetime. For small instances it was shown to take up 30-50% percent of
the total database query time, while for bigger isntaces it still held
a spot in the top 3 — alost as or even more expensive overall than
timeline queries!
The good news is, there’s a cheaper way using the instance table:
no need to process each entry, no need to filter NULLs
and no need to dedupe. EXPLAIN estimates the cost of the
old query as 13272.39 and the cost of the new query as 395.74
for me; i.e. a 33-fold reduction.
Results can slightly differ. E.g. we might have an old user
predating the instance tables existence and no interaction with since
or no instance table entry due to failure to query nodeinfo.
Conversely, we might have an instance entry but all known users got
deleted since.
However, this seems unproblematic in practice
and well worth the perf improvment.
Given the previous query didn’t exclude unreachable instances
neither does the new query.
Ideally we’d like to split this up more and count most invalid documents
as an error, but silently drop e.g. Deletes for unknown objects.
However, this is hard to extract from the changeset and jobs canceled
with :discard don’t count as exceptions and I’m not aware of a idiomatic
way to cancel further retries while retaining the exception status.
Thus at least keep a log, but since superfluous "Delete"s
seem kinda frequent, don't log at error, only info level.
Retrying seems unlikely to be helpful:
- if it timed out, chances are the short delay before reattempting
won't give the remote enough time to recover from its outage and
a longer delay makes the job pointless as users likely scrolled
further already. (Likely this is already be the case after the
first 20s timeout)
- if the remote data is so borked we couldn't even parse it far enough
for an "invalid metadata" error, chances are it will remain borked
upon reattempt
There were two issues leading to needles effort:
Most importnatly, the use of AP IDs as "source_url" meant multiple
simultaneous jobs got scheduled for the same instance even with the
default unique settings.
Also jobs were scheduled uncontionally for each processed AP object
meaning we incured oberhead from managing Oban jobs even if we knew it
wasn't necessary. By comparison the single query to check if an update
is needed should be cheaper overall.
The return value is never used here; later stages which actually need it
fetch the user themselves and it doesn't matter wheter we wait for the
fech here or later (if needed at all).
Even more, this early fetch always fails if the user was already deleted
or never known to begin with, but we get something referencing it; e.g.
the very Delete action carrying out the user deletion.
This prevents processing of the Delete, but before that it will be
reattempted several times, each time attempring to fetch the
non-existing profile, wasting resources.
It was used to migrate OStatus connections to ActivityPub if possible,
but support for OStatus was long since dropped, all new actors always AP
and if anything wasn't migrated before, their instance is already marked
as unreachable anyway.
The associated logic was also buggy in several ways and deleted users
got set to ap_enabled=false also causing some issues.
This patch is a pretty direct port of the original Pleroma MR;
follow-up commits will further fix and clean up remaining issues.
Changes made (other than trivial merge conflict resolutions):
- converted CHANGELOG format
- adapted migration id for Akkoma’s timeline
- removed ap_enabled from additional tests
Ported-from: https://git.pleroma.social/pleroma/pleroma/-/merge_requests/3880