And for now treat partial fetches as a success, since for all
current users partial collection data is better than no data at all.
If an error occurred while fetching a page, this previously
returned a bogus {:ok, {:error, _}} success, causing the error
to be attached to the object as an reply list subsequently
leading to the whole post getting rejected during validation.
Also the pinned collection caller did not actually handle
the preexisting error case resulting in process crashes.
Oban cataches crashes to handle job failure and retry,
thus it never bubbles up all the way and nothing is logged by default.
For better debugging, catch and log any crashes.
The object lookup is later repeated in the validator, but due to
caching shouldn't incur any noticeable performance impact.
It’s actually preferable to check here, since it avoids the otherwise
occuring user lookup and overhead from starting and aborting a
transaction
Most of them actually only accept either activities or a
non-activity object later; querying both is then a waste
of resources and may create false positives.
The only thing this does is changing the updated_at field of the user.
Afaict this came to be because prior to pins federating this was split
into two functions, one of which created a changeset, the other applying
a given changeset. When this was merged the bits were just copied into
place.
User.get_or_fetch_by_(apid|nickname) are the only external users of fetch_and_prepare_user_from_ap_id,
thus there’s no point in duplicating logging, expecially not at error level.
Currently (duplicated) _not_found errors for users make up the bulk of my logs
and are created almost every second. Deleted users are a common occurence and not
worth logging outside of debug
To facilitate this ObjectValidator.fetch_actor_and_object is adapted to
return an informative error. Otherwise we’d be unable to make an
informed decision on retrying or not later. There’s no point in
retrying to fetch MRF-blocked stuff or private posts for example.
This is the only user of fetch_actor_and_object which previously just
always preteneded to be successful. For all the activity types handled
here, we absolutely need the referenced object to be able to process it
(other than Announce whether or not processing those activity types for
unknown remote objects is desirable in the first place is up for debate)
All other users of the similar fetch_actor already properly check success.
Note, this currently lumps all reolv failure reasons together,
so even e.g. boosts of MRF rejected posts will still exhaust all
retries. The following commit improves on this.
It makes decisions based on error sources harder since all possible
nesting levels need to be checked for. As shown by the return values
handled in the receiver worker something else still nests those,
but this is a first start.
This value is currently only used by Prometheus metrics
but (after optimisng the peer query inthe preceeding commit)
the most costly part of instance stats.
This query is one of the top cost offenders during an instances
lifetime. For small instances it was shown to take up 30-50% percent of
the total database query time, while for bigger isntaces it still held
a spot in the top 3 — alost as or even more expensive overall than
timeline queries!
The good news is, there’s a cheaper way using the instance table:
no need to process each entry, no need to filter NULLs
and no need to dedupe. EXPLAIN estimates the cost of the
old query as 13272.39 and the cost of the new query as 395.74
for me; i.e. a 33-fold reduction.
Results can slightly differ. E.g. we might have an old user
predating the instance tables existence and no interaction with since
or no instance table entry due to failure to query nodeinfo.
Conversely, we might have an instance entry but all known users got
deleted since.
However, this seems unproblematic in practice
and well worth the perf improvment.
Given the previous query didn’t exclude unreachable instances
neither does the new query.
Ideally we’d like to split this up more and count most invalid documents
as an error, but silently drop e.g. Deletes for unknown objects.
However, this is hard to extract from the changeset and jobs canceled
with :discard don’t count as exceptions and I’m not aware of a idiomatic
way to cancel further retries while retaining the exception status.
Thus at least keep a log, but since superfluous "Delete"s
seem kinda frequent, don't log at error, only info level.
Retrying seems unlikely to be helpful:
- if it timed out, chances are the short delay before reattempting
won't give the remote enough time to recover from its outage and
a longer delay makes the job pointless as users likely scrolled
further already. (Likely this is already be the case after the
first 20s timeout)
- if the remote data is so borked we couldn't even parse it far enough
for an "invalid metadata" error, chances are it will remain borked
upon reattempt
There were two issues leading to needles effort:
Most importnatly, the use of AP IDs as "source_url" meant multiple
simultaneous jobs got scheduled for the same instance even with the
default unique settings.
Also jobs were scheduled uncontionally for each processed AP object
meaning we incured oberhead from managing Oban jobs even if we knew it
wasn't necessary. By comparison the single query to check if an update
is needed should be cheaper overall.
The return value is never used here; later stages which actually need it
fetch the user themselves and it doesn't matter wheter we wait for the
fech here or later (if needed at all).
Even more, this early fetch always fails if the user was already deleted
or never known to begin with, but we get something referencing it; e.g.
the very Delete action carrying out the user deletion.
This prevents processing of the Delete, but before that it will be
reattempted several times, each time attempring to fetch the
non-existing profile, wasting resources.
It was used to migrate OStatus connections to ActivityPub if possible,
but support for OStatus was long since dropped, all new actors always AP
and if anything wasn't migrated before, their instance is already marked
as unreachable anyway.
The associated logic was also buggy in several ways and deleted users
got set to ap_enabled=false also causing some issues.
This patch is a pretty direct port of the original Pleroma MR;
follow-up commits will further fix and clean up remaining issues.
Changes made (other than trivial merge conflict resolutions):
- converted CHANGELOG format
- adapted migration id for Akkoma’s timeline
- removed ap_enabled from additional tests
Ported-from: https://git.pleroma.social/pleroma/pleroma/-/merge_requests/3880