Improve stat queries and ReceiverWorker logic #862

Merged
floatingghost merged 36 commits from Oneric/akkoma:perf_tweaks_stats+jobs into develop 2025-02-14 19:22:40 +00:00
Member

This is a batch of patches primarily aimed at improving perf, but also cleaning up some other stuff along the way.
I’ve been gradually applying this on my instance clearing up metrics such that job exceptions are now actually meaningful and more importantly yielding a significant reduction in jobs with a noticeable database load reduction. Addresses part of #784.

The relative difference will probably differ for larger instances, but considering the two stat queries caused a considerable chunk of total db time even for ihba and akko.wtf in #784, i expect it to still be a measurable improvement.

I haven't seen anything odd in logs anymore, but while the major perf improvements have weeks of testing, i also only finished the last tochups about two days ago, so it might be good to give this some more testing before merge (it’s been uneventful weeks by now); perferably also on more servers.
The following allows viewing a filtered log with frequent but unrelated messages removed: grep -Ev '^$|Can'\''t find LRDD template in|(Could not decode featured collection at fetch|Object rejected).*:(not_found|forbidden|mrf_reject_check|reject)' /var/log/akkoma | awk '/^[[0-9]{4}-[0-9]{2}-[0-9]{2} / {print "";} 1 {print;}' | less +G

As usual, the PR can and is best be reviewed commit by commit and each commit message gives more details.

The main improvements fall into the following categories:

  • purge ap_enabled; backported from Pleroma + fixups
    This isn’t directly a perf improvement, but about half of all suspicious log entries I investigated during testing turned out to be a pre-existing issue related to poor ap_enabled handling, so to save me time and annoyance this is included before the main patches here.
    Dropping the user fetch and limiting refetches, might very slightly improve perf though.
  • Avoid unnecessary Worker churn:
    Before this a known-noop worker was created for each discovered post if db search is used; and NodeInfo refetcher jobs got frequently duplicated and also created when not needed.
    Creating, queuing and cleaning up jobs comes with overhead and increases PostgreSQL MVCC-related table bloat.
    Note: the job filtering mechanism here is differently implemented than in the 6-month-old #789 PR, the latter could adopt the more transparent version introduced here once both are merged.
  • Cheaper stat queries:
    Even though stat queries are only run once per 5min, they were ridiculously expensive and caused a well visible, regular spike in average query time. On one single user instance in #784 it even took up more than half of the total database time!
  • Improve decision when to retry or discard ReceiverWorker and RichMediaBackfill.
    Previously e.g. Deletes of unknown objects failed the validator and always exhausted all retries and similar other things.
    The Delete scenarios is quite common and when a remote user followed by a local user deleted most posts, this caused the by far biggest receiver worker job count (job failures) per minute the server experienced. But also in general there was a constant high background job failure rate from needlessly retried receiver jobs.

Ever since adding the final "gracefully ignore" bits, i didn’t encounter any "invalid AP" logs anymore so far.

Important disclaimers: this was actually production tested on something based off upstream commit 71d3bbd7be i.e. slightly behind current develop; pulling in commits
changing the db layout might have disturbed the relative perf diff stats.
Furthermore, more patches not in this PR were applied on top of the aforementioned commit, most notably in this context, PR #789 which severely lessens the impact of
the by far biggest offender in unpatched outlier stats (ref #784). Without #789 but with attachment cleanup enabled there likely won't be any noticeable change since the absurdly high cost of futile attachment cleanup attempts drowns out everything else.

Results:
Before the patch the average total db query time had well visible spikes every 5 minutes when the stat queries got refreshed:

db_prepatch_5min_peak.webp

The average exec time of the remote and local user count query (both averaged together) was 18000µs per query; after the patches the remaining local user count queries only take 67µs on average and the remote estimate is quite fast too. The regular spikes are gone:

db_postpatch_no_peak.webp

No image for this, but similarly the processing cost from the oban_jobs UPDATE query also dropped; not only from running fewer queries, the cost of each individual query also dropped to a third presumably due to less bloat from deleted rows (oban tables are regularly reindexed by default).

As for jobs other than Receive-Fetch-Publish (RFP), the decrease is also well visible (note there are several hues of blue used to represent jobs and not all indicators shown in the image; the bulk of the jobs prior to patching were SearchIndexingWorker and NodeInfoFetcher, not RichMedia nd Background):
jobs_total_timeline.webp

As for RFP jobs, there used to be a constant background of failures, exceeding completions
rfp_events_zoomin.webp

See also the huge failure spike arising from an user account sending out deletes for almost all posts, barely any of which are known locally and thus reprocessed 5 times:
rfp_events_total_timeline.webp

And the average job count per minute:
jobs_per_minute.webp

Everything combined also measurably decreased time spent on db actions on this small instance. Spikes in total time without a proportional rise in exec time are from a request flood causing many queries to stay in queue for a while:
db_48h-avg.webp

Comparing the averages of the 48h averages (ofc the two may differ due to unpredictable federation workload, but both time intervals span several weeks and i think the difference is clearly greater than expected from background fluctuations alone):

query exec query total job
pre-patch 356.82ms 373.27ms 1022.02ms
post-patch 267.64ms 292.56ms 783.06ms
relative time savings 25% 22% 23%
Future outlook (things out of scope for this PR)

The main issue and last thing i did was graceful handling of Deletes.
It’s awkward to do in the current transmogrifier,validator,side_effect pipeline.
Ideally we’d want to look up the object only once and pass it along, instead of refetching it.
The check for orphaned activities is also awkward and given Activity queries are never cached thus more costly
I’m personally in favour of just dropping this and strongly recommending all prune_objects to be followed by or directly including an activity prune.

Notably missing is skipping past duplicate user Deletes;
there’s even a test to ensure deletes of supposedly-deleted users are reattempted
since atm, is_active is immediately set to false, but cleaning p posts etc
happens afterwards in a job which might fail.
Spurious user Deletes are also common though, in fact my instance was hit by a buggy instance
repeatedly flooding the same user Delete every 5 minutes for months until the remote admin fixed it
and i’ve heard others be similarly affected by other instances of the same software.
Fixing this prob requires changing the deletion status to a tri-state active, delete_in_progress, deleted
and db migrations so i omitted it here.

Another future improvement would be to improve handling of duplicate inserts
which now appear to be the most frequent cause of :link_resolve retries.
E.g. when a followed remote user replies to and likes a post, the parent will be fetched
when processing the received reply, but also needs to be resolved when receiving the Like.
If both arrive nearly simultaneously it’s possible both start fetching (annoying remote servers with duplicated requests), but only one wins and the other errors out. If the Like looses, it will log a message
and the job will be retried.

Also the retry backoff for :rate_limit errors should ideally use the rate limit reset
info sent by the remote. Our current receiver retry backoff is far too short for this
and most likely just all retries will run into a rate_limit error.

It also doesn't do anything about the high insertion cost for activities
(on average 1000µs per insertion here; for objects it’s 290µs and for saner tables about 1-90µs)

This is a batch of patches primarily aimed at improving perf, but also cleaning up some other stuff along the way. I’ve been gradually applying this on my instance clearing up metrics such that job exceptions are now actually meaningful and more importantly yielding a significant reduction in jobs with a noticeable database load reduction. Addresses part of #784. The relative difference will probably differ for larger instances, but considering the two stat queries caused a considerable chunk of total db time even for ihba and akko.wtf in #784, i expect it to still be a measurable improvement. I haven't seen anything odd in logs anymore, but while the major perf improvements have weeks of testing, ~~i also only finished the last tochups about two days ago, so it might be good to give this some more testing before merge~~ *(it’s been uneventful weeks by now)*; perferably also on more servers. The following allows viewing a filtered log with frequent but unrelated messages removed: `grep -Ev '^$|Can'\''t find LRDD template in|(Could not decode featured collection at fetch|Object rejected).*:(not_found|forbidden|mrf_reject_check|reject)' /var/log/akkoma | awk '/^[[0-9]{4}-[0-9]{2}-[0-9]{2} / {print "";} 1 {print;}' | less +G` As usual, the PR can and is best be reviewed commit by commit and each commit message gives more details. The main improvements fall into the following categories: - **purge `ap_enabled`**; backported from Pleroma + fixups This isn’t directly a perf improvement, but about half of all suspicious log entries I investigated during testing turned out to be a pre-existing issue related to poor `ap_enabled` handling, so to save me time and annoyance this is included before the main patches here. Dropping the user fetch and limiting refetches, might very slightly improve perf though. - **Avoid unnecessary Worker churn**: Before this a known-noop worker was created for each discovered post if db search is used; and NodeInfo refetcher jobs got frequently duplicated and also created when not needed. Creating, queuing and cleaning up jobs comes with overhead and increases PostgreSQL MVCC-related table bloat. Note: the job filtering mechanism here is differently implemented than in the 6-month-old #789 PR, the latter could adopt the more transparent version introduced here once both are merged. - **Cheaper stat queries**: Even though stat queries are only run once per 5min, they were ridiculously expensive and caused a well visible, regular spike in average query time. On one single user instance in #784 it even took up more than half of the total database time! - **Improve decision when to retry or discard ReceiverWorker and RichMediaBackfill.** Previously e.g. Deletes of unknown objects failed the validator and always exhausted all retries and similar other things. The Delete scenarios is quite common and when a remote user followed by a local user deleted most posts, this caused the by far biggest receiver worker job count (job failures) per minute the server experienced. But also in general there was a constant high background job failure rate from needlessly retried receiver jobs. Ever since adding the final "gracefully ignore" bits, i didn’t encounter any "invalid AP" logs anymore so far. **Important disclaimers**: this was actually production tested on something based off upstream commit 71d3bbd7be i.e. slightly behind current develop; pulling in commits changing the db layout might have disturbed the relative perf diff stats. Furthermore, more patches not in this PR were applied on top of the aforementioned commit, most notably in this context, PR #789 which severely lessens the impact of the by far biggest offender in unpatched outlier stats (ref #784). Without #789 but with attachment cleanup enabled there likely won't be any noticeable change since the absurdly high cost of futile attachment cleanup attempts drowns out everything else. **Results:** Before the patch the average total db query time had well visible spikes every 5 minutes when the stat queries got refreshed: ![db_prepatch_5min_peak.webp](/attachments/5f6bfd9b-4013-4e95-ae14-12bb856b8be9) The average exec time of the remote and local user count query (both averaged together) was 18000µs per query; after the patches the remaining local user count queries only take 67µs on average and the remote estimate is quite fast too. The regular spikes are gone: ![db_postpatch_no_peak.webp](/attachments/294c3877-82bc-4206-b090-996f7f0adcb6) No image for this, but similarly the processing cost from the `oban_jobs` `UPDATE` query also dropped; not only from running fewer queries, the cost of each individual query also dropped to a third presumably due to less bloat from deleted rows (oban tables are regularly reindexed by default). As for jobs other than Receive-Fetch-Publish (RFP), the decrease is also well visible *(note there are several hues of blue used to represent jobs and not all indicators shown in the image; the bulk of the jobs prior to patching were SearchIndexingWorker and NodeInfoFetcher, not RichMedia nd Background)*: ![jobs_total_timeline.webp](/attachments/ac01bb70-9784-45e6-b962-6d7992802ff6) As for RFP jobs, there used to be a constant background of failures, exceeding completions ![rfp_events_zoomin.webp](/attachments/f023bffa-6fa3-4ea8-82dd-641e706bbdd0) See also the huge failure spike arising from an user account sending out deletes for almost all posts, barely any of which are known locally and thus reprocessed 5 times: ![rfp_events_total_timeline.webp](/attachments/3ca75379-b685-4f58-96ed-a131a9f4e39b) And the average job count per minute: ![jobs_per_minute.webp](/attachments/8d9ca92b-bd5d-4d01-8e7e-3d1f83809184) Everything combined also measurably decreased time spent on db actions on this small instance. Spikes in total time without a proportional rise in exec time are from a request flood causing many queries to stay in queue for a while: ![db_48h-avg.webp](/attachments/c9b79321-995c-4599-b79d-e4d328e0f0be) Comparing the averages of the 48h averages *(ofc the two may differ due to unpredictable federation workload, but both time intervals span several weeks and i think the difference is clearly greater than expected from background fluctuations alone)*: | | query exec | query total | job | |----|----|----|----| | pre-patch | 356.82ms | 373.27ms | 1022.02ms | | post-patch | 267.64ms | 292.56ms | 783.06ms | | relative time savings | 25% | 22% | 23% | <details> <summary>Future outlook (things out of scope for this PR)</summary> The main issue and last thing i did was graceful handling of `Delete`s. It’s awkward to do in the current transmogrifier,validator,side_effect pipeline. Ideally we’d want to look up the object only once and pass it along, instead of refetching it. The check for orphaned activities is also awkward and given Activity queries are never cached thus more costly I’m personally in favour of just dropping this and strongly recommending all `prune_objects` to be followed by or directly including an activity prune. Notably missing is skipping past duplicate user `Delete`s; there’s even a test to ensure deletes of supposedly-deleted users are reattempted since atm, `is_active` is immediately set to `false`, but cleaning p posts etc happens afterwards in a job which might fail. Spurious user `Delete`s are also common though, in fact my instance was hit by a buggy instance repeatedly flooding the same user `Delete` every 5 minutes for months until the remote admin fixed it and i’ve heard others be similarly affected by other instances of the same software. Fixing this prob requires changing the deletion status to a tri-state `active`, `delete_in_progress`, `deleted` and db migrations so i omitted it here. Another future improvement would be to improve handling of duplicate inserts which now appear to be the most frequent cause of `:link_resolve` retries. E.g. when a followed remote user replies to and likes a post, the parent will be fetched when processing the received reply, but also needs to be resolved when receiving the Like. If both arrive nearly simultaneously it’s possible both start fetching (annoying remote servers with duplicated requests), but only one wins and the other errors out. If the Like looses, it will log a message and the job will be retried. Also the retry backoff for :rate_limit errors should ideally use the rate limit reset info sent by the remote. Our current receiver retry backoff is far too short for this and most likely just all retries will run into a rate_limit error. It also doesn't do anything about the high insertion cost for `activities` (on average 1000µs per insertion here; for `objects` it’s 290µs and for saner tables about 1-90µs) </summary>
Oneric force-pushed perf_tweaks_stats+jobs from 1c276f36d5 to 870d15de12 2024-12-22 21:06:14 +00:00 Compare
Oneric force-pushed perf_tweaks_stats+jobs from 870d15de12 to cac7ab8bc0 2025-01-03 19:54:56 +00:00 Compare
Author
Member

rebased to resolve the CHANGELOG conflict and added a new small commit to also log any crashes occurring during ReceiverWorker handling.

Nothing problematic happened since posting the patches.
Occasionally (like about once a day) there’s an odd failing job due to a missing object which should be impossible due to the prior checks added here, but this might come from some cache weirdness (cache changes don't obey transaction isolation)* and/or race condition. Usually those errors disappeared on retry. During the whole time so far only two Update *(couldn't find the object to Update in later stages, eventhough initial checks found it) and one Delete (trying to Delete an actor-less tombstone in later stages, eventhough initial checks for an actor field succeeded) activities had persistent failures.

Whatever it is almost certainly is a pre-existing thing though, since the problem always happens in parts not touched by this PR and evidently passes barriers set-up in this PR. Thus I don’t think they are blockers for merging this.

rebased to resolve the CHANGELOG conflict and added a new small commit to also log any crashes occurring during ReceiverWorker handling. Nothing problematic happened since posting the patches. Occasionally *(like about once a day)* there’s an odd failing job due to a missing object which _should_ be impossible due to the prior checks added here, but this might come from some cache weirdness (cache changes don't obey transaction isolation)* and/or race condition. Usually those errors disappeared on retry. During the whole time so far only two `Update` *(couldn't find the object to Update in later stages, eventhough initial checks found it) and one `Delete` *(trying to Delete an actor-less tombstone in later stages, eventhough initial checks for an `actor` field succeeded)* activities had persistent failures. Whatever it is almost certainly is a pre-existing thing though, since the problem always happens in parts not touched by this PR and evidently passes barriers set-up in this PR. Thus I don’t think they are blockers for merging this.

will stage this against current develop and run it on ihba for a while, will release as part of a patch in a week or two if it's all gud

will stage this against current develop and run it on ihba for a while, will release as part of a patch in a week or two if it's all gud
Oneric force-pushed perf_tweaks_stats+jobs from cac7ab8bc0 to ca1681e224 2025-01-06 05:47:14 +00:00 Compare
Oneric force-pushed perf_tweaks_stats+jobs from ca1681e224 to b441251aed 2025-01-07 19:28:06 +00:00 Compare
Author
Member

for reference: after discussion on IRC, we suspect the nil inbox issue addressed by 2025.01.01 resulted from an old OStatus user account never upgraded to AP and the removal of ap_enabled here.
AP requires actors to have an inbox and outbox and since a few versions we explicitly reject users without an inbox and (unintentionally) crash on explicit-nil inboxes while parsing the user, effectively rejecting those at well.

If true, this can only affect instances old enough to have started out with OStatus support still present and who never received anything from a follower triggering a user refresh (possibly because the remote died before adopting AP)

While it wasn’t linked to the removal of ap_enabled, Pleroma experienced the same issue some months after releasing the removal of ap_enabled and applied a very similar fix: https://git.pleroma.social/pleroma/pleroma/-/merge_requests/4083
There were no further nil inbox patches in Pleroma afaict, making it seem unlikely this causes any further problems.

The inbox value should automatically be updated on the next user refresh if they now support AP. If desired we could just leave it at the already deployed fix, or try to refresh all ap_enabled=false users already during DB migration, but the existing fix needs to remain to deal with dead or temporarily unavailable remotes. Imho "just waiting for the next natural refresh" is good enough

for reference: after discussion on IRC, we suspect the `nil` inbox issue addressed by 2025.01.01 resulted from an old OStatus user account never upgraded to AP and the removal of `ap_enabled` here. AP _requires_ actors to have an inbox and outbox and since a few versions we explicitly reject users without an inbox and (unintentionally) crash on explicit-`nil` inboxes while parsing the user, effectively rejecting those at well. If true, this can only affect instances old enough to have started out with OStatus support still present and who _never_ received anything from a follower triggering a user refresh *(possibly because the remote died before adopting AP)* While it wasn’t linked to the removal of `ap_enabled`, Pleroma experienced the same issue some months after releasing the removal of `ap_enabled` and applied a very similar fix: https://git.pleroma.social/pleroma/pleroma/-/merge_requests/4083 There were no further `nil` inbox patches in Pleroma afaict, making it seem unlikely this causes any further problems. The inbox value should automatically be updated on the next user refresh if they now support AP. If desired we could just leave it at the already deployed fix, or try to refresh all `ap_enabled=false` users already during DB migration, but the existing fix needs to remain to deal with dead or temporarily unavailable remotes. Imho "just waiting for the next natural refresh" is good enough
Oneric added 3 commits 2025-01-27 18:58:33 +00:00
And for now treat partial fetches as a success, since for all
current users partial collection data is better than no data at all.

If an error occurred while fetching a page, this previously
returned a bogus {:ok, {:error, _}} success, causing the error
to be attached to the object as an reply list subsequently
leading to the whole post getting rejected during validation.

Also the pinned collection caller did not actually handle
the preexisting error case resulting in process crashes.
Reported by riley on IRC
federation.md: list FEP-dc88 formatting mathematics
Some checks are pending
ci/woodpecker/pr/build-amd64 Pipeline is pending approval
ci/woodpecker/pr/build-arm64 Pipeline is pending approval
ci/woodpecker/pr/docs Pipeline is pending approval
ci/woodpecker/pr/lint Pipeline is pending approval
ci/woodpecker/pr/test Pipeline is pending approval
ace86fdaaa
Implemented by #642
Author
Member

Added another fix for preëxisting bugs related to collection fetches + two unrelated small fixes here since the exact commit likely depends on prior changes here anyway and there’s little point in untangling it just to deal with merge conflicts later.

Added another fix for preëxisting bugs related to collection fetches + two unrelated small fixes here since the exact commit likely depends on prior changes here anyway and there’s little point in untangling it just to deal with merge conflicts later.
Oneric added 1 commit 2025-02-10 03:41:43 +00:00
article_note_validator: fix handling of Mastodon-style replies collections
Some checks are pending
ci/woodpecker/pr/build-amd64 Pipeline is pending approval
ci/woodpecker/pr/build-arm64 Pipeline is pending approval
ci/woodpecker/pr/docs Pipeline is pending approval
ci/woodpecker/pr/lint Pipeline is pending approval
ci/woodpecker/pr/test Pipeline is pending approval
94738f29cc
The first collection page is (sometimes?) inlined
which caused crashes when attempting to log the fetch failure.
But there’s no need to fetch and we can treat it like the other inline collection
Oneric force-pushed perf_tweaks_stats+jobs from 94738f29cc to f0a99b4595 2025-02-14 17:50:32 +00:00 Compare

i have indeed been running this patch locally sincei last commented, with no issues noted

i have indeed been running this patch locally sincei last commented, with no issues noted
floatingghost merged commit 4c41f8c286 into develop 2025-02-14 19:22:40 +00:00
floatingghost deleted branch perf_tweaks_stats+jobs 2025-02-14 19:22:44 +00:00
Sign in to join this conversation.
No description provided.