Fix various attachment cleanup issues #789

Open
Oneric wants to merge 2 commits from Oneric/akkoma:attachcleanup-overeager into develop
Member

Comparing the runcount of the cleanup query from #784 (comment) with the total attachment count from #765 (comment) showed suspiciously many query executions.
Adding some debug logs (thx @norm) revealed the query also runs for delted remote posts eventhough we obviously can't not need to do anything about them. Skip those useless yet expensive query runs.

While at it also:

  • filter out noops earlier to avoid unnecessary Oban-related DB churn
  • fix deletion of files after a media domain migration (prev mentioned in #765) if a new setting is set; this new setting might also be helpful for an eventual db schema migration (again #765)
  • add a delay before deletion to fix #775 moved into separate PR for further discussion
  • fix the changelog cherry-pickedi nto and applied as part of #786
Comparing the runcount of the cleanup query from https://akkoma.dev/AkkomaGang/akkoma/issues/784#issuecomment-12252 with the total attachment count from https://akkoma.dev/AkkomaGang/akkoma/issues/765#issuecomment-12256 showed suspiciously many query executions. Adding some debug logs (thx @norm) revealed the query also runs for delted remote posts eventhough we obviously can't not need to do anything about them. Skip those useless yet expensive query runs. While at it also: - filter out noops earlier to avoid unnecessary Oban-related DB churn - fix deletion of files after a media domain migration (prev mentioned in #765) if a new setting is set; this new setting might also be helpful for an eventual db schema migration (again #765) - ~~add a delay before deletion to *f*ix #775~~ moved into separate PR for further discussion - ~~fix the changelog~~ cherry-pickedi nto and applied as part of #786
Oneric added 5 commits 2024-06-03 22:28:25 +00:00
The cleanup attachment worker was run for every deleted post,
even if it’s a remote post whose attachments we don't even store.
This was especially bad due to attachment cleanup involving a
particularly heavy query wasting a bunch of database perf for nil.

This was uncovered by comparing statistics from
#784 and
#765 (comment)
Until now only the current base_url was checked.
After a media domain migration all pre-existing files
thus turned undeletable. Fix this with a new config
option allowing to list all old media base urls.
(This may also come in handy for a later db refactor, see
 #765)
Otherwise attachments have a high chance to disappear with akkoma-fe’s
“delete & redraft” feature when cleanup is enabled in the backend. Since
we don't know whether a deletion was intended to be part of a redraft
process or even if whether the redraft was abandoned we still have to
delete attachments eventually.
A thirty minute delay should provide sufficient time for redrafting.

Fixes: #775
Apparently got jumbled during some rebase(s)
Add cleanup_attachment-related changelog entries
Some checks are pending
ci/woodpecker/pr/build-amd64 Pipeline is pending
ci/woodpecker/pr/build-arm64 Pipeline is pending
ci/woodpecker/pr/docs Pipeline is pending
ci/woodpecker/pr/lint Pipeline is pending
ci/woodpecker/pr/test Pipeline is pending
a15afbbc76
Oneric force-pushed attachcleanup-overeager from a15afbbc76 to 0d3edfb7fe 2024-06-04 00:50:43 +00:00 Compare
Oneric force-pushed attachcleanup-overeager from 0d3edfb7fe to c5bb5397f1 2024-06-04 00:51:04 +00:00 Compare
norm reviewed 2024-06-04 00:56:40 +00:00
@ -254,6 +255,7 @@ config :pleroma, :instance,
external_user_synchronization: true,
extended_nickname_format: true,
cleanup_attachments: false,
cleanup_attachments_delay: 1800,
Contributor

Kinda think 30 minutes is a bit long if the main purpose is to allow for delete & redraft, though I guess there wouldn't be more than a few d/r requests going on at a time anyways.

Kinda think 30 minutes is a bit long if the main purpose is to allow for delete & redraft, though I guess there wouldn't be more than a few d/r requests going on at a time anyways.
Author
Member

i just picked some value which should be safe for virtually all redraft scenarios; i’m open to change it to any other suggested value

i just picked some value which should be safe for virtually all redraft scenarios; i’m open to change it to any other suggested value

quite honestly any value is fine, having a few extra bytes on your drive for 30 mins isn't the end of the world

quite honestly any value is fine, having a few extra bytes on your drive for 30 mins isn't the end of the world
Contributor

Also unsure about the all_base_url thing, having to track every base url ever used manually seems like a bit much.

Also unsure about the `all_base_url` thing, having to track every base url ever used manually seems like a bit much.
Author
Member

Also unsure about the all_base_url thing, having to track every base url ever used manually seems like a bit much.

I’m not sure how to reliably solve this without having a full base url list in the current db scheme.

With forced dedupe, the file identifier of any URL will only be the last part and we don’t need a base url at all. However, before without dedupe this was typically the last two parts for the local uploader. For S3 there once used to be possibly arbitrary many parts before f77ec96707, always two after and with later changes always only one (and no consideration what happens if “strict_encoding” creates overlaps; impossible now with forced dedupe). I’ve also seen another, now removed uploader in git history which might have had yet another file identifier scheme. When migrating between uploaders old identifiers will (mostly) be kept iiuc.
When trying to handle by just trying to use more parts of the URL as an identifier until deletion succeeds, we might end up mistakenly deleting another file (albeit unlikely to happen in common setups, if it happens it creates irrecoverable data loss); e.g.:

a.png was uploaded in the past to the base URL https://example.org/media/service1/ without intermediate directories. The file got lost during some hardware or uploader migration. Later on Something gets uploaded with the file handle service1/a.png to base URL https://media.example.org/usercontent/.
Now the post contianing the original a.png get deleted, deleting the orignal file fails, so deletion of service1/a.png will successfully be attempted — but now an unrelated file got destroyed.
(to make this more applicable to actual logic, replace service1 with some UUID and a with a shasum; for readability i used friendlier names here)

In practice i’d expect media domain and/or uploader migrations to happen rarely if at all and therefore for most instances the list should only consisting of 2 or 3 entries anyway (local↔S3, instance domain → subdomain).

> Also unsure about the all_base_url thing, having to track every base url ever used manually seems like a bit much. I’m not sure how to reliably solve this *without* having a full base url list in the current db scheme. With forced dedupe, the file identifier of any URL will only be the last part and we don’t need a base url at all. However, before without dedupe this was typically the last two parts for the local uploader. For S3 there once used to be possibly arbitrary many parts before f77ec96707bbce99725c4cad2ef5aea70511c6f2, always two after and with later changes always only one *(and no consideration what happens if “strict_encoding” creates overlaps; impossible now with forced dedupe)*. I’ve also seen another, now removed uploader in git history which might have had yet another file identifier scheme. When migrating between uploaders old identifiers will (mostly) be kept iiuc. When trying to handle by just trying to use more parts of the URL as an identifier until deletion succeeds, we might end up mistakenly deleting another file *(albeit unlikely to happen in common setups, if it happens it creates irrecoverable data loss)*; e.g.: `a.png` was uploaded in the past to the base URL `https://example.org/media/service1/` without intermediate directories. The file got lost during some hardware or uploader migration. Later on Something gets uploaded with the file handle `service1/a.png` to base URL `https://media.example.org/usercontent/`. Now the post contianing the original `a.png` get deleted, deleting the orignal file fails, so deletion of `service1/a.png` will successfully be attempted — but now an unrelated file got destroyed. *(to make this more applicable to actual logic, replace `service1` with some UUID and `a` with a shasum; for readability i used friendlier names here)* In practice i’d expect media domain and/or uploader migrations to happen rarely if at all and therefore for most instances the list should only consisting of 2 or 3 entries anyway *(local↔S3, instance domain → subdomain)*.

I also really do not like the all_base_urls settings

suggestion: best-effort in which we check for the same base domain

for example, we consider an attachment deletable if its base domain (x.(MATCHES.net)) is the same as our endpoint OR media URLs - which should account for both media migrated to s3 and people that just moved to a subdomain

I also _really_ do not like the `all_base_urls` settings suggestion: best-effort in which we check for the same base domain for example, we consider an attachment deletable if its base domain (x.(MATCHES.net)) is the same as our endpoint _OR_ media URLs - which should account for both media migrated to s3 and people that just moved to a subdomain
Author
Member

suggestion: best-effort in which we check for the same base domain

for example, we consider an attachment deletable if its base domain (x.(MATCHES.net)) is the same as our endpoint OR media URLs - which should account for both media migrated to s3 and people that just moved to a subdomain

There might still be some failures to delete, like after migrating from S3 to local uploader, but afaict no accidental deletions of unrelated files. For file cleanup this is probably good enough.
Looking ahead to an eventual db refactor though, do you think it’s also reasonable to lose media objects for such affected files during the eventual db migration, or do you have ideas how to solve this too without a full base url list?
EDIT: to clarify, attachments from affected posts will not disappear, but they’ll be treated like remote attachments with no local media_id

> suggestion: best-effort in which we check for the same base domain > > for example, we consider an attachment deletable if its base domain (x.(MATCHES.net)) is the same as our endpoint OR media URLs - which should account for both media migrated to s3 and people that just moved to a subdomain There might still be some failures to delete, like after migrating _from_ S3 _to_ local uploader, but afaict no accidental deletions of unrelated files. For file cleanup this is probably good enough. Looking ahead to an eventual db refactor though, do you think it’s also reasonable to lose media objects for such affected files during the eventual db migration, or do you have ideas how to solve this too without a full base url list? **EDIT:** to clarify, attachments from affected posts will not disappear, but they’ll be treated like remote attachments with no local `media_id`
Oneric force-pushed attachcleanup-overeager from c5bb5397f1 to 8578b1103b 2024-06-07 17:05:13 +00:00 Compare
Oneric force-pushed attachcleanup-overeager from 8578b1103b to df2373e550 2024-06-11 23:32:26 +00:00 Compare
Author
Member

swapped commit order so the (now) first two can be merged without a domain migration fix already if desired; not constantly running heavy but useless query should be a nice improvement

re all_base_urls i can change to e.g. the suggested “best-effort” guess, but just wanted to make sure possible interaction with later changes are taken into consideration
(but i’m too lazy to change things when it’s not yet clear if or what change is needed)

swapped commit order so the (now) first two can be merged without a domain migration fix already if desired; not constantly running heavy but useless query should be a nice improvement re `all_base_urls` i can change to e.g. the suggested “best-effort” guess, but just wanted to make sure possible interaction with later changes are taken into consideration (but i’m too lazy to change things when it’s not yet clear if or what change is needed)
Oneric force-pushed attachcleanup-overeager from df2373e550 to 5e1a47ce70 2024-07-06 01:22:35 +00:00 Compare
Author
Member

To avoid any obstacles i completely dropped the media migration fix from here; the remaining two commits already provide a significant improvment by avoiding hammering the DB with heavy but unnecessary, noop queries

I’ll open a new PR to discuss how to best deal with migrated media domains

To avoid any obstacles i completely dropped the media migration fix from here; the remaining two commits already provide a significant improvment by avoiding hammering the DB with heavy but unnecessary, noop queries I’ll open a new PR to discuss how to best deal with migrated media domains
Oneric referenced this pull request from a commit 2024-07-06 01:28:50 +00:00
Some checks are pending
ci/woodpecker/pr/build-amd64 Pipeline is pending
ci/woodpecker/pr/build-arm64 Pipeline is pending
ci/woodpecker/pr/docs Pipeline is pending
ci/woodpecker/pr/lint Pipeline is pending
ci/woodpecker/pr/test Pipeline is pending
This pull request can be merged automatically.
This branch is out-of-date with the base branch
You are not authorized to merge this pull request.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u attachcleanup-overeager:Oneric-attachcleanup-overeager
git checkout Oneric-attachcleanup-overeager
Sign in to join this conversation.
No description provided.