Fix various attachment cleanup issues #789

Oneric · 2024-06-03T22:28:25Z

Oneric commented

2024-06-03 22:28:25 +00:00

Comparing the runcount of the cleanup query from #784 (comment) with the total attachment count from #765 (comment) showed suspiciously many query executions.
Adding some debug logs (thx @norm) revealed the query also runs for delted remote posts eventhough we obviously can't not need to do anything about them. Skip those useless yet expensive query runs.

While at it also:

filter out noops earlier to avoid unnecessary Oban-related DB churn
fix deletion of files after a media domain migration (prev mentioned in #765) if a new setting is set; this new setting might also be helpful for an eventual db schema migration (again #765)
add a delay before deletion to fix #775
fix the changelog

Comparing the runcount of the cleanup query from https://akkoma.dev/AkkomaGang/akkoma/issues/784#issuecomment-12252 with the total attachment count from https://akkoma.dev/AkkomaGang/akkoma/issues/765#issuecomment-12256 showed suspiciously many query executions. Adding some debug logs (thx @norm) revealed the query also runs for delted remote posts eventhough we obviously can't not need to do anything about them. Skip those useless yet expensive query runs. While at it also: - filter out noops earlier to avoid unnecessary Oban-related DB churn - fix deletion of files after a media domain migration (prev mentioned in #765) if a new setting is set; this new setting might also be helpful for an eventual db schema migration (again #765) - add a delay before deletion to fix #775 - fix the changelog

Oneric added 5 commits 2024-06-03 22:28:25 +00:00

Don't try to cleanup remote attachments 6283155e21

The cleanup attachment worker was run for every deleted post,
even if it’s a remote post whose attachments we don't even store.
This was especially bad due to attachment cleanup involving a
particularly heavy query wasting a bunch of database perf for nil.

This was uncovered by comparing statistics from
#784 and
#765 (comment)

Allow deleting uploaded files after media migration c3e5ac30ed

Until now only the current base_url was checked.
After a media domain migration all pre-existing files
thus turned undeletable. Fix this with a new config
option allowing to list all old media base urls.
(This may also come in handy for a later db refactor, see
 #765)

Delay attachment deletion a5a8265819

Otherwise attachments have a high chance to disappear with akkoma-fe’s
“delete & redraft” feature when cleanup is enabled in the backend. Since
we don't know whether a deletion was intended to be part of a redraft
process or even if whether the redraft was abandoned we still have to
delete attachments eventually.
A thirty minute delay should provide sufficient time for redrafting.

Fixes: #775

Fix changelog 443b7b44dc

Apparently got jumbled during some rebase(s)

Add cleanup_attachment-related changelog entries

ci/woodpecker/pr/build-amd64 Pipeline is pending

Details

ci/woodpecker/pr/build-arm64 Pipeline is pending

Details

ci/woodpecker/pr/docs Pipeline is pending

Details

ci/woodpecker/pr/lint Pipeline is pending

Details

ci/woodpecker/pr/test Pipeline is pending

Details

a15afbbc76

Oneric force-pushed attachcleanup-overeager from a15afbbc76 to 0d3edfb7fe

2024-06-04 00:50:43 +00:00

Compare

Oneric force-pushed attachcleanup-overeager from 0d3edfb7fe to c5bb5397f1

2024-06-04 00:51:04 +00:00

Compare

norm reviewed 2024-06-04 00:56:40 +00:00

config/config.exs Outdated

					
				@ -254,6 +255,7 @@ config :pleroma, :instance,

				  external_user_synchronization: true,

				  extended_nickname_format: true,

				  cleanup_attachments: false,

				  cleanup_attachments_delay: 1800,

norm commented

2024-06-04 00:56:40 +00:00

Kinda think 30 minutes is a bit long if the main purpose is to allow for delete & redraft, though I guess there wouldn't be more than a few d/r requests going on at a time anyways.

Oneric commented

2024-06-04 01:04:42 +00:00

i just picked some value which should be safe for virtually all redraft scenarios; i’m open to change it to any other suggested value

floatingghost commented

2024-06-04 13:43:42 +00:00

quite honestly any value is fine, having a few extra bytes on your drive for 30 mins isn't the end of the world

👍 1

norm commented

2024-06-04 01:00:19 +00:00

Also unsure about the all_base_url thing, having to track every base url ever used manually seems like a bit much.

Also unsure about the `all_base_url` thing, having to track every base url ever used manually seems like a bit much.

Oneric commented

2024-06-04 01:29:34 +00:00

Also unsure about the all_base_url thing, having to track every base url ever used manually seems like a bit much.

I’m not sure how to reliably solve this without having a full base url list in the current db scheme.

With forced dedupe, the file identifier of any URL will only be the last part and we don’t need a base url at all. However, before without dedupe this was typically the last two parts for the local uploader. For S3 there once used to be possibly arbitrary many parts before f77ec96707, always two after and with later changes always only one (and no consideration what happens if “strict_encoding” creates overlaps; impossible now with forced dedupe). I’ve also seen another, now removed uploader in git history which might have had yet another file identifier scheme. When migrating between uploaders old identifiers will (mostly) be kept iiuc.
When trying to handle by just trying to use more parts of the URL as an identifier until deletion succeeds, we might end up mistakenly deleting another file (albeit unlikely to happen in common setups, if it happens it creates irrecoverable data loss); e.g.:

a.png was uploaded in the past to the base URL https://example.org/media/service1/ without intermediate directories. The file got lost during some hardware or uploader migration. Later on Something gets uploaded with the file handle service1/a.png to base URL https://media.example.org/usercontent/.
Now the post contianing the original a.png get deleted, deleting the orignal file fails, so deletion of service1/a.png will successfully be attempted — but now an unrelated file got destroyed.
(to make this more applicable to actual logic, replace service1 with some UUID and a with a shasum; for readability i used friendlier names here)

In practice i’d expect media domain and/or uploader migrations to happen rarely if at all and therefore for most instances the list should only consisting of 2 or 3 entries anyway (local↔S3, instance domain → subdomain).

> Also unsure about the all_base_url thing, having to track every base url ever used manually seems like a bit much. I’m not sure how to reliably solve this *without* having a full base url list in the current db scheme. With forced dedupe, the file identifier of any URL will only be the last part and we don’t need a base url at all. However, before without dedupe this was typically the last two parts for the local uploader. For S3 there once used to be possibly arbitrary many parts before f77ec96707bbce99725c4cad2ef5aea70511c6f2, always two after and with later changes always only one *(and no consideration what happens if “strict_encoding” creates overlaps; impossible now with forced dedupe)*. I’ve also seen another, now removed uploader in git history which might have had yet another file identifier scheme. When migrating between uploaders old identifiers will (mostly) be kept iiuc. When trying to handle by just trying to use more parts of the URL as an identifier until deletion succeeds, we might end up mistakenly deleting another file *(albeit unlikely to happen in common setups, if it happens it creates irrecoverable data loss)*; e.g.: `a.png` was uploaded in the past to the base URL `https://example.org/media/service1/` without intermediate directories. The file got lost during some hardware or uploader migration. Later on Something gets uploaded with the file handle `service1/a.png` to base URL `https://media.example.org/usercontent/`. Now the post contianing the original `a.png` get deleted, deleting the orignal file fails, so deletion of `service1/a.png` will successfully be attempted — but now an unrelated file got destroyed. *(to make this more applicable to actual logic, replace `service1` with some UUID and `a` with a shasum; for readability i used friendlier names here)* In practice i’d expect media domain and/or uploader migrations to happen rarely if at all and therefore for most instances the list should only consisting of 2 or 3 entries anyway *(local↔S3, instance domain → subdomain)*.

👍 1

floatingghost commented

2024-06-04 13:46:35 +00:00

I also really do not like the all_base_urls settings

suggestion: best-effort in which we check for the same base domain

for example, we consider an attachment deletable if its base domain (x.(MATCHES.net)) is the same as our endpoint OR media URLs - which should account for both media migrated to s3 and people that just moved to a subdomain

I also _really_ do not like the `all_base_urls` settings suggestion: best-effort in which we check for the same base domain for example, we consider an attachment deletable if its base domain (x.(MATCHES.net)) is the same as our endpoint _OR_ media URLs - which should account for both media migrated to s3 and people that just moved to a subdomain

Oneric commented

2024-06-04 16:51:08 +00:00

suggestion: best-effort in which we check for the same base domain

for example, we consider an attachment deletable if its base domain (x.(MATCHES.net)) is the same as our endpoint OR media URLs - which should account for both media migrated to s3 and people that just moved to a subdomain

There might still be some failures to delete, like after migrating from S3 to local uploader, but afaict no accidental deletions of unrelated files. For file cleanup this is probably good enough.
Looking ahead to an eventual db refactor though, do you think it’s also reasonable to lose media objects for such affected files during the eventual db migration, or do you have ideas how to solve this too without a full base url list?
EDIT: to clarify, attachments from affected posts will not disappear, but they’ll be treated like remote attachments with no local media_id

> suggestion: best-effort in which we check for the same base domain > > for example, we consider an attachment deletable if its base domain (x.(MATCHES.net)) is the same as our endpoint OR media URLs - which should account for both media migrated to s3 and people that just moved to a subdomain There might still be some failures to delete, like after migrating _from_ S3 _to_ local uploader, but afaict no accidental deletions of unrelated files. For file cleanup this is probably good enough. Looking ahead to an eventual db refactor though, do you think it’s also reasonable to lose media objects for such affected files during the eventual db migration, or do you have ideas how to solve this too without a full base url list? **EDIT:** to clarify, attachments from affected posts will not disappear, but they’ll be treated like remote attachments with no local `media_id`

Oneric referenced this pull request

2024-06-07 14:22:59 +00:00

Raise minimum PostgreSQL version to 12 #786

Oneric force-pushed attachcleanup-overeager from c5bb5397f1 to 8578b1103b

2024-06-07 17:05:13 +00:00

Compare

Oneric force-pushed attachcleanup-overeager from 8578b1103b to df2373e550

2024-06-11 23:32:26 +00:00

Compare

Oneric commented

2024-06-11 23:41:07 +00:00

swapped commit order so the (now) first two can be merged without a domain migration fix already if desired; not constantly running heavy but useless query should be a nice improvement

re all_base_urls i can change to e.g. the suggested “best-effort” guess, but just wanted to make sure possible interaction with later changes are taken into consideration
(but i’m too lazy to change things when it’s not yet clear if or what change is needed)

swapped commit order so the (now) first two can be merged without a domain migration fix already if desired; not constantly running heavy but useless query should be a nice improvement re `all_base_urls` i can change to e.g. the suggested “best-effort” guess, but just wanted to make sure possible interaction with later changes are taken into consideration (but i’m too lazy to change things when it’s not yet clear if or what change is needed)

ci/woodpecker/pr/lint Pipeline was successful

Details

ci/woodpecker/pr/test Pipeline was successful

Details

ci/woodpecker/pr/build-amd64 Pipeline was successful

Details

ci/woodpecker/pr/build-arm64 Pipeline was successful

Details

ci/woodpecker/pr/docs Pipeline was successful

Details

This pull request has changes conflicting with the target branch.

docs/docs/configuration/cheatsheet.md

View command line instructions.

Checkout

From your project repository, check out a new branch and test the changes.

git fetch -u attachcleanup-overeager:Oneric-attachcleanup-overeager

git checkout Oneric-attachcleanup-overeager

Sign in to join this conversation.

No reviewers

No milestone

No project

No assignees

3 participants

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: AkkomaGang/akkoma#789