Fix various attachment cleanup issues #789

Oneric · 2024-06-03T22:28:25Z

Oneric commented

2024-06-03 22:28:25 +00:00

Comparing the runcount of the cleanup query from #784 (comment) with the total attachment count from #765 (comment) showed suspiciously many query executions.
Adding some debug logs (thx @norm) revealed the query also runs for delted remote posts eventhough we obviously can't not need to do anything about them. Skip those useless yet expensive query runs.

While at it also:

filter out noops earlier to avoid unnecessary Oban-related DB churn
add a delay before deletion to fix #775
~~fix deletion of files after a media domain migration (prev mentioned in #765) if a new setting is set; this new setting might also be helpful for an eventual db schema migration (again #765)~~ moved into separate PR for further discussion
~~fix the changelog~~ cherry-pickedi nto and applied as part of #786

Comparing the runcount of the cleanup query from https://akkoma.dev/AkkomaGang/akkoma/issues/784#issuecomment-12252 with the total attachment count from https://akkoma.dev/AkkomaGang/akkoma/issues/765#issuecomment-12256 showed suspiciously many query executions. Adding some debug logs (thx @norm) revealed the query also runs for delted remote posts eventhough we obviously can't not need to do anything about them. Skip those useless yet expensive query runs. While at it also: - filter out noops earlier to avoid unnecessary Oban-related DB churn - add a delay before deletion to fix #775 - ~~fix deletion of files after a media domain migration (prev mentioned in #765) if a new setting is set; this new setting might also be helpful for an eventual db schema migration (again #765)~~ moved into separate PR for further discussion - ~~fix the changelog~~ cherry-pickedi nto and applied as part of #786

Oneric force-pushed attachcleanup-overeager from a15afbbc76 to 0d3edfb7fe

2024-06-04 00:50:43 +00:00

Compare

Oneric force-pushed attachcleanup-overeager from 0d3edfb7fe to c5bb5397f1

2024-06-04 00:51:04 +00:00

Compare

norm reviewed 2024-06-04 00:56:40 +00:00

config/config.exs Outdated

					
				@ -254,6 +255,7 @@ config :pleroma, :instance,

				  external_user_synchronization: true,

				  extended_nickname_format: true,

				  cleanup_attachments: false,

				  cleanup_attachments_delay: 1800,

norm commented

2024-06-04 00:56:40 +00:00

Kinda think 30 minutes is a bit long if the main purpose is to allow for delete & redraft, though I guess there wouldn't be more than a few d/r requests going on at a time anyways.

Oneric commented

2024-06-04 01:04:42 +00:00

i just picked some value which should be safe for virtually all redraft scenarios; i’m open to change it to any other suggested value

floatingghost commented

2024-06-04 13:43:42 +00:00

quite honestly any value is fine, having a few extra bytes on your drive for 30 mins isn't the end of the world

👍 1

norm commented

2024-06-04 01:00:19 +00:00

Also unsure about the all_base_url thing, having to track every base url ever used manually seems like a bit much.

Also unsure about the `all_base_url` thing, having to track every base url ever used manually seems like a bit much.

Oneric commented

2024-06-04 01:29:34 +00:00

Also unsure about the all_base_url thing, having to track every base url ever used manually seems like a bit much.

I’m not sure how to reliably solve this without having a full base url list in the current db scheme.

With forced dedupe, the file identifier of any URL will only be the last part and we don’t need a base url at all. However, before without dedupe this was typically the last two parts for the local uploader. For S3 there once used to be possibly arbitrary many parts before f77ec96707, always two after and with later changes always only one (and no consideration what happens if “strict_encoding” creates overlaps; impossible now with forced dedupe). I’ve also seen another, now removed uploader in git history which might have had yet another file identifier scheme. When migrating between uploaders old identifiers will (mostly) be kept iiuc.
When trying to handle by just trying to use more parts of the URL as an identifier until deletion succeeds, we might end up mistakenly deleting another file (albeit unlikely to happen in common setups, if it happens it creates irrecoverable data loss); e.g.:

a.png was uploaded in the past to the base URL https://example.org/media/service1/ without intermediate directories. The file got lost during some hardware or uploader migration. Later on Something gets uploaded with the file handle service1/a.png to base URL https://media.example.org/usercontent/.
Now the post contianing the original a.png get deleted, deleting the orignal file fails, so deletion of service1/a.png will successfully be attempted — but now an unrelated file got destroyed.
(to make this more applicable to actual logic, replace service1 with some UUID and a with a shasum; for readability i used friendlier names here)

In practice i’d expect media domain and/or uploader migrations to happen rarely if at all and therefore for most instances the list should only consisting of 2 or 3 entries anyway (local↔S3, instance domain → subdomain).

> Also unsure about the all_base_url thing, having to track every base url ever used manually seems like a bit much. I’m not sure how to reliably solve this *without* having a full base url list in the current db scheme. With forced dedupe, the file identifier of any URL will only be the last part and we don’t need a base url at all. However, before without dedupe this was typically the last two parts for the local uploader. For S3 there once used to be possibly arbitrary many parts before f77ec96707bbce99725c4cad2ef5aea70511c6f2, always two after and with later changes always only one *(and no consideration what happens if “strict_encoding” creates overlaps; impossible now with forced dedupe)*. I’ve also seen another, now removed uploader in git history which might have had yet another file identifier scheme. When migrating between uploaders old identifiers will (mostly) be kept iiuc. When trying to handle by just trying to use more parts of the URL as an identifier until deletion succeeds, we might end up mistakenly deleting another file *(albeit unlikely to happen in common setups, if it happens it creates irrecoverable data loss)*; e.g.: `a.png` was uploaded in the past to the base URL `https://example.org/media/service1/` without intermediate directories. The file got lost during some hardware or uploader migration. Later on Something gets uploaded with the file handle `service1/a.png` to base URL `https://media.example.org/usercontent/`. Now the post contianing the original `a.png` get deleted, deleting the orignal file fails, so deletion of `service1/a.png` will successfully be attempted — but now an unrelated file got destroyed. *(to make this more applicable to actual logic, replace `service1` with some UUID and `a` with a shasum; for readability i used friendlier names here)* In practice i’d expect media domain and/or uploader migrations to happen rarely if at all and therefore for most instances the list should only consisting of 2 or 3 entries anyway *(local↔S3, instance domain → subdomain)*.

👍 1

floatingghost commented

2024-06-04 13:46:35 +00:00

I also really do not like the all_base_urls settings

suggestion: best-effort in which we check for the same base domain

for example, we consider an attachment deletable if its base domain (x.(MATCHES.net)) is the same as our endpoint OR media URLs - which should account for both media migrated to s3 and people that just moved to a subdomain

I also _really_ do not like the `all_base_urls` settings suggestion: best-effort in which we check for the same base domain for example, we consider an attachment deletable if its base domain (x.(MATCHES.net)) is the same as our endpoint _OR_ media URLs - which should account for both media migrated to s3 and people that just moved to a subdomain

Oneric commented

2024-06-04 16:51:08 +00:00

suggestion: best-effort in which we check for the same base domain

for example, we consider an attachment deletable if its base domain (x.(MATCHES.net)) is the same as our endpoint OR media URLs - which should account for both media migrated to s3 and people that just moved to a subdomain

There might still be some failures to delete, like after migrating from S3 to local uploader, but afaict no accidental deletions of unrelated files. For file cleanup this is probably good enough.
Looking ahead to an eventual db refactor though, do you think it’s also reasonable to lose media objects for such affected files during the eventual db migration, or do you have ideas how to solve this too without a full base url list?
EDIT: to clarify, attachments from affected posts will not disappear, but they’ll be treated like remote attachments with no local media_id

> suggestion: best-effort in which we check for the same base domain > > for example, we consider an attachment deletable if its base domain (x.(MATCHES.net)) is the same as our endpoint OR media URLs - which should account for both media migrated to s3 and people that just moved to a subdomain There might still be some failures to delete, like after migrating _from_ S3 _to_ local uploader, but afaict no accidental deletions of unrelated files. For file cleanup this is probably good enough. Looking ahead to an eventual db refactor though, do you think it’s also reasonable to lose media objects for such affected files during the eventual db migration, or do you have ideas how to solve this too without a full base url list? **EDIT:** to clarify, attachments from affected posts will not disappear, but they’ll be treated like remote attachments with no local `media_id`

Oneric referenced this pull request

2024-06-07 14:22:59 +00:00

Raise minimum PostgreSQL version to 12 #786

Oneric force-pushed attachcleanup-overeager from c5bb5397f1 to 8578b1103b

2024-06-07 17:05:13 +00:00

Compare

Oneric force-pushed attachcleanup-overeager from 8578b1103b to df2373e550

2024-06-11 23:32:26 +00:00

Compare

Oneric commented

2024-06-11 23:41:07 +00:00

swapped commit order so the (now) first two can be merged without a domain migration fix already if desired; not constantly running heavy but useless query should be a nice improvement

re all_base_urls i can change to e.g. the suggested “best-effort” guess, but just wanted to make sure possible interaction with later changes are taken into consideration
(but i’m too lazy to change things when it’s not yet clear if or what change is needed)

swapped commit order so the (now) first two can be merged without a domain migration fix already if desired; not constantly running heavy but useless query should be a nice improvement re `all_base_urls` i can change to e.g. the suggested “best-effort” guess, but just wanted to make sure possible interaction with later changes are taken into consideration (but i’m too lazy to change things when it’s not yet clear if or what change is needed)

Oneric force-pushed attachcleanup-overeager from df2373e550 to 5e1a47ce70

2024-07-06 01:22:35 +00:00

Compare

Oneric commented

2024-07-06 01:25:18 +00:00

To avoid any obstacles i completely dropped the media migration fix from here; the remaining two commits already provide a significant improvment by avoiding hammering the DB with heavy but unnecessary, noop queries

I’ll open a new PR to discuss how to best deal with migrated media domains

To avoid any obstacles i completely dropped the media migration fix from here; the remaining two commits already provide a significant improvment by avoiding hammering the DB with heavy but unnecessary, noop queries I’ll open a new PR to discuss how to best deal with migrated media domains

Oneric referenced this pull request from a commit

2024-07-06 01:28:50 +00:00

[TMP] following commits depend on #789

Oneric referenced this pull request

2024-07-06 01:40:53 +00:00

WIP: fix attachment cleanup after media domain migration #818

Oneric force-pushed attachcleanup-overeager from 5e1a47ce70 to b31388b5df

2024-11-16 23:41:49 +00:00

Compare

norm commented

2024-11-17 15:34:30 +00:00

Seems to be working well so far on my instance.