Fix various attachment cleanup issues #789
No reviewers
Labels
No labels
approved, awaiting change
bug
configuration
documentation
duplicate
enhancement
extremely low priority
feature request
Fix it yourself
help wanted
invalid
mastodon_api
needs docs
needs tests
not a bug
planned
pleroma_api
privacy
question
static_fe
triage
wontfix
No milestone
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: AkkomaGang/akkoma#789
Loading…
Reference in a new issue
No description provided.
Delete branch "Oneric/akkoma:attachcleanup-overeager"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Comparing the runcount of the cleanup query from #784 (comment) with the total attachment count from #765 (comment) showed suspiciously many query executions.
Adding some debug logs (thx @norm) revealed the query also runs for delted remote posts eventhough we obviously can't not need to do anything about them. Skip those useless yet expensive query runs.
While at it also:
add a delay before deletion to fix #775moved into separate PR for further discussionfix the changelogcherry-pickedi nto and applied as part of #786a15afbbc76
to0d3edfb7fe
0d3edfb7fe
toc5bb5397f1
@ -254,6 +255,7 @@ config :pleroma, :instance,
external_user_synchronization: true,
extended_nickname_format: true,
cleanup_attachments: false,
cleanup_attachments_delay: 1800,
Kinda think 30 minutes is a bit long if the main purpose is to allow for delete & redraft, though I guess there wouldn't be more than a few d/r requests going on at a time anyways.
i just picked some value which should be safe for virtually all redraft scenarios; i’m open to change it to any other suggested value
quite honestly any value is fine, having a few extra bytes on your drive for 30 mins isn't the end of the world
Also unsure about the
all_base_url
thing, having to track every base url ever used manually seems like a bit much.I’m not sure how to reliably solve this without having a full base url list in the current db scheme.
With forced dedupe, the file identifier of any URL will only be the last part and we don’t need a base url at all. However, before without dedupe this was typically the last two parts for the local uploader. For S3 there once used to be possibly arbitrary many parts before
f77ec96707
, always two after and with later changes always only one (and no consideration what happens if “strict_encoding” creates overlaps; impossible now with forced dedupe). I’ve also seen another, now removed uploader in git history which might have had yet another file identifier scheme. When migrating between uploaders old identifiers will (mostly) be kept iiuc.When trying to handle by just trying to use more parts of the URL as an identifier until deletion succeeds, we might end up mistakenly deleting another file (albeit unlikely to happen in common setups, if it happens it creates irrecoverable data loss); e.g.:
a.png
was uploaded in the past to the base URLhttps://example.org/media/service1/
without intermediate directories. The file got lost during some hardware or uploader migration. Later on Something gets uploaded with the file handleservice1/a.png
to base URLhttps://media.example.org/usercontent/
.Now the post contianing the original
a.png
get deleted, deleting the orignal file fails, so deletion ofservice1/a.png
will successfully be attempted — but now an unrelated file got destroyed.(to make this more applicable to actual logic, replace
service1
with some UUID anda
with a shasum; for readability i used friendlier names here)In practice i’d expect media domain and/or uploader migrations to happen rarely if at all and therefore for most instances the list should only consisting of 2 or 3 entries anyway (local↔S3, instance domain → subdomain).
I also really do not like the
all_base_urls
settingssuggestion: best-effort in which we check for the same base domain
for example, we consider an attachment deletable if its base domain (x.(MATCHES.net)) is the same as our endpoint OR media URLs - which should account for both media migrated to s3 and people that just moved to a subdomain
There might still be some failures to delete, like after migrating from S3 to local uploader, but afaict no accidental deletions of unrelated files. For file cleanup this is probably good enough.
Looking ahead to an eventual db refactor though, do you think it’s also reasonable to lose media objects for such affected files during the eventual db migration, or do you have ideas how to solve this too without a full base url list?
EDIT: to clarify, attachments from affected posts will not disappear, but they’ll be treated like remote attachments with no local
media_id
c5bb5397f1
to8578b1103b
8578b1103b
todf2373e550
swapped commit order so the (now) first two can be merged without a domain migration fix already if desired; not constantly running heavy but useless query should be a nice improvement
re
all_base_urls
i can change to e.g. the suggested “best-effort” guess, but just wanted to make sure possible interaction with later changes are taken into consideration(but i’m too lazy to change things when it’s not yet clear if or what change is needed)
df2373e550
to5e1a47ce70
To avoid any obstacles i completely dropped the media migration fix from here; the remaining two commits already provide a significant improvment by avoiding hammering the DB with heavy but unnecessary, noop queries
I’ll open a new PR to discuss how to best deal with migrated media domains
View command line instructions
Checkout
From your project repository, check out a new branch and test the changes.