RFC: database scheme for media uploads #765
Labels
No labels
approved, awaiting change
bug
configuration
documentation
duplicate
enhancement
extremely low priority
feature request
Fix it yourself
help wanted
invalid
mastodon_api
needs docs
needs tests
not a bug
planned
pleroma_api
privacy
question
static_fe
triage
wontfix
No milestone
No project
No assignees
5 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: AkkomaGang/akkoma#765
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Well, it will be a while before there’s any chance of me actually implementing this, but it’s probably still good to already put up here for discussion and as reference. I certainly also won’t mind if someone else helps out with the remaining perf tests or even takes over implementation :)
Basics
Each post has zero or more attachment objects associated with it and those attachment objects each reference a file, but multiple attachments can reference the same file.
In Mastodon an attachment object exclusively belongs to a single (scheduled) post. (see also API docs talking about
GET /api/v1/media/:id
requests being denied once it’s attached)In *oma, we currently actually allow multiple posts to reference the same media attachment, with some restrictions on who can access it. Though there’s no support for this in akkoma-fe or any other frontend or app afaik. I suspect in practice only few attachments (from manual API use/testing and maybe C2S?) are actually shared.
Furthermore, in Mastodon it is (presumably) sufficient to only call the media update API for alt text edits etc to show up in associated posts, but in Akkoma those changes only become visible once the post itself is also updated (this is a consequence of the current storage scheme and not realistically fixable within this scheme due to excessive query overhead).
Also Mastodon treats post attachment media separate from user profile images; the latter aren’t MastoAPI media objects.
Current Situation
At the moment we keep track of local media (i.e. belonging to our instance, not necessarily
Uploader.Local
) viatype=Image
(user profile {avatars,banners,backgrounds}) andtype=Document
(post attachments) JSON objects in well... theobjects
table. Like all other objects their JSON blob has an AP-like URLid
(eventhough this URL cannot actually be resolved via HTTP). The primary database key of this object is used for Masto-APImedia_id
, but internally posts reference their attachment via the URL id inside the JSON data.This is quite bad.
This is already conceptually a bit weird, as all other types (
Note, Question, Answer, Tombstone
) in theobjects
table are actually dereferencable ActivityPub objects, while uploaded media data is not.But the much bigger issue are the resulting inefficiencies and usability problems / bugs resulting from this.
On a more general note, it bloats the already large
objects
table further making all queries on it more costly and table indices presumably scale super-linear with object count. It now also contends for and is affected byobjects
table locks. Just splitting it out and keeping everything else as is would probably already help somewhat but is far from enough.It’s hard to check which other attachments reference the same file (you need to at least strip query parameters and then join the table with itself over the processed URL) and for
Uploader.Local
the file storage path is also hard-tied to the public facing file URL, meaning to determine the file path we need to strip thebase_url
prefix and query parameters.As a result, instances which migrated their media domain can never successfully delete old, pre-migration files.
In fact this lookup is so costly, deletion of attachments was moved into an asynchronous job queue
five years ago (
d6a532bf0f
), and since this still wasn't enough for large serversmade optional and disabled by default four years ago (
8057157ee3
).Independently, both Akkoma and Pleroma later added a 900s timeout to attachment cleanup jobs.
Attachment cleanup being disabled by default (or timing out even if enabled) means servers accumulate useless entries in
objects
as well as orphaned useless upload files. Eternal bloat.Proposal
Media attachments not actually being ActivityPub objects makes this much easier to fix than if they were.
There’s no need to consider federation effects or preserve URI paths; only Masto-API
media_id
s are relevant.Still, “easier” here is relative; most of the upload and media handling needs to be adjusted and the data migration itself also poses some challenges.
Attachment objects being allowed to be associated with multiple posts makes things a bit more inconvenient and less efficient if we wanted to preserve that (we’ll need multiple separate join table like for hashtags). I think we can move to exclusively ownership though and it shouldn’t be too bad if we change media ids for the rare shared attachment during migration. In the following exclusive ownership is assumed.
I believe we should split attachments out of
objects
and also clearly distinguish uploaded files (which can be shared) from attachment objects (reference a single file but also contain additional metadata, like alt text and if we ever implement it focal point etc). Thus we’d add two new tablesattachments
andattachment_files
.We can then also decouple deduplication from filename (and otoh i think we also don't check if a file already exists but rely on just nothing actually changing on overwrite — with external uploaders each storage/overwrite might have some still create a new version of the key or have some cost).
There are some details still to hammer out, but i’ll put the general DB scheme first:
Thanks to the first index we can cheaply locate all attachment objects belonging to a post and orphaned attachments.
Thanks to the second quickly find out if anything else references the same file (if needed; see “Cleanup”).
Now attachment deletions should be so cheap again that we can enable it by default (and probably don't even need an option to disable it anymore in the first place).
Notes and some open questions re schema
Attachments vs user images
We could split user images and attachments to cut down mutually exclusive foreign keys to only two columns.
E.g. directly referencing
upload_files
fromusers
, but, then we’ll have to check two tables instead for deleting files (or also split files with no deduplication between categories). Also for implementing e.g. avatar alt texts (#249) it makes more sense to use the existing attachment infrastructure. If we ever revive C2S, we’ll never know what the upload gets used for ahead of time. And ofc we still need two different foreign keys inmedia
for normal and scheduled posts.So currently it seems to me like one shared
media
table makes more sense.URL parameters
It’s a flat JSONB object for future extensability, but atm it’s only
link_name
and we could also just use onelink_name
column instead or one plain TEXTurl_parameters
column. No strong opinion here.Cleanup
Attachments are created before their posts and might not even ever be associated with a post, so for cleaning them up we have no choice but to (additionally or exclusively) run a regularly scheduled jobs, querying
attachments.note IS NULL
.For files we could do the same, but it’d require a
LEFT JOIN
betweenattachment_files
andattachments
and i’m not sure if indices will actually be used for this. Instead we could after each attachment delete (but within the same db transaction) check if their files are now still referenced and if not also delete the files.This guarantees indices can be used and should always be cheap per file, but is more effort to use correctly as we can’t just rely on
ON DELETE CASCADE
triggers to delete attachments when their post is deleted.However, at least on the test data (50 million entries, no other load) joins with a 410 thousand table "only" took about 3.5 seconds if DB is in file cache. Compared to a 900s timeout, maayyybee it might be fine to run such a
LEFT JOIN
once a day (and regardless of how many locks this operation holds, it will now only affect media operations instead of basically everything).Mismatched file extensions
Currently we don’t actually deduplicate all identical files, only those for which both data and file extension match. In most cases different file extensions won't occur anyway (but consider e.g.
jpeg
vsjpg
andpng
vsPNG
).To replicate this behaviour
attachment_files
contains aextensions
column and uniqueness is checked for the combination of hash and media type. We should at least normalise this extensions though to avoid to many unnecessary copies; e.g. doing a roundtrip throughMIME
and if it’s justoctet-stream
at least lowercase it.We could also just not care about extensions mismatches, but if the first uploader now used an actually wrong extensions all future uploads get this borked extensions too. eh
Just stripping the file extension from the URL might at first seem sensible too (instead sending a
Content-Type
header) — but then we’ll either have to look up the media type in the database on each media GET request or read the media type from a new URL parameter (or enforce an extension forlink_name
). This means by editing the URL any file could be served with any user-chosen media type though ofc those media types would still go through the same sanitisation as now.But I’m also not sure if setting
Content-Type
works at all with external uploaders like S3 (if you know S3, let me know).EDIT: on upload and normal updates we can set a single
Content-Type
header andContent-Disposition
header per object which will get used for regular requests. However, it is possible to override these with signed URLs; we could use this for full deduplication regardless of original extension.Using signed URLs is already necessary to make
link_name
(Content-Disposition: filename=...;
) actually work with S3 (afaictlink_name
is currently not implemented at all for S3 uploads).See “Overriding response header values through the request” in https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObject.html
EDIT2: ... but apparently presigned URLs are only valid for a limited time (https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-presigned-url.html#PresignedUrl-Expiration)
Reused attachments
Any reason to allow reusing the same attachment object afterall eventhough Mastodon doesn't?
File paths
We can now fully decouple file paths from file content since hashes are tracked in the DB.
Using e.g. the Base62 encoded file (uu)id, is unproblematic and safe on any case sensitive filesystem *(do we care about case insensitive filesysytems?) and S3 (currently there’s a cursed, lossy
strict_encoding
function for pre-Dedupe
uploads) and would bring the file name length down from 64 characters to at most 22 which is a bit more friendly.Indices
If we use scheduled file scans for cleanup, we maybe don't need the
media_file
index. I observed PostgreSQL not using indices before if joining with a too large table. This will need testing.I’m not sure yet whether
HASH(note, scheduled_note, profile)
NULL
; wastes a lot of space)HASH(note)
(since this makes up the bulk and a full scan on the rest might be cheap enough anyway)is better.
If using a combined index we also need to test whether PostgreSQL is smart
enough to infer
= NULL
for the other two entries from the constraint, orif we’ll always need to spell out
note = 'some-id' AND scheduled_note = NULL AND profile = NULL
etc.(ideally test also onl owest support PostgreSQL version).
At least for indices over a single UUID,
HASH
is definitely better thanBTREE
though:We'll only ever do equality lookups on the index; uuids have no meaningful ordering anyway.
Thus
HASH
is sufficient and saves on disk storage compared toBTREE
.Tests on a table with 50,000,000 non-primary-key UUID rows also suggests
HASH
performs better thanBTREE
here on PostgreSQL 16.Details on the perf test
Observations from timed tests and
EXPLAIN (ANALYZE) <query>;
:lookups_big
never use indices; everything is about the same speedlookups
indices are still used (but because everything is faster the absolute time diff between HASH and BTREE isn’t that big)HASH
is a bit smaller and notably faster than BTREE. In the analysers estimated cost range is just a tad better but the actual runtime duration was about only half ofBTREE
’s on individual lookups for me.BTREE
is massively faster than no index at allThis test needs to be repeated with a combined index with three UUIDs.
There’s one issue with
HASH
though; prior to PostgreSQL 10HASH
indices were not yet crash safe (and also slow).Currently we still support 9.6 and have a runtime check for >= 11.
Given even PostgreSQL 11 already reached EOL about half a year ago, i think we can just raise our minimum version to 11 and also get rid of the runtime check.
S3
storage_handle
for it?S3
s base url?Migrations between S3 and Local
Do we want to track whether an uplaod was made with S3 or Local, so old files can remain with their original upload method? But then deletion would always need to consider all upload methods etc...
(I’m guessing “no”, but just in case)
The trouble with migrations
As mentioned before, current uploads can’t be deleted after a media domain change.
Our db migration will need to somehow deal with URLs using old media
base_url
sand also do something if the
base_url
matches but its file no longer exists (due to e.g. hardware or admin mess ups).If this isn’t handled gracefully we’ll be flooded with support requests and worse people manually mangling their DB before reaching out. If we’re too forgiving on the other hand the migration might turn out lossy or otherwise mess up things in a way which only shows up after a while (and again mangled databases and support flood).
I’m not yet sure how to best handle that.
Perhaps we should already introduce an
old_base_urls: [...]
config ahead of time, together with a mix task to check whether they cover all existing media. Then people could use this to fix up their config ahead of time and we could just fail on unparseable URLs (but we’ll still have to do something about missing files).Future
Just replacing
Image
andDocument
objects is already a sufficiently big change, so i’d like to limit the initial port to this.However, once done this allows to also tackle some other things. I’ll give some ideas here for future reference, but they shouldn’t be the focus atm
We could fix the media update bug.
The most simple approach is to now just automatically update the (scheduled) post (in
objects
) when media is updated.But this means we still have to operate on
objects
and duplicate already known information. Ideally posts would read all data about their attachments frommedia
(andfile
). This gives us updates for free and there’s no more potential for copies to fall out of sync. However, this raises some new questions/problems for remote media... *(For instance admins: checking stats
For realistic mockups and perf tets I’d like to get some references for media
and file counts on real-world instances. IF you’re running an instance but don’t
want to run too involved SQL queries just
would already help.
But if you’re up for more complex queries, the below gives more insight and shows how much orphaned attachments and files currently waste space on your server
If it’s a larger server or runs on weak hardware, maybe better do this on
a backup restored to a different machine to not impact instance operation.
SQL queries to check for orphaned attachments and files
These queries may also give an idea of why the current scheme is problematic. *(The current cleanup job also has to check if the attachment itself is referenced by multiple objects, which we’
If you use or used unusual media
base_url
s you might need to tweak the final two queries a bit.Summary of TODOs
create realistic (also in terms of row count) mock up tables and test whether PostgreSQL will even use the
file
index forLEFT JOIN
s;if not this index is unnecessary bloat and can be dropped
I imagine it would theoretically be useful in combination with a "Drive" like thing, just like how Misskey and its forks does things.
Some results from the queries:
afaict from poking it a bit *key’s “Drive” works as follows; let me know if something’s missing or incorrect:
No auto-deletion is kinda opposite to how *oma works now, but we could add a flag to exempt some files from auto deletion.
Listing all files of a user already is possible by aggregating media attachments by
owner
. Similarly determining total size is made easier with the new scheme if the file table also records the size.Automatically updating all at texts may sometimes be useful, but doing so when editing from a single post may also be surprising and unexpected. Sharing the sensitive flag culd also be problematic considering attachments of posts with CW are often automatically set to “sensitive” (and some clients directly display attachments of CWed posts if not set to sensitive. But if the image itself doesn't warrant a “sensitive” flag, but may sometimes need hiding (e.g. because when paired with content fro ma post it may be spoiler) this may be problematic)
If sharing the properties is desired, allowing reuse may indeed be a good idea. I’m undecided whether sharing is actually always a good idea though or if just cloning an attachment with it’s current properties upon post creation suffices or is perhaps even better.
As far as I can tell all the above seem to be correct, at least with Sharkey.
Perhaps cloning might be the better option since sharing the alt text and sensitive flag isn't always a good idea like you said.
Stats from a approximately-single-user instance:
and
If needed I can also run the more involved ones, although I suspect data from an instance with apparently 5 images in total might be somewhat useless. :P
Very cool, sounds massively more reasonable than the current situation. I'm quite worried about how hard migrations will work for already existing instances during updates, but other than that only a couple (possibly quite uninformed) comments.
Couldn't there be an additional type of object (draft?) that would get associated with these images and then we could have a cascading delete? Then there would be four possibilities for an object referring to a piece of media, and always exactly one of them is not null. Periodic cleanup (if needed?) would only have to look at the "draft" (or whatever) table, which would usually be absolutely tiny.
I'm usually very in favour of making everything ridiculously future-proof, and I cannot think of one.
I wouldn't worry about this too much and leave it as-is – the minimal deduplication and marginally more readable DB specification are not worth the trouble of somehow standardizing file extensions. Especially with S3 complicating things even more.
Here's my results for the additional queries:
I selected query 1 as that seemed to be the least costly of the two according to
EXPLAIN
, which gave me 3178 rows.(was unable to save a tsv file for some reason)
I am not familiar with Akkoma's architecture or codebase, so please think of this as a curious question rather than a suggestion.
I noticed the sha256 in the proposed table schema. Would it be better to use a content-addressable filesystem like IPFS? They have put a lot of thought into future-proof content-addressing like multihash
There is some discussion on this Mastodon issue.
@timorl
akkoma-fe also has a “Delete & Redraft” feature; a cascading delete would always remove attachments from redrafts (also see #775)
Completely stripping extensions seems impossible with the current S3 backend (i guess in theory it could work if S3 versioning is turned on and for each
link_name
a new version of the item is created, but this makes things much more complicated)Just standardising spelling of extensions should be fairly trivial though; not sure how much it matters in practice but if it doesn'T cause trouble we might as well
@eshnil
IPFS is completely orthogonal to how uploads are tracked in our DB. If anything it would be another uploader backend.
Yeah, I got that, I just wasn't clear, sorry. ^^" I proposed that a post can end up in a "draft" state, in which it's not published, but keeps the attachments (and any other contents?). "Delete & Redraft" would then put posts in this state (maybe "Delete" as well? although that might be unexpected behaviour). Any uploaded media would be associated with such a draft initially as well. When the drafts would get deleted if unused could probably be configurable.
I think this might be worth the trouble, since you get a much better reflection of what the user expects in the DB and much easier attachment deletion.
There’s no difference from the backend’s perspective between “Delete & Redraft” and regular delete until the redraft is actually resubmitted, so yes if implemented this will necessarily also have to apply to normal deletes.
Given the current proposal already has an index over ownership references anyway, is there really a perf benefit though? Lookups for “not owned” should already be just a simple index lookup. Potentially though the
ON DELETE SET NULL
trigger needs to be replaced with a custom “set NULL and updateupdated_at
” trigger to make deletion grace periods work reliably however.Right now, unless the ownership index disappears, tracking this separately seems to just add complexity to me, but maybe i’m still missing the point?
The point was to make this more uniform, reasoning about stuff (and general maintainability) tends to become easier then. But I might be using intuitions from other kinds of software dev which don't apply as much to DBs (mainly nulls/nils etc. being bad). I don't have much experience designing these, so if you are saying this is alright then lets go with your approach.
IMMEDIATE EDIT ON RE-READING: wait, but the index could disappear with that additional ownership relation, couldn't it? Or am I misunderstanding its purpose?
Sorry i forgot to reply to this
Currently we need the following to be relatively quick to not thrash performance (in no particular order):
-OR-
(b) being able to prune orphaned files (finding all unreferenced files for regular cleanup swipes) —
4 is trivially achieved with the foreign file key column in media attachments.
An index on this column is necessary to facilitate a sufficiently fast 2a, but if regular cleanup swipes (2b) turn out to be sufficiently fast too, we prob don’t even need an index here.
1, 3 and 5 all (hopefully; to be tested) work out with the combined
HASH(note, profile, scheduled_note)
index. The drawback here is we waste a little bit of space in the table on two always empty uuid columns. The major advantage here being we can check ownership for all possible types in one go due to the combined index. This way we also don’t need to explicitly track not-yet-or-no-longer referenced attachments; we get it for free as aNULL, NULL, NULL
lookup.Should a combined index turn out problematic in testing, we’ll have to split it up. In that case though, keeping references to what uses the media attachment inside the media table doesn’t make much sense any more, we’d fill up each index with many
NULL
entries. Instead we’d use three join table each using the media id as their primary key and the appropriate other foreign key as the second field.This means though we’ll have to resave the media id and now have three
BTREE
indices instead of oneHASH
index, so we probably don’t really save much space or might even use more. Furthermore, we’ll now need 3 lookups to determine whether a given attachment is already referenced and we no longer get a guaranteed cheap lookup for “all unused media”.We could in theory relax the reuse requirement a bit without going to the more expensive many-to-many relationship by saying “a single media attachment can only be referenced by a single post and user profile each”. The issue being due to scheduled notes and published notes being in different tables, we’ll still need two lookups to determine whether it’s safe to associate this attachment with a new (scheduled) post (and we’ll have to remember to update both join tables atomically in a single transaction when a scheduled post gets promoted to a published post).
Going full on in with arbitrary many posts being able to reference an media attachment again creates additional overhead and problems as reasoned e.g. in the original post.
tl;dr: if a combined index works out, i expect it to perform the best and we don’t need separate relation. If it doesn’t we’ll need seperate relations and pay special attention to cleanup performance.
Let it be no surprise that I 100% share the concerns laid out here, and I would be very happy if this could be fixed, or at least improved upon :)
I'm going off on a tangent here, but isn't this really also a problem in design of the code? I see Repo imports all over the place. IMO it would be much cleaner to have something like
Pleroma.DataStorage
and let that do all database things. You would still need to change code (and of course there's a big refactoring needed to get to that point, which is completely out-of-scope for this specific issue), but then it would purely be in this small part of the codebase, which is clearly meant for data storage. Instead of having to go through the entire codebase, hoping you didn't miss anything. (This is another discussion, I know, but maybe also one worth having somewhere?)YES! \o/ ty! (^.^)
I didn't understand at first, but I understand the reason is to allow multiple media entries for same file due to deduplication (while still allowing different e.g. media description). This seems indeed a good reason to split them up like this.
You've already thought of more than I had!
One thing I notice is that the url isn't there any more. Is the idea to rebuild it each time? If so, how/what fields will be used for it. At the very least, I think it should be independent of Uploader bc that can change, while url's should be stable (unless we send out Update activities for each post with media when something in the url format changes).
I'm unsure what a "proper" normalisation would say, but I think it makes sense for the reason you say; More overhead for some parts, and we won't always know what it is linked to beforehand.
I think this is currently filled in by an UploadFilter? I guess it makes sense to keep it somewhat "flexible" in that regard then since the UploadFilter are modular. But how does it return query parameters now? aren't they just part of the url? Why not keep the same thing? (At least for a first run)
Note that deleting of the file itself can also go wrong. If something goes wrong and the file still exists, the deletion of the database entries should also fail (i.e. rollback). A retry then happens next time the attachments cleanup job runs. If we can do it like that, and that is indeed cheaper to do, than that seems best to me. IF we want an extra job "just in case", it could be a mix task, but it should not be needed, so I wouldn't spend time on it.
No opinion bc I don't have enough insights in how this all works and what the implications will be. All I can say is that whatever we do, it should work independent of storage back-end. (ATM local and S3, but it's always possible someone adds something else later.) If it's too much trouble to "fix", then I agree with @timorl to not worry about this and leave it as-is.
None that I can think of where reuploading the file, or adding a new media entry, wouldn't be an acceptable work-around. As mentioned in previous comments, description, for example, makes sense within the context of a post. It would be weird to change the description, and accidentally have it changed in other posts too. It would also cause extra overhead with federation, bc those posts have technically speaking been updated.
Can't provide feedback since I don't know enough about it
storage_handle
is generic, so this should also be generic. Maybe this should just be (part of) the final url? It doesn't matter then what uploader is used. (Personally I would start with "flattening" the tables, and not think about these things at all yet, but maybe that causes other problems, so 🤷♀️ )It's not just old_base_url. It's also old uploader... How are such migrations handled now? Do people move their media from one storage (e.g. Local) to another (e.g. s3)? How do the url's keep working?
But also, (how) does cleanup in these cases work at the moment? If it doesn't work properly now, then maybe it's OK to not fix this specific case just yet? (I.e. move it to "Future" section) Should be properly documented, though (somewhere an admin will see when they change the Uploader).
few, that was a lot to go over ^^'
For the asked queries, I get an error for the "involved" one
Here's the "less involved" ones
But when I try the bigger thing, I get an error when creating the function. I put all of it in a file, then did
sudo -Hu postgres psql -d pleroma_ynh -f run_thing.psql
, this is the output:OS and DB version:
It seems for latest postgres, it's
regexp_replace(source, pattern, replacement [, start [, N ]] [, flags ])
, but my version doesn't support the[, start [, N ]]
yet, https://www.postgresql.org/docs/13/functions-matching.html