RFC: database scheme for media uploads #765
Labels
No Label
approved, awaiting change
bug
configuration
documentation
duplicate
enhancement
extremely low priority
feature request
Fix it yourself
help wanted
invalid
mastodon_api
needs docs
needs tests
not a bug
planned
pleroma_api
privacy
question
static_fe
triage
wontfix
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: AkkomaGang/akkoma#765
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Well, it will be a while before there’s any chance of me actually implementing this, but it’s probably still good to already put up here for discussion and as reference. I certainly also won’t mind if someone else helps out with the remaining perf tests or even takes over implementation :)
Basics
Each post has zero or more attachment objects associated with it and those attachment objects each reference a file, but multiple attachments can reference the same file.
In Mastodon an attachment object exclusively belongs to a single (scheduled) post. (see also API docs talking about
GET /api/v1/media/:id
requests being denied once it’s attached)In *oma, we currently actually allow multiple posts to reference the same media attachment, with some restrictions on who can access it. Though there’s no support for this in akkoma-fe or any other frontend or app afaik. I suspect in practice only few attachments (from manual API use/testing and maybe C2S?) are actually shared.
Furthermore, in Mastodon it is (presumably) sufficient to only call the media update API for alt text edits etc to show up in associated posts, but in Akkoma those changes only become visible once the post itself is also updated (this is a consequence of the current storage scheme and not realistically fixable within this scheme due to excessive query overhead).
Also Mastodon treats post attachment media separate from user profile images; the latter aren’t MastoAPI media objects.
Current Situation
At the moment we keep track of local media (i.e. belonging to our instance, not necessarily
Uploader.Local
) viatype=Image
(user profile {avatars,banners,backgrounds}) andtype=Document
(post attachments) JSON objects in well... theobjects
table. Like all other objects their JSON blob has an AP-like URLid
(eventhough this URL cannot actually be resolved via HTTP). The primary database key of this object is used for Masto-APImedia_id
, but internally posts reference their attachment via the URL id inside the JSON data.This is quite bad.
This is already conceptually a bit weird, as all other types (
Note, Question, Answer, Tombstone
) in theobjects
table are actually dereferencable ActivityPub objects, while uploaded media data is not.But the much bigger issue are the resulting inefficiencies and usability problems / bugs resulting from this.
On a more general note, it bloats the already large
objects
table further making all queries on it more costly and table indices presumably scale super-linear with object count. It now also contends for and is affected byobjects
table locks. Just splitting it out and keeping everything else as is would probably already help somewhat but is far from enough.It’s hard to check which other attachments reference the same file (you need to at least strip query parameters and then join the table with itself over the processed URL) and for
Uploader.Local
the file storage path is also hard-tied to the public facing file URL, meaning to determine the file path we need to strip thebase_url
prefix and query parameters.As a result, instances which migrated their media domain can never successfully delete old, pre-migration files.
In fact this lookup is so costly, deletion of attachments was moved into an asynchronous job queue
five years ago (
d6a532bf0f
), and since this still wasn't enough for large serversmade optional and disabled by default four years ago (
8057157ee3
).Independently, both Akkoma and Pleroma later added a 900s timeout to attachment cleanup jobs.
Attachment cleanup being disabled by default (or timing out even if enabled) means servers accumulate useless entries in
objects
as well as orphaned useless upload files. Eternal bloat.Proposal
Media attachments not actually being ActivityPub objects makes this much easier to fix than if they were.
There’s no need to consider federation effects or preserve URI paths; only Masto-API
media_id
s are relevant.Still, “easier” here is relative; most of the upload and media handling needs to be adjusted and the data migration itself also poses some challenges.
Attachment objects being allowed to be associated with multiple posts makes things a bit more inconvenient and less efficient if we wanted to preserve that (we’ll need multiple separate join table like for hashtags). I think we can move to exclusively ownership though and it shouldn’t be too bad if we change media ids for the rare shared attachment during migration. In the following exclusive ownership is assumed.
I believe we should split attachments out of
objects
and also clearly distinguish uploaded files (which can be shared) from attachment objects (reference a single file but also contain additional metadata, like alt text and if we ever implement it focal point etc). Thus we’d add two new tablesattachments
andattachment_files
.We can then also decouple deduplication from filename (and otoh i think we also don't check if a file already exists but rely on just nothing actually changing on overwrite — with external uploaders each storage/overwrite might have some still create a new version of the key or have some cost).
There are some details still to hammer out, but i’ll put the general DB scheme first:
Thanks to the first index we can cheaply locate all attachment objects belonging to a post and orphaned attachments.
Thanks to the second quickly find out if anything else references the same file (if needed; see “Cleanup”).
Now attachment deletions should be so cheap again that we can enable it by default (and probably don't even need an option to disable it anymore in the first place).
Notes and some open questions re schema
Attachments vs user images
We could split user images and attachments to cut down mutually exclusive foreign keys to only two columns.
E.g. directly referencing
upload_files
fromusers
, but, then we’ll have to check two tables instead for deleting files (or also split files with no deduplication between categories). Also for implementing e.g. avatar alt texts (#249) it makes more sense to use the existing attachment infrastructure. If we ever revive C2S, we’ll never know what the upload gets used for ahead of time. And ofc we still need two different foreign keys inmedia
for normal and scheduled posts.So currently it seems to me like one shared
media
table makes more sense.URL parameters
It’s a flat JSONB object for future extensability, but atm it’s only
link_name
and we could also just use onelink_name
column instead or one plain TEXTurl_parameters
column. No strong opinion here.Cleanup
Attachments are created before their posts and might not even ever be associated with a post, so for cleaning them up we have no choice but to (additionally or exclusively) run a regularly scheduled jobs, querying
attachments.note IS NULL
.For files we could do the same, but it’d require a
LEFT JOIN
betweenattachment_files
andattachments
and i’m not sure if indices will actually be used for this. Instead we could after each attachment delete (but within the same db transaction) check if their files are now still referenced and if not also delete the files.This guarantees indices can be used and should always be cheap per file, but is more effort to use correctly as we can’t just rely on
ON DELETE CASCADE
triggers to delete attachments when their post is deleted.However, at least on the test data (50 million entries, no other load) joins with a 410 thousand table "only" took about 3.5 seconds if DB is in file cache. Compared to a 900s timeout, maayyybee it might be fine to run such a
LEFT JOIN
once a day (and regardless of how many locks this operation holds, it will now only affect media operations instead of basically everything).Mismatched file extensions
Currently we don’t actually deduplicate all identical files, only those for which both data and file extension match. In most cases different file extensions won't occur anyway (but consider e.g.
jpeg
vsjpg
andpng
vsPNG
).To replicate this behaviour
attachment_files
contains aextensions
column and uniqueness is checked for the combination of hash and media type. We should at least normalise this extensions though to avoid to many unnecessary copies; e.g. doing a roundtrip throughMIME
and if it’s justoctet-stream
at least lowercase it.We could also just not care about extensions mismatches, but if the first uploader now used an actually wrong extensions all future uploads get this borked extensions too. eh
Just stripping the file extension from the URL might at first seem sensible too (instead sending a
Content-Type
header) — but then we’ll either have to look up the media type in the database on each media GET request or read the media type from a new URL parameter (or enforce an extension forlink_name
). This means by editing the URL any file could be served with any user-chosen media type though ofc those media types would still go through the same sanitisation as now.But I’m also not sure if setting
Content-Type
works at all with external uploaders like S3 (if you know S3, let me know).EDIT: on upload and normal updates we can set a single
Content-Type
header andContent-Disposition
header per object which will get used for regular requests. However, it is possible to override these with signed URLs; we could use this for full deduplication regardless of original extension.Using signed URLs is already necessary to make
link_name
(Content-Disposition: filename=...;
) actually work with S3 (afaictlink_name
is currently not implemented at all for S3 uploads).See “Overriding response header values through the request” in https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObject.html
EDIT2: ... but apparently presigned URLs are only valid for a limited time (https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-presigned-url.html#PresignedUrl-Expiration)
Reused attachments
Any reason to allow reusing the same attachment object afterall eventhough Mastodon doesn't?
File paths
We can now fully decouple file paths from file content since hashes are tracked in the DB.
Using e.g. the Base62 encoded file (uu)id, is unproblematic and safe on any case sensitive filesystem *(do we care about case insensitive filesysytems?) and S3 (currently there’s a cursed, lossy
strict_encoding
function for pre-Dedupe
uploads) and would bring the file name length down from 64 characters to at most 22 which is a bit more friendly.Indices
If we use scheduled file scans for cleanup, we maybe don't need the
media_file
index. I observed PostgreSQL not using indices before if joining with a too large table. This will need testing.I’m not sure yet whether
HASH(note, scheduled_note, profile)
NULL
; wastes a lot of space)HASH(note)
(since this makes up the bulk and a full scan on the rest might be cheap enough anyway)is better.
If using a combined index we also need to test whether PostgreSQL is smart
enough to infer
= NULL
for the other two entries from the constraint, orif we’ll always need to spell out
note = 'some-id' AND scheduled_note = NULL AND profile = NULL
etc.(ideally test also onl owest support PostgreSQL version).
At least for indices over a single UUID,
HASH
is definitely better thanBTREE
though:We'll only ever do equality lookups on the index; uuids have no meaningful ordering anyway.
Thus
HASH
is sufficient and saves on disk storage compared toBTREE
.Tests on a table with 50,000,000 non-primary-key UUID rows also suggests
HASH
performs better thanBTREE
here on PostgreSQL 16.Details on the perf test
Observations from timed tests and
EXPLAIN (ANALYZE) <query>;
:lookups_big
never use indices; everything is about the same speedlookups
indices are still used (but because everything is faster the absolute time diff between HASH and BTREE isn’t that big)HASH
is a bit smaller and notably faster than BTREE. In the analysers estimated cost range is just a tad better but the actual runtime duration was about only half ofBTREE
’s on individual lookups for me.BTREE
is massively faster than no index at allThis test needs to be repeated with a combined index with three UUIDs.
There’s one issue with
HASH
though; prior to PostgreSQL 10HASH
indices were not yet crash safe (and also slow).Currently we still support 9.6 and have a runtime check for >= 11.
Given even PostgreSQL 11 already reached EOL about half a year ago, i think we can just raise our minimum version to 11 and also get rid of the runtime check.
S3
storage_handle
for it?S3
s base url?Migrations between S3 and Local
Do we want to track whether an uplaod was made with S3 or Local, so old files can remain with their original upload method? But then deletion would always need to consider all upload methods etc...
(I’m guessing “no”, but just in case)
The trouble with migrations
As mentioned before, current uploads can’t be deleted after a media domain change.
Our db migration will need to somehow deal with URLs using old media
base_url
sand also do something if the
base_url
matches but its file no longer exists (due to e.g. hardware or admin mess ups).If this isn’t handled gracefully we’ll be flooded with support requests and worse people manually mangling their DB before reaching out. If we’re too forgiving on the other hand the migration might turn out lossy or otherwise mess up things in a way which only shows up after a while (and again mangled databases and support flood).
I’m not yet sure how to best handle that.
Perhaps we should already introduce an
old_base_urls: [...]
config ahead of time, together with a mix task to check whether they cover all existing media. Then people could use this to fix up their config ahead of time and we could just fail on unparseable URLs (but we’ll still have to do something about missing files).Future
Just replacing
Image
andDocument
objects is already a sufficiently big change, so i’d like to limit the initial port to this.However, once done this allows to also tackle some other things. I’ll give some ideas here for future reference, but they shouldn’t be the focus atm
We could fix the media update bug.
The most simple approach is to now just automatically update the (scheduled) post (in
objects
) when media is updated.But this means we still have to operate on
objects
and duplicate already known information. Ideally posts would read all data about their attachments frommedia
(andfile
). This gives us updates for free and there’s no more potential for copies to fall out of sync. However, this raises some new questions/problems for remote media... *(For instance admins: checking stats
For realistic mockups and perf tets I’d like to get some references for media
and file counts on real-world instances. IF you’re running an instance but don’t
want to run too involved SQL queries just
would already help.
But if you’re up for more complex queries, the below gives more insight and shows how much orphaned attachments and files currently waste space on your server
If it’s a larger server or runs on weak hardware, maybe better do this on
a backup restored to a different machine to not impact instance operation.
SQL queries to check for orphaned attachments and files
These queries may also give an idea of why the current scheme is problematic. *(The current cleanup job also has to check if the attachment itself is referenced by multiple objects, which we’
If you use or used unusual media
base_url
s you might need to tweak the final two queries a bit.Summary of TODOs
create realistic (also in terms of row count) mock up tables and test whether PostgreSQL will even use the
file
index forLEFT JOIN
s;if not this index is unnecessary bloat and can be dropped