RFC: database scheme for media uploads #765

Open
opened 2024-04-28 22:13:52 +00:00 by Oneric · 0 comments
Member

Well, it will be a while before there’s any chance of me actually implementing this, but it’s probably still good to already put up here for discussion and as reference. I certainly also won’t mind if someone else helps out with the remaining perf tests or even takes over implementation :)

Basics

Each post has zero or more attachment objects associated with it and those attachment objects each reference a file, but multiple attachments can reference the same file.

In Mastodon an attachment object exclusively belongs to a single (scheduled) post. (see also API docs talking about GET /api/v1/media/:id requests being denied once it’s attached)

In *oma, we currently actually allow multiple posts to reference the same media attachment, with some restrictions on who can access it. Though there’s no support for this in akkoma-fe or any other frontend or app afaik. I suspect in practice only few attachments (from manual API use/testing and maybe C2S?) are actually shared.

Furthermore, in Mastodon it is (presumably) sufficient to only call the media update API for alt text edits etc to show up in associated posts, but in Akkoma those changes only become visible once the post itself is also updated (this is a consequence of the current storage scheme and not realistically fixable within this scheme due to excessive query overhead).

Also Mastodon treats post attachment media separate from user profile images; the latter aren’t MastoAPI media objects.

Current Situation

At the moment we keep track of local media (i.e. belonging to our instance, not necessarily Uploader.Local) via type=Image (user profile {avatars,banners,backgrounds}) and type=Document (post attachments) JSON objects in well... the objects table. Like all other objects their JSON blob has an AP-like URL id (eventhough this URL cannot actually be resolved via HTTP). The primary database key of this object is used for Masto-API media_id, but internally posts reference their attachment via the URL id inside the JSON data.

This is quite bad.

This is already conceptually a bit weird, as all other types (Note, Question, Answer, Tombstone) in the objects table are actually dereferencable ActivityPub objects, while uploaded media data is not.

But the much bigger issue are the resulting inefficiencies and usability problems / bugs resulting from this.
On a more general note, it bloats the already large objects table further making all queries on it more costly and table indices presumably scale super-linear with object count. It now also contends for and is affected by objects table locks. Just splitting it out and keeping everything else as is would probably already help somewhat but is far from enough.

It’s hard to check which other attachments reference the same file (you need to at least strip query parameters and then join the table with itself over the processed URL) and for Uploader.Local the file storage path is also hard-tied to the public facing file URL, meaning to determine the file path we need to strip the base_url prefix and query parameters.
As a result, instances which migrated their media domain can never successfully delete old, pre-migration files.

In fact this lookup is so costly, deletion of attachments was moved into an asynchronous job queue
five years ago (d6a532bf0f), and since this still wasn't enough for large servers
made optional and disabled by default four years ago (8057157ee3).
Independently, both Akkoma and Pleroma later added a 900s timeout to attachment cleanup jobs.

Attachment cleanup being disabled by default (or timing out even if enabled) means servers accumulate useless entries in objects as well as orphaned useless upload files. Eternal bloat.


Proposal

Media attachments not actually being ActivityPub objects makes this much easier to fix than if they were.
There’s no need to consider federation effects or preserve URI paths; only Masto-API media_ids are relevant.
Still, “easier” here is relative; most of the upload and media handling needs to be adjusted and the data migration itself also poses some challenges.

Attachment objects being allowed to be associated with multiple posts makes things a bit more inconvenient and less efficient if we wanted to preserve that (we’ll need multiple separate join table like for hashtags). I think we can move to exclusively ownership though and it shouldn’t be too bad if we change media ids for the rare shared attachment during migration. In the following exclusive ownership is assumed.

I believe we should split attachments out of objects and also clearly distinguish uploaded files (which can be shared) from attachment objects (reference a single file but also contain additional metadata, like alt text and if we ever implement it focal point etc). Thus we’d add two new tables attachments and attachment_files.
We can then also decouple deduplication from filename (and otoh i think we also don't check if a file already exists but rely on just nothing actually changing on overwrite — with external uploaders each storage/overwrite might have some still create a new version of the key or have some cost).

There are some details still to hammer out, but i’ll put the general DB scheme first:

CREATE TABLE upload_files(
    id uuid PRIMARY KEY, -- any UUID scheme should be fine
    sha256 BYTEA NOT NULL,
    extension TEXT, -- see later notes
    UNIQUE(sha256, extension) NULLS NOT DISTINCT, -- dedupe is now baked into db
    -- Uploader.Local: path relative to uplaod dir
    -- S3: key? or bucket:key?
    storage_handle TEXT UNIQUE NOT NULL,
    inserted_at TIME NOT NULL, -- first upload
    updated_at TIME NOT NULL   -- optional: time of newest associated upload
);


CREATE TABLE media(
    id uuid PRIMARY KEY, -- Flake-CompatType (to preserve old media ids)
    owner uuid REFERENCES users(id) NOT NULL ON DELETE CASCADE,

    -- Tracking who uses media; all initially NULL
    note uuid REFERENCES objects(id), -- if post attachment; due to "delete&redraft" _no_ ON DELETE CASCADE
    profile uuid REFERENCES users(id) ON DELETE CASCADE, -- if user image
    scheduled_note uuid REFERENCES scheduled_activities(id) ON DELETE CASCADE; -- if scheduled post attachment
    CHECK(
        (note IS NULL  AND  profile IS NULL) OR
        (note IS NULL  AND  scheduled_note IS NULL) OR
        (profile is NULL  AND  scheduled_note IS NULL)
    ),

    -- or ON DELETE CASCADE to clean up on messups ig
    file uuid REFERENCES upload_files(id) NOT NULL ON DELETE RESTRICT,
    description TEXT,
    title TEXT,
    url_parameters jsonb, -- currently just link_name; simple flat "key": "value" object
    inserted_at TIME NOT NULL,
    update_at TIME NOT NULL
);
CREATE INDEX media_usage_index ON attachments USING HASH (note, scheduled_note, profile);
CREATE INDEX media_file_index ON attachments USING HASH (file);

Thanks to the first index we can cheaply locate all attachment objects belonging to a post and orphaned attachments.
Thanks to the second quickly find out if anything else references the same file (if needed; see “Cleanup”).

Now attachment deletions should be so cheap again that we can enable it by default (and probably don't even need an option to disable it anymore in the first place).

Notes and some open questions re schema

Attachments vs user images

We could split user images and attachments to cut down mutually exclusive foreign keys to only two columns.
E.g. directly referencing upload_files from users, but, then we’ll have to check two tables instead for deleting files (or also split files with no deduplication between categories). Also for implementing e.g. avatar alt texts (#249) it makes more sense to use the existing attachment infrastructure. If we ever revive C2S, we’ll never know what the upload gets used for ahead of time. And ofc we still need two different foreign keys in media for normal and scheduled posts.

So currently it seems to me like one shared media table makes more sense.

URL parameters

It’s a flat JSONB object for future extensability, but atm it’s only link_name and we could also just use one link_name column instead or one plain TEXT url_parameters column. No strong opinion here.

Cleanup

Attachments are created before their posts and might not even ever be associated with a post, so for cleaning them up we have no choice but to (additionally or exclusively) run a regularly scheduled jobs, querying attachments.note IS NULL.

For files we could do the same, but it’d require a LEFT JOIN between attachment_files and attachments and i’m not sure if indices will actually be used for this. Instead we could after each attachment delete (but within the same db transaction) check if their files are now still referenced and if not also delete the files.
This guarantees indices can be used and should always be cheap per file, but is more effort to use correctly as we can’t just rely on ON DELETE CASCADE triggers to delete attachments when their post is deleted.

However, at least on the test data (50 million entries, no other load) joins with a 410 thousand table "only" took about 3.5 seconds if DB is in file cache. Compared to a 900s timeout, maayyybee it might be fine to run such a LEFT JOIN once a day (and regardless of how many locks this operation holds, it will now only affect media operations instead of basically everything).

Mismatched file extensions

Currently we don’t actually deduplicate all identical files, only those for which both data and file extension match. In most cases different file extensions won't occur anyway (but consider e.g. jpeg vs jpg and png vs PNG).
To replicate this behaviour attachment_files contains a extensions column and uniqueness is checked for the combination of hash and media type. We should at least normalise this extensions though to avoid to many unnecessary copies; e.g. doing a roundtrip through MIME and if it’s just octet-stream at least lowercase it.

We could also just not care about extensions mismatches, but if the first uploader now used an actually wrong extensions all future uploads get this borked extensions too. eh

Just stripping the file extension from the URL might at first seem sensible too (instead sending a Content-Type header) — but then we’ll either have to look up the media type in the database on each media GET request or read the media type from a new URL parameter (or enforce an extension for link_name). This means by editing the URL any file could be served with any user-chosen media type though ofc those media types would still go through the same sanitisation as now.
But I’m also not sure if setting Content-Type works at all with external uploaders like S3 (if you know S3, let me know).

EDIT: on upload and normal updates we can set a single Content-Type header and Content-Disposition header per object which will get used for regular requests. However, it is possible to override these with signed URLs; we could use this for full deduplication regardless of original extension.
Using signed URLs is already necessary to make link_name (Content-Disposition: filename=...;) actually work with S3 (afaict link_name is currently not implemented at all for S3 uploads).
See “Overriding response header values through the request” in https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObject.html

EDIT2: ... but apparently presigned URLs are only valid for a limited time (https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-presigned-url.html#PresignedUrl-Expiration)

Reused attachments

Any reason to allow reusing the same attachment object afterall eventhough Mastodon doesn't?

File paths

We can now fully decouple file paths from file content since hashes are tracked in the DB.
Using e.g. the Base62 encoded file (uu)id, is unproblematic and safe on any case sensitive filesystem *(do we care about case insensitive filesysytems?) and S3 (currently there’s a cursed, lossy strict_encoding function for pre-Dedupe uploads) and would bring the file name length down from 64 characters to at most 22 which is a bit more friendly.

Indices

If we use scheduled file scans for cleanup, we maybe don't need the media_file index. I observed PostgreSQL not using indices before if joining with a too large table. This will need testing.

I’m not sure yet whether

  • one HASH(note, scheduled_note, profile)
  • all three as separate indices (most entries will be NULL; wastes a lot of space)
  • only HASH(note) (since this makes up the bulk and a full scan on the rest might be cheap enough anyway)

is better.

If using a combined index we also need to test whether PostgreSQL is smart
enough to infer = NULL for the other two entries from the constraint, or
if we’ll always need to spell out note = 'some-id' AND scheduled_note = NULL AND profile = NULL etc.
(ideally test also onl owest support PostgreSQL version).

At least for indices over a single UUID, HASH is definitely better than BTREE though:
We'll only ever do equality lookups on the index; uuids have no meaningful ordering anyway.
Thus HASH is sufficient and saves on disk storage compared to BTREE.
Tests on a table with 50,000,000 non-primary-key UUID rows also suggests HASH performs better than BTREE here on PostgreSQL 16.

Details on the perf test
-- Generate test data with 50 million UUIDs
CREATE TABLE testdata AS
  SELECT generate_series(1,50_000_000) AS key,
         gen_random_uuid() AS uuid;
-- duplicate for comparisons
CREATE TABLE testdata_btree AS SELECT * FROM testdata;
CREATE TABLE testdata_noi AS SELECT * FROM testdata;

-- create indices
CREATE INDEX test_hash_index ON testdata USING HASH (uuid);
CREATE INDEX test_btree_index ON testdata_btree USING BTREE (uuid);

-- check storage consumption
SELECT
   relname  as table_name,
   pg_size_pretty(pg_total_relation_size(relid)) As "Total Size",
   pg_size_pretty(pg_indexes_size(relid)) as "Index Size",
   pg_size_pretty(pg_table_size(relid)) as "Storage Size (with TOAST)",
   pg_size_pretty(pg_relation_size(relid)) as "Data Size (without TOAST)"
   FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;

-- observation: HASH is smaller

-- create a small and big lookup table
CREATE TABLE lookups AS (
  (SELECT uuid FROM testdata WHERE (random() * 31) >= 30 LIMIT 30_000)
  UNION ALL
  SELECT gen_random_uuid() AS uuid FROM generate_series(1,3_000)
);

CREATE TABLE lookups_big AS (
  (SELECT uuid FROM testdata WHERE (random() * 31) >= 30 LIMIT 400_000)
  UNION ALL
  SELECT gen_random_uuid() AS uuid FROM generate_series(1,10_000)
);

-- finally do some timed test queries and check EXPLAIN ANALYZE ...
-- for different scenarios
-- e.g.
--  SELECT key FROM testadata WHERE uuid = 'an existing uuid';
--  SELECT key FROM testadata WHERE uuid = 'an non.existing uuid';
--  SELECT key FROM testdata AS d JOIN lookups AS l ON d.uuid = l.uuid;
--  SELECT key FROM testdata AS d JOIN lookups_big AS l ON d.uuid = l.uuid;

Observations from timed tests and EXPLAIN (ANALYZE) <query>;:

  • joins with lookups_big never use indices; everything is about the same speed
  • joins with lookups indices are still used (but because everything is faster the absolute time diff between HASH and BTREE isn’t that big)
  • HASH is a bit smaller and notably faster than BTREE. In the analysers estimated cost range is just a tad better but the actual runtime duration was about only half of BTREE’s on individual lookups for me.
  • BTREE is massively faster than no index at all
# JOINING both testdata tables as an extreme test
# note how indices are not used
test_indices=# EXPLAIN ANALYZE SELECT d.key FROM testdata AS d JOIN testdata2 AS l ON d.uuid = l.uuid;
                                                               QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=1687782.11..7086396.11 rows=50000000 width=4) (actual time=14382.006..50487.475 rows=50000000 loops
=1)
   Hash Cond: (d.uuid = l.uuid)
   ->  Seq Scan on testdata d  (cost=0.00..818512.00 rows=50000000 width=20) (actual time=0.009..3567.087 rows=50000000 loops=1)
   ->  Hash  (cost=818559.16..818559.16 rows=50004716 width=16) (actual time=14372.441..14372.442 rows=50000000 loops=1)
         Buckets: 262144  Batches: 512  Memory Usage: 6617kB
         ->  Seq Scan on testdata2 l  (cost=0.00..818559.16 rows=50004716 width=16) (actual time=0.041..4730.541 rows=50000000 loops=1)
 Planning Time: 0.368 ms
 Execution Time: 51901.144 ms


This test needs to be repeated with a combined index with three UUIDs.

There’s one issue with HASH though; prior to PostgreSQL 10 HASH indices were not yet crash safe (and also slow).
Currently we still support 9.6 and have a runtime check for >= 11.
Given even PostgreSQL 11 already reached EOL about half a year ago, i think we can just raise our minimum version to 11 and also get rid of the runtime check.

S3

  • I’m not too familiar with the S3 uploader, what would be a good storage_handle for it?
  • Do we (want to) support multiple buckets?
  • Is the bucket name already part of S3s base url?

Migrations between S3 and Local

Do we want to track whether an uplaod was made with S3 or Local, so old files can remain with their original upload method? But then deletion would always need to consider all upload methods etc...
(I’m guessing “no”, but just in case)

The trouble with migrations

As mentioned before, current uploads can’t be deleted after a media domain change.
Our db migration will need to somehow deal with URLs using old media base_urls
and also do something if the base_url matches but its file no longer exists (due to e.g. hardware or admin mess ups).
If this isn’t handled gracefully we’ll be flooded with support requests and worse people manually mangling their DB before reaching out. If we’re too forgiving on the other hand the migration might turn out lossy or otherwise mess up things in a way which only shows up after a while (and again mangled databases and support flood).

I’m not yet sure how to best handle that.
Perhaps we should already introduce an old_base_urls: [...] config ahead of time, together with a mix task to check whether they cover all existing media. Then people could use this to fix up their config ahead of time and we could just fail on unparseable URLs (but we’ll still have to do something about missing files).

Future

Just replacing Image and Document objects is already a sufficiently big change, so i’d like to limit the initial port to this.

However, once done this allows to also tackle some other things. I’ll give some ideas here for future reference, but they shouldn’t be the focus atm

We could fix the media update bug.
The most simple approach is to now just automatically update the (scheduled) post (in objects) when media is updated.
But this means we still have to operate on objects and duplicate already known information. Ideally posts would read all data about their attachments from media (and file). This gives us updates for free and there’s no more potential for copies to fall out of sync. However, this raises some new questions/problems for remote media... *(

For instance admins: checking stats

For realistic mockups and perf tets I’d like to get some references for media
and file counts on real-world instances. IF you’re running an instance but don’t
want to run too involved SQL queries just

SELECT COUNT(*) FROM objects;
SELECT COUNT(*) FROM objects WHERE data->>'type' = 'Note';
SELECT COUNT(*) FROM objects WHERE data->>'type' = 'Image';
SELECT COUNT(*) FROM objects WHERE data->>'type' = 'Document';

would already help.

But if you’re up for more complex queries, the below gives more insight and shows how much orphaned attachments and files currently waste space on your server

If it’s a larger server or runs on weak hardware, maybe better do this on
a backup restored to a different machine to not impact instance operation.

SQL queries to check for orphaned attachments and files

These queries may also give an idea of why the current scheme is problematic. *(The current cleanup job also has to check if the attachment itself is referenced by multiple objects, which we’

If you use or used unusual media base_urls you might need to tweak the final two queries a bit.

-- Total object count for reference
SELECT COUNT(*) FROM objects;

-- Now lets get the orphaned and referenced attachments
-- (temporary tables are auto deleted when the connection closes)
CREATE TEMPORARY TABLE tmp_orphanattachs AS (
  WITH attachs(aid) AS (
    SELECT DISTINCT a->>'id'
    FROM objects AS o, jsonb_array_elements(o.data->'attachment') AS a
    WHERE o.data->>'type' = 'Note'
  )
  SELECT DISTINCT id, data, inserted_at
  FROM objects LEFT JOIN attachs ON data->>'id' = aid
  WHERE data->>'type' = 'Document' 
        AND aid IS NULL
);

CREATE TEMPORARY TABLE tmp_refattachs AS (
  SELECT DISTINCT o.id, o.data
  FROM objects AS o LEFT JOIN tmp_orphanattachs AS toa
       ON o.id = toa.id
  WHERE toa.id IS NULL
        AND o.data->>'type' = 'Document' 
);

-- And referenced and orphaned user images
CREATE TEMPORARY TABLE tmp_refimgs AS (
  WITH images(aid) AS (
    SELECT DISTINCT avatar->>'id' FROM users WHERE local = 't'
    UNION    
    SELECT DISTINCT banner->>'id' FROM users WHERE local = 't'                  
    UNION    
    SELECT DISTINCT background->>'id' FROM users WHERE local = 't'                                  
  )
  SELECT DISTINCT id, data             
  FROM objects JOIN images ON data->>'id' = aid
  WHERE data->>'type' = 'Image'
        AND aid IS NOT NULL
);

CREATE TEMPORARY TABLE tmp_orphanimgs AS (
  SELECT DISTINCT o.id, o.data, o.inserted_at
  FROM objects AS o LEFT JOIN tmp_refimgs AS tri
       ON tri.id = o.id
  WHERE tri.id IS NULL
        AND o.data->>'type' = 'Image'
);

-- Compare counts
SELECT COUNT(*) FROM tmp_refattachs;
SELECT COUNT(*) FROM tmp_orphanattachs;

SELECT COUNT(*) FROM tmp_refimgs;
SELECT COUNT(*) FROM tmp_orphanimgs;

-- Determine files are only associated with orphaned attachments
--
-- if you want to output the result to a file, use:
--   COPY (<insert query here>) TO '/path/to/file.tsv' (csv, delimiter E'\t');

-- tweak this if using an unusual base_url
CREATE FUNCTION pg_temp.get_file(text) RETURNS text LANGUAGE plpgsql AS
$$ BEGIN
  RETURN
    regexp_replace(
      regexp_replace($1, '\?.*$', '', 1, 1),
      '^https?://[^/]+/(media/)?', '', 1, 1
    );
END; $$
;

-- First ALL actually referenced files
CREATE TEMPORARY TABLE tmp_rfiles AS (
  SELECT DISTINCT pg_temp.get_file(url->>'href') AS rfile
  FROM tmp_refattachs AS a,
       jsonb_array_elements(a.data->'url') AS url
  UNION
  SELECT DISTINCT pg_temp.get_file(url->>'href') AS rfile
  FROM tmp_refimgs AS i,
       jsonb_array_elements(i.data->'url') AS url
);

-- Finally orphaned files
--
-- NOTE: There are two variants below, both do the same
--   EXPLAIN ANALYZE’s estimated cost likes the first much more
--   but in actual execution time the second performed much better on my testdata,
--   but for real-world instances datasets will be much larger and this might differ
--
--   Prob best to run EXPLAIN (without ANALYZE) on both and take the one which
--   is (estimated to be) less costly for your particular database.

-- variant 1
WITH ourls(ourl, inserted_at) AS (
  SELECT DISTINCT url->>'href', inserted_at
  FROM tmp_orphanattachs AS a,
       jsonb_array_elements(a.data->'url') AS url
  UNION ALL
  SELECT DISTINCT url->>'href', inserted_at
  FROM tmp_orphanimgs AS i,
       jsonb_array_elements(i.data->'url') AS url
), ofiles(ofile, upload_time) AS (
  SELECT DISTINCT
    pg_temp.get_file(ourl) AS ofile,
    MIN(inserted_at) AS upload_time
  FROM ourls
  GROUP BY ofile
)
SELECT DISTINCT ofile, upload_time
FROM ofiles LEFT JOIN tmp_rfiles ON ofile = rfile
WHERE rfile IS NULL
ORDER BY upload_time, ofile;

-- variant 2
WITH ofiles_a(ofile, upload_time) AS (
  SELECT DISTINCT
    pg_temp.get_file(url->>'href') AS ofile,
    MIN(inserted_at) AS upload_time
  FROM tmp_orphanattachs AS a,
       jsonb_array_elements(a.data->'url') AS url
  GROUP BY ofile
), ofiles_i(ofile, upload_time) AS (
  SELECT DISTINCT
    pg_temp.get_file(url->>'href') AS ofile,
    MIN(inserted_at) AS upload_time
  FROM tmp_orphanimgs AS i,
       jsonb_array_elements(i.data->'url') AS url
  GROUP BY ofile
), ofiles(ofile, upload_time) AS (
  SELECT * FROM ofiles_a
  UNION ALL
  SELECT * FROM ofiles_i
)
SELECT DISTINCT ofile, MIN(upload_time) AS "first upload"
FROM ofiles LEFT JOIN tmp_rfiles ON ofile = rfile
WHERE rfile IS NULL
GROUP BY ofile
ORDER BY "first upload", ofile;

Summary of TODOs

  • collect stats from real-world instances
  • raise minimum PostgreSQL version to 11 *(gets rid of current runtime check and broken HASH index in pre-10 PSQL)
  • decide on what to do about mismatched extensions
  • decide on file cleanup strategy (regular scan or on check on post deletion?)
    • if using scheduled scan:
      create realistic (also in terms of row count) mock up tables and test whether PostgreSQL will even use the file index for LEFT JOINs;
      if not this index is unnecessary bloat and can be dropped
  • perf tests for different index types (also check storage consumption) and whether indices for scheduled posts and users are needed at all
  • actually writing the patches once the above is done
  • much, much, much testing of the migration taking media domain migrations, missing files and other edge cases into account
Well, it will be a while before there’s any chance of me actually implementing this, but it’s probably still good to already put up here for discussion and as reference. I certainly also won’t mind if someone else helps out with the remaining perf tests or even takes over implementation :) ## Basics Each post has zero or more attachment objects associated with it and those attachment objects each reference a file, but multiple attachments can reference the same file. In Mastodon an attachment object [exclusively belongs to a single (scheduled) post](https://github.com/mastodon/mastodon/blob/35b517c207a5a5b24d5ac67ed9ad98943dcc3d7f/app/models/media_attachment.rb#L178). *(see also API docs talking about `GET /api/v1/media/:id` requests being denied once it’s attached)* In \*oma, we currently actually allow multiple posts to reference the same media attachment, with some restrictions on who can access it. Though there’s no support for this in akkoma-fe or any other frontend or app afaik. I suspect in practice only few attachments (from manual API use/testing and maybe C2S?) are actually shared. Furthermore, in Mastodon it is (presumably) sufficient to only call the media update API for alt text edits etc to show up in associated posts, but in Akkoma those changes only become visible once the post itself is also updated *(this is a consequence of the current storage scheme and not realistically fixable within this scheme due to excessive query overhead)*. Also Mastodon treats post attachment media separate from user profile images; the latter aren’t MastoAPI media objects. ## Current Situation At the moment we keep track of local media *(i.e. belonging to our instance, not necessarily `Uploader.Local`)* via `type=Image` *(user profile {avatars,banners,backgrounds})* and `type=Document` *(post attachments)* JSON objects in well... the `objects` table. Like all other objects their JSON blob has an AP-like URL `id` *(eventhough this URL cannot actually be resolved via HTTP)*. The primary database key of this object is used for Masto-API `media_id`, but internally posts reference their attachment via the URL id inside the JSON data. <details> <summary>This is quite bad.</summary> -------- This is already conceptually a bit weird, as all other types *(`Note, Question, Answer, Tombstone`)* in the `objects` table are actually dereferencable ActivityPub objects, while uploaded media data is not. But the much bigger issue are the resulting inefficiencies and usability problems / bugs resulting from this. On a more general note, it bloats the already large `objects` table further making all queries on it more costly and table indices presumably scale super-linear with object count. It now also contends for and is affected by `objects` table locks. Just splitting it out and keeping everything else as is would probably already help somewhat but is far from enough. It’s hard to check which other attachments reference the same file *(you need to at least strip query parameters and then join the table with itself over the processed URL)* and for `Uploader.Local` the file storage path is also hard-tied to the public facing file URL, meaning to determine the file path we need to strip the `base_url` prefix and query parameters. As a result, instances which migrated their media domain can never successfully delete old, pre-migration files. In fact this lookup is so costly, deletion of attachments was moved into an asynchronous job queue five years ago (d6a532bf0f280cc191a9f2c1f53af31c451481d9), and since this still wasn't enough for large servers made optional and disabled by default four years ago (8057157ee3172c370200f328373f0a7e32092b91). Independently, both [Akkoma](https://akkoma.dev/AkkomaGang/akkoma/commit/c1127e321b151a98709072c1789a04c98bcf8c91) and [Pleroma](https://git.pleroma.social/pleroma/pleroma/-/merge_requests/3777/diffs#0ee620f2448d8c71e1013fbbfa2ddd7e8eca4110_34_35) later added a 900s timeout to attachment cleanup jobs. Attachment cleanup being disabled by default (or timing out even if enabled) means servers accumulate useless entries in `objects` as well as orphaned useless upload files. Eternal bloat. -------- </details> ## Proposal Media attachments not actually being ActivityPub objects makes this much easier to fix than if they were. There’s no need to consider federation effects or preserve URI paths; only Masto-API `media_id`s are relevant. Still, “easier” here is relative; most of the upload and media handling needs to be adjusted and the data migration itself also poses some challenges. Attachment objects being allowed to be associated with multiple posts makes things a bit more inconvenient and less efficient if we wanted to preserve that *(we’ll need multiple separate join table like for hashtags)*. I think we can move to exclusively ownership though and it shouldn’t be too bad if we change media ids for the rare shared attachment during migration. In the following exclusive ownership is assumed. I believe we should split attachments out of `objects` and also clearly distinguish uploaded files (which can be shared) from attachment objects (reference a single file but also contain additional metadata, like alt text and if we ever implement it focal point etc). Thus we’d add two new tables `attachments` and `attachment_files`. We can then also decouple deduplication from filename *(and otoh i think we also don't check if a file already exists but rely on just nothing actually changing on overwrite — with external uploaders each storage/overwrite might have some still create a new version of the key or have some cost)*. There are some details still to hammer out, but i’ll put the general DB scheme first: ```sql CREATE TABLE upload_files( id uuid PRIMARY KEY, -- any UUID scheme should be fine sha256 BYTEA NOT NULL, extension TEXT, -- see later notes UNIQUE(sha256, extension) NULLS NOT DISTINCT, -- dedupe is now baked into db -- Uploader.Local: path relative to uplaod dir -- S3: key? or bucket:key? storage_handle TEXT UNIQUE NOT NULL, inserted_at TIME NOT NULL, -- first upload updated_at TIME NOT NULL -- optional: time of newest associated upload ); CREATE TABLE media( id uuid PRIMARY KEY, -- Flake-CompatType (to preserve old media ids) owner uuid REFERENCES users(id) NOT NULL ON DELETE CASCADE, -- Tracking who uses media; all initially NULL note uuid REFERENCES objects(id), -- if post attachment; due to "delete&redraft" _no_ ON DELETE CASCADE profile uuid REFERENCES users(id) ON DELETE CASCADE, -- if user image scheduled_note uuid REFERENCES scheduled_activities(id) ON DELETE CASCADE; -- if scheduled post attachment CHECK( (note IS NULL AND profile IS NULL) OR (note IS NULL AND scheduled_note IS NULL) OR (profile is NULL AND scheduled_note IS NULL) ), -- or ON DELETE CASCADE to clean up on messups ig file uuid REFERENCES upload_files(id) NOT NULL ON DELETE RESTRICT, description TEXT, title TEXT, url_parameters jsonb, -- currently just link_name; simple flat "key": "value" object inserted_at TIME NOT NULL, update_at TIME NOT NULL ); CREATE INDEX media_usage_index ON attachments USING HASH (note, scheduled_note, profile); CREATE INDEX media_file_index ON attachments USING HASH (file); ``` Thanks to the first index we can cheaply locate all attachment objects belonging to a post and orphaned attachments. Thanks to the second quickly find out if anything else references the same file *(if needed; see “Cleanup”)*. Now attachment deletions _should_ be so cheap again that we can enable it by default *(and probably don't even need an option to disable it anymore in the first place)*. ## Notes and some open questions re schema ### Attachments vs user images We could split user images and attachments to cut down mutually exclusive foreign keys to only two columns. E.g. directly referencing `upload_files` from `users`, but, then we’ll have to check two tables instead for deleting files (or also split files with no deduplication between categories). Also for implementing e.g. avatar alt texts (#249) it makes more sense to use the existing attachment infrastructure. If we ever revive C2S, we’ll _never_ know what the upload gets used for ahead of time. And ofc we still need two different foreign keys in `media` for normal and scheduled posts. So currently it seems to me like one shared `media` table makes more sense. ### URL parameters It’s a flat JSONB object for future extensability, but atm it’s only `link_name` and we could also just use one `link_name` column instead or one plain TEXT `url_parameters` column. No strong opinion here. ### Cleanup Attachments are created before their posts and might not even ever be associated with a post, so for cleaning them up we have no choice but to (additionally or exclusively) run a regularly scheduled jobs, querying `attachments.note IS NULL`. For **files** we _could_ do the same, but it’d require a `LEFT JOIN` between `attachment_files` and `attachments` and i’m not sure if indices will actually be used for this. Instead we could after each attachment delete (but within the same db transaction) check if their files are now still referenced and if not also delete the files. This guarantees indices can be used and should always be cheap per file, but is more effort to use correctly as we can’t just rely on `ON DELETE CASCADE` triggers to delete attachments when their post is deleted. However, at least on the test data (50 million entries, no other load) joins with a 410 thousand table "only" took about 3.5 seconds if DB is in file cache. Compared to a 900s timeout, _maayyybee_ it might be fine to run such a `LEFT JOIN` once a day *(and regardless of how many locks this operation holds, it will now only affect media operations instead of basically everything)*. ### Mismatched file extensions Currently we don’t actually deduplicate all identical files, only those for which both data _and_ file extension match. In most cases different file extensions won't occur anyway (but consider e.g. `jpeg` vs `jpg` and `png` vs `PNG`). To replicate this behaviour `attachment_files` contains a `extensions` column and uniqueness is checked for the combination of hash and media type. We should at least normalise this extensions though to avoid to many unnecessary copies; e.g. doing a roundtrip through `MIME` and if it’s just `octet-stream` at least lowercase it. We could also just not care about extensions mismatches, but if the first uploader now used an actually wrong extensions all future uploads get this borked extensions too. eh Just stripping the file extension from the URL might at first seem sensible too (instead sending a `Content-Type` header) — but then we’ll either have to look up the media type in the database on each media GET request or read the media type from a new URL parameter (or enforce an extension for `link_name`). This means by editing the URL any file could be served with any user-chosen media type though ofc those media types would still go through the same sanitisation as now. But I’m also not sure if setting `Content-Type` works at all with external uploaders like S3 *(if you know S3, let me know)*. **EDIT:** on upload and normal updates we can set a single `Content-Type` header and `Content-Disposition` header per object which will get used for regular requests. However, it is possible to override these with signed URLs; we could use this for full deduplication regardless of original extension. Using signed URLs is already necessary to make `link_name` (`Content-Disposition: filename=...;`) actually work with S3 *(afaict `link_name` is currently not implemented at all for S3 uploads)*. See “Overriding response header values through the request” in https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObject.html **EDIT2**: ... but apparently presigned URLs are only valid for a limited time (https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-presigned-url.html#PresignedUrl-Expiration) ### Reused attachments Any reason to allow reusing the same attachment object afterall eventhough Mastodon doesn't? ### File paths We can now fully decouple file paths from file content since hashes are tracked in the DB. Using e.g. the Base62 encoded file (uu)id, is unproblematic and safe on any case sensitive filesystem *(do we care about case insensitive filesysytems?) and S3 *(currently there’s a cursed, lossy `strict_encoding` function for pre-`Dedupe` uploads)* and would bring the file name length down from 64 characters to at most 22 which is a bit more friendly. ### Indices If we use scheduled file scans for cleanup, we maybe don't need the `media_file` index. I observed PostgreSQL not using indices before if joining with a too large table. This will need testing. I’m not sure yet whether - one `HASH(note, scheduled_note, profile)` - all three as separate indices (most entries will be `NULL`; wastes a lot of space) - only `HASH(note)` *(since this makes up the bulk and a full scan on the rest might be cheap enough anyway)* is better. If using a combined index we also need to test whether PostgreSQL is smart enough to infer `= NULL` for the other two entries from the constraint, or if we’ll always need to spell out `note = 'some-id' AND scheduled_note = NULL AND profile = NULL` etc. (ideally test also onl owest support PostgreSQL version). At least for indices over a single UUID, `HASH` is definitely better than `BTREE` though: We'll only ever do equality lookups on the index; uuids have no meaningful ordering anyway. Thus `HASH` is sufficient and saves on disk storage compared to `BTREE`. Tests on a table with 50,000,000 non-primary-key UUID rows also suggests `HASH` performs better than `BTREE` here on PostgreSQL 16. <details> <summary>Details on the perf test</summary> -------- ```sql -- Generate test data with 50 million UUIDs CREATE TABLE testdata AS SELECT generate_series(1,50_000_000) AS key, gen_random_uuid() AS uuid; -- duplicate for comparisons CREATE TABLE testdata_btree AS SELECT * FROM testdata; CREATE TABLE testdata_noi AS SELECT * FROM testdata; -- create indices CREATE INDEX test_hash_index ON testdata USING HASH (uuid); CREATE INDEX test_btree_index ON testdata_btree USING BTREE (uuid); -- check storage consumption SELECT relname as table_name, pg_size_pretty(pg_total_relation_size(relid)) As "Total Size", pg_size_pretty(pg_indexes_size(relid)) as "Index Size", pg_size_pretty(pg_table_size(relid)) as "Storage Size (with TOAST)", pg_size_pretty(pg_relation_size(relid)) as "Data Size (without TOAST)" FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC; -- observation: HASH is smaller -- create a small and big lookup table CREATE TABLE lookups AS ( (SELECT uuid FROM testdata WHERE (random() * 31) >= 30 LIMIT 30_000) UNION ALL SELECT gen_random_uuid() AS uuid FROM generate_series(1,3_000) ); CREATE TABLE lookups_big AS ( (SELECT uuid FROM testdata WHERE (random() * 31) >= 30 LIMIT 400_000) UNION ALL SELECT gen_random_uuid() AS uuid FROM generate_series(1,10_000) ); -- finally do some timed test queries and check EXPLAIN ANALYZE ... -- for different scenarios -- e.g. -- SELECT key FROM testadata WHERE uuid = 'an existing uuid'; -- SELECT key FROM testadata WHERE uuid = 'an non.existing uuid'; -- SELECT key FROM testdata AS d JOIN lookups AS l ON d.uuid = l.uuid; -- SELECT key FROM testdata AS d JOIN lookups_big AS l ON d.uuid = l.uuid; ``` Observations from timed tests and `EXPLAIN (ANALYZE) <query>;`: - joins with `lookups_big` never use indices; everything is about the same speed - joins with `lookups` indices are still used (but because everything is faster the absolute time diff between HASH and BTREE isn’t that big) - `HASH` is a bit smaller _and_ notably faster than BTREE. In the analysers estimated cost range is just a tad better but the actual runtime duration was about only half of `BTREE`’s on individual lookups for me. - `BTREE` is massively faster than no index at all ``` # JOINING both testdata tables as an extreme test # note how indices are not used test_indices=# EXPLAIN ANALYZE SELECT d.key FROM testdata AS d JOIN testdata2 AS l ON d.uuid = l.uuid; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------- Hash Join (cost=1687782.11..7086396.11 rows=50000000 width=4) (actual time=14382.006..50487.475 rows=50000000 loops =1) Hash Cond: (d.uuid = l.uuid) -> Seq Scan on testdata d (cost=0.00..818512.00 rows=50000000 width=20) (actual time=0.009..3567.087 rows=50000000 loops=1) -> Hash (cost=818559.16..818559.16 rows=50004716 width=16) (actual time=14372.441..14372.442 rows=50000000 loops=1) Buckets: 262144 Batches: 512 Memory Usage: 6617kB -> Seq Scan on testdata2 l (cost=0.00..818559.16 rows=50004716 width=16) (actual time=0.041..4730.541 rows=50000000 loops=1) Planning Time: 0.368 ms Execution Time: 51901.144 ms ``` -------- </details> This test needs to be repeated with a combined index with three UUIDs. There’s one issue with `HASH` though; prior to PostgreSQL 10 `HASH` indices were not yet crash safe (and also slow). Currently we still support 9.6 and have a runtime check for >= 11. Given even PostgreSQL 11 already reached EOL about half a year ago, i think we can just raise our minimum version to 11 and also get rid of the runtime check. ### S3 - I’m not too familiar with the S3 uploader, what would be a good `storage_handle` for it? - Do we (want to) support multiple buckets? - Is the bucket name already part of `S3`s base url? ### Migrations between S3 and Local Do we want to track whether an uplaod was made with S3 or Local, so old files can remain with their original upload method? But then deletion would always need to consider all upload methods etc... *(I’m guessing “no”, but just in case)* ## The trouble with migrations As mentioned before, current uploads can’t be deleted after a media domain change. Our db migration will need to somehow deal with URLs using old media `base_url`s and also do something if the `base_url` matches but its file no longer exists (due to e.g. hardware or admin mess ups). If this isn’t handled gracefully we’ll be flooded with support requests and worse people manually mangling their DB before reaching out. If we’re too forgiving on the other hand the migration might turn out lossy or otherwise mess up things in a way which only shows up after a while (and again mangled databases and support flood). I’m not yet sure how to best handle that. Perhaps we should already introduce an `old_base_urls: [...]` config ahead of time, together with a mix task to check whether they cover all existing media. Then people could use this to fix up their config ahead of time and we could just fail on unparseable URLs (but we’ll still have to do something about missing files). ## Future Just replacing `Image` and `Document` objects is already a sufficiently big change, so i’d like to limit the initial port to this. However, once done this allows to also tackle some other things. I’ll give some ideas here for future reference, but they shouldn’t be the focus atm We could fix the media update bug. The most simple approach is to now just automatically update the (scheduled) post (in `objects`) when media is updated. But this means we still have to operate on `objects` and duplicate already known information. Ideally posts would read all data about their attachments from `media` (and `file`). This gives us updates for free and there’s no more potential for copies to fall out of sync. However, this raises some new questions/problems for remote media... *( ## For instance admins: checking stats For realistic mockups and perf tets I’d like to get some references for media and file counts on real-world instances. IF you’re running an instance but don’t want to run too involved SQL queries just ```sql SELECT COUNT(*) FROM objects; SELECT COUNT(*) FROM objects WHERE data->>'type' = 'Note'; SELECT COUNT(*) FROM objects WHERE data->>'type' = 'Image'; SELECT COUNT(*) FROM objects WHERE data->>'type' = 'Document'; ``` would already help. But if you’re up for more complex queries, the below gives more insight and shows how much orphaned attachments and files currently waste space on your server If it’s a larger server or runs on weak hardware, maybe better do this on a backup restored to a different machine to not impact instance operation. <details> <summary>SQL queries to check for orphaned attachments and files</summary> -------- These queries may also give an idea of why the current scheme is problematic. *(The current cleanup job also has to check if the attachment itself is referenced by multiple objects, which we’ If you use or used unusual media `base_url`s you might need to tweak the final two queries a bit. ```sql -- Total object count for reference SELECT COUNT(*) FROM objects; -- Now lets get the orphaned and referenced attachments -- (temporary tables are auto deleted when the connection closes) CREATE TEMPORARY TABLE tmp_orphanattachs AS ( WITH attachs(aid) AS ( SELECT DISTINCT a->>'id' FROM objects AS o, jsonb_array_elements(o.data->'attachment') AS a WHERE o.data->>'type' = 'Note' ) SELECT DISTINCT id, data, inserted_at FROM objects LEFT JOIN attachs ON data->>'id' = aid WHERE data->>'type' = 'Document' AND aid IS NULL ); CREATE TEMPORARY TABLE tmp_refattachs AS ( SELECT DISTINCT o.id, o.data FROM objects AS o LEFT JOIN tmp_orphanattachs AS toa ON o.id = toa.id WHERE toa.id IS NULL AND o.data->>'type' = 'Document' ); -- And referenced and orphaned user images CREATE TEMPORARY TABLE tmp_refimgs AS ( WITH images(aid) AS ( SELECT DISTINCT avatar->>'id' FROM users WHERE local = 't' UNION SELECT DISTINCT banner->>'id' FROM users WHERE local = 't' UNION SELECT DISTINCT background->>'id' FROM users WHERE local = 't' ) SELECT DISTINCT id, data FROM objects JOIN images ON data->>'id' = aid WHERE data->>'type' = 'Image' AND aid IS NOT NULL ); CREATE TEMPORARY TABLE tmp_orphanimgs AS ( SELECT DISTINCT o.id, o.data, o.inserted_at FROM objects AS o LEFT JOIN tmp_refimgs AS tri ON tri.id = o.id WHERE tri.id IS NULL AND o.data->>'type' = 'Image' ); -- Compare counts SELECT COUNT(*) FROM tmp_refattachs; SELECT COUNT(*) FROM tmp_orphanattachs; SELECT COUNT(*) FROM tmp_refimgs; SELECT COUNT(*) FROM tmp_orphanimgs; -- Determine files are only associated with orphaned attachments -- -- if you want to output the result to a file, use: -- COPY (<insert query here>) TO '/path/to/file.tsv' (csv, delimiter E'\t'); -- tweak this if using an unusual base_url CREATE FUNCTION pg_temp.get_file(text) RETURNS text LANGUAGE plpgsql AS $$ BEGIN RETURN regexp_replace( regexp_replace($1, '\?.*$', '', 1, 1), '^https?://[^/]+/(media/)?', '', 1, 1 ); END; $$ ; -- First ALL actually referenced files CREATE TEMPORARY TABLE tmp_rfiles AS ( SELECT DISTINCT pg_temp.get_file(url->>'href') AS rfile FROM tmp_refattachs AS a, jsonb_array_elements(a.data->'url') AS url UNION SELECT DISTINCT pg_temp.get_file(url->>'href') AS rfile FROM tmp_refimgs AS i, jsonb_array_elements(i.data->'url') AS url ); -- Finally orphaned files -- -- NOTE: There are two variants below, both do the same -- EXPLAIN ANALYZE’s estimated cost likes the first much more -- but in actual execution time the second performed much better on my testdata, -- but for real-world instances datasets will be much larger and this might differ -- -- Prob best to run EXPLAIN (without ANALYZE) on both and take the one which -- is (estimated to be) less costly for your particular database. -- variant 1 WITH ourls(ourl, inserted_at) AS ( SELECT DISTINCT url->>'href', inserted_at FROM tmp_orphanattachs AS a, jsonb_array_elements(a.data->'url') AS url UNION ALL SELECT DISTINCT url->>'href', inserted_at FROM tmp_orphanimgs AS i, jsonb_array_elements(i.data->'url') AS url ), ofiles(ofile, upload_time) AS ( SELECT DISTINCT pg_temp.get_file(ourl) AS ofile, MIN(inserted_at) AS upload_time FROM ourls GROUP BY ofile ) SELECT DISTINCT ofile, upload_time FROM ofiles LEFT JOIN tmp_rfiles ON ofile = rfile WHERE rfile IS NULL ORDER BY upload_time, ofile; -- variant 2 WITH ofiles_a(ofile, upload_time) AS ( SELECT DISTINCT pg_temp.get_file(url->>'href') AS ofile, MIN(inserted_at) AS upload_time FROM tmp_orphanattachs AS a, jsonb_array_elements(a.data->'url') AS url GROUP BY ofile ), ofiles_i(ofile, upload_time) AS ( SELECT DISTINCT pg_temp.get_file(url->>'href') AS ofile, MIN(inserted_at) AS upload_time FROM tmp_orphanimgs AS i, jsonb_array_elements(i.data->'url') AS url GROUP BY ofile ), ofiles(ofile, upload_time) AS ( SELECT * FROM ofiles_a UNION ALL SELECT * FROM ofiles_i ) SELECT DISTINCT ofile, MIN(upload_time) AS "first upload" FROM ofiles LEFT JOIN tmp_rfiles ON ofile = rfile WHERE rfile IS NULL GROUP BY ofile ORDER BY "first upload", ofile; ``` -------- </details> ## Summary of TODOs - collect stats from real-world instances - raise minimum PostgreSQL version to 11 *(gets rid of current runtime check and broken HASH index in pre-10 PSQL) - decide on what to do about mismatched extensions - decide on file cleanup strategy (regular scan or on check on post deletion?) - if using scheduled scan: create realistic (also in terms of row count) mock up tables and test whether PostgreSQL will even use the `file` index for `LEFT JOIN`s; if not this index is unnecessary bloat and can be dropped - perf tests for different index types *(also check storage consumption)* and whether indices for scheduled posts and users are needed at all - actually writing the patches once the above is done - much, much, much testing of the migration taking media domain migrations, missing files and other edge cases into account
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: AkkomaGang/akkoma#765
No description provided.