AkkomaGang/akkoma

Fork 94

RFC: database scheme for media uploads #765

New issue

Open

opened 2024-04-28 22:13:52 +00:00 by Oneric · 15 comments

Oneric commented

2024-04-28 22:13:52 +00:00

Member

Basics

Each post has zero or more attachment objects associated with it and those attachment objects each reference a file, but multiple attachments can reference the same file.

In Mastodon an attachment object exclusively belongs to a single (scheduled) post. (see also API docs talking about GET /api/v1/media/:id requests being denied once it’s attached)

In *oma, we currently actually allow multiple posts to reference the same media attachment, with some restrictions on who can access it. Though there’s no support for this in akkoma-fe or any other frontend or app afaik. I suspect in practice only few attachments (from manual API use/testing and maybe C2S?) are actually shared.

Furthermore, in Mastodon it is (presumably) sufficient to only call the media update API for alt text edits etc to show up in associated posts, but in Akkoma those changes only become visible once the post itself is also updated (this is a consequence of the current storage scheme and not realistically fixable within this scheme due to excessive query overhead).

Also Mastodon treats post attachment media separate from user profile images; the latter aren’t MastoAPI media objects.

Current Situation

At the moment we keep track of local media (i.e. belonging to our instance, not necessarily Uploader.Local) via type=Image (user profile {avatars,banners,backgrounds}) and type=Document (post attachments) JSON objects in well... the objects table. Like all other objects their JSON blob has an AP-like URL id (eventhough this URL cannot actually be resolved via HTTP). The primary database key of this object is used for Masto-API media_id, but internally posts reference their attachment via the URL id inside the JSON data.

This is quite bad.

This is already conceptually a bit weird, as all other types (Note, Question, Answer, Tombstone) in the objects table are actually dereferencable ActivityPub objects, while uploaded media data is not.

But the much bigger issue are the resulting inefficiencies and usability problems / bugs resulting from this.
On a more general note, it bloats the already large objects table further making all queries on it more costly and table indices presumably scale super-linear with object count. It now also contends for and is affected by objects table locks. Just splitting it out and keeping everything else as is would probably already help somewhat but is far from enough.

It’s hard to check which other attachments reference the same file (you need to at least strip query parameters and then join the table with itself over the processed URL) and for Uploader.Local the file storage path is also hard-tied to the public facing file URL, meaning to determine the file path we need to strip the base_url prefix and query parameters.
As a result, instances which migrated their media domain can never successfully delete old, pre-migration files.

In fact this lookup is so costly, deletion of attachments was moved into an asynchronous job queue
five years ago (d6a532bf0f), and since this still wasn't enough for large servers
made optional and disabled by default four years ago (8057157ee3).
Independently, both Akkoma and Pleroma later added a 900s timeout to attachment cleanup jobs.

Attachment cleanup being disabled by default (or timing out even if enabled) means servers accumulate useless entries in objects as well as orphaned useless upload files. Eternal bloat.

Proposal

Media attachments not actually being ActivityPub objects makes this much easier to fix than if they were.
There’s no need to consider federation effects or preserve URI paths; only Masto-API media_ids are relevant.
Still, “easier” here is relative; most of the upload and media handling needs to be adjusted and the data migration itself also poses some challenges.

Attachment objects being allowed to be associated with multiple posts makes things a bit more inconvenient and less efficient if we wanted to preserve that (we’ll need multiple separate join table like for hashtags). I think we can move to exclusively ownership though and it shouldn’t be too bad if we change media ids for the rare shared attachment during migration. In the following exclusive ownership is assumed.

I believe we should split attachments out of objects and also clearly distinguish uploaded files (which can be shared) from attachment objects (reference a single file but also contain additional metadata, like alt text and if we ever implement it focal point etc). Thus we’d add two new tables attachments and attachment_files.
We can then also decouple deduplication from filename (and otoh i think we also don't check if a file already exists but rely on just nothing actually changing on overwrite — with external uploaders each storage/overwrite might have some still create a new version of the key or have some cost).

There are some details still to hammer out, but i’ll put the general DB scheme first:

CREATE TABLE upload_files(
    id uuid PRIMARY KEY, -- any UUID scheme should be fine
    sha256 BYTEA NOT NULL,
    extension TEXT, -- see later notes
    UNIQUE(sha256, extension) NULLS NOT DISTINCT, -- dedupe is now baked into db
    -- Uploader.Local: path relative to uplaod dir
    -- S3: key? or bucket:key?
    storage_handle TEXT UNIQUE NOT NULL,
    inserted_at TIME NOT NULL, -- first upload
    updated_at TIME NOT NULL   -- optional: time of newest associated upload
);


CREATE TABLE media(
    id uuid PRIMARY KEY, -- Flake-CompatType (to preserve old media ids)
    owner uuid REFERENCES users(id) NOT NULL ON DELETE CASCADE,

    -- Tracking what uses media; all initially NULL
    note uuid REFERENCES objects(id) ON DELETE SET NULL, -- if post attachment; due to "delete&redraft" _no_ CASCADE
    profile uuid REFERENCES users(id) ON DELETE CASCADE, -- if user image
    scheduled_note uuid REFERENCES scheduled_activities(id) ON DELETE CASCADE; -- if scheduled post attachment
    CHECK(
        (note IS NULL  AND  profile IS NULL) OR
        (note IS NULL  AND  scheduled_note IS NULL) OR
        (profile is NULL  AND  scheduled_note IS NULL)
    ),

    -- or ON DELETE CASCADE to clean up on messups ig
    file uuid REFERENCES upload_files(id) NOT NULL ON DELETE RESTRICT,
    description TEXT,
    title TEXT,
    url_parameters jsonb, -- currently just link_name; simple flat "key": "value" object
    inserted_at TIME NOT NULL,
    update_at TIME NOT NULL
);
CREATE INDEX media_usage_index ON attachments USING HASH (note, scheduled_note, profile);
CREATE INDEX media_file_index ON attachments USING HASH (file);

Thanks to the first index we can cheaply locate all attachment objects belonging to a post and orphaned attachments.
Thanks to the second quickly find out if anything else references the same file (if needed; see “Cleanup”).

Now attachment deletions should be so cheap again that we can enable it by default (and probably don't even need an option to disable it anymore in the first place).

Notes and some open questions re schema

Attachments vs user images

We could split user images and attachments to cut down mutually exclusive foreign keys to only two columns.
E.g. directly referencing upload_files from users, but, then we’ll have to check two tables instead for deleting files (or also split files with no deduplication between categories). Also for implementing e.g. avatar alt texts (#249) it makes more sense to use the existing attachment infrastructure. If we ever revive C2S, we’ll never know what the upload gets used for ahead of time. And ofc we still need two different foreign keys in media for normal and scheduled posts.

So currently it seems to me like one shared media table makes more sense.

URL parameters

It’s a flat JSONB object for future extensability, but atm it’s only link_name and we could also just use one link_name column instead or one plain TEXT url_parameters column. No strong opinion here.

Cleanup

Attachments are created before their posts and might not even ever be associated with a post, so for cleaning them up we have no choice but to (additionally or exclusively) run a regularly scheduled jobs, querying attachments.note IS NULL.

For files we could do the same, but it’d require a LEFT JOIN between attachment_files and attachments and i’m not sure if indices will actually be used for this. Instead we could after each attachment delete (but within the same db transaction) check if their files are now still referenced and if not also delete the files.
This guarantees indices can be used and should always be cheap per file, but is more effort to use correctly as we can’t just rely on ON DELETE CASCADE triggers to delete attachments when their post is deleted.

However, at least on the test data (50 million entries, no other load) joins with a 410 thousand table "only" took about 3.5 seconds if DB is in file cache. Compared to a 900s timeout, maayyybee it might be fine to run such a LEFT JOIN once a day (and regardless of how many locks this operation holds, it will now only affect media operations instead of basically everything).

Mismatched file extensions

Currently we don’t actually deduplicate all identical files, only those for which both data and file extension match. In most cases different file extensions won't occur anyway (but consider e.g. jpeg vs jpg and png vs PNG).
To replicate this behaviour attachment_files contains a extensions column and uniqueness is checked for the combination of hash and media type. We should at least normalise this extensions though to avoid to many unnecessary copies; e.g. doing a roundtrip through MIME and if it’s just octet-stream at least lowercase it.

We could also just not care about extensions mismatches, but if the first uploader now used an actually wrong extensions all future uploads get this borked extensions too. eh

Just stripping the file extension from the URL might at first seem sensible too (instead sending a Content-Type header) — but then we’ll either have to look up the media type in the database on each media GET request or read the media type from a new URL parameter (or enforce an extension for link_name). This means by editing the URL any file could be served with any user-chosen media type though ofc those media types would still go through the same sanitisation as now.
But I’m also not sure if setting Content-Type works at all with external uploaders like S3 (if you know S3, let me know).

EDIT: on upload and normal updates we can set a single Content-Type header and Content-Disposition header per object which will get used for regular requests. However, it is possible to override these with signed URLs; we could use this for full deduplication regardless of original extension.
Using signed URLs is already necessary to make link_name (Content-Disposition: filename=...;) actually work with S3 (afaict link_name is currently not implemented at all for S3 uploads).
See “Overriding response header values through the request” in https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObject.html

EDIT2: ... but apparently presigned URLs are only valid for a limited time (https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-presigned-url.html#PresignedUrl-Expiration)

Reused attachments

Any reason to allow reusing the same attachment object afterall eventhough Mastodon doesn't?

File paths

We can now fully decouple file paths from file content since hashes are tracked in the DB.
Using e.g. the Base62 encoded file (uu)id, is unproblematic and safe on any case sensitive filesystem *(do we care about case insensitive filesysytems?) and S3 (currently there’s a cursed, lossy strict_encoding function for pre-Dedupe uploads) and would bring the file name length down from 64 characters to at most 22 which is a bit more friendly.

Indices

If we use scheduled file scans for cleanup, we maybe don't need the media_file index. I observed PostgreSQL not using indices before if joining with a too large table. This will need testing.

I’m not sure yet whether

one HASH(note, scheduled_note, profile)
all three as separate indices (most entries will be NULL; wastes a lot of space)
only HASH(note) (since this makes up the bulk and a full scan on the rest might be cheap enough anyway)

is better.

If using a combined index we also need to test whether PostgreSQL is smart
enough to infer = NULL for the other two entries from the constraint, or
if we’ll always need to spell out note = 'some-id' AND scheduled_note = NULL AND profile = NULL etc.
(ideally test also onl owest support PostgreSQL version).

At least for indices over a single UUID, HASH is definitely better than BTREE though:
We'll only ever do equality lookups on the index; uuids have no meaningful ordering anyway.
Thus HASH is sufficient and saves on disk storage compared to BTREE.
Tests on a table with 50,000,000 non-primary-key UUID rows also suggests HASH performs better than BTREE here on PostgreSQL 16.

Details on the perf test

-- Generate test data with 50 million UUIDs
CREATE TABLE testdata AS
  SELECT generate_series(1,50_000_000) AS key,
         gen_random_uuid() AS uuid;
-- duplicate for comparisons
CREATE TABLE testdata_btree AS SELECT * FROM testdata;
CREATE TABLE testdata_noi AS SELECT * FROM testdata;

-- create indices
CREATE INDEX test_hash_index ON testdata USING HASH (uuid);
CREATE INDEX test_btree_index ON testdata_btree USING BTREE (uuid);

-- check storage consumption
SELECT
   relname  as table_name,
   pg_size_pretty(pg_total_relation_size(relid)) As "Total Size",
   pg_size_pretty(pg_indexes_size(relid)) as "Index Size",
   pg_size_pretty(pg_table_size(relid)) as "Storage Size (with TOAST)",
   pg_size_pretty(pg_relation_size(relid)) as "Data Size (without TOAST)"
   FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC;

-- observation: HASH is smaller

-- create a small and big lookup table
CREATE TABLE lookups AS (
  (SELECT uuid FROM testdata WHERE (random() * 31) >= 30 LIMIT 30_000)
  UNION ALL
  SELECT gen_random_uuid() AS uuid FROM generate_series(1,3_000)
);

CREATE TABLE lookups_big AS (
  (SELECT uuid FROM testdata WHERE (random() * 31) >= 30 LIMIT 400_000)
  UNION ALL
  SELECT gen_random_uuid() AS uuid FROM generate_series(1,10_000)
);

-- finally do some timed test queries and check EXPLAIN ANALYZE ...
-- for different scenarios
-- e.g.
--  SELECT key FROM testadata WHERE uuid = 'an existing uuid';
--  SELECT key FROM testadata WHERE uuid = 'an non.existing uuid';
--  SELECT key FROM testdata AS d JOIN lookups AS l ON d.uuid = l.uuid;
--  SELECT key FROM testdata AS d JOIN lookups_big AS l ON d.uuid = l.uuid;

Observations from timed tests and EXPLAIN (ANALYZE) <query>;:

joins with lookups_big never use indices; everything is about the same speed
joins with lookups indices are still used (but because everything is faster the absolute time diff between HASH and BTREE isn’t that big)
HASH is a bit smaller and notably faster than BTREE. In the analysers estimated cost range is just a tad better but the actual runtime duration was about only half of BTREE’s on individual lookups for me.
BTREE is massively faster than no index at all

# JOINING both testdata tables as an extreme test
# note how indices are not used
test_indices=# EXPLAIN ANALYZE SELECT d.key FROM testdata AS d JOIN testdata2 AS l ON d.uuid = l.uuid;
                                                               QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=1687782.11..7086396.11 rows=50000000 width=4) (actual time=14382.006..50487.475 rows=50000000 loops
=1)
   Hash Cond: (d.uuid = l.uuid)
   ->  Seq Scan on testdata d  (cost=0.00..818512.00 rows=50000000 width=20) (actual time=0.009..3567.087 rows=50000000 loops=1)
   ->  Hash  (cost=818559.16..818559.16 rows=50004716 width=16) (actual time=14372.441..14372.442 rows=50000000 loops=1)
         Buckets: 262144  Batches: 512  Memory Usage: 6617kB
         ->  Seq Scan on testdata2 l  (cost=0.00..818559.16 rows=50004716 width=16) (actual time=0.041..4730.541 rows=50000000 loops=1)
 Planning Time: 0.368 ms
 Execution Time: 51901.144 ms

This test needs to be repeated with a combined index with three UUIDs.

There’s one issue with HASH though; prior to PostgreSQL 10 HASH indices were not yet crash safe (and also slow).
Currently we still support 9.6 and have a runtime check for >= 11.
Given even PostgreSQL 11 already reached EOL about half a year ago, i think we can just raise our minimum version to 11 and also get rid of the runtime check.

S3

I’m not too familiar with the S3 uploader, what would be a good storage_handle for it?
Do we (want to) support multiple buckets?
Is the bucket name already part of S3s base url?

Migrations between S3 and Local

Do we want to track whether an uplaod was made with S3 or Local, so old files can remain with their original upload method? But then deletion would always need to consider all upload methods etc...
(I’m guessing “no”, but just in case)

The trouble with migrations

As mentioned before, current uploads can’t be deleted after a media domain change.
Our db migration will need to somehow deal with URLs using old media base_urls
and also do something if the base_url matches but its file no longer exists (due to e.g. hardware or admin mess ups).
If this isn’t handled gracefully we’ll be flooded with support requests and worse people manually mangling their DB before reaching out. If we’re too forgiving on the other hand the migration might turn out lossy or otherwise mess up things in a way which only shows up after a while (and again mangled databases and support flood).

I’m not yet sure how to best handle that.
Perhaps we should already introduce an old_base_urls: [...] config ahead of time, together with a mix task to check whether they cover all existing media. Then people could use this to fix up their config ahead of time and we could just fail on unparseable URLs (but we’ll still have to do something about missing files).

Future

Just replacing Image and Document objects is already a sufficiently big change, so i’d like to limit the initial port to this.

However, once done this allows to also tackle some other things. I’ll give some ideas here for future reference, but they shouldn’t be the focus atm

We could fix the media update bug.
The most simple approach is to now just automatically update the (scheduled) post (in objects) when media is updated.
But this means we still have to operate on objects and duplicate already known information. Ideally posts would read all data about their attachments from media (and file). This gives us updates for free and there’s no more potential for copies to fall out of sync. However, this raises some new questions/problems for remote media... *(

For instance admins: checking stats

For realistic mockups and perf tets I’d like to get some references for media
and file counts on real-world instances. IF you’re running an instance but don’t
want to run too involved SQL queries just

SELECT COUNT(*) FROM objects;
SELECT COUNT(*) FROM objects WHERE data->>'type' = 'Note';
SELECT COUNT(*) FROM objects WHERE data->>'type' = 'Image';
SELECT COUNT(*) FROM objects WHERE data->>'type' = 'Document';

would already help.

But if you’re up for more complex queries, the below gives more insight and shows how much orphaned attachments and files currently waste space on your server

If it’s a larger server or runs on weak hardware, maybe better do this on
a backup restored to a different machine to not impact instance operation.

SQL queries to check for orphaned attachments and files

These queries may also give an idea of why the current scheme is problematic. *(The current cleanup job also has to check if the attachment itself is referenced by multiple objects, which we’

If you use or used unusual media base_urls you might need to tweak the final two queries a bit.

-- NOTE: if you want to save query outputs to a TSV file
--       this should be run as "postgres" NOT "akkoma"
--
-- For saving query output as tsv use:
--   COPY (<insert query here>) TO '/tmp/file.tsv' (format csv, delimiter E'\t');

-- Total object count for reference
SELECT COUNT(*) FROM objects;

-- Now lets get the orphaned and referenced attachments
-- (temporary tables are auto deleted when the connection closes)
CREATE TEMPORARY TABLE tmp_orphanattachs AS (
  WITH attachs(aid) AS (
    SELECT DISTINCT a->>'id'
    FROM objects AS o, jsonb_array_elements(o.data->'attachment') AS a
    WHERE o.data->>'type' = 'Note'
  )
  SELECT DISTINCT id, data, inserted_at
  FROM objects LEFT JOIN attachs ON data->>'id' = aid
  WHERE data->>'type' = 'Document' 
        AND aid IS NULL
);

CREATE TEMPORARY TABLE tmp_refattachs AS (
  SELECT DISTINCT o.id, o.data
  FROM objects AS o LEFT JOIN tmp_orphanattachs AS toa
       ON o.id = toa.id
  WHERE toa.id IS NULL
        AND o.data->>'type' = 'Document' 
);

-- And referenced and orphaned user images
CREATE TEMPORARY TABLE tmp_refimgs AS (
  WITH images(aid) AS (
    SELECT DISTINCT avatar->>'id' FROM users WHERE local = 't'
    UNION    
    SELECT DISTINCT banner->>'id' FROM users WHERE local = 't'                  
    UNION    
    SELECT DISTINCT background->>'id' FROM users WHERE local = 't'                                  
  )
  SELECT DISTINCT id, data             
  FROM objects JOIN images ON data->>'id' = aid
  WHERE data->>'type' = 'Image'
        AND aid IS NOT NULL
);

CREATE TEMPORARY TABLE tmp_orphanimgs AS (
  SELECT DISTINCT o.id, o.data, o.inserted_at
  FROM objects AS o LEFT JOIN tmp_refimgs AS tri
       ON tri.id = o.id
  WHERE tri.id IS NULL
        AND o.data->>'type' = 'Image'
);

-- Compare counts
SELECT COUNT(*) FROM tmp_refattachs;
SELECT COUNT(*) FROM tmp_orphanattachs;

SELECT COUNT(*) FROM tmp_refimgs;
SELECT COUNT(*) FROM tmp_orphanimgs;

-- Determine files are only associated with orphaned attachments
-- (consider saving to a tsv file)

-- tweak this if using an unusual base_url
CREATE FUNCTION pg_temp.get_file(text) RETURNS text LANGUAGE plpgsql AS
$$ BEGIN
  RETURN
    regexp_replace(
      regexp_replace($1, '\?.*$', '', 1, 1),
      '^https?://[^/]+/(media/)?', '', 1, 1
    );
END; $$
;

-- First ALL actually referenced files
CREATE TEMPORARY TABLE tmp_rfiles AS (
  SELECT DISTINCT pg_temp.get_file(url->>'href') AS rfile
  FROM tmp_refattachs AS a,
       jsonb_array_elements(a.data->'url') AS url
  UNION
  SELECT DISTINCT pg_temp.get_file(url->>'href') AS rfile
  FROM tmp_refimgs AS i,
       jsonb_array_elements(i.data->'url') AS url
);

-- Finally orphaned files
--
-- NOTE: There are two variants below, both do the same
--   EXPLAIN ANALYZE’s estimated cost likes the first much more
--   but in actual execution time the second performed much better on my testdata,
--   but for real-world instances datasets will be much larger and this might differ
--
--   Prob best to run EXPLAIN (without ANALYZE) on both and take the one which
--   is (estimated to be) less costly for your particular database.

-- variant 1
WITH ourls(ourl, inserted_at) AS (
  SELECT DISTINCT url->>'href', inserted_at
  FROM tmp_orphanattachs AS a,
       jsonb_array_elements(a.data->'url') AS url
  UNION ALL
  SELECT DISTINCT url->>'href', inserted_at
  FROM tmp_orphanimgs AS i,
       jsonb_array_elements(i.data->'url') AS url
), ofiles(ofile, upload_time) AS (
  SELECT DISTINCT
    pg_temp.get_file(ourl) AS ofile,
    MIN(inserted_at) AS upload_time
  FROM ourls
  GROUP BY ofile
)
SELECT DISTINCT ofile, upload_time
FROM ofiles LEFT JOIN tmp_rfiles ON ofile = rfile
WHERE rfile IS NULL
ORDER BY upload_time, ofile;

-- variant 2
WITH ofiles_a(ofile, upload_time) AS (
  SELECT DISTINCT
    pg_temp.get_file(url->>'href') AS ofile,
    MIN(inserted_at) AS upload_time
  FROM tmp_orphanattachs AS a,
       jsonb_array_elements(a.data->'url') AS url
  GROUP BY ofile
), ofiles_i(ofile, upload_time) AS (
  SELECT DISTINCT
    pg_temp.get_file(url->>'href') AS ofile,
    MIN(inserted_at) AS upload_time
  FROM tmp_orphanimgs AS i,
       jsonb_array_elements(i.data->'url') AS url
  GROUP BY ofile
), ofiles(ofile, upload_time) AS (
  SELECT * FROM ofiles_a
  UNION ALL
  SELECT * FROM ofiles_i
)
SELECT DISTINCT ofile, MIN(upload_time) AS "first upload"
FROM ofiles LEFT JOIN tmp_rfiles ON ofile = rfile
WHERE rfile IS NULL
GROUP BY ofile
ORDER BY "first upload", ofile;

Summary of TODOs

collect stats from real-world instances
raise minimum PostgreSQL version to 11 *(gets rid of current runtime check and broken HASH index in pre-10 PSQL)
decide on what to do about mismatched extensions
decide on file cleanup strategy (regular scan or on check on post deletion?)
- if using scheduled scan:
  create realistic (also in terms of row count) mock up tables and test whether PostgreSQL will even use the file index for LEFT JOINs;
  if not this index is unnecessary bloat and can be dropped
perf tests for different index types (also check storage consumption) and whether indices for scheduled posts and users are needed at all
actually writing the patches once the above is done
much, much, much testing of the migration taking media domain migrations, missing files and other edge cases into account

Well, it will be a while before there’s any chance of me actually implementing this, but it’s probably still good to already put up here for discussion and as reference. I certainly also won’t mind if someone else helps out with the remaining perf tests or even takes over implementation :) ## Basics Each post has zero or more attachment objects associated with it and those attachment objects each reference a file, but multiple attachments can reference the same file. In Mastodon an attachment object [exclusively belongs to a single (scheduled) post](https://github.com/mastodon/mastodon/blob/35b517c207a5a5b24d5ac67ed9ad98943dcc3d7f/app/models/media_attachment.rb#L178). *(see also API docs talking about `GET /api/v1/media/:id` requests being denied once it’s attached)* In \*oma, we currently actually allow multiple posts to reference the same media attachment, with some restrictions on who can access it. Though there’s no support for this in akkoma-fe or any other frontend or app afaik. I suspect in practice only few attachments (from manual API use/testing and maybe C2S?) are actually shared. Furthermore, in Mastodon it is (presumably) sufficient to only call the media update API for alt text edits etc to show up in associated posts, but in Akkoma those changes only become visible once the post itself is also updated *(this is a consequence of the current storage scheme and not realistically fixable within this scheme due to excessive query overhead)*. Also Mastodon treats post attachment media separate from user profile images; the latter aren’t MastoAPI media objects. ## Current Situation At the moment we keep track of local media *(i.e. belonging to our instance, not necessarily `Uploader.Local`)* via `type=Image` *(user profile {avatars,banners,backgrounds})* and `type=Document` *(post attachments)* JSON objects in well... the `objects` table. Like all other objects their JSON blob has an AP-like URL `id` *(eventhough this URL cannot actually be resolved via HTTP)*. The primary database key of this object is used for Masto-API `media_id`, but internally posts reference their attachment via the URL id inside the JSON data. <details> <summary>This is quite bad.</summary> -------- This is already conceptually a bit weird, as all other types *(`Note, Question, Answer, Tombstone`)* in the `objects` table are actually dereferencable ActivityPub objects, while uploaded media data is not. But the much bigger issue are the resulting inefficiencies and usability problems / bugs resulting from this. On a more general note, it bloats the already large `objects` table further making all queries on it more costly and table indices presumably scale super-linear with object count. It now also contends for and is affected by `objects` table locks. Just splitting it out and keeping everything else as is would probably already help somewhat but is far from enough. It’s hard to check which other attachments reference the same file *(you need to at least strip query parameters and then join the table with itself over the processed URL)* and for `Uploader.Local` the file storage path is also hard-tied to the public facing file URL, meaning to determine the file path we need to strip the `base_url` prefix and query parameters. As a result, instances which migrated their media domain can never successfully delete old, pre-migration files. In fact this lookup is so costly, deletion of attachments was moved into an asynchronous job queue five years ago (d6a532bf0f280cc191a9f2c1f53af31c451481d9), and since this still wasn't enough for large servers made optional and disabled by default four years ago (8057157ee3172c370200f328373f0a7e32092b91). Independently, both [Akkoma](https://akkoma.dev/AkkomaGang/akkoma/commit/c1127e321b151a98709072c1789a04c98bcf8c91) and [Pleroma](https://git.pleroma.social/pleroma/pleroma/-/merge_requests/3777/diffs#0ee620f2448d8c71e1013fbbfa2ddd7e8eca4110_34_35) later added a 900s timeout to attachment cleanup jobs. Attachment cleanup being disabled by default (or timing out even if enabled) means servers accumulate useless entries in `objects` as well as orphaned useless upload files. Eternal bloat. -------- </details> ## Proposal Media attachments not actually being ActivityPub objects makes this much easier to fix than if they were. There’s no need to consider federation effects or preserve URI paths; only Masto-API `media_id`s are relevant. Still, “easier” here is relative; most of the upload and media handling needs to be adjusted and the data migration itself also poses some challenges. Attachment objects being allowed to be associated with multiple posts makes things a bit more inconvenient and less efficient if we wanted to preserve that *(we’ll need multiple separate join table like for hashtags)*. I think we can move to exclusively ownership though and it shouldn’t be too bad if we change media ids for the rare shared attachment during migration. In the following exclusive ownership is assumed. I believe we should split attachments out of `objects` and also clearly distinguish uploaded files (which can be shared) from attachment objects (reference a single file but also contain additional metadata, like alt text and if we ever implement it focal point etc). Thus we’d add two new tables `attachments` and `attachment_files`. We can then also decouple deduplication from filename *(and otoh i think we also don't check if a file already exists but rely on just nothing actually changing on overwrite — with external uploaders each storage/overwrite might have some still create a new version of the key or have some cost)*. There are some details still to hammer out, but i’ll put the general DB scheme first: ```sql CREATE TABLE upload_files( id uuid PRIMARY KEY, -- any UUID scheme should be fine sha256 BYTEA NOT NULL, extension TEXT, -- see later notes UNIQUE(sha256, extension) NULLS NOT DISTINCT, -- dedupe is now baked into db -- Uploader.Local: path relative to uplaod dir -- S3: key? or bucket:key? storage_handle TEXT UNIQUE NOT NULL, inserted_at TIME NOT NULL, -- first upload updated_at TIME NOT NULL -- optional: time of newest associated upload ); CREATE TABLE media( id uuid PRIMARY KEY, -- Flake-CompatType (to preserve old media ids) owner uuid REFERENCES users(id) NOT NULL ON DELETE CASCADE, -- Tracking what uses media; all initially NULL note uuid REFERENCES objects(id) ON DELETE SET NULL, -- if post attachment; due to "delete&redraft" _no_ CASCADE profile uuid REFERENCES users(id) ON DELETE CASCADE, -- if user image scheduled_note uuid REFERENCES scheduled_activities(id) ON DELETE CASCADE; -- if scheduled post attachment CHECK( (note IS NULL AND profile IS NULL) OR (note IS NULL AND scheduled_note IS NULL) OR (profile is NULL AND scheduled_note IS NULL) ), -- or ON DELETE CASCADE to clean up on messups ig file uuid REFERENCES upload_files(id) NOT NULL ON DELETE RESTRICT, description TEXT, title TEXT, url_parameters jsonb, -- currently just link_name; simple flat "key": "value" object inserted_at TIME NOT NULL, update_at TIME NOT NULL ); CREATE INDEX media_usage_index ON attachments USING HASH (note, scheduled_note, profile); CREATE INDEX media_file_index ON attachments USING HASH (file); ``` Thanks to the first index we can cheaply locate all attachment objects belonging to a post and orphaned attachments. Thanks to the second quickly find out if anything else references the same file *(if needed; see “Cleanup”)*. Now attachment deletions _should_ be so cheap again that we can enable it by default *(and probably don't even need an option to disable it anymore in the first place)*. ## Notes and some open questions re schema ### Attachments vs user images We could split user images and attachments to cut down mutually exclusive foreign keys to only two columns. E.g. directly referencing `upload_files` from `users`, but, then we’ll have to check two tables instead for deleting files (or also split files with no deduplication between categories). Also for implementing e.g. avatar alt texts (#249) it makes more sense to use the existing attachment infrastructure. If we ever revive C2S, we’ll _never_ know what the upload gets used for ahead of time. And ofc we still need two different foreign keys in `media` for normal and scheduled posts. So currently it seems to me like one shared `media` table makes more sense. ### URL parameters It’s a flat JSONB object for future extensability, but atm it’s only `link_name` and we could also just use one `link_name` column instead or one plain TEXT `url_parameters` column. No strong opinion here. ### Cleanup Attachments are created before their posts and might not even ever be associated with a post, so for cleaning them up we have no choice but to (additionally or exclusively) run a regularly scheduled jobs, querying `attachments.note IS NULL`. For **files** we _could_ do the same, but it’d require a `LEFT JOIN` between `attachment_files` and `attachments` and i’m not sure if indices will actually be used for this. Instead we could after each attachment delete (but within the same db transaction) check if their files are now still referenced and if not also delete the files. This guarantees indices can be used and should always be cheap per file, but is more effort to use correctly as we can’t just rely on `ON DELETE CASCADE` triggers to delete attachments when their post is deleted. However, at least on the test data (50 million entries, no other load) joins with a 410 thousand table "only" took about 3.5 seconds if DB is in file cache. Compared to a 900s timeout, _maayyybee_ it might be fine to run such a `LEFT JOIN` once a day *(and regardless of how many locks this operation holds, it will now only affect media operations instead of basically everything)*. ### Mismatched file extensions Currently we don’t actually deduplicate all identical files, only those for which both data _and_ file extension match. In most cases different file extensions won't occur anyway (but consider e.g. `jpeg` vs `jpg` and `png` vs `PNG`). To replicate this behaviour `attachment_files` contains a `extensions` column and uniqueness is checked for the combination of hash and media type. We should at least normalise this extensions though to avoid to many unnecessary copies; e.g. doing a roundtrip through `MIME` and if it’s just `octet-stream` at least lowercase it. We could also just not care about extensions mismatches, but if the first uploader now used an actually wrong extensions all future uploads get this borked extensions too. eh Just stripping the file extension from the URL might at first seem sensible too (instead sending a `Content-Type` header) — but then we’ll either have to look up the media type in the database on each media GET request or read the media type from a new URL parameter (or enforce an extension for `link_name`). This means by editing the URL any file could be served with any user-chosen media type though ofc those media types would still go through the same sanitisation as now. But I’m also not sure if setting `Content-Type` works at all with external uploaders like S3 *(if you know S3, let me know)*. **EDIT:** on upload and normal updates we can set a single `Content-Type` header and `Content-Disposition` header per object which will get used for regular requests. However, it is possible to override these with signed URLs; we could use this for full deduplication regardless of original extension. Using signed URLs is already necessary to make `link_name` (`Content-Disposition: filename=...;`) actually work with S3 *(afaict `link_name` is currently not implemented at all for S3 uploads)*. See “Overriding response header values through the request” in https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObject.html **EDIT2**: ... but apparently presigned URLs are only valid for a limited time (https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-presigned-url.html#PresignedUrl-Expiration) ### Reused attachments Any reason to allow reusing the same attachment object afterall eventhough Mastodon doesn't? ### File paths We can now fully decouple file paths from file content since hashes are tracked in the DB. Using e.g. the Base62 encoded file (uu)id, is unproblematic and safe on any case sensitive filesystem *(do we care about case insensitive filesysytems?) and S3 *(currently there’s a cursed, lossy `strict_encoding` function for pre-`Dedupe` uploads)* and would bring the file name length down from 64 characters to at most 22 which is a bit more friendly. ### Indices If we use scheduled file scans for cleanup, we maybe don't need the `media_file` index. I observed PostgreSQL not using indices before if joining with a too large table. This will need testing. I’m not sure yet whether - one `HASH(note, scheduled_note, profile)` - all three as separate indices (most entries will be `NULL`; wastes a lot of space) - only `HASH(note)` *(since this makes up the bulk and a full scan on the rest might be cheap enough anyway)* is better. If using a combined index we also need to test whether PostgreSQL is smart enough to infer `= NULL` for the other two entries from the constraint, or if we’ll always need to spell out `note = 'some-id' AND scheduled_note = NULL AND profile = NULL` etc. (ideally test also onl owest support PostgreSQL version). At least for indices over a single UUID, `HASH` is definitely better than `BTREE` though: We'll only ever do equality lookups on the index; uuids have no meaningful ordering anyway. Thus `HASH` is sufficient and saves on disk storage compared to `BTREE`. Tests on a table with 50,000,000 non-primary-key UUID rows also suggests `HASH` performs better than `BTREE` here on PostgreSQL 16. <details> <summary>Details on the perf test</summary> -------- ```sql -- Generate test data with 50 million UUIDs CREATE TABLE testdata AS SELECT generate_series(1,50_000_000) AS key, gen_random_uuid() AS uuid; -- duplicate for comparisons CREATE TABLE testdata_btree AS SELECT * FROM testdata; CREATE TABLE testdata_noi AS SELECT * FROM testdata; -- create indices CREATE INDEX test_hash_index ON testdata USING HASH (uuid); CREATE INDEX test_btree_index ON testdata_btree USING BTREE (uuid); -- check storage consumption SELECT relname as table_name, pg_size_pretty(pg_total_relation_size(relid)) As "Total Size", pg_size_pretty(pg_indexes_size(relid)) as "Index Size", pg_size_pretty(pg_table_size(relid)) as "Storage Size (with TOAST)", pg_size_pretty(pg_relation_size(relid)) as "Data Size (without TOAST)" FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC; -- observation: HASH is smaller -- create a small and big lookup table CREATE TABLE lookups AS ( (SELECT uuid FROM testdata WHERE (random() * 31) >= 30 LIMIT 30_000) UNION ALL SELECT gen_random_uuid() AS uuid FROM generate_series(1,3_000) ); CREATE TABLE lookups_big AS ( (SELECT uuid FROM testdata WHERE (random() * 31) >= 30 LIMIT 400_000) UNION ALL SELECT gen_random_uuid() AS uuid FROM generate_series(1,10_000) ); -- finally do some timed test queries and check EXPLAIN ANALYZE ... -- for different scenarios -- e.g. -- SELECT key FROM testadata WHERE uuid = 'an existing uuid'; -- SELECT key FROM testadata WHERE uuid = 'an non.existing uuid'; -- SELECT key FROM testdata AS d JOIN lookups AS l ON d.uuid = l.uuid; -- SELECT key FROM testdata AS d JOIN lookups_big AS l ON d.uuid = l.uuid; ``` Observations from timed tests and `EXPLAIN (ANALYZE) <query>;`: - joins with `lookups_big` never use indices; everything is about the same speed - joins with `lookups` indices are still used (but because everything is faster the absolute time diff between HASH and BTREE isn’t that big) - `HASH` is a bit smaller _and_ notably faster than BTREE. In the analysers estimated cost range is just a tad better but the actual runtime duration was about only half of `BTREE`’s on individual lookups for me. - `BTREE` is massively faster than no index at all ``` # JOINING both testdata tables as an extreme test # note how indices are not used test_indices=# EXPLAIN ANALYZE SELECT d.key FROM testdata AS d JOIN testdata2 AS l ON d.uuid = l.uuid; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------- Hash Join (cost=1687782.11..7086396.11 rows=50000000 width=4) (actual time=14382.006..50487.475 rows=50000000 loops =1) Hash Cond: (d.uuid = l.uuid) -> Seq Scan on testdata d (cost=0.00..818512.00 rows=50000000 width=20) (actual time=0.009..3567.087 rows=50000000 loops=1) -> Hash (cost=818559.16..818559.16 rows=50004716 width=16) (actual time=14372.441..14372.442 rows=50000000 loops=1) Buckets: 262144 Batches: 512 Memory Usage: 6617kB -> Seq Scan on testdata2 l (cost=0.00..818559.16 rows=50004716 width=16) (actual time=0.041..4730.541 rows=50000000 loops=1) Planning Time: 0.368 ms Execution Time: 51901.144 ms ``` -------- </details> This test needs to be repeated with a combined index with three UUIDs. There’s one issue with `HASH` though; prior to PostgreSQL 10 `HASH` indices were not yet crash safe (and also slow). Currently we still support 9.6 and have a runtime check for >= 11. Given even PostgreSQL 11 already reached EOL about half a year ago, i think we can just raise our minimum version to 11 and also get rid of the runtime check. ### S3 - I’m not too familiar with the S3 uploader, what would be a good `storage_handle` for it? - Do we (want to) support multiple buckets? - Is the bucket name already part of `S3`s base url? ### Migrations between S3 and Local Do we want to track whether an uplaod was made with S3 or Local, so old files can remain with their original upload method? But then deletion would always need to consider all upload methods etc... *(I’m guessing “no”, but just in case)* ## The trouble with migrations As mentioned before, current uploads can’t be deleted after a media domain change. Our db migration will need to somehow deal with URLs using old media `base_url`s and also do something if the `base_url` matches but its file no longer exists (due to e.g. hardware or admin mess ups). If this isn’t handled gracefully we’ll be flooded with support requests and worse people manually mangling their DB before reaching out. If we’re too forgiving on the other hand the migration might turn out lossy or otherwise mess up things in a way which only shows up after a while (and again mangled databases and support flood). I’m not yet sure how to best handle that. Perhaps we should already introduce an `old_base_urls: [...]` config ahead of time, together with a mix task to check whether they cover all existing media. Then people could use this to fix up their config ahead of time and we could just fail on unparseable URLs (but we’ll still have to do something about missing files). ## Future Just replacing `Image` and `Document` objects is already a sufficiently big change, so i’d like to limit the initial port to this. However, once done this allows to also tackle some other things. I’ll give some ideas here for future reference, but they shouldn’t be the focus atm We could fix the media update bug. The most simple approach is to now just automatically update the (scheduled) post (in `objects`) when media is updated. But this means we still have to operate on `objects` and duplicate already known information. Ideally posts would read all data about their attachments from `media` (and `file`). This gives us updates for free and there’s no more potential for copies to fall out of sync. However, this raises some new questions/problems for remote media... *( ## For instance admins: checking stats For realistic mockups and perf tets I’d like to get some references for media and file counts on real-world instances. IF you’re running an instance but don’t want to run too involved SQL queries just ```sql SELECT COUNT(*) FROM objects; SELECT COUNT(*) FROM objects WHERE data->>'type' = 'Note'; SELECT COUNT(*) FROM objects WHERE data->>'type' = 'Image'; SELECT COUNT(*) FROM objects WHERE data->>'type' = 'Document'; ``` would already help. But if you’re up for more complex queries, the below gives more insight and shows how much orphaned attachments and files currently waste space on your server If it’s a larger server or runs on weak hardware, maybe better do this on a backup restored to a different machine to not impact instance operation. <details> <summary>SQL queries to check for orphaned attachments and files</summary> -------- These queries may also give an idea of why the current scheme is problematic. *(The current cleanup job also has to check if the attachment itself is referenced by multiple objects, which we’ If you use or used unusual media `base_url`s you might need to tweak the final two queries a bit. ```sql -- NOTE: if you want to save query outputs to a TSV file -- this should be run as "postgres" NOT "akkoma" -- -- For saving query output as tsv use: -- COPY (<insert query here>) TO '/tmp/file.tsv' (format csv, delimiter E'\t'); -- Total object count for reference SELECT COUNT(*) FROM objects; -- Now lets get the orphaned and referenced attachments -- (temporary tables are auto deleted when the connection closes) CREATE TEMPORARY TABLE tmp_orphanattachs AS ( WITH attachs(aid) AS ( SELECT DISTINCT a->>'id' FROM objects AS o, jsonb_array_elements(o.data->'attachment') AS a WHERE o.data->>'type' = 'Note' ) SELECT DISTINCT id, data, inserted_at FROM objects LEFT JOIN attachs ON data->>'id' = aid WHERE data->>'type' = 'Document' AND aid IS NULL ); CREATE TEMPORARY TABLE tmp_refattachs AS ( SELECT DISTINCT o.id, o.data FROM objects AS o LEFT JOIN tmp_orphanattachs AS toa ON o.id = toa.id WHERE toa.id IS NULL AND o.data->>'type' = 'Document' ); -- And referenced and orphaned user images CREATE TEMPORARY TABLE tmp_refimgs AS ( WITH images(aid) AS ( SELECT DISTINCT avatar->>'id' FROM users WHERE local = 't' UNION SELECT DISTINCT banner->>'id' FROM users WHERE local = 't' UNION SELECT DISTINCT background->>'id' FROM users WHERE local = 't' ) SELECT DISTINCT id, data FROM objects JOIN images ON data->>'id' = aid WHERE data->>'type' = 'Image' AND aid IS NOT NULL ); CREATE TEMPORARY TABLE tmp_orphanimgs AS ( SELECT DISTINCT o.id, o.data, o.inserted_at FROM objects AS o LEFT JOIN tmp_refimgs AS tri ON tri.id = o.id WHERE tri.id IS NULL AND o.data->>'type' = 'Image' ); -- Compare counts SELECT COUNT(*) FROM tmp_refattachs; SELECT COUNT(*) FROM tmp_orphanattachs; SELECT COUNT(*) FROM tmp_refimgs; SELECT COUNT(*) FROM tmp_orphanimgs; -- Determine files are only associated with orphaned attachments -- (consider saving to a tsv file) -- tweak this if using an unusual base_url CREATE FUNCTION pg_temp.get_file(text) RETURNS text LANGUAGE plpgsql AS $$ BEGIN RETURN regexp_replace( regexp_replace($1, '\?.*$', '', 1, 1), '^https?://[^/]+/(media/)?', '', 1, 1 ); END; $$ ; -- First ALL actually referenced files CREATE TEMPORARY TABLE tmp_rfiles AS ( SELECT DISTINCT pg_temp.get_file(url->>'href') AS rfile FROM tmp_refattachs AS a, jsonb_array_elements(a.data->'url') AS url UNION SELECT DISTINCT pg_temp.get_file(url->>'href') AS rfile FROM tmp_refimgs AS i, jsonb_array_elements(i.data->'url') AS url ); -- Finally orphaned files -- -- NOTE: There are two variants below, both do the same -- EXPLAIN ANALYZE’s estimated cost likes the first much more -- but in actual execution time the second performed much better on my testdata, -- but for real-world instances datasets will be much larger and this might differ -- -- Prob best to run EXPLAIN (without ANALYZE) on both and take the one which -- is (estimated to be) less costly for your particular database. -- variant 1 WITH ourls(ourl, inserted_at) AS ( SELECT DISTINCT url->>'href', inserted_at FROM tmp_orphanattachs AS a, jsonb_array_elements(a.data->'url') AS url UNION ALL SELECT DISTINCT url->>'href', inserted_at FROM tmp_orphanimgs AS i, jsonb_array_elements(i.data->'url') AS url ), ofiles(ofile, upload_time) AS ( SELECT DISTINCT pg_temp.get_file(ourl) AS ofile, MIN(inserted_at) AS upload_time FROM ourls GROUP BY ofile ) SELECT DISTINCT ofile, upload_time FROM ofiles LEFT JOIN tmp_rfiles ON ofile = rfile WHERE rfile IS NULL ORDER BY upload_time, ofile; -- variant 2 WITH ofiles_a(ofile, upload_time) AS ( SELECT DISTINCT pg_temp.get_file(url->>'href') AS ofile, MIN(inserted_at) AS upload_time FROM tmp_orphanattachs AS a, jsonb_array_elements(a.data->'url') AS url GROUP BY ofile ), ofiles_i(ofile, upload_time) AS ( SELECT DISTINCT pg_temp.get_file(url->>'href') AS ofile, MIN(inserted_at) AS upload_time FROM tmp_orphanimgs AS i, jsonb_array_elements(i.data->'url') AS url GROUP BY ofile ), ofiles(ofile, upload_time) AS ( SELECT * FROM ofiles_a UNION ALL SELECT * FROM ofiles_i ) SELECT DISTINCT ofile, MIN(upload_time) AS "first upload" FROM ofiles LEFT JOIN tmp_rfiles ON ofile = rfile WHERE rfile IS NULL GROUP BY ofile ORDER BY "first upload", ofile; ``` -------- </details> ## Summary of TODOs - collect stats from real-world instances - raise minimum PostgreSQL version to 11 *(gets rid of current runtime check and broken HASH index in pre-10 PSQL) - decide on what to do about mismatched extensions - decide on file cleanup strategy (regular scan or on check on post deletion?) - if using scheduled scan: create realistic (also in terms of row count) mock up tables and test whether PostgreSQL will even use the `file` index for `LEFT JOIN`s; if not this index is unnecessary bloat and can be dropped - perf tests for different index types *(also check storage consumption)* and whether indices for scheduled posts and users are needed at all - actually writing the patches once the above is done - much, much, much testing of the migration taking media domain migrations, missing files and other edge cases into account

Oneric referenced this issue

2024-04-28 22:15:22 +00:00

Writing image descriptions into EXIF tags #746

Oneric referenced this issue

2024-05-27 17:09:27 +00:00

SQL: Outlier query information gathering #784

norm commented

2024-05-27 17:45:13 +00:00

Contributor

Any reason to allow reusing the same attachment object afterall eventhough Mastodon doesn't?

I imagine it would theoretically be useful in combination with a "Drive" like thing, just like how Misskey and its forks does things.

> Any reason to allow reusing the same attachment object afterall eventhough Mastodon doesn't? I imagine it would theoretically be useful in combination with a "Drive" like thing, just like how Misskey and its forks does things.

norm commented

2024-05-27 17:48:10 +00:00

Contributor

Some results from the queries:

SELECT COUNT(*) FROM objects;
/*
  count  
---------
 6167279
(1 row)
*/

--- queries 2, 3, 4 combined into one
SELECT data->>'type' AS data_type, COUNT(data) FROM objects WHERE data->>'type' IN ('Note', 'Image', 'Document') GROUP BY data->>'type';

/*
 data_type |  count  
-----------+---------
 Document  |   18576
 Image     |    1315
 Note      | 6039126
*/

Some results from the queries: ```sql SELECT COUNT(*) FROM objects; /* count --------- 6167279 (1 row) */ --- queries 2, 3, 4 combined into one SELECT data->>'type' AS data_type, COUNT(data) FROM objects WHERE data->>'type' IN ('Note', 'Image', 'Document') GROUP BY data->>'type'; /* data_type | count -----------+--------- Document | 18576 Image | 1315 Note | 6039126 */ ```

👍 1

Oneric referenced this issue

2024-05-30 23:58:56 +00:00

Raise minimum PostgreSQL version to 12 #786

Oneric commented

2024-05-31 03:20:34 +00:00

Author

Member

Any reason to allow reusing the same attachment object afterall eventhough Mastodon doesn't?

I imagine it would theoretically be useful in combination with a "Drive" like thing, just like how Misskey and its forks does things.

afaict from poking it a bit *key’s “Drive” works as follows; let me know if something’s missing or incorrect:

files can be uploaded to the drive without needing to be attached to a post
the drive consists of all files attached to posts, used in the user profile or explicitly uploaded
each file has a name, alt text and “sensitive” indicator
files are not automatically deleted when all their post gets deleted
files used in the user profile are specially marked; files used in posts aren't distinguishable from unused files
there’s a total size limit per user
when the size limit is reached manual intervention is needed
there’s no automatic dedupe of identical uploads and in particular no dedupe ever between different users
when posting an attachment can be loaded from the drive instead of uploaded; only then things point to the same file
when editing alt text on any place (drive menu or on any post using it) all alt texts are updated; it’s impossible to use different alt texts; same for the other properties

No auto-deletion is kinda opposite to how *oma works now, but we could add a flag to exempt some files from auto deletion.
Listing all files of a user already is possible by aggregating media attachments by owner. Similarly determining total size is made easier with the new scheme if the file table also records the size.

Automatically updating all at texts may sometimes be useful, but doing so when editing from a single post may also be surprising and unexpected. Sharing the sensitive flag culd also be problematic considering attachments of posts with CW are often automatically set to “sensitive” (and some clients directly display attachments of CWed posts if not set to sensitive. But if the image itself doesn't warrant a “sensitive” flag, but may sometimes need hiding (e.g. because when paired with content fro ma post it may be spoiler) this may be problematic)
If sharing the properties is desired, allowing reuse may indeed be a good idea. I’m undecided whether sharing is actually always a good idea though or if just cloning an attachment with it’s current properties upon post creation suffices or is perhaps even better.

> > Any reason to allow reusing the same attachment object afterall eventhough Mastodon doesn't? > > I imagine it would theoretically be useful in combination with a "Drive" like thing, just like how Misskey and its forks does things. afaict from poking it a bit \*key’s “Drive” works as follows; let me know if something’s missing or incorrect: - files can be uploaded to the drive without needing to be attached to a post - the drive consists of all files attached to posts, used in the user profile or explicitly uploaded - each file has a name, alt text and “sensitive” indicator - files are **not** automatically deleted when all their post gets deleted - files used in the user profile are specially marked; files used in posts aren't distinguishable from unused files - there’s a total size limit per user - when the size limit is reached manual intervention is needed - there’s no automatic dedupe of identical uploads and in particular no dedupe ever between different users - when posting an attachment can be loaded from the drive instead of uploaded; only then things point to the same file - when editing alt text on any place (drive menu or on *any* post using it) *all* alt texts are updated; it’s impossible to use different alt texts; same for the other properties No auto-deletion is kinda opposite to how \*oma works now, but we could add a flag to exempt some files from auto deletion. Listing all files of a user already is possible by aggregating media attachments by `owner`. Similarly determining total size is made easier with the new scheme *if* the file table also records the size. Automatically updating all at texts may sometimes be useful, but doing so when editing from a single post may also be surprising and unexpected. Sharing the sensitive flag culd also be problematic considering attachments of posts with CW are often automatically set to “sensitive” (and some clients directly display attachments of CWed posts if not set to sensitive. But if the image itself doesn't warrant a “sensitive” flag, but may sometimes need hiding (e.g. because when paired with content fro ma post it may be spoiler) this may be problematic) If sharing the properties is desired, allowing reuse may indeed be a good idea. I’m undecided whether sharing is actually always a good idea though or if just cloning an attachment with it’s current properties upon post creation suffices or is perhaps even better.

norm commented

2024-06-01 06:19:38 +00:00

Contributor

As far as I can tell all the above seem to be correct, at least with Sharkey.

I’m undecided whether sharing is actually always a good idea though or if just cloning an attachment with it’s current properties upon post creation suffices or is perhaps even better.

Perhaps cloning might be the better option since sharing the alt text and sensitive flag isn't always a good idea like you said.

As far as I can tell all the above seem to be correct, at least with Sharkey. > I’m undecided whether sharing is actually always a good idea though or if just cloning an attachment with it’s current properties upon post creation suffices or is perhaps even better. Perhaps cloning might be the better option since sharing the alt text and sensitive flag isn't always a good idea like you said.

timorl commented

2024-06-02 09:02:19 +00:00

Contributor

Stats from a approximately-single-user instance:

 SELECT COUNT(*) FROM objects;

  count  
---------
 2473180
(1 row)

and

SELECT data->>'type' AS data_type, COUNT(data) FROM objects WHERE data->>'type' IN ('Note', 'Image', 'Document') GROUP BY data->>'type';

 data_type |  count  
-----------+---------
 Document  |      56
 Image     |       5
 Note      | 2463252
(3 rows)

If needed I can also run the more involved ones, although I suspect data from an instance with apparently 5 images in total might be somewhat useless. :P

Stats from a approximately-single-user instance: ``` SELECT COUNT(*) FROM objects; count --------- 2473180 (1 row) ``` and ``` SELECT data->>'type' AS data_type, COUNT(data) FROM objects WHERE data->>'type' IN ('Note', 'Image', 'Document') GROUP BY data->>'type'; data_type | count -----------+--------- Document | 56 Image | 5 Note | 2463252 (3 rows) ``` If needed I can also run the more involved ones, although I suspect data from an instance with apparently 5 images in total might be somewhat useless. :P

👍 1

timorl commented

2024-06-02 12:53:25 +00:00

Contributor

Attachments are created before their posts and might not even ever be associated with a post, so for cleaning them up we have no choice but to

Couldn't there be an additional type of object (draft?) that would get associated with these images and then we could have a cascading delete? Then there would be four possibilities for an object referring to a piece of media, and always exactly one of them is not null. Periodic cleanup (if needed?) would only have to look at the "draft" (or whatever) table, which would usually be absolutely tiny.

Any reason to allow reusing the same attachment object afterall eventhough Mastodon doesn't?

I'm usually very in favour of making everything ridiculously future-proof, and I cannot think of one.

Currently we don’t actually deduplicate all identical files, only those for which both data and file extension match.

I wouldn't worry about this too much and leave it as-is – the minimal deduplication and marginally more readable DB specification are not worth the trouble of somehow standardizing file extensions. Especially with S3 complicating things even more.

Very cool, sounds massively more reasonable than the current situation. I'm quite worried about how hard migrations will work for already existing instances during updates, but other than that only a couple (possibly quite uninformed) comments. > Attachments are created before their posts and might not even ever be associated with a post, so for cleaning them up we have no choice but to Couldn't there be an additional type of object (draft?) that would get associated with these images and then we could have a cascading delete? Then there would be four possibilities for an object referring to a piece of media, and always exactly one of them is not null. Periodic cleanup (if needed?) would only have to look at the "draft" (or whatever) table, which would usually be absolutely tiny. > Any reason to allow reusing the same attachment object afterall eventhough Mastodon doesn't? I'm usually very in favour of making everything ridiculously future-proof, and I cannot think of one. > Currently we don’t actually deduplicate all identical files, only those for which both data and file extension match. I wouldn't worry about this too much and leave it as-is – the minimal deduplication and marginally more readable DB specification are not worth the trouble of somehow standardizing file extensions. Especially with S3 complicating things even more.

norm commented

2024-06-02 16:08:28 +00:00

Contributor

Here's my results for the additional queries:

akkoma=# -- Total object count for reference
SELECT COUNT(*) FROM objects;
  count  
---------
 6404331
(1 row)

akkoma=# -- Now lets get the orphaned and referenced attachments
-- (temporary tables are auto deleted when the connection closes)
CREATE TEMPORARY TABLE tmp_orphanattachs AS (
  WITH attachs(aid) AS (
    SELECT DISTINCT a->>'id'
    FROM objects AS o, jsonb_array_elements(o.data->'attachment') AS a
    WHERE o.data->>'type' = 'Note'
  )
  SELECT DISTINCT id, data, inserted_at
  FROM objects LEFT JOIN attachs ON data->>'id' = aid
  WHERE data->>'type' = 'Document' 
        AND aid IS NULL
);

CREATE TEMPORARY TABLE tmp_refattachs AS (
  SELECT DISTINCT o.id, o.data
  FROM objects AS o LEFT JOIN tmp_orphanattachs AS toa
       ON o.id = toa.id
  WHERE toa.id IS NULL
        AND o.data->>'type' = 'Document' 
);

-- And referenced and orphaned user images
CREATE TEMPORARY TABLE tmp_refimgs AS (
  WITH images(aid) AS (
    SELECT DISTINCT avatar->>'id' FROM users WHERE local = 't'
    UNION    
    SELECT DISTINCT banner->>'id' FROM users WHERE local = 't'                  
    UNION    
    SELECT DISTINCT background->>'id' FROM users WHERE local = 't'                                  
  )
  SELECT DISTINCT id, data             
  FROM objects JOIN images ON data->>'id' = aid
  WHERE data->>'type' = 'Image'
        AND aid IS NOT NULL
);

);      AND o.data->>'type' = 'Image'imgs AS tri
SELECT 2525
SELECT 16228
SELECT 176
SELECT 1151
akkoma=# -- Compare counts
SELECT COUNT(*) FROM tmp_refattachs;
SELECT COUNT(*) FROM tmp_orphanattachs;

SELECT COUNT(*) FROM tmp_refimgs;
SELECT COUNT(*) FROM tmp_orphanimgs;
 count 
-------
 16228
(1 row)

 count 
-------
  2525
(1 row)

 count 
-------
   176
(1 row)

 count 
-------
  1151
(1 row)
akkoma=# CREATE FUNCTION pg_temp.get_file(text) RETURNS text LANGUAGE plpgsql AS
$$ BEGIN
  RETURN
    regexp_replace(
      regexp_replace($1, '\?.*$', '', 1, 1),
      '^https?://[^/]+/(media/)?', '', 1, 1
    );
END; $$
;
CREATE FUNCTION
akkoma=# CREATE TEMPORARY TABLE tmp_rfiles AS (
  SELECT DISTINCT pg_temp.get_file(url->>'href') AS rfile
  FROM tmp_refattachs AS a,
       jsonb_array_elements(a.data->'url') AS url
  UNION
  SELECT DISTINCT pg_temp.get_file(url->>'href') AS rfile
  FROM tmp_refimgs AS i,
       jsonb_array_elements(i.data->'url') AS url
);
SELECT 16253

I selected query 1 as that seemed to be the least costly of the two according to EXPLAIN, which gave me 3178 rows.

(was unable to save a tsv file for some reason)

Here's my results for the additional queries: ```sql akkoma=# -- Total object count for reference SELECT COUNT(*) FROM objects; count --------- 6404331 (1 row) akkoma=# -- Now lets get the orphaned and referenced attachments -- (temporary tables are auto deleted when the connection closes) CREATE TEMPORARY TABLE tmp_orphanattachs AS ( WITH attachs(aid) AS ( SELECT DISTINCT a->>'id' FROM objects AS o, jsonb_array_elements(o.data->'attachment') AS a WHERE o.data->>'type' = 'Note' ) SELECT DISTINCT id, data, inserted_at FROM objects LEFT JOIN attachs ON data->>'id' = aid WHERE data->>'type' = 'Document' AND aid IS NULL ); CREATE TEMPORARY TABLE tmp_refattachs AS ( SELECT DISTINCT o.id, o.data FROM objects AS o LEFT JOIN tmp_orphanattachs AS toa ON o.id = toa.id WHERE toa.id IS NULL AND o.data->>'type' = 'Document' ); -- And referenced and orphaned user images CREATE TEMPORARY TABLE tmp_refimgs AS ( WITH images(aid) AS ( SELECT DISTINCT avatar->>'id' FROM users WHERE local = 't' UNION SELECT DISTINCT banner->>'id' FROM users WHERE local = 't' UNION SELECT DISTINCT background->>'id' FROM users WHERE local = 't' ) SELECT DISTINCT id, data FROM objects JOIN images ON data->>'id' = aid WHERE data->>'type' = 'Image' AND aid IS NOT NULL ); ); AND o.data->>'type' = 'Image'imgs AS tri SELECT 2525 SELECT 16228 SELECT 176 SELECT 1151 akkoma=# -- Compare counts SELECT COUNT(*) FROM tmp_refattachs; SELECT COUNT(*) FROM tmp_orphanattachs; SELECT COUNT(*) FROM tmp_refimgs; SELECT COUNT(*) FROM tmp_orphanimgs; count ------- 16228 (1 row) count ------- 2525 (1 row) count ------- 176 (1 row) count ------- 1151 (1 row) akkoma=# CREATE FUNCTION pg_temp.get_file(text) RETURNS text LANGUAGE plpgsql AS $$ BEGIN RETURN regexp_replace( regexp_replace($1, '\?.*$', '', 1, 1), '^https?://[^/]+/(media/)?', '', 1, 1 ); END; $$ ; CREATE FUNCTION akkoma=# CREATE TEMPORARY TABLE tmp_rfiles AS ( SELECT DISTINCT pg_temp.get_file(url->>'href') AS rfile FROM tmp_refattachs AS a, jsonb_array_elements(a.data->'url') AS url UNION SELECT DISTINCT pg_temp.get_file(url->>'href') AS rfile FROM tmp_refimgs AS i, jsonb_array_elements(i.data->'url') AS url ); SELECT 16253 ``` I selected query 1 as that seemed to be the least costly of the two according to `EXPLAIN`, which gave me 3178 rows. (was unable to save a tsv file for some reason)

❤️ 2

eshnil commented

2024-06-03 04:09:46 +00:00

I am not familiar with Akkoma's architecture or codebase, so please think of this as a curious question rather than a suggestion.

I noticed the sha256 in the proposed table schema. Would it be better to use a content-addressable filesystem like IPFS? They have put a lot of thought into future-proof content-addressing like multihash

There is some discussion on this Mastodon issue.

I am not familiar with Akkoma's architecture or codebase, so please think of this as a curious question rather than a suggestion. I noticed the sha256 in the proposed table schema. Would it be better to use a content-addressable filesystem like [IPFS](https://ipfs.tech/developers/)? They have put a lot of thought into future-proof content-addressing like [multihash](https://multiformats.io/multihash/) There is some discussion on [this Mastodon issue](https://github.com/mastodon/mastodon/issues/21461).

Oneric commented

2024-06-03 21:23:49 +00:00

Author

Member

@timorl

Couldn't there be an additional type of object (draft?) that would get associated with these images and then we could have a cascading delete?

akkoma-fe also has a “Delete & Redraft” feature; a cascading delete would always remove attachments from redrafts (also see #775)

not worth the trouble of somehow standardizing file extensions

Completely stripping extensions seems impossible with the current S3 backend (i guess in theory it could work if S3 versioning is turned on and for each link_name a new version of the item is created, but this makes things much more complicated)
Just standardising spelling of extensions should be fairly trivial though; not sure how much it matters in practice but if it doesn'T cause trouble we might as well

@eshnil
IPFS is completely orthogonal to how uploads are tracked in our DB. If anything it would be another uploader backend.

@timorl > Couldn't there be an additional type of object (draft?) that would get associated with these images and then we could have a cascading delete? akkoma-fe also has a “Delete & Redraft” feature; a cascading delete would always remove attachments from redrafts (also see #775) > not worth the trouble of somehow standardizing file extensions Completely stripping extensions seems impossible with the current S3 backend (i guess in theory it could work **if** S3 versioning is turned on and for each `link_name` a new version of the item is created, but this makes things much more complicated) Just standardising spelling of extensions should be fairly trivial though; not sure how much it matters in practice but if it doesn'T cause trouble we might as well --- @eshnil IPFS is completely orthogonal to how uploads are tracked in our DB. If anything it would be another uploader backend.

Oneric referenced this issue from a commit

2024-06-03 22:19:27 +00:00

Allow deleting uploaded files after media migration

Oneric referenced this issue

2024-06-03 22:28:25 +00:00

Fix various attachment cleanup issues #789

Oneric referenced this issue from a commit

2024-06-04 00:50:43 +00:00

Allow deleting uploaded files after media migration

Oneric referenced this issue from a commit

2024-06-04 00:51:04 +00:00

Allow deleting uploaded files after media migration

timorl commented

2024-06-04 10:36:12 +00:00

Contributor

akkoma-fe also has a “Delete & Redraft” feature; a cascading delete would always remove attachments from redrafts (also see #775)

Yeah, I got that, I just wasn't clear, sorry. ^^" I proposed that a post can end up in a "draft" state, in which it's not published, but keeps the attachments (and any other contents?). "Delete & Redraft" would then put posts in this state (maybe "Delete" as well? although that might be unexpected behaviour). Any uploaded media would be associated with such a draft initially as well. When the drafts would get deleted if unused could probably be configurable.

I think this might be worth the trouble, since you get a much better reflection of what the user expects in the DB and much easier attachment deletion.

> akkoma-fe also has a “Delete & Redraft” feature; a cascading delete would always remove attachments from redrafts (also see #775) Yeah, I got that, I just wasn't clear, sorry. ^^" I proposed that a post can end up in a "draft" state, in which it's not published, but keeps the attachments (and any other contents?). "Delete & Redraft" would then put posts in this state (maybe "Delete" as well? although that might be unexpected behaviour). Any uploaded media would be associated with such a draft initially as well. When the drafts would get deleted if unused could probably be configurable. I think this might be worth the trouble, since you get a much better reflection of what the user expects in the DB and much easier attachment deletion.

Oneric commented

2024-06-04 17:09:50 +00:00

Author

Member

"Delete & Redraft" would then put posts in this state (maybe "Delete" as well? although that might be unexpected behaviour)

There’s no difference from the backend’s perspective between “Delete & Redraft” and regular delete until the redraft is actually resubmitted, so yes if implemented this will necessarily also have to apply to normal deletes.

Given the current proposal already has an index over ownership references anyway, is there really a perf benefit though? Lookups for “not owned” should already be just a simple index lookup. Potentially though the ON DELETE SET NULL trigger needs to be replaced with a custom “set NULL and update updated_at” trigger to make deletion grace periods work reliably however.
Right now, unless the ownership index disappears, tracking this separately seems to just add complexity to me, but maybe i’m still missing the point?

> "Delete & Redraft" would then put posts in this state (maybe "Delete" as well? although that might be unexpected behaviour) There’s no difference from the backend’s perspective between “Delete & Redraft” and regular delete until the redraft is actually resubmitted, so yes if implemented this will necessarily also have to apply to normal deletes. Given the current proposal already has an index over ownership references anyway, is there really a perf benefit though? Lookups for “not owned” should already be just a simple index lookup. Potentially though the `ON DELETE SET NULL` trigger needs to be replaced with a custom “set NULL and update `updated_at`” trigger to make deletion grace periods work reliably however. Right now, unless the ownership index disappears, tracking this separately seems to just add complexity to me, but maybe i’m still missing the point?

timorl commented

2024-06-05 05:49:30 +00:00

Contributor

Right now, unless the ownership index disappears, tracking this separately seems to just add complexity to me, but maybe i’m still missing the point?

The point was to make this more uniform, reasoning about stuff (and general maintainability) tends to become easier then. But I might be using intuitions from other kinds of software dev which don't apply as much to DBs (mainly nulls/nils etc. being bad). I don't have much experience designing these, so if you are saying this is alright then lets go with your approach.

IMMEDIATE EDIT ON RE-READING: wait, but the index could disappear with that additional ownership relation, couldn't it? Or am I misunderstanding its purpose?

> Right now, unless the ownership index disappears, tracking this separately seems to just add complexity to me, but maybe i’m still missing the point? The point was to make this more uniform, reasoning about stuff (and general maintainability) tends to become easier then. But I might be using intuitions from other kinds of software dev which don't apply as much to DBs (mainly nulls/nils etc. being bad). I don't have much experience designing these, so if you are saying this is alright then lets go with your approach. IMMEDIATE EDIT ON RE-READING: wait, but the index _could_ disappear with that additional ownership relation, couldn't it? Or am I misunderstanding its purpose?

Oneric referenced this issue from a commit

2024-06-07 17:05:13 +00:00

Allow deleting uploaded files after media migration

Oneric referenced this issue from a commit

2024-06-11 23:32:26 +00:00

Allow deleting uploaded files after media migration

Oneric referenced this issue from a commit

2024-07-06 01:28:50 +00:00

Allow deleting uploaded files after media migration

Oneric referenced this issue

2024-07-06 01:40:53 +00:00

WIP: fix attachment cleanup after media domain migration #818

Oneric commented

2024-07-13 19:03:01 +00:00

Author

Member

Sorry i forgot to reply to this

wait, but the index could disappear with that additional ownership relation, couldn't it? Or am I misunderstanding its purpose?

Currently we need the following to be relatively quick to not thrash performance (in no particular order):

being able to prune orphaned media attachments (finding all unreferenced attachments)
(a) checking whether a given file is referenced (to delete it immediately when its last media is deleted)
-OR-
(b) being able to prune orphaned files (finding all unreferenced files for regular cleanup swipes) —
finding all media attachments referenced by a particular object
finding the file of a given media attachment
checking whether a media attachment is already in use (to block reuse)

4 is trivially achieved with the foreign file key column in media attachments.
An index on this column is necessary to facilitate a sufficiently fast 2a, but if regular cleanup swipes (2b) turn out to be sufficiently fast too, we prob don’t even need an index here.

1, 3 and 5 all (hopefully; to be tested) work out with the combined HASH(note, profile, scheduled_note) index. The drawback here is we waste a little bit of space in the table on two always empty uuid columns. The major advantage here being we can check ownership for all possible types in one go due to the combined index. This way we also don’t need to explicitly track not-yet-or-no-longer referenced attachments; we get it for free as a NULL, NULL, NULL lookup.

Should a combined index turn out problematic in testing, we’ll have to split it up. In that case though, keeping references to what uses the media attachment inside the media table doesn’t make much sense any more, we’d fill up each index with many NULL entries. Instead we’d use three join table each using the media id as their primary key and the appropriate other foreign key as the second field.
This means though we’ll have to resave the media id and now have three BTREE indices instead of one HASH index, so we probably don’t really save much space or might even use more. Furthermore, we’ll now need 3 lookups to determine whether a given attachment is already referenced and we no longer get a guaranteed cheap lookup for “all unused media”.

We could in theory relax the reuse requirement a bit without going to the more expensive many-to-many relationship by saying “a single media attachment can only be referenced by a single post and user profile each”. The issue being due to scheduled notes and published notes being in different tables, we’ll still need two lookups to determine whether it’s safe to associate this attachment with a new (scheduled) post (and we’ll have to remember to update both join tables atomically in a single transaction when a scheduled post gets promoted to a published post).

Going full on in with arbitrary many posts being able to reference an media attachment again creates additional overhead and problems as reasoned e.g. in the original post.

tl;dr: if a combined index works out, i expect it to perform the best and we don’t need separate relation. If it doesn’t we’ll need seperate relations and pay special attention to cleanup performance.

Sorry i forgot to reply to this > wait, but the index could disappear with that additional ownership relation, couldn't it? Or am I misunderstanding its purpose? Currently we need the following to be relatively quick to not thrash performance (in no particular order): 1. being able to prune orphaned media attachments *(finding all unreferenced attachments)* 2. (a) checking whether a given file is referenced *(to delete it immediately when its last media is deleted)* -OR- (b) being able to prune orphaned files *(finding all unreferenced files for regular cleanup swipes)* — 3. finding all media attachments referenced by a particular object 4. finding the file of a given media attachment 5. checking whether a media attachment is already in use *(to block reuse)* 4 is trivially achieved with the foreign file key column in media attachments. An index on this column is necessary to facilitate a sufficiently fast 2a, but if regular cleanup swipes (2b) turn out to be sufficiently fast too, we prob don’t even need an index here. 1, 3 and 5 all (hopefully; to be tested) work out with the combined `HASH(note, profile, scheduled_note)` index. The drawback here is we waste a little bit of space in the table on two always empty uuid columns. The major advantage here being we can check ownership for all possible types in one go due to the combined index. This way we also don’t need to explicitly track not-yet-or-no-longer referenced attachments; we get it for free as a `NULL, NULL, NULL` lookup. Should a combined index turn out problematic in testing, we’ll have to split it up. In that case though, keeping references to what uses the media attachment inside the media table doesn’t make much sense any more, we’d fill up each index with many `NULL` entries. Instead we’d use three join table each using the media id as their primary key and the appropriate other foreign key as the second field. This means though we’ll have to resave the media id and now have three `BTREE` indices instead of one `HASH` index, so we probably don’t really save much space or might even use more. Furthermore, we’ll now need 3 lookups to determine whether a given attachment is already referenced and we no longer get a guaranteed cheap lookup for “all unused media”. We could in theory relax the reuse requirement a bit without going to the more expensive many-to-many relationship by saying “a single media attachment can only be referenced by a single post and user profile **each**”. The issue being due to scheduled notes and published notes being in different tables, we’ll still need two lookups to determine whether it’s safe to associate this attachment with a new (scheduled) post (and we’ll have to remember to update both join tables atomically in a single transaction when a scheduled post gets promoted to a published post). Going full on in with arbitrary many posts being able to reference an media attachment again creates additional overhead and problems as reasoned e.g. in the original post. tl;dr: if a combined index works out, i expect it to perform the best and we don’t need separate relation. If it doesn’t we’ll need seperate relations and pay special attention to cleanup performance.

ilja commented

2024-08-13 09:42:04 +00:00

Contributor

Let it be no surprise that I 100% share the concerns laid out here, and I would be very happy if this could be fixed, or at least improved upon :)

most of the upload and media handling needs to be adjusted

I'm going off on a tangent here, but isn't this really also a problem in design of the code? I see Repo imports all over the place. IMO it would be much cleaner to have something like Pleroma.DataStorage and let that do all database things. You would still need to change code (and of course there's a big refactoring needed to get to that point, which is completely out-of-scope for this specific issue), but then it would purely be in this small part of the codebase, which is clearly meant for data storage. Instead of having to go through the entire codebase, hoping you didn't miss anything. (This is another discussion, I know, but maybe also one worth having somewhere?)

I believe we should split attachments out of objects

YES! \o/ ty! (^.^)

and also clearly distinguish uploaded files (which can be shared) from attachment objects (reference a single file but also contain additional metadata, like alt text and if we ever implement it focal point etc).

I didn't understand at first, but I understand the reason is to allow multiple media entries for same file due to deduplication (while still allowing different e.g. media description). This seems indeed a good reason to split them up like this.

general DB scheme

You've already thought of more than I had!

One thing I notice is that the url isn't there any more. Is the idea to rebuild it each time? If so, how/what fields will be used for it. At the very least, I think it should be independent of Uploader bc that can change, while url's should be stable (unless we send out Update activities for each post with media when something in the url format changes).

So currently it seems to me like one shared media table makes more sense

I'm unsure what a "proper" normalisation would say, but I think it makes sense for the reason you say; More overhead for some parts, and we won't always know what it is linked to beforehand.

URL parameters

I think this is currently filled in by an UploadFilter? I guess it makes sense to keep it somewhat "flexible" in that regard then since the UploadFilter are modular. But how does it return query parameters now? aren't they just part of the url? Why not keep the same thing? (At least for a first run)

(but within the same db transaction) check if their files are now still referenced and if not also delete the files

Note that deleting of the file itself can also go wrong. If something goes wrong and the file still exists, the deletion of the database entries should also fail (i.e. rollback). A retry then happens next time the attachments cleanup job runs. If we can do it like that, and that is indeed cheaper to do, than that seems best to me. IF we want an extra job "just in case", it could be a mix task, but it should not be needed, so I wouldn't spend time on it.

Mismatched file extensions

No opinion bc I don't have enough insights in how this all works and what the implications will be. All I can say is that whatever we do, it should work independent of storage back-end. (ATM local and S3, but it's always possible someone adds something else later.) If it's too much trouble to "fix", then I agree with @timorl to not worry about this and leave it as-is.

Any reason to allow reusing the same attachment object afterall eventhough Mastodon doesn't?

None that I can think of where reuploading the file, or adding a new media entry, wouldn't be an acceptable work-around. As mentioned in previous comments, description, for example, makes sense within the context of a post. It would be weird to change the description, and accidentally have it changed in other posts too. It would also cause extra overhead with federation, bc those posts have technically speaking been updated.

Indices

Can't provide feedback since I don't know enough about it

S3

storage_handle is generic, so this should also be generic. Maybe this should just be (part of) the final url? It doesn't matter then what uploader is used. (Personally I would start with "flattening" the tables, and not think about these things at all yet, but maybe that causes other problems, so 🤷‍♀️ )

The trouble with migrations

It's not just old_base_url. It's also old uploader... How are such migrations handled now? Do people move their media from one storage (e.g. Local) to another (e.g. s3)? How do the url's keep working?

But also, (how) does cleanup in these cases work at the moment? If it doesn't work properly now, then maybe it's OK to not fix this specific case just yet? (I.e. move it to "Future" section) Should be properly documented, though (somewhere an admin will see when they change the Uploader).

few, that was a lot to go over ^^'

Let it be no surprise that I 100% share the concerns laid out here, and I would be very happy if this could be fixed, or at least improved upon :) > most of the upload and media handling needs to be adjusted I'm going off on a tangent here, but isn't this really also a problem in design of the code? I see Repo imports all over the place. IMO it would be much cleaner to have something like `Pleroma.DataStorage` and let that do all database things. You would still need to change code (and of course there's a big refactoring needed to get to that point, which is completely out-of-scope for this specific issue), but then it would purely be in this small part of the codebase, which is clearly meant for data storage. Instead of having to go through the entire codebase, hoping you didn't miss anything. (This is another discussion, I know, but maybe also one worth having somewhere?) > I believe we should split attachments out of objects YES! \o/ ty! (*^.^*) > and also clearly distinguish uploaded files (which can be shared) from attachment objects (reference a single file but also contain additional metadata, like alt text and if we ever implement it focal point etc). I didn't understand at first, but I understand the reason is to allow multiple media entries for same file due to deduplication (while still allowing different e.g. media description). This seems indeed a good reason to split them up like this. > general DB scheme You've already thought of more than I had! One thing I notice is that the url isn't there any more. Is the idea to rebuild it each time? If so, how/what fields will be used for it. At the very least, I think it should be independent of Uploader bc that can change, while url's should be stable (unless we send out Update activities for each post with media when something in the url format changes). > So currently it seems to me like one shared media table makes more sense I'm unsure what a "proper" normalisation would say, but I think it makes sense for the reason you say; More overhead for some parts, and we won't always know what it is linked to beforehand. > URL parameters I *think* this is currently filled in by an UploadFilter? I guess it makes sense to keep it somewhat "flexible" in that regard then since the UploadFilter are modular. But how does it return query parameters now? aren't they just part of the url? Why not keep the same thing? (At least for a first run) > (but within the same db transaction) check if their files are now still referenced and if not also delete the files Note that deleting of the file itself can also go wrong. If something goes wrong and the file still exists, the deletion of the database entries should also fail (i.e. rollback). A retry then happens next time the attachments cleanup job runs. If we can do it like that, and that is indeed cheaper to do, than that seems best to me. IF we want an extra job "just in case", it could be a mix task, but it should not be needed, so I wouldn't spend time on it. > Mismatched file extensions No opinion bc I don't have enough insights in how this all works and what the implications will be. All I can say is that whatever we do, it should work independent of storage back-end. (ATM local and S3, but it's always possible someone adds something else later.) If it's too much trouble to "fix", then I agree with @timorl to not worry about this and leave it as-is. > Any reason to allow reusing the same attachment object afterall eventhough Mastodon doesn't? None that I can think of where reuploading the file, or adding a new media entry, wouldn't be an acceptable work-around. As mentioned in previous comments, description, for example, makes sense within the context of a post. It would be weird to change the description, and accidentally have it changed in other posts too. It would also cause extra overhead with federation, bc those posts have technically speaking been updated. > Indices Can't provide feedback since I don't know enough about it > S3 `storage_handle` is generic, so this should also be generic. Maybe this should just be (part of) the final url? It doesn't matter then what uploader is used. (Personally I would start with "flattening" the tables, and not think about these things at all yet, but maybe that causes other problems, so 🤷‍♀️ ) > The trouble with migrations It's not just old_base_url. It's also old uploader... How are such migrations handled now? Do people move their media from one storage (e.g. Local) to another (e.g. s3)? How do the url's keep working? But also, (how) does cleanup in these cases work at the moment? If it doesn't work properly now, then maybe it's OK to not fix this specific case just yet? (I.e. move it to "Future" section) Should be properly documented, though (somewhere an admin will see when they change the Uploader). *** few, that was a lot to go over ^^'

ilja commented

2024-08-13 10:20:43 +00:00

Contributor

For the asked queries, I get an error for the "involved" one

Here's the "less involved" ones

-- SELECT COUNT(*) FROM objects;
/*
816029
*/

-- SELECT COUNT(*) FROM objects WHERE data->>'type' = 'Note';
/*
788473
*/

-- SELECT COUNT(*) FROM objects WHERE data->>'type' = 'Image';
/*
20
*/

-- SELECT COUNT(*) FROM objects WHERE data->>'type' = 'Document';
/*
22508
*/

But when I try the bigger thing, I get an error when creating the function. I put all of it in a file, then did sudo -Hu postgres psql -d pleroma_ynh -f run_thing.psql , this is the output:

 count  
--------
 816035
(1 row)

SELECT 11732
SELECT 10776
SELECT 3
SELECT 17
 count 
-------
 10776
(1 row)

 count 
-------
 11732
(1 row)

 count 
-------
     3
(1 row)

 count 
-------
    17
(1 row)

CREATE FUNCTION
psql:run_thing.psql:85: ERROR:  function regexp_replace(text, unknown, unknown, integer, integer) does not exist
LINE 2:       regexp_replace($1, '\?.*$', '', 1, 1),
              ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
QUERY:  SELECT regexp_replace(
      regexp_replace($1, '\?.*$', '', 1, 1),
      '^https?://[^/]+/(media/)?', '', 1, 1
    )
CONTEXT:  PL/pgSQL function pg_temp_26.get_file(text) line 2 at RETURN
psql:run_thing.psql:116: ERROR:  relation "tmp_rfiles" does not exist
LINE 17: FROM ofiles LEFT JOIN tmp_rfiles ON ofile = rfile
                               ^
psql:run_thing.psql:142: ERROR:  relation "tmp_rfiles" does not exist
LINE 21: FROM ofiles LEFT JOIN tmp_rfiles ON ofile = rfile

OS and DB version:

# lsb_release -a && psql --version
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 11 (bullseye)
Release:        11
Codename:       bullseye
psql (PostgreSQL) 13.16 (Debian 13.16-0+deb11u1)

It seems for latest postgres, it's regexp_replace(source, pattern, replacement [, start [, N ]] [, flags ]), but my version doesn't support the [, start [, N ]] yet, https://www.postgresql.org/docs/13/functions-matching.html

For the asked queries, I get an error for the "involved" one Here's the "less involved" ones ``` -- SELECT COUNT(*) FROM objects; /* 816029 */ -- SELECT COUNT(*) FROM objects WHERE data->>'type' = 'Note'; /* 788473 */ -- SELECT COUNT(*) FROM objects WHERE data->>'type' = 'Image'; /* 20 */ -- SELECT COUNT(*) FROM objects WHERE data->>'type' = 'Document'; /* 22508 */ ``` But when I try the bigger thing, I get an error when creating the function. I put all of it in a file, then did `sudo -Hu postgres psql -d pleroma_ynh -f run_thing.psql `, this is the output: ``` count -------- 816035 (1 row) SELECT 11732 SELECT 10776 SELECT 3 SELECT 17 count ------- 10776 (1 row) count ------- 11732 (1 row) count ------- 3 (1 row) count ------- 17 (1 row) CREATE FUNCTION psql:run_thing.psql:85: ERROR: function regexp_replace(text, unknown, unknown, integer, integer) does not exist LINE 2: regexp_replace($1, '\?.*$', '', 1, 1), ^ HINT: No function matches the given name and argument types. You might need to add explicit type casts. QUERY: SELECT regexp_replace( regexp_replace($1, '\?.*$', '', 1, 1), '^https?://[^/]+/(media/)?', '', 1, 1 ) CONTEXT: PL/pgSQL function pg_temp_26.get_file(text) line 2 at RETURN psql:run_thing.psql:116: ERROR: relation "tmp_rfiles" does not exist LINE 17: FROM ofiles LEFT JOIN tmp_rfiles ON ofile = rfile ^ psql:run_thing.psql:142: ERROR: relation "tmp_rfiles" does not exist LINE 21: FROM ofiles LEFT JOIN tmp_rfiles ON ofile = rfile ``` OS and DB version: ``` # lsb_release -a && psql --version No LSB modules are available. Distributor ID: Debian Description: Debian GNU/Linux 11 (bullseye) Release: 11 Codename: bullseye psql (PostgreSQL) 13.16 (Debian 13.16-0+deb11u1) ``` It seems for latest postgres, it's `regexp_replace(source, pattern, replacement [, start [, N ]] [, flags ])`, but my version doesn't support the `[, start [, N ]]` yet, https://www.postgresql.org/docs/13/functions-matching.html

❤️ 1