Writing image descriptions into EXIF tags #746

Open
opened 2024-04-16 19:09:51 +00:00 by timorl · 5 comments
Contributor

As mentioned in #744 it would be nice to be able to automatically add media descriptions to EXIF data of image files, so that after downloading them and re-uploading posters would get descriptions that are immediately filled in. This issue is mostly for discussing some details of implementing this idea and what the precise behaviour should be.

@Oneric already mentioned quite a couple possible technical challenges in this comment. I've only started contributing and I kinda have no idea what I'm doing, so below I'll write some stuff with the hope that more knowledgeable people tell me where I'm being silly – if not, I'll just read the code.

As for precise ideas – initially I thought this could be done as a message filter on all posts, but that cannot work, since most media is not being kept locally anyway, right? So the functionality has to work only for in-instance posts?

The writing of the tag should happen after the user clicks "Post", so I assume this won't be quite in the same spot as the reading, since that happens on file upload. Now, if I understood @Oneric's comment correctly, the file that gets uploaded is already on the server before the post gets made. Does it stay there if the user aborts the process of writing the post, or is it somehow garbage-collected? If the latter, could that mechanism help with the rewriting that we would like to do, by removing the original upload after the post is made, and the EXIF data modified + hash/name changed?

(Alternative approach that I don't like, but mentioning it for completeness – we could hash just the contents of the file without (most of) the EXIF metadata and use that as a filename. This is complicated and could behave very weirdly with multiple people posting the same image with different descriptions, so bleh. Speaking of multiple posts of the same image – I don't think this and Dedup will be much of an issue, people probably post the exact same image either relatively rarely, or use versions of it downloaded from other posts, and with this feature being complete they would likely just keep the auto-filled image descriptions.)

One more minor point – I would assume that if the file already has an ImageDecription tag, then we don't want to touch it right? Or is overwriting it a good idea for some reason?

As mentioned in #744 it would be nice to be able to automatically add media descriptions to EXIF data of image files, so that after downloading them and re-uploading posters would get descriptions that are immediately filled in. This issue is mostly for discussing some details of implementing this idea and what the precise behaviour should be. @Oneric already mentioned quite a couple possible technical challenges in [this comment](https://akkoma.dev/AkkomaGang/akkoma/pulls/744#issuecomment-11619_). I've only started contributing and I kinda have no idea what I'm doing, so below I'll write some stuff with the hope that more knowledgeable people tell me where I'm being silly – if not, I'll just read the code. As for precise ideas – initially I thought this could be done as a message filter on all posts, but that cannot work, since most media is not being kept locally anyway, right? So the functionality has to work only for in-instance posts? The writing of the tag should happen after the user clicks "Post", so I assume this won't be quite in the same spot as the reading, since that happens on file upload. Now, if I understood @Oneric's comment correctly, the file that gets uploaded is already on the server before the post gets made. Does it stay there if the user aborts the process of writing the post, or is it somehow garbage-collected? If the latter, could that mechanism help with the rewriting that we would like to do, by removing the original upload _after_ the post is made, and the EXIF data modified + hash/name changed? (Alternative approach that I don't like, but mentioning it for completeness – we could hash just the contents of the file without (most of) the EXIF metadata and use that as a filename. This is complicated and could behave very weirdly with multiple people posting the same image with different descriptions, so bleh. Speaking of multiple posts of the same image – I don't think this and `Dedup` will be much of an issue, people probably post the exact same image either relatively rarely, or use versions of it downloaded from other posts, and with this feature being complete they would likely just keep the auto-filled image descriptions.) One more minor point – I would assume that if the file already has an `ImageDecription` tag, then we don't want to touch it right? Or is overwriting it a good idea for some reason?
Member

since most media is not being kept locally anyway, right?

Yep. If :media_proxy is enabled, a cache of remote files can be kept, but this cache is managed by nginx and just directly correspond to remote content. With mediaproxy preview we potentially also preprocess remote content (and cahce it), but this only takes the media as input and doesn’t know anything about its associated post. In fact, here too a single media file might be associated with multiple distinct remote posts with differing alt text

Now, if I understood @Oneric's comment correctly, the file that gets uploaded is already on the server before the post gets made.

Yup, that’s a result of how Mastodon APi works. WHen you post a new status with attachments, you must include an array of media_ids. To get media ids you must first upload media to the server. After uploading you can update media to change alt text etc.
Meaning what actually happens when you draft a new post, each attachment gets immediately uploaded without any alt text or title. Once you hit the post button, multiple API requests are then made to update media info and finally actually create the post.

Does it stay there if the user aborts the process of writing the post, or is it somehow garbage-collected?

Ideally it would get deleted at some point :)
In practice (and as already hinted at in ilja’s original PR nope; must’ve been somewhere else sorry), media are currently tracked in a suboptimal form in our database. As a result operations like checking if media are actually used are rather expensive.
If you enabled the relevant config option, media will get deleted together with their post, but any media which already gets scrapped in draft status or later edited out stays (unless manual action is taken)

I’ve been thinking of how to split out media and files into separate and easily scannable tables, but realistically it will be at least several months before i actually get around to implementing a large scale data migration like this. (Maybe i should already write down and publish the db scheme i have in mind anyway, so it can be discussed and mayhaps even picked up by someone else earlier)
If implemented this would indeed help with garbage collecting older versions of uploads whose alt text was modified

(given how things are stored, i suspect, our posts also don’t get updated alttext if you only use the /media API, but only once also issuing a /status update while in Mastodon using only /media is presumably enough)

One more minor point – I would assume that if the file already has an ImageDecription tag, then we don't want to touch it right? Or is overwriting it a good idea for some reason?

If the alt text gets changed, metadata should be overwritten as well; the original description can differ from the edited alt text. E.g. the poster may have translated the original description from metadata into another language.

Alternative approach that I don't like, but mentioning it for completeness – we could hash just the contents of the file without (most of) the EXIF metadata and use that as a filename.

yeah no; as you already pointed out this will cause issues if pure image data matches but metadata differs and leaking data from one user into another users context is never a good idea

> since most media is not being kept locally anyway, right? Yep. If `:media_proxy` is enabled, a cache of remote files can be kept, but this cache is managed by nginx and just directly correspond to remote content. With mediaproxy preview we potentially also preprocess remote content (and cahce it), but this only takes the media as input and doesn’t know anything about its associated post. In fact, here too a single media file might be associated with multiple distinct remote posts with differing alt text > Now, if I understood @Oneric's comment correctly, the file that gets uploaded is already on the server before the post gets made. Yup, that’s a result of how Mastodon APi works. WHen you [post a new status](https://docs.joinmastodon.org/methods/statuses/#create) with attachments, you must include an array of `media_ids`. To get media ids you must first [upload media](https://docs.joinmastodon.org/methods/media/#v2) to the server. After uploading you can update media to change alt text etc. Meaning what actually happens when you draft a new post, each attachment gets immediately uploaded without any alt text or title. Once you hit the post button, multiple API requests are then made to update media info and finally actually create the post. > Does it stay there if the user aborts the process of writing the post, or is it somehow garbage-collected? Ideally it would get deleted at some point :) In practice *(~~and as already hinted at in ilja’s original PR~~ nope; must’ve been somewhere else sorry)*, media are currently tracked in a suboptimal form in our database. As a result operations like checking if media are actually used are rather expensive. If you enabled the relevant config option, media will get deleted together with their post, but any media which already gets scrapped in draft status or later edited out stays *(unless manual action is taken)* I’ve been thinking of how to split out media and files into separate and easily scannable tables, but realistically it will be _at least_ several months before i actually get around to implementing a large scale data migration like this. *(Maybe i should already write down and publish the db scheme i have in mind anyway, so it can be discussed and mayhaps even picked up by someone else earlier)* If implemented this would indeed help with garbage collecting older versions of uploads whose alt text was modified *(given how things are stored, i suspect, our posts also don’t get updated alttext if you only use the `/media` API, but only once also issuing a `/status` update while in Mastodon using only `/media` is presumably enough)* > One more minor point – I would assume that if the file already has an ImageDecription tag, then we don't want to touch it right? Or is overwriting it a good idea for some reason? If the alt text gets changed, metadata should be overwritten as well; the original description can differ from the edited alt text. E.g. the poster may have translated the original description from metadata into another language. > Alternative approach that I don't like, but mentioning it for completeness – we could hash just the contents of the file without (most of) the EXIF metadata and use that as a filename. yeah no; as you already pointed out this will cause issues if pure image data matches but metadata differs and leaking data from one user into another users context is never a good idea
Author
Contributor

If the alt text gets changed, metadata should be overwritten

Ha, good points in there and I'm happy you made them, since that actually makes thing easier.

My general feeling is that the best we could do for now would be to first upload the media and then create a copy with overwritten contents on post creation. This would essentially double the media library size for every instance, which doesn't sound acceptable. :/

If we had the media purging capabilities you describe then this design would work, with the possible problem of confusing some implementations that assume that the masto endpoint they are touching preserves the media urls (maybe ids as well? would they have to change in this case?) on post. I think that would technically be incorrect, but I'm not sure how common such an assumption would be. Considering the post action returns the contents and you can read all that from there it shouldn't be too tempting, so I would hope this would be ok?

Maybe i should already write down and publish the db scheme

That would be grand! Maybe I could even help with the implementation a tad. Is there already an issue for tracking this?

> If the alt text gets changed, metadata should be overwritten Ha, good points in there and I'm happy you made them, since that actually makes thing easier. My general feeling is that the best we could do for now would be to first upload the media and then create a copy with overwritten contents on post creation. This would essentially double the media library size for every instance, which doesn't sound acceptable. :/ If we had the media purging capabilities you describe then this design would work, with the possible problem of confusing some implementations that assume that the masto endpoint they are touching preserves the media urls (maybe ids as well? would they have to change in this case?) on post. I _think_ that would technically be incorrect, but I'm not sure how common such an assumption would be. Considering the post action returns the contents and you can read all that from there it shouldn't be too tempting, so I would hope this would be ok? > Maybe i should already write down and publish the db scheme That would be grand! Maybe I could even help with the implementation a tad. Is there already an issue for tracking this?
Author
Contributor

Waitwaitwait, again, posting too early, without reading your posts and the docs in enough detail, sorry.

So it's not actually the posting action that sets the description, but the media update, right? And this returns a new, full, response about the media anyway, including possibly a changed id and url?

So for now, if I understand correctly, we might have multiple media with different ids pointing to the same file, but with different (non-EXIF) metadata (e.g. description), right? And they are pointing to the same file because of deduplication. This still means we would have to currently double the files for writing to work, but makes me much more optimistic about this working with all clients after purging unused media is added, since they kinda already have to deal with the media update potentially returning different data in the response.

Waitwaitwait, again, posting too early, without reading your posts and the docs in enough detail, sorry. So it's not actually the posting action that sets the description, but the [media update](https://docs.joinmastodon.org/methods/media/#update), right? And this returns a new, full, response about the media anyway, including possibly a changed id and url? So for now, if I understand correctly, we might have multiple media with different ids pointing to the same file, but with different (non-EXIF) metadata (e.g. description), right? And they are pointing to the same file because of deduplication. This still means we would have to currently double the files for writing to work, but makes me much more optimistic about this working with all clients after purging unused media is added, since they kinda already have to deal with the media update potentially returning different data in the response.
Member

the media update, right?

yep (though unlike on Mastodon, currently on *oma media updates only show up in the post, once the post itself is also updated. fixing this without insane db cost also depends on changing how media objects are tracked)

And this returns a new, full, response about the media anyway, including possibly a changed id and url?

All modification Masto-APIs return the full updated object as if it were retrieved via its GET endpoint. This doesn’t mean though all fields are actually allowed to change and i suspect assuming the ID — which is part of the API url afterall — remains unchanged is common enough to cause issues in practice when broken. Afaict Mastodon never changes IDs.
E.g. it seems not too unreasonable for some clients with post drafting capabilities to already upload or update media by already putting it on the server if a draft is saved and closed (but not yet posted) and remember just the media ids for later refetch.

URLs probably also never change in Mastodon, but here it just odesn’t make much sense for clients to track and store them in the first place so i expect no or only minor isssues (i might be wrong though).

we might have multiple media with different ids pointing to the same file, but with different (non-EXIF) metadata (e.g. description), right? And they are pointing to the same file because of deduplication.

yep

Maybe i should already write down and publish the db scheme

That would be grand!

i’ll write it up probably until or on next weekend then :)

Is there already an issue for tracking this?

I’m sure it was brought up before, but the only mention in akkoma.dev issues or PRs i can find right now is this remark by ilja in #504:

Media is stored in the objects table. For me, that table is for objects that show on the TL's directly and at least can be fetched over AP. Uploads aren't that. Storing it in the Objects table doesn't make sense to me, neither from a SQL nor from a NoSQL perspective. However, changing that would imply DB migrations, which I don't want to do rn.

> the media update, right? yep (though unlike on Mastodon, currently on \*oma media updates only show up in the post, once the post itself is also updated. fixing this without insane db cost also depends on changing how media objects are tracked) > And this returns a new, full, response about the media anyway, including possibly a changed id and url? All modification Masto-APIs return the full updated object as if it were retrieved via its `GET` endpoint. This doesn’t mean though all fields are actually allowed to change and i suspect assuming the ID — which is part of the API url afterall — remains unchanged is common enough to cause issues in practice when broken. Afaict Mastodon never changes IDs. E.g. it seems not too unreasonable for some clients with post drafting capabilities to already upload or update media by already putting it on the server if a draft is saved and closed (but not yet posted) and remember just the media ids for later refetch. URLs probably also never change in Mastodon, but here it just odesn’t make much sense for clients to track and store them in the first place so i expect no or only minor isssues *(i might be wrong though)*. > we might have multiple media with different ids pointing to the same file, but with different (non-EXIF) metadata (e.g. description), right? And they are pointing to the same file because of deduplication. yep > > Maybe i should already write down and publish the db scheme > > That would be grand! i’ll write it up probably until or on next weekend then :) > Is there already an issue for tracking this? I’m sure it was brought up before, but the only mention in akkoma.dev issues or PRs i can find right now is this remark by ilja in #504: > Media is stored in the objects table. For me, that table is for objects that show on the TL's directly and at least can be fetched over AP. Uploads aren't that. Storing it in the Objects table doesn't make sense to me, neither from a SQL nor from a NoSQL perspective. However, changing that would imply DB migrations, which I don't want to do rn.
Member

Maybe i should already write down and publish the db scheme

That would be grand! Maybe I could even help with the implementation a tad. Is there already an issue for tracking this?

@timorl now there is: #765

> > Maybe i should already write down and publish the db scheme > > That would be grand! Maybe I could even help with the implementation a tad. Is there already an issue for tracking this? @timorl now there is: #765
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: AkkomaGang/akkoma#746
No description provided.