Writing image descriptions into EXIF tags #746
Labels
No labels
approved, awaiting change
bug
configuration
documentation
duplicate
enhancement
extremely low priority
feature request
Fix it yourself
help wanted
invalid
mastodon_api
needs docs
needs tests
not a bug
planned
pleroma_api
privacy
question
static_fe
triage
wontfix
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: AkkomaGang/akkoma#746
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
As mentioned in #744 it would be nice to be able to automatically add media descriptions to EXIF data of image files, so that after downloading them and re-uploading posters would get descriptions that are immediately filled in. This issue is mostly for discussing some details of implementing this idea and what the precise behaviour should be.
@Oneric already mentioned quite a couple possible technical challenges in this comment. I've only started contributing and I kinda have no idea what I'm doing, so below I'll write some stuff with the hope that more knowledgeable people tell me where I'm being silly – if not, I'll just read the code.
As for precise ideas – initially I thought this could be done as a message filter on all posts, but that cannot work, since most media is not being kept locally anyway, right? So the functionality has to work only for in-instance posts?
The writing of the tag should happen after the user clicks "Post", so I assume this won't be quite in the same spot as the reading, since that happens on file upload. Now, if I understood @Oneric's comment correctly, the file that gets uploaded is already on the server before the post gets made. Does it stay there if the user aborts the process of writing the post, or is it somehow garbage-collected? If the latter, could that mechanism help with the rewriting that we would like to do, by removing the original upload after the post is made, and the EXIF data modified + hash/name changed?
(Alternative approach that I don't like, but mentioning it for completeness – we could hash just the contents of the file without (most of) the EXIF metadata and use that as a filename. This is complicated and could behave very weirdly with multiple people posting the same image with different descriptions, so bleh. Speaking of multiple posts of the same image – I don't think this and
Dedup
will be much of an issue, people probably post the exact same image either relatively rarely, or use versions of it downloaded from other posts, and with this feature being complete they would likely just keep the auto-filled image descriptions.)One more minor point – I would assume that if the file already has an
ImageDecription
tag, then we don't want to touch it right? Or is overwriting it a good idea for some reason?Yep. If
:media_proxy
is enabled, a cache of remote files can be kept, but this cache is managed by nginx and just directly correspond to remote content. With mediaproxy preview we potentially also preprocess remote content (and cahce it), but this only takes the media as input and doesn’t know anything about its associated post. In fact, here too a single media file might be associated with multiple distinct remote posts with differing alt textYup, that’s a result of how Mastodon APi works. WHen you post a new status with attachments, you must include an array of
media_ids
. To get media ids you must first upload media to the server. After uploading you can update media to change alt text etc.Meaning what actually happens when you draft a new post, each attachment gets immediately uploaded without any alt text or title. Once you hit the post button, multiple API requests are then made to update media info and finally actually create the post.
Ideally it would get deleted at some point :)
In practice (
and as already hinted at in ilja’s original PRnope; must’ve been somewhere else sorry), media are currently tracked in a suboptimal form in our database. As a result operations like checking if media are actually used are rather expensive.If you enabled the relevant config option, media will get deleted together with their post, but any media which already gets scrapped in draft status or later edited out stays (unless manual action is taken)
I’ve been thinking of how to split out media and files into separate and easily scannable tables, but realistically it will be at least several months before i actually get around to implementing a large scale data migration like this. (Maybe i should already write down and publish the db scheme i have in mind anyway, so it can be discussed and mayhaps even picked up by someone else earlier)
If implemented this would indeed help with garbage collecting older versions of uploads whose alt text was modified
(given how things are stored, i suspect, our posts also don’t get updated alttext if you only use the
/media
API, but only once also issuing a/status
update while in Mastodon using only/media
is presumably enough)If the alt text gets changed, metadata should be overwritten as well; the original description can differ from the edited alt text. E.g. the poster may have translated the original description from metadata into another language.
yeah no; as you already pointed out this will cause issues if pure image data matches but metadata differs and leaking data from one user into another users context is never a good idea
Ha, good points in there and I'm happy you made them, since that actually makes thing easier.
My general feeling is that the best we could do for now would be to first upload the media and then create a copy with overwritten contents on post creation. This would essentially double the media library size for every instance, which doesn't sound acceptable. :/
If we had the media purging capabilities you describe then this design would work, with the possible problem of confusing some implementations that assume that the masto endpoint they are touching preserves the media urls (maybe ids as well? would they have to change in this case?) on post. I think that would technically be incorrect, but I'm not sure how common such an assumption would be. Considering the post action returns the contents and you can read all that from there it shouldn't be too tempting, so I would hope this would be ok?
That would be grand! Maybe I could even help with the implementation a tad. Is there already an issue for tracking this?
Waitwaitwait, again, posting too early, without reading your posts and the docs in enough detail, sorry.
So it's not actually the posting action that sets the description, but the media update, right? And this returns a new, full, response about the media anyway, including possibly a changed id and url?
So for now, if I understand correctly, we might have multiple media with different ids pointing to the same file, but with different (non-EXIF) metadata (e.g. description), right? And they are pointing to the same file because of deduplication. This still means we would have to currently double the files for writing to work, but makes me much more optimistic about this working with all clients after purging unused media is added, since they kinda already have to deal with the media update potentially returning different data in the response.
yep (though unlike on Mastodon, currently on *oma media updates only show up in the post, once the post itself is also updated. fixing this without insane db cost also depends on changing how media objects are tracked)
All modification Masto-APIs return the full updated object as if it were retrieved via its
GET
endpoint. This doesn’t mean though all fields are actually allowed to change and i suspect assuming the ID — which is part of the API url afterall — remains unchanged is common enough to cause issues in practice when broken. Afaict Mastodon never changes IDs.E.g. it seems not too unreasonable for some clients with post drafting capabilities to already upload or update media by already putting it on the server if a draft is saved and closed (but not yet posted) and remember just the media ids for later refetch.
URLs probably also never change in Mastodon, but here it just odesn’t make much sense for clients to track and store them in the first place so i expect no or only minor isssues (i might be wrong though).
yep
i’ll write it up probably until or on next weekend then :)
I’m sure it was brought up before, but the only mention in akkoma.dev issues or PRs i can find right now is this remark by ilja in #504:
@timorl now there is: #765