Read image description from EXIF data #744

Merged
floatingghost merged 10 commits from timorl/akkoma:elseinspe into develop 2024-04-25 12:52:32 +00:00
Contributor

This is just #241 with the current develop merged in. I think it would be a pretty rad feature to have and I'll gladly go through review with this PR.

After this is done I might do the next part, with writing EXIF data as well, as @ilja intended, but no guarantees. <_<

This is just #241 with the current develop merged in. I think it would be a pretty rad feature to have and I'll gladly go through review with this PR. After this is done I might do the next part, with writing EXIF data as well, as @ilja intended, but no guarantees. <_<
timorl added 4 commits 2024-04-15 15:45:37 +00:00
During attachment upload Pleroma returns a "description" field.

* This MR allows Pleroma to read the EXIF data during upload and return the description to the FE using this field.
    * If a description is already present (e.g. because a previous module added it), it will use that
    * Otherwise it will read from the EXIF data. First it will check -ImageDescription, if that's empty, it will check -iptc:Caption-Abstract
    * If no description is found, it will simply return nil, which is the default value
* When people set up a new instance, they will be asked if they want to read metadata and this module will be activated if so

There was an Exiftool module, which has now been renamed to Exiftool.StripLocation
Descriptions from exif data with only whitespeces are considered empty
Some checks are pending
ci/woodpecker/pr/woodpecker Pipeline is pending
66a04cead3
I noticed that pictures taken with Ubuntu-Touch have whitespace in one of the fields
This should just be ignored imo
Merge branch 'develop' into elseinspe
Some checks failed
ci/woodpecker/pr/lint Pipeline failed
ci/woodpecker/pr/test unknown status
ci/woodpecker/pr/build-arm64 unknown status
ci/woodpecker/pr/build-amd64 unknown status
ci/woodpecker/pr/docs unknown status
b144218dce
Member

fyi, if #745 gets merged the renaming of the existing Exiftool filter here as well as the order of filters generated by the instance mix tasks will need a little tweaking
(i assume if #745 gets merged at all, it will be before this, as renaming the filter twice in short succession doesn’t make much sense)

Looking ahead a bit:

writing EXIF data as well

hmm, a single image file might be attached to multiple posts by different people. I guess Dedupe being the final upload filter will ensure different alt texts mean different file hashes and thus no conflict.

But now editing the alt text will change file content, hash and thus the upload URL which previously never happened; this should be fine MastoAPI-wise, but probably needs some additional tweaks in our. Unless Mastodon also changes media ids on update, I’m assuming changing the id might cause issues with some clients. But if we just update the existing attachment objects, this will leave orphaned upload files without an associated attachment object. Cleaning old files during updates on the other hand entails querying all attachments to check if anything uses the same file; might be costly.

fyi, if #745 gets merged the renaming of the existing Exiftool filter here as well as the order of filters generated by the `instance` mix tasks will need a little tweaking *(i assume if #745 gets merged at all, it will be before this, as renaming the filter twice in short succession doesn’t make much sense)* Looking ahead a bit: > writing EXIF data as well hmm, a single image file might be attached to multiple posts by different people. I guess `Dedupe` being the final upload filter will ensure different alt texts mean different file hashes and thus no conflict. But now editing the alt text will change file content, hash and thus the upload URL which previously never happened; this should be fine MastoAPI-wise, but probably needs some additional tweaks in our. Unless Mastodon also changes media ids on update, I’m assuming changing the id might cause issues with some clients. But if we just update the existing attachment objects, this will leave orphaned upload files without an associated attachment object. Cleaning old files during updates on the other hand entails querying _all_ attachments to check if anything uses the same file; might be costly.

this fails lint so CI can't run tests over it - can you run mix format on this branch? thankies

this fails lint so CI can't run tests over it - can you run `mix format` on this branch? thankies
timorl added 2 commits 2024-04-16 18:37:17 +00:00
Rename StripLocation to StripMetadata for temporal-proofing reasons
Some checks are pending
ci/woodpecker/pr/build-amd64 Pipeline is pending
ci/woodpecker/pr/build-arm64 Pipeline is pending
ci/woodpecker/pr/docs Pipeline is pending
ci/woodpecker/pr/lint Pipeline is pending
ci/woodpecker/pr/test Pipeline is pending
cd7af81896
Author
Contributor

Ran the formatter (sorry!) although it touched a couple of files I'm pretty sure don't actually fall under this PR.

Renamed StripLocation to StripMetadata, technically true already and definitely true after #745. As for the order of operations – wouldn't it be better to make the stripping more configurable and just make it skip over ImageDescription if metadata reading/writing is enabled? I'm not sure about exposing the configuration to users – on the one hand I also don't see why anyone would want to strip only GPS metadata (as you rightly point out in the other PR), on the other the metadata tags are not all for tracking purposes and I imagine there might be some e.g. photographers who would like to preserve at least a subset of them when uploading. Maybe having a way do define "don't strip tags from this list" would be a good idea?

As for the writing part – I'll create an issue to discuss this to avoid creating chaos here. Definitely thank you for the points, I've been wondering what's so much harder about that part than the reading part and now I'm starting to get an understanding.

Ran the formatter (sorry!) although it touched a couple of files I'm pretty sure don't actually fall under this PR. Renamed `StripLocation` to `StripMetadata`, technically true already and definitely true after #745. As for the order of operations – wouldn't it be better to make the stripping more configurable and just make it skip over `ImageDescription` if metadata reading/writing is enabled? I'm not sure about exposing the configuration to users – on the one hand I also don't see why anyone would want to strip only GPS metadata (as you rightly point out in the other PR), on the other the metadata tags are not all for tracking purposes and I imagine there might be some e.g. photographers who would like to preserve at least a subset of them when uploading. Maybe having a way do define "don't strip tags from this list" would be a good idea? As for the writing part – I'll create an issue to discuss this to avoid creating chaos here. Definitely thank you for the points, I've been wondering what's so much harder about that part than the reading part and now I'm starting to get an understanding.
Member

wouldn't it be better to make the stripping more configurable and just make it skip over ImageDescription if metadata reading/writing is enabled?

If the alt text get’s edited or the content of the different description metadata fields differs, skipping stripping can leave unintended data exposed

As for metadata stripping being more configurable in general (but independent of other loaded filters) to allow preserving/deleting specific tag (groups) this is a good point. It would probably be easiest to expose a purge and preserve string list which will just get mapped directly to exiftool commandline parameters. The drawback being misspelling something here will result in all uploads failing

> wouldn't it be better to make the stripping more configurable and just make it skip over ImageDescription if metadata reading/writing is enabled? If the alt text get’s edited or the content of the different description metadata fields differs, skipping stripping can leave unintended data exposed As for metadata stripping being more configurable in general (but independent of other loaded filters) to allow preserving/deleting specific tag (groups) this is a good point. It would probably be easiest to expose a `purge` and `preserve` string list which will just get mapped directly to exiftool commandline parameters. The drawback being misspelling something here will result in all uploads failing
Member

Ran the formatter (sorry!) although it touched a couple of files I'm pretty sure don't actually fall under this PR.

running mix format on this PR’s HEAD reverts some of the formatting changes for me:

diff --git a/lib/pleroma/upload/filter/exiftool/read_description.ex b/lib/pleroma/upload/filter/exiftool/read_description.ex
index 03d698a81..5dfe60cd8 100644
--- a/lib/pleroma/upload/filter/exiftool/read_description.ex
+++ b/lib/pleroma/upload/filter/exiftool/read_description.ex
@@ -33,7 +33,10 @@ defp read_when_empty(current_description, _, _) when is_binary(current_descripti
   defp read_when_empty(_, file, tag) do
     try do
       {tag_content, 0} =
-        System.cmd("exiftool", ["-b", "-s3", tag, file], stderr_to_stdout: true, parallelism: true)
+        System.cmd("exiftool", ["-b", "-s3", tag, file],
+          stderr_to_stdout: true,
+          parallelism: true
+        )
 
       tag_content = String.trim(tag_content)
 
diff --git a/lib/pleroma/web/plugs/o_auth_scopes_plug.ex b/lib/pleroma/web/plugs/o_auth_scopes_plug.ex
index f017c8bc7..e4d098a7d 100644
--- a/lib/pleroma/web/plugs/o_auth_scopes_plug.ex
+++ b/lib/pleroma/web/plugs/o_auth_scopes_plug.ex
@@ -34,7 +34,9 @@ def perform(%Plug.Conn{assigns: assigns} = conn, %{scopes: scopes} = options) do
         permissions = Enum.join(missing_scopes, " #{op} ")
 
         error_message =
-          dgettext("errors", "Insufficient permissions: %{permissions}.", permissions: permissions)
+          dgettext("errors", "Insufficient permissions: %{permissions}.",
+            permissions: permissions
+          )
 
         conn
         |> put_resp_content_type("application/json")

And running it on the commit just before formatting was applied only touches stuff from this PR for me:

diff --git a/lib/pleroma/upload/filter/exiftool/read_description.ex b/lib/pleroma/upload/filter/exiftool/read_description.ex
index 03d698a81..5dfe60cd8 100644
--- a/lib/pleroma/upload/filter/exiftool/read_description.ex
+++ b/lib/pleroma/upload/filter/exiftool/read_description.ex
@@ -33,7 +33,10 @@ defp read_when_empty(current_description, _, _) when is_binary(current_descripti
   defp read_when_empty(_, file, tag) do
     try do
       {tag_content, 0} =
-        System.cmd("exiftool", ["-b", "-s3", tag, file], stderr_to_stdout: true, parallelism: true)
+        System.cmd("exiftool", ["-b", "-s3", tag, file],
+          stderr_to_stdout: true,
+          parallelism: true
+        )
 
       tag_content = String.trim(tag_content)
 
diff --git a/test/mix/tasks/pleroma/instance_test.exs b/test/mix/tasks/pleroma/instance_test.exs
index a2b73e0c8..d650bdf7c 100644
--- a/test/mix/tasks/pleroma/instance_test.exs
+++ b/test/mix/tasks/pleroma/instance_test.exs
@@ -93,8 +93,10 @@ test "running gen" do
     assert generated_config =~ "password: \"dbpass\""
     assert generated_config =~ "configurable_from_database: true"
     assert generated_config =~ "http: [ip: {127, 0, 0, 1}, port: 4000]"
+
     assert generated_config =~
              "filters: [Pleroma.Upload.Filter.Exiftool.StripLocation, Pleroma.Upload.Filter.Exiftool.ReadDescription]"
+
     assert generated_config =~ "base_url: \"https://media.pleroma.social/media\""
     assert File.read!(tmp_path() <> "setup.psql") == generated_setup_psql()
     assert File.exists?(Path.expand("./test/instance/static/robots.txt"))

For reference this is with elixir/mix 1.15 and OTP26 (though since CI tests multiple elixir versions, this alone shouldn’t be the cause).
I’m not sure what stuff mix format picks up, but do you have anything which might influence it like a .editorconfig file etc?

> Ran the formatter (sorry!) although it touched a couple of files I'm pretty sure don't actually fall under this PR. running mix format on this PR’s HEAD reverts some of the formatting changes for me: ```diff diff --git a/lib/pleroma/upload/filter/exiftool/read_description.ex b/lib/pleroma/upload/filter/exiftool/read_description.ex index 03d698a81..5dfe60cd8 100644 --- a/lib/pleroma/upload/filter/exiftool/read_description.ex +++ b/lib/pleroma/upload/filter/exiftool/read_description.ex @@ -33,7 +33,10 @@ defp read_when_empty(current_description, _, _) when is_binary(current_descripti defp read_when_empty(_, file, tag) do try do {tag_content, 0} = - System.cmd("exiftool", ["-b", "-s3", tag, file], stderr_to_stdout: true, parallelism: true) + System.cmd("exiftool", ["-b", "-s3", tag, file], + stderr_to_stdout: true, + parallelism: true + ) tag_content = String.trim(tag_content) diff --git a/lib/pleroma/web/plugs/o_auth_scopes_plug.ex b/lib/pleroma/web/plugs/o_auth_scopes_plug.ex index f017c8bc7..e4d098a7d 100644 --- a/lib/pleroma/web/plugs/o_auth_scopes_plug.ex +++ b/lib/pleroma/web/plugs/o_auth_scopes_plug.ex @@ -34,7 +34,9 @@ def perform(%Plug.Conn{assigns: assigns} = conn, %{scopes: scopes} = options) do permissions = Enum.join(missing_scopes, " #{op} ") error_message = - dgettext("errors", "Insufficient permissions: %{permissions}.", permissions: permissions) + dgettext("errors", "Insufficient permissions: %{permissions}.", + permissions: permissions + ) conn |> put_resp_content_type("application/json") ``` And running it on the commit just before formatting was applied only touches stuff from this PR for me: ```diff diff --git a/lib/pleroma/upload/filter/exiftool/read_description.ex b/lib/pleroma/upload/filter/exiftool/read_description.ex index 03d698a81..5dfe60cd8 100644 --- a/lib/pleroma/upload/filter/exiftool/read_description.ex +++ b/lib/pleroma/upload/filter/exiftool/read_description.ex @@ -33,7 +33,10 @@ defp read_when_empty(current_description, _, _) when is_binary(current_descripti defp read_when_empty(_, file, tag) do try do {tag_content, 0} = - System.cmd("exiftool", ["-b", "-s3", tag, file], stderr_to_stdout: true, parallelism: true) + System.cmd("exiftool", ["-b", "-s3", tag, file], + stderr_to_stdout: true, + parallelism: true + ) tag_content = String.trim(tag_content) diff --git a/test/mix/tasks/pleroma/instance_test.exs b/test/mix/tasks/pleroma/instance_test.exs index a2b73e0c8..d650bdf7c 100644 --- a/test/mix/tasks/pleroma/instance_test.exs +++ b/test/mix/tasks/pleroma/instance_test.exs @@ -93,8 +93,10 @@ test "running gen" do assert generated_config =~ "password: \"dbpass\"" assert generated_config =~ "configurable_from_database: true" assert generated_config =~ "http: [ip: {127, 0, 0, 1}, port: 4000]" + assert generated_config =~ "filters: [Pleroma.Upload.Filter.Exiftool.StripLocation, Pleroma.Upload.Filter.Exiftool.ReadDescription]" + assert generated_config =~ "base_url: \"https://media.pleroma.social/media\"" assert File.read!(tmp_path() <> "setup.psql") == generated_setup_psql() assert File.exists?(Path.expand("./test/instance/static/robots.txt")) ``` For reference this is with elixir/mix 1.15 and OTP26 *(though since CI tests multiple elixir versions, this alone shouldn’t be the cause)*. I’m not sure what stuff `mix format` picks up, but do you have anything which might influence it like a `.editorconfig` file etc?

the lint step in particular runs under 1.15/otp26 on ci (I should bump to 1.16 but I don't think formatting changes between them)

the lint step in particular runs under 1.15/otp26 on ci (I should bump to 1.16 but I don't think formatting changes between them)
timorl added 3 commits 2024-04-19 18:52:07 +00:00
Read description before stripping metadata
Some checks are pending
ci/woodpecker/pr/build-amd64 Pipeline is pending
ci/woodpecker/pr/build-arm64 Pipeline is pending
ci/woodpecker/pr/docs Pipeline is pending
ci/woodpecker/pr/lint Pipeline is pending
ci/woodpecker/pr/test Pipeline is pending
09d3ccf770
Author
Contributor

I should bump to 1.16

No need to shame me so hard for being only on 1.14. ;-;

Anyway, I updated to 1.15 and the formatting seems better now. Although I get a lot of tests failing, both on my branch as well as on develop (this was already the case for 1.14, so no change there). No idea what that is about, but I'm pretty sure none of these are related to the changes in this PR.

I also changed the order of the filter application (I think?). I'm pretty sure the purge and preserve lists will still eventually be necessary, when we add writing, but it seems that won't happen for a while.

The drawback being misspelling something here will result in all uploads failing

Silly question – considering that the config seems to just be code, and running tests seems relatively simple, perhaps we could just add tests that check passing a single image through the configured filters and immediately fail on startup? Also, maybe this discussion should actually be happening in the other PR. <_<"

> I should bump to 1.16 No need to shame me _so_ hard for being only on 1.14. ;-; Anyway, I updated to 1.15 and the formatting seems better now. Although I get a lot of tests failing, both on my branch as well as on `develop` (this was already the case for 1.14, so no change there). No idea what that is about, but I'm pretty sure none of these are related to the changes in this PR. I also changed the order of the filter application (I think?). I'm pretty sure the `purge` and `preserve` lists will still eventually be necessary, when we add writing, but it seems that won't happen for a while. > The drawback being misspelling something here will result in all uploads failing Silly question – considering that the config seems to just be code, and running tests seems relatively simple, perhaps we could just add tests that check passing a single image through the configured filters and immediately fail on startup? Also, maybe this discussion should actually be happening in the other PR. <_<"
Member

huh, so the Elixir version can affect formating results `^^

Anyway, I updated to 1.15 and the formatting seems better now. Although I get a lot of tests failing, both on my branch as well as on develop (this was already the case for 1.14, so no change there). No idea what that is about, but I'm pretty sure none of these are related to the changes in this PR.

A few tests are flaky (mostly emoji stuff and "only federate to reachable instances") and occasionally fail eventhough nothing is wrong. Rerunning just those on their own will show if somethings wrong; i.e. mix test <path to test source files>...
But if you get consistent and many failures that’s probably not it. Do you have all optional dependencies installed (e.g. ffmpeg)? Tests include all optional paths and if you’re missing stuff it won’t work

fwiw running tests on you’re branch (and after excluding one flaky fail) there’s only one real failure:
test/mix/tasks/pleroma/instance_test.exs:32 :: test running gen (Mix.Tasks.Pleroma.InstanceTest) still checks for eadDescriptioncoming _after_StripMetadata` and now fails.

huh, so the Elixir version can affect formating results `^^ > Anyway, I updated to 1.15 and the formatting seems better now. Although I get a lot of tests failing, both on my branch as well as on develop (this was already the case for 1.14, so no change there). No idea what that is about, but I'm pretty sure none of these are related to the changes in this PR. A few tests are flaky (mostly emoji stuff and "only federate to reachable instances") and occasionally fail eventhough nothing is wrong. Rerunning just those on their own will show if somethings wrong; i.e. `mix test <path to test source files>...` But if you get consistent and many failures that’s probably not it. Do you have all optional dependencies installed (e.g. ffmpeg)? Tests include all optional paths and if you’re missing stuff it won’t work fwiw running tests on you’re branch (and after excluding one flaky fail) there’s only one real failure: `test/mix/tasks/pleroma/instance_test.exs:32 :: test running gen (Mix.Tasks.Pleroma.InstanceTest)` still checks for eadDescription` coming _after_ `StripMetadata` and now fails.
timorl added 1 commit 2024-04-21 17:43:37 +00:00
Fix the one test that wasn't just being flaky
Some checks failed
ci/woodpecker/pr/lint Pipeline was successful
ci/woodpecker/pr/test Pipeline was successful
ci/woodpecker/pr/build-arm64 unknown status
ci/woodpecker/pr/build-amd64 unknown status
ci/woodpecker/pr/docs unknown status
3f54945033
Author
Contributor

Quite a couple of these tests seem flaky and the logs being thrown around during test execution don't help with clarity. ._. Anyway, I fixed the test that was actually wrong (thanks for the pointer!) and after a few tries even got a "0 failures", so should now be fine.

Quite a couple of these tests seem flaky and the logs being thrown around during test execution don't help with clarity. ._. Anyway, I fixed the test that was actually wrong (thanks for the pointer!) and after a few tries even got a "0 failures", so should now be fine.
floatingghost merged commit b1c6621e66 into develop 2024-04-25 12:52:32 +00:00
floatingghost deleted branch elseinspe 2024-04-25 12:52:32 +00:00
Member

I had a feeling i fixed something while porting the rename to #745 but forgot what. Now I (too late, sorry) remember:
The migration id is still 20220220135625; this works fine while applying, but messes up the order when rolling migrations back

Since rename action is idempotent I think we can still just rename it to 20240425120000_. I guess to not break rolling back for everyone who already applied this we’ll also need a noop migration under the 2022 id

I had a feeling i fixed something while porting the rename to #745 but forgot what. Now I (too late, sorry) remember: The migration id is still `20220220135625`; this works fine while applying, but messes up the order when rolling migrations back Since rename action is idempotent I think we can still just rename it to `20240425120000_`. I guess to not break rolling back for everyone who already applied this we’ll also need a noop migration under the 2022 id
Sign in to join this conversation.
No description provided.