[bug] Rich media parser crashes, resulting in 504 and no posts being shown #790

Closed
opened 2024-06-04 13:06:20 +00:00 by ahfrom · 9 comments

Your setup

OTP

Extra details

Akkoma from nixpkgs running in a Docker image. Details: https://gist.github.com/astahfrom/d75414bef384009066312a2612780fdf

Version

3.13.1

PostgreSQL version

16

What were you trying to do?

Load my timeline.

What did you expect to happen?

Posts would show up.

What actually happened?

The server returned a 504 error and my timeline was completely blank.

This has happened twice now. The first time due to this post: https://recurse.social/@lindsey/112425593807393381
The second time due to this post: https://mastodon.scot/@herdingdata/112538347267348104

For the first post, I just waited until my timeline had enough new posts that Akkoma no longer tried to display the problematic one. If I tried to seek it out, I would just not see any posts in that context. This prevented me from reading a thread of replies that I was interested in.

The second time, I used an account on a Mastodon server to find the post in question and then muted that person on my Akkoma server. This made my timeline show up again (without the problematic post).

At time of writing, recurse.social lists "v4.0.14+hometown-1.1.1" and mastodon.scot lists "v4.2.9".
I would be happy to file a bug against Mastodon if this is non-compliant behaviour, but I would also be very interested in a workaround, so my timeline doesn't just disappear because of some random post.

Logs

Logs for the https://recurse.social/@lindsey/112425593807393381 post:

```
09:03:04.184 [error] Task #PID<0.5259.0> started from #PID<0.5258.0> terminating
** (ArgumentError) scheme is required for url: wp-json/oembed/1.0/embed?url=http:%252F%252Fwww.weaselhat.com%252F2015%252F03%252F16%252Fa-refinement-type-by-any-other-name%252F
    (finch 0.16.0) lib/finch/request.ex:135: Finch.Request.parse_url/1
    (finch 0.16.0) lib/finch/request.ex:103: Finch.Request.build/5
    (tesla 1.9.0) lib/tesla/adapter/finch.ex:66: Tesla.Adapter.Finch.call/2
    (tesla 1.9.0) lib/tesla/middleware/telemetry.ex:99: Tesla.Middleware.Telemetry.call/3
    (tesla 1.9.0) lib/tesla/middleware/follow_redirects.ex:46: Tesla.Middleware.FollowRedirects.redirect/3
    (pleroma 3.13.1) lib/pleroma/web/rich_media/helpers.ex:84: Pleroma.Web.RichMedia.Helpers.rich_media_get/1
    (pleroma 3.13.1) lib/pleroma/web/rich_media/parsers/o_embed.ex:25: Pleroma.Web.RichMedia.Parsers.OEmbed.get_oembed_data/1
    (pleroma 3.13.1) lib/pleroma/web/rich_media/parsers/o_embed.ex:9: Pleroma.Web.RichMedia.Parsers.OEmbed.parse/2
Function: #Function<6.88132293/0 in Pleroma.Web.RichMedia.Parser.parse_with_timeout/1>
    Args: []
```

Logs for the https://mastodon.scot/@herdingdata/112538347267348104 post:

```
16:24:16.877 [error] Task #PID<0.1548085.0> started from #PID<0.1547880.0> terminating
** (ArgumentError) scheme is required for url: //seblee.co/wp-json/oembed/1.0/embed?url=https%3A%2F%2Fseblee.co%2F
    (finch 0.16.0) lib/finch/request.ex:135: Finch.Request.parse_url/1
    (finch 0.16.0) lib/finch/request.ex:103: Finch.Request.build/5
    (tesla 1.9.0) lib/tesla/adapter/finch.ex:66: Tesla.Adapter.Finch.call/2
    (tesla 1.9.0) lib/tesla/middleware/telemetry.ex:99: Tesla.Middleware.Telemetry.call/3
    (tesla 1.9.0) lib/tesla/middleware/follow_redirects.ex:46: Tesla.Middleware.FollowRedirects.redirect/3
    (pleroma 3.13.1) lib/pleroma/web/rich_media/helpers.ex:84: Pleroma.Web.RichMedia.Helpers.rich_media_get/1
    (pleroma 3.13.1) lib/pleroma/web/rich_media/parsers/o_embed.ex:25: Pleroma.Web.RichMedia.Parsers.OEmbed.get_oembed_data/1
    (pleroma 3.13.1) lib/pleroma/web/rich_media/parsers/o_embed.ex:9: Pleroma.Web.RichMedia.Parsers.OEmbed.parse/2
Function: #Function<6.88132293/0 in Pleroma.Web.RichMedia.Parser.parse_with_timeout/1>
    Args: []
```

Severity

I cannot use it as easily as I'd like

Have you searched for this issue?

  • I have double-checked and have not found this issue mentioned anywhere.
### Your setup OTP ### Extra details Akkoma from nixpkgs running in a Docker image. Details: https://gist.github.com/astahfrom/d75414bef384009066312a2612780fdf ### Version 3.13.1 ### PostgreSQL version 16 ### What were you trying to do? Load my timeline. ### What did you expect to happen? Posts would show up. ### What actually happened? The server returned a 504 error and my timeline was completely blank. This has happened twice now. The first time due to this post: https://recurse.social/@lindsey/112425593807393381 The second time due to this post: https://mastodon.scot/@herdingdata/112538347267348104 For the first post, I just waited until my timeline had enough new posts that Akkoma no longer tried to display the problematic one. If I tried to seek it out, I would just not see any posts in that context. This prevented me from reading a thread of replies that I was interested in. The second time, I used an account on a Mastodon server to find the post in question and then muted that person on my Akkoma server. This made my timeline show up again (without the problematic post). At time of writing, recurse.social lists "v4.0.14+hometown-1.1.1" and mastodon.scot lists "v4.2.9". I would be happy to file a bug against Mastodon if this is non-compliant behaviour, but I would also be very interested in a workaround, so my timeline doesn't just disappear because of some random post. ### Logs ````shell Logs for the https://recurse.social/@lindsey/112425593807393381 post: ``` 09:03:04.184 [error] Task #PID<0.5259.0> started from #PID<0.5258.0> terminating ** (ArgumentError) scheme is required for url: wp-json/oembed/1.0/embed?url=http:%252F%252Fwww.weaselhat.com%252F2015%252F03%252F16%252Fa-refinement-type-by-any-other-name%252F (finch 0.16.0) lib/finch/request.ex:135: Finch.Request.parse_url/1 (finch 0.16.0) lib/finch/request.ex:103: Finch.Request.build/5 (tesla 1.9.0) lib/tesla/adapter/finch.ex:66: Tesla.Adapter.Finch.call/2 (tesla 1.9.0) lib/tesla/middleware/telemetry.ex:99: Tesla.Middleware.Telemetry.call/3 (tesla 1.9.0) lib/tesla/middleware/follow_redirects.ex:46: Tesla.Middleware.FollowRedirects.redirect/3 (pleroma 3.13.1) lib/pleroma/web/rich_media/helpers.ex:84: Pleroma.Web.RichMedia.Helpers.rich_media_get/1 (pleroma 3.13.1) lib/pleroma/web/rich_media/parsers/o_embed.ex:25: Pleroma.Web.RichMedia.Parsers.OEmbed.get_oembed_data/1 (pleroma 3.13.1) lib/pleroma/web/rich_media/parsers/o_embed.ex:9: Pleroma.Web.RichMedia.Parsers.OEmbed.parse/2 Function: #Function<6.88132293/0 in Pleroma.Web.RichMedia.Parser.parse_with_timeout/1> Args: [] ``` Logs for the https://mastodon.scot/@herdingdata/112538347267348104 post: ``` 16:24:16.877 [error] Task #PID<0.1548085.0> started from #PID<0.1547880.0> terminating ** (ArgumentError) scheme is required for url: //seblee.co/wp-json/oembed/1.0/embed?url=https%3A%2F%2Fseblee.co%2F (finch 0.16.0) lib/finch/request.ex:135: Finch.Request.parse_url/1 (finch 0.16.0) lib/finch/request.ex:103: Finch.Request.build/5 (tesla 1.9.0) lib/tesla/adapter/finch.ex:66: Tesla.Adapter.Finch.call/2 (tesla 1.9.0) lib/tesla/middleware/telemetry.ex:99: Tesla.Middleware.Telemetry.call/3 (tesla 1.9.0) lib/tesla/middleware/follow_redirects.ex:46: Tesla.Middleware.FollowRedirects.redirect/3 (pleroma 3.13.1) lib/pleroma/web/rich_media/helpers.ex:84: Pleroma.Web.RichMedia.Helpers.rich_media_get/1 (pleroma 3.13.1) lib/pleroma/web/rich_media/parsers/o_embed.ex:25: Pleroma.Web.RichMedia.Parsers.OEmbed.get_oembed_data/1 (pleroma 3.13.1) lib/pleroma/web/rich_media/parsers/o_embed.ex:9: Pleroma.Web.RichMedia.Parsers.OEmbed.parse/2 Function: #Function<6.88132293/0 in Pleroma.Web.RichMedia.Parser.parse_with_timeout/1> Args: [] ``` ```` ### Severity I cannot use it as easily as I'd like ### Have you searched for this issue? - [x] I have double-checked and have not found this issue mentioned anywhere.
ahfrom added the
bug
label 2024-06-04 13:06:20 +00:00

raised #791 which should address this

raised #791 which should address this

I would recommend if you haven't been following my Pleroma refactor of the Rich Media stuff and do the same -- currently the rich media stuff is done while the posts are being fetched to render the timeline which is terrible, and it's also not cached (other than Cachex). My changes makes the rich media stuff asynchronous so it doesn't block rendering timelines if we don't have it cached yet, and as soon as it's received it streams it to the timelines on PleromaFE automagically. It's very nice.

https://git.pleroma.social/pleroma/pleroma/-/merge_requests/4057

I would recommend if you haven't been following my Pleroma refactor of the Rich Media stuff and do the same -- currently the rich media stuff is done *while the posts are being fetched to render the timeline* which is terrible, and it's also not cached (other than Cachex). My changes makes the rich media stuff asynchronous so it doesn't block rendering timelines if we don't have it cached yet, and as soon as it's received it streams it to the timelines on PleromaFE automagically. It's very nice. https://git.pleroma.social/pleroma/pleroma/-/merge_requests/4057

thanks for the information ^

i will take a look at it and see how it works

thanks for the information ^ i will take a look at it and see how it works

I would recommend if you haven't been following my Pleroma refactor of the Rich Media stuff and do the same -- currently the rich media stuff is done while the posts are being fetched to render the timeline which is terrible, and it's also not cached (other than Cachex). My changes makes the rich media stuff asynchronous so it doesn't block rendering timelines if we don't have it cached yet, and as soon as it's received it streams it to the timelines on PleromaFE automagically. It's very nice.

https://git.pleroma.social/pleroma/pleroma/-/merge_requests/4057

@feld we've adapted these changes, but be very wary - taking your patches verbatim led to a memory leak with some supervised tasks not being killed and sitting there with allocated binaries that cannot be GC'd

whilst i am not sure as to the precise set of circumstances that can lead to that leak, it ended up being quite severe on one of the instances we're beta-testing on

our solution is currently to move the rich media fetcher from a supervised task where we have to worry about it, into an oban task that we can rely on its inbuilt duplicate removal and timeouts to handle reaping the processes if they stall or whatever else

> I would recommend if you haven't been following my Pleroma refactor of the Rich Media stuff and do the same -- currently the rich media stuff is done *while the posts are being fetched to render the timeline* which is terrible, and it's also not cached (other than Cachex). My changes makes the rich media stuff asynchronous so it doesn't block rendering timelines if we don't have it cached yet, and as soon as it's received it streams it to the timelines on PleromaFE automagically. It's very nice. > > https://git.pleroma.social/pleroma/pleroma/-/merge_requests/4057 @feld we've adapted these changes, but be very wary - taking your patches verbatim led to a memory leak with some supervised tasks not being killed and sitting there with allocated binaries that cannot be GC'd whilst i am not sure as to the precise set of circumstances that can lead to that leak, it ended up being quite severe on one of the instances we're beta-testing on our solution is currently to move the rich media fetcher from a supervised task where we have to worry about it, into an oban task that we can rely on its inbuilt duplicate removal and timeouts to handle reaping the processes if they stall or whatever else

I'm not sure why you'd have a binary leak from this, but I have solved other binary leaks

https://git.pleroma.social/pleroma/pleroma/-/merge_requests/4060

https://git.pleroma.social/pleroma/pleroma/-/merge_requests/4064

Also I notice you're using Finch which will generate a new connection pool for every single server it connects to which will leak. I'm generally only recommending Gun and Hackney now, and I've fixed the Gun's connection pool (pretty sure 😅).

My server can have an uptime of a week and when it is idle the binaries are only 20MB. My graphs show binaries never crossed 50MB in the last week.

Hope this helps

I'm not sure why you'd have a binary leak from this, but I have solved other binary leaks https://git.pleroma.social/pleroma/pleroma/-/merge_requests/4060 https://git.pleroma.social/pleroma/pleroma/-/merge_requests/4064 Also I notice you're using Finch which will generate a new connection pool for every single server it connects to which will leak. I'm generally only recommending Gun and Hackney now, and I've fixed the Gun's connection pool (pretty sure 😅). My server can have an uptime of a week and when it is idle the binaries are only 20MB. My graphs show binaries never crossed 50MB in the last week. Hope this helps

point being, if your task stalls for any reason, you have an instant leak on your hands with the current implementation

but eh i've done more than my part in relaying the testing info back so 🍤

point being, if your task stalls for any reason, you have an instant leak on your hands with the current implementation but eh i've done more than my part in relaying the testing info back so 🍤

What are you referring to stalling? The HTTP request? I feel like you're being incredibly opaque

What are you referring to stalling? The HTTP request? I feel like you're being incredibly opaque
Contributor

Hi, I was the one that was experiencing the binary leakage with the richmedia patches. Apparently some RichMedia.Backfill tasks were stuck on something (not even really sure what), and that caused them to sick around for a very long time. After around 11 hours, the total amount of memory occupied by binaries grew to 2.8 GiB.

Forcing a GC sweep with :recon.bin_leak didn't seem to do much. It was only after @floatingghost moved the tasks to oban that this problem stopped occurring.

Does Gun or Hackey just maintain one connection pool for all hosts?

Hi, I was the one that was experiencing the binary leakage with the richmedia patches. Apparently some RichMedia.Backfill tasks were stuck on something (not even really sure what), and that caused them to sick around for a very long time. After around 11 hours, the total amount of memory occupied by binaries grew to 2.8 GiB. Forcing a GC sweep with `:recon.bin_leak` didn't seem to do much. It was only after @floatingghost moved the tasks to oban that this problem stopped occurring. Does Gun or Hackey just maintain one connection pool for all hosts?

If you could reproduce it again it would be good to know what e.g., the top 10 binary PIDs are with their full state:

# don't truncate the output
IEx.configure(inspect: [limit: :infinity])

# dump the data for top 10 processes using binaries
:recon.proc_count(:binary_memory, 10) 
|> Enum.map(fn {pid, _, _} -> :sys.get_status(pid) end)

edit: :sys.get_status/1 includes the state, so that seems better

and then we can better see exactly what they're stuck on

Gun and Hackney do not have their own connection pooling logic. Gun exposes everything required to be used with a custom connection pool and we have that code in Pleroma but it appears it was fully removed from Akkoma. Hackney is being used in Pleroma with no connection pool at all; the process will exit when the connection is done. I forgot Hackney has some connection pool stuff but it's kind of opaque and I'm not sure how or if it's being used via Tesla. We have some Hackney pools defined in our configuration but nowhere in our code do we call :hackney_pool. I've found in the Tesla code you can define the adapter as {Tesla.Adapter.Hackney, pool: :my_pool}). I'll have to do more research on that...

Finch is really only meant to be used for making requests to API endpoints and not to random HTTP servers across the internet. Finch starts a new connection pool for every {scheme, host, port} it has to make a request to.

https://elixirforum.com/t/massive-increase-in-binaries-using-finch/36392/15

If you could reproduce it again it would be good to know what e.g., the top 10 binary PIDs are with their full state: ``` # don't truncate the output IEx.configure(inspect: [limit: :infinity]) # dump the data for top 10 processes using binaries :recon.proc_count(:binary_memory, 10) |> Enum.map(fn {pid, _, _} -> :sys.get_status(pid) end) ``` edit: `:sys.get_status/1` includes the state, so that seems better and then we can better see exactly what they're stuck on Gun ~~and Hackney~~ do not have their own connection pooling logic. Gun exposes everything required to be used with a custom connection pool and we have that code in Pleroma but it appears it was fully removed from Akkoma. ~~Hackney is being used in Pleroma with no connection pool at all; the process will exit when the connection is done.~~ I forgot Hackney has some connection pool stuff but it's kind of opaque and I'm not sure how or if it's being used via Tesla. We have some Hackney pools defined in our configuration but nowhere in our code do we call `:hackney_pool`. I've found in the Tesla code you can define the adapter as `{Tesla.Adapter.Hackney, pool: :my_pool})`. I'll have to do more research on that... Finch is really only meant to be used for making requests to API endpoints and not to random HTTP servers across the internet. Finch starts a new connection pool for every `{scheme, host, port}` it has to make a request to. https://elixirforum.com/t/massive-increase-in-binaries-using-finch/36392/15
Sign in to join this conversation.
No milestone
No project
No assignees
4 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: AkkomaGang/akkoma#790
No description provided.