[bug] Rich media parser crashes, resulting in 504 and no posts being shown #790
Labels
No labels
approved, awaiting change
bug
configuration
documentation
duplicate
enhancement
extremely low priority
feature request
Fix it yourself
help wanted
invalid
mastodon_api
needs docs
needs tests
not a bug
planned
pleroma_api
privacy
question
static_fe
triage
wontfix
No milestone
No project
No assignees
4 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: AkkomaGang/akkoma#790
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Your setup
OTP
Extra details
Akkoma from nixpkgs running in a Docker image. Details: https://gist.github.com/astahfrom/d75414bef384009066312a2612780fdf
Version
3.13.1
PostgreSQL version
16
What were you trying to do?
Load my timeline.
What did you expect to happen?
Posts would show up.
What actually happened?
The server returned a 504 error and my timeline was completely blank.
This has happened twice now. The first time due to this post: https://recurse.social/@lindsey/112425593807393381
The second time due to this post: https://mastodon.scot/@herdingdata/112538347267348104
For the first post, I just waited until my timeline had enough new posts that Akkoma no longer tried to display the problematic one. If I tried to seek it out, I would just not see any posts in that context. This prevented me from reading a thread of replies that I was interested in.
The second time, I used an account on a Mastodon server to find the post in question and then muted that person on my Akkoma server. This made my timeline show up again (without the problematic post).
At time of writing, recurse.social lists "v4.0.14+hometown-1.1.1" and mastodon.scot lists "v4.2.9".
I would be happy to file a bug against Mastodon if this is non-compliant behaviour, but I would also be very interested in a workaround, so my timeline doesn't just disappear because of some random post.
Logs
Severity
I cannot use it as easily as I'd like
Have you searched for this issue?
raised #791 which should address this
I would recommend if you haven't been following my Pleroma refactor of the Rich Media stuff and do the same -- currently the rich media stuff is done while the posts are being fetched to render the timeline which is terrible, and it's also not cached (other than Cachex). My changes makes the rich media stuff asynchronous so it doesn't block rendering timelines if we don't have it cached yet, and as soon as it's received it streams it to the timelines on PleromaFE automagically. It's very nice.
https://git.pleroma.social/pleroma/pleroma/-/merge_requests/4057
thanks for the information ^
i will take a look at it and see how it works
@feld we've adapted these changes, but be very wary - taking your patches verbatim led to a memory leak with some supervised tasks not being killed and sitting there with allocated binaries that cannot be GC'd
whilst i am not sure as to the precise set of circumstances that can lead to that leak, it ended up being quite severe on one of the instances we're beta-testing on
our solution is currently to move the rich media fetcher from a supervised task where we have to worry about it, into an oban task that we can rely on its inbuilt duplicate removal and timeouts to handle reaping the processes if they stall or whatever else
I'm not sure why you'd have a binary leak from this, but I have solved other binary leaks
https://git.pleroma.social/pleroma/pleroma/-/merge_requests/4060
https://git.pleroma.social/pleroma/pleroma/-/merge_requests/4064
Also I notice you're using Finch which will generate a new connection pool for every single server it connects to which will leak. I'm generally only recommending Gun and Hackney now, and I've fixed the Gun's connection pool (pretty sure 😅).
My server can have an uptime of a week and when it is idle the binaries are only 20MB. My graphs show binaries never crossed 50MB in the last week.
Hope this helps
point being, if your task stalls for any reason, you have an instant leak on your hands with the current implementation
but eh i've done more than my part in relaying the testing info back so 🍤
What are you referring to stalling? The HTTP request? I feel like you're being incredibly opaque
Hi, I was the one that was experiencing the binary leakage with the richmedia patches. Apparently some RichMedia.Backfill tasks were stuck on something (not even really sure what), and that caused them to sick around for a very long time. After around 11 hours, the total amount of memory occupied by binaries grew to 2.8 GiB.
Forcing a GC sweep with
:recon.bin_leak
didn't seem to do much. It was only after @floatingghost moved the tasks to oban that this problem stopped occurring.Does Gun or Hackey just maintain one connection pool for all hosts?
If you could reproduce it again it would be good to know what e.g., the top 10 binary PIDs are with their full state:
edit:
:sys.get_status/1
includes the state, so that seems betterand then we can better see exactly what they're stuck on
Gun
and Hackneydo not have their own connection pooling logic. Gun exposes everything required to be used with a custom connection pool and we have that code in Pleroma but it appears it was fully removed from Akkoma.Hackney is being used in Pleroma with no connection pool at all; the process will exit when the connection is done.I forgot Hackney has some connection pool stuff but it's kind of opaque and I'm not sure how or if it's being used via Tesla. We have some Hackney pools defined in our configuration but nowhere in our code do we call:hackney_pool
. I've found in the Tesla code you can define the adapter as{Tesla.Adapter.Hackney, pool: :my_pool})
. I'll have to do more research on that...Finch is really only meant to be used for making requests to API endpoints and not to random HTTP servers across the internet. Finch starts a new connection pool for every
{scheme, host, port}
it has to make a request to.https://elixirforum.com/t/massive-increase-in-binaries-using-finch/36392/15