[bug] Akkoma getting OOM killed #736
Labels
No labels
approved, awaiting change
bug
configuration
documentation
duplicate
enhancement
extremely low priority
feature request
Fix it yourself
help wanted
invalid
mastodon_api
needs docs
needs tests
not a bug
planned
pleroma_api
privacy
question
static_fe
triage
wontfix
No milestone
No project
No assignees
5 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: AkkomaGang/akkoma#736
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Your setup
From source
Extra details
NixOS 23.11
Version
3.12.1 (originally 3.10.4)
PostgreSQL version
16.2
What were you trying to do?
For the 3rd time in the last 2-3 months akkoma was getting OOM killed (each while I was asleep). Hours before getting killed, the system started to require more and more memory (from normally 2GB to about 13GB of the available 16GB before getting OOM killed). I have no definite proof akkoma was the memory hog, but all three times akkoma was selected by the oomd.
What did you expect to happen?
Akkoma having a somewhat stable memory footprint.
What actually happened?
Akkoma amasses a seemingly unstoppable memory footprint after a trigger event (or so it seems). The sharp edge in the graph is exactly between 17:02 and 17:03. The logs after this point seem normal up until the oom killer kicks in.
Logs
Severity
I cannot use it as easily as I'd like
Have you searched for this issue?
This is 8 months out of date and crucially is missing critical security fixes. Please read 3.12’s upgrade instructions carefully and then upgrade as soon as you can.
(If the problem is resolved by the upgrade, please close the issue)
If the issue persist on 3.12, please check the live dashboard for more information once mem usage has already risen a fair bit. While logged ina as an admin, you can access the live dashboard under
/phoenix/live_dashboard
. The “Home”, “Processes” (sort by memory usage) and “Ecto Stats” tabs are probably particularly helpful. Post gain with this info attachedYour log excerpt doesn’t seem too unusual
I have also seen this on GenServer Social, but I am never around when it starts so I am not sure what triggers it yet, I figured it was just a "me issue" so never prioritized digging in.
It starts about once a month since was originally deployed(2022), and continues to happen on version 3,12.
There is no one process consuming memory, instead the process count in live dashboard grows rapidly until it runs out of memory. On a 16gb RAM server, this takes about 1 day. (the app usually hovers around 1,6gb memory usage)
The last time it happened it seemed like
:ssl_gen_statem.init/1
was the process getting replicated. I'm not sure if it's a retry in a pool gone rogue or what.I'll attach to the running instance and try to get better data the next time it happens.
I finally updated to akkoma 3.12.1. The live dashboard is accessible and ready whenever this bug might manifest again.
I also created a grafana alert to hopefully notify me on high memory usage to look into it when it happens.
Unfortunately, this issue occurs about once a month, it's hard to determine whether it was fixed through the update or hasn't occurred again. But @paulyd reported the issue persists on 3.12. Let's wait for more data.
seems like this tends to occur when akkoma is unable to fetch a post, see also AkkomaGang/akkoma#787
#762 (merged in
76ded10a70
) does fix it for when the other server responds with a 429, but I think there should be a global backoff for any sort of fetch that could fail so that it doesn't end up clogging up the queue and eating up memoryi found out what it is! it was a relic from old, OLD, OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOLD pleroma wherein some post fetches would use an ad-hoc
Task
which can infinitely recursePR raised, that should be fixed soon
Very cool, thanks for digging into it! I'm closing this issue in belief that the problem is fixed now, but people can reopen (or comment on) it if issues still arise again.
It seems this issue persists... My Akkoma is currently sucking up all memory and fully utilizing one cpu core with the same symptoms as before. This is with Akkoma 3.13.2 (on NixOS 24.05), which should incude the fix in #762.
I looked into the admin phoenix webpage, but none of the points stand out with weird memory usage. Some quick screenshots in the attachment.
EDIT: Excerpt of the log during this time frame:
there hasn't been a release with the fix included