[bug] Akkoma getting OOM killed #736

Closed
opened 2024-04-05 14:37:48 +00:00 by Mynacol · 7 comments

Your setup

From source

Extra details

NixOS 23.11

Version

3.12.1 (originally 3.10.4)

PostgreSQL version

16.2

What were you trying to do?

For the 3rd time in the last 2-3 months akkoma was getting OOM killed (each while I was asleep). Hours before getting killed, the system started to require more and more memory (from normally 2GB to about 13GB of the available 16GB before getting OOM killed). I have no definite proof akkoma was the memory hog, but all three times akkoma was selected by the oomd.

What did you expect to happen?

Akkoma having a somewhat stable memory footprint.

What actually happened?

Akkoma amasses a seemingly unstoppable memory footprint after a trigger event (or so it seems). The sharp edge in the graph is exactly between 17:02 and 17:03. The logs after this point seem normal up until the oom killer kicks in.

Logs

Apr 01 17:01:21 akkoma[2675140]: request_id=F8I2otKQ3iMGjuwADLCh [error] Could not fetch user https://mastodon.social/users/*redacted*, {nil, {:error, {"Object has been deleted", "https://mastodon.social/users/*redacted*", 410}}}
Apr 01 17:03:16 postgres[2996846]: [2996846] ERROR:  duplicate key value violates unique constraint "objects_unique_apid_index"
Apr 01 17:03:16 postgres[2996846]: [2996846] DETAIL:  Key ((data ->> 'id'::text))=(https://mastodon.social/users/lisamelton/statuses/112197041349103569) already exists.
Apr 01 17:03:16 postgres[2996846]: [2996846] STATEMENT:  INSERT INTO "objects" ("data","inserted_at","updated_at") VALUES ($1,$2,$3) RETURNING "id"
Apr 01 17:03:16 akkoma[2675140]: [error] {:error, {:persist, {:error, #Ecto.Changeset<action: :insert, changes: %{data: %{"actor" => "https://mastodon.social/users/lisamelton", "announcement_count" => 0, "announcements" => [], "attachment" => [], "attributedTo" => "https://mastodon.social/users/lisamelton", "bcc" => [], "bto" => [], "cc" => ["https://chaos.social/users/rjayasinghe", "https://mastodon.gamedev.place/users/grumpygamer", "https://mastodon.social/users/Gargron", "https://mastodon.social/users/lisamelton/followers", "https://oisaur.com/users/renchap"], "content" => "<p><span class=\"h-card\" translate=\"no\"><a href=\"https://chaos.social/@rjayasinghe\" class=\"u-url mention\">@<span>rjayasinghe</span></a></span> Exactly! It&#39;s another reason that I love Mastodon and the Fediverse. 💖</p><p><span class=\"h-card\" translate=\"no\"><a href=\"https://oisaur.com/@renchap\" class=\"u-url mention\">@<span>renchap</span></a></span> <span class=\"h-card\" translate=\"no\"><a href=\"https://mastodon.social/@Gargron\" class=\"u-url mention\">@<span>Gargron</span></a></span> <span class=\"h-card\" translate=\"no\"><a href=\"https://mastodon.gamedev.place/@grumpygamer\" class=\"u-url mention\">@<span>grumpygamer</span></a></span></p>", "contentMap" => %{"en" => "<p><span class=\"h-card\" translate=\"no\"><a href=\"https://chaos.social/@rjayasinghe\" class=\"u-url mention\">@<span>rjayasinghe</span></a></span> Exactly! It&#39;s another reason that I love Mastodon and the Fediverse. 💖</p><p><span class=\"h-card\" translate=\"no\"><a href=\"https://oisaur.com/@renchap\" class=\"u-url mention\">@<span>renchap</span></a></span> <span class=\"h-card\" translate=\"no\"><a href=\"https://mastodon.social/@Gargron\" class=\"u-url mention\">@<span>Gargron</span></a></span> <span class=\"h-card\" translate=\"no\"><a href=\"https://mastodon.gamedev.place/@grumpygamer\" class=\"u-url mention\">@<span>grumpygamer</span></a></span></p>"}, "context" => "tag:mastodon.social,2024-04-01:objectId=676394297:objectType=Conversation", "emoji" => %{}, "id" => "https://mastodon.social/users/lisamelton/statuses/112197041349103569", "inReplyTo" => "https://chaos.social/users/rjayasinghe/statuses/112197035470513639", "like_count" => 0, "likes" => [], "published" => "2024-04-01T17:03:07Z", "replies" => [], "replies_count" => 0, "sensitive" => false, "tag" => [%{"href" => "https://chaos.social/users/rjayasinghe", "name" => "@rjayasinghe@chaos.social", "type" => "Mention"}, %{"href" => "https://oisaur.com/users/renchap", "name" => "@renchap@oisaur.com", "type" => "Mention"}, %{"href" => "https://mastodon.social/users/Gargron", "name" => "@Gargron", "type" => "Mention"}, %{"href" => "https://mastodon.gamedev.place/users/grumpygamer", "name" => "@grumpygamer@mastodon.gamedev.place", "type" => "Mention"}], "to" => ["https://www.w3.org/ns/activitystreams#Public"], "type" => "Note", "url" => "https://mastodon.social/@lisamelton/112197041349103569"}}, errors: [ap_id: {"has already been taken", [constraint: :unique, constraint_name: "objects_unique_apid_index"]}], data: #Pleroma.Object<>, valid?: false>}}}
Apr 01 17:03:32 akkoma[2675140]: request_id=F8I2vzwHmay26kYADLQB [error] Could not decode featured collection at fetch https://mastodon.social/users/*redacted*/collections/featured, {:error, {"Object has been deleted", "https://mastodon.social/users/*redacted*/collections/featured", 410}}
Apr 01 17:03:47 akkoma[2675140]: request_id=F8I2xNrLnFF5PrAADLRh [error] Could not fetch user https://cupoftea.social/users/*redacted*, {nil, {:error, {"Object has been deleted", "https://cupoftea.social/users/*redacted*", 410}}}
Apr 01 17:03:47 akkoma[2675140]: request_id=F8I2xNrLnFF5PrAADLRh [error] Could not fetch user https://cupoftea.social/users/*redacted*, {nil, {:error, {"Object has been deleted", "https://cupoftea.social/users/*redacted*", 410}}}
Apr 01 17:03:53 postgres[2200]: [2200] LOG:  checkpoint starting: time
Apr 01 17:04:18 akkoma[2675140]: [error] Error while fetching https://bsd.network/users/*redacted*/statuses/*redacted*: {:error, "timeout"}
Apr 01 17:04:18 akkoma[2675140]: [warn] Couldn't fetch reply@"https://bsd.network/users/*redacted*/statuses/*redacted*", error: nil
Apr 01 17:04:26 akkoma[2675140]: [info] Updating metadata for bears.town
Apr 01 17:04:30 akkoma[2675140]: [info] Updating metadata for fearness.org
Apr 01 17:04:31 akkoma[2675140]: [info] Postgrex.Protocol (#PID<0.3942.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.22233.10> exited
Apr 01 17:04:32 postgres[2978467]: [2978467] LOG:  could not send data to client: Broken pipe
Apr 01 17:04:32 postgres[2978467]: [2978467] ERROR:  canceling statement due to user request
Apr 01 17:04:32 postgres[2978467]: [2978467] FATAL:  connection to client lost
Apr 01 17:04:40 akkoma[2675140]: [warn] Timeout while fetching rich media for *redacted*

Severity

I cannot use it as easily as I'd like

Have you searched for this issue?

  • I have double-checked and have not found this issue mentioned anywhere.
### Your setup From source ### Extra details NixOS 23.11 ### Version 3.12.1 (originally 3.10.4) ### PostgreSQL version 16.2 ### What were you trying to do? For the 3rd time in the last 2-3 months akkoma was getting OOM killed (each while I was asleep). Hours before getting killed, the system started to require more and more memory (from normally 2GB to about 13GB of the available 16GB before getting OOM killed). I have no definite proof akkoma was the memory hog, but all three times akkoma was selected by the oomd. ### What did you expect to happen? Akkoma having a somewhat stable memory footprint. ### What actually happened? Akkoma amasses a seemingly unstoppable memory footprint after a trigger event (or so it seems). The sharp edge in the graph is exactly between 17:02 and 17:03. The logs after this point seem normal up until the oom killer kicks in. ### Logs ```shell Apr 01 17:01:21 akkoma[2675140]: request_id=F8I2otKQ3iMGjuwADLCh [error] Could not fetch user https://mastodon.social/users/*redacted*, {nil, {:error, {"Object has been deleted", "https://mastodon.social/users/*redacted*", 410}}} Apr 01 17:03:16 postgres[2996846]: [2996846] ERROR: duplicate key value violates unique constraint "objects_unique_apid_index" Apr 01 17:03:16 postgres[2996846]: [2996846] DETAIL: Key ((data ->> 'id'::text))=(https://mastodon.social/users/lisamelton/statuses/112197041349103569) already exists. Apr 01 17:03:16 postgres[2996846]: [2996846] STATEMENT: INSERT INTO "objects" ("data","inserted_at","updated_at") VALUES ($1,$2,$3) RETURNING "id" Apr 01 17:03:16 akkoma[2675140]: [error] {:error, {:persist, {:error, #Ecto.Changeset<action: :insert, changes: %{data: %{"actor" => "https://mastodon.social/users/lisamelton", "announcement_count" => 0, "announcements" => [], "attachment" => [], "attributedTo" => "https://mastodon.social/users/lisamelton", "bcc" => [], "bto" => [], "cc" => ["https://chaos.social/users/rjayasinghe", "https://mastodon.gamedev.place/users/grumpygamer", "https://mastodon.social/users/Gargron", "https://mastodon.social/users/lisamelton/followers", "https://oisaur.com/users/renchap"], "content" => "<p><span class=\"h-card\" translate=\"no\"><a href=\"https://chaos.social/@rjayasinghe\" class=\"u-url mention\">@<span>rjayasinghe</span></a></span> Exactly! It&#39;s another reason that I love Mastodon and the Fediverse. 💖</p><p><span class=\"h-card\" translate=\"no\"><a href=\"https://oisaur.com/@renchap\" class=\"u-url mention\">@<span>renchap</span></a></span> <span class=\"h-card\" translate=\"no\"><a href=\"https://mastodon.social/@Gargron\" class=\"u-url mention\">@<span>Gargron</span></a></span> <span class=\"h-card\" translate=\"no\"><a href=\"https://mastodon.gamedev.place/@grumpygamer\" class=\"u-url mention\">@<span>grumpygamer</span></a></span></p>", "contentMap" => %{"en" => "<p><span class=\"h-card\" translate=\"no\"><a href=\"https://chaos.social/@rjayasinghe\" class=\"u-url mention\">@<span>rjayasinghe</span></a></span> Exactly! It&#39;s another reason that I love Mastodon and the Fediverse. 💖</p><p><span class=\"h-card\" translate=\"no\"><a href=\"https://oisaur.com/@renchap\" class=\"u-url mention\">@<span>renchap</span></a></span> <span class=\"h-card\" translate=\"no\"><a href=\"https://mastodon.social/@Gargron\" class=\"u-url mention\">@<span>Gargron</span></a></span> <span class=\"h-card\" translate=\"no\"><a href=\"https://mastodon.gamedev.place/@grumpygamer\" class=\"u-url mention\">@<span>grumpygamer</span></a></span></p>"}, "context" => "tag:mastodon.social,2024-04-01:objectId=676394297:objectType=Conversation", "emoji" => %{}, "id" => "https://mastodon.social/users/lisamelton/statuses/112197041349103569", "inReplyTo" => "https://chaos.social/users/rjayasinghe/statuses/112197035470513639", "like_count" => 0, "likes" => [], "published" => "2024-04-01T17:03:07Z", "replies" => [], "replies_count" => 0, "sensitive" => false, "tag" => [%{"href" => "https://chaos.social/users/rjayasinghe", "name" => "@rjayasinghe@chaos.social", "type" => "Mention"}, %{"href" => "https://oisaur.com/users/renchap", "name" => "@renchap@oisaur.com", "type" => "Mention"}, %{"href" => "https://mastodon.social/users/Gargron", "name" => "@Gargron", "type" => "Mention"}, %{"href" => "https://mastodon.gamedev.place/users/grumpygamer", "name" => "@grumpygamer@mastodon.gamedev.place", "type" => "Mention"}], "to" => ["https://www.w3.org/ns/activitystreams#Public"], "type" => "Note", "url" => "https://mastodon.social/@lisamelton/112197041349103569"}}, errors: [ap_id: {"has already been taken", [constraint: :unique, constraint_name: "objects_unique_apid_index"]}], data: #Pleroma.Object<>, valid?: false>}}} Apr 01 17:03:32 akkoma[2675140]: request_id=F8I2vzwHmay26kYADLQB [error] Could not decode featured collection at fetch https://mastodon.social/users/*redacted*/collections/featured, {:error, {"Object has been deleted", "https://mastodon.social/users/*redacted*/collections/featured", 410}} Apr 01 17:03:47 akkoma[2675140]: request_id=F8I2xNrLnFF5PrAADLRh [error] Could not fetch user https://cupoftea.social/users/*redacted*, {nil, {:error, {"Object has been deleted", "https://cupoftea.social/users/*redacted*", 410}}} Apr 01 17:03:47 akkoma[2675140]: request_id=F8I2xNrLnFF5PrAADLRh [error] Could not fetch user https://cupoftea.social/users/*redacted*, {nil, {:error, {"Object has been deleted", "https://cupoftea.social/users/*redacted*", 410}}} Apr 01 17:03:53 postgres[2200]: [2200] LOG: checkpoint starting: time Apr 01 17:04:18 akkoma[2675140]: [error] Error while fetching https://bsd.network/users/*redacted*/statuses/*redacted*: {:error, "timeout"} Apr 01 17:04:18 akkoma[2675140]: [warn] Couldn't fetch reply@"https://bsd.network/users/*redacted*/statuses/*redacted*", error: nil Apr 01 17:04:26 akkoma[2675140]: [info] Updating metadata for bears.town Apr 01 17:04:30 akkoma[2675140]: [info] Updating metadata for fearness.org Apr 01 17:04:31 akkoma[2675140]: [info] Postgrex.Protocol (#PID<0.3942.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.22233.10> exited Apr 01 17:04:32 postgres[2978467]: [2978467] LOG: could not send data to client: Broken pipe Apr 01 17:04:32 postgres[2978467]: [2978467] ERROR: canceling statement due to user request Apr 01 17:04:32 postgres[2978467]: [2978467] FATAL: connection to client lost Apr 01 17:04:40 akkoma[2675140]: [warn] Timeout while fetching rich media for *redacted* ``` ### Severity I cannot use it as easily as I'd like ### Have you searched for this issue? - [x] I have double-checked and have not found this issue mentioned anywhere.
Mynacol added the
bug
label 2024-04-05 14:37:48 +00:00
Member

Version
3.10.4

This is 8 months out of date and crucially is missing critical security fixes. Please read 3.12’s upgrade instructions carefully and then upgrade as soon as you can.

(If the problem is resolved by the upgrade, please close the issue)
If the issue persist on 3.12, please check the live dashboard for more information once mem usage has already risen a fair bit. While logged ina as an admin, you can access the live dashboard under /phoenix/live_dashboard. The “Home”, “Processes” (sort by memory usage) and “Ecto Stats” tabs are probably particularly helpful. Post gain with this info attached
Your log excerpt doesn’t seem too unusual

> **Version** > 3.10.4 This is 8 months out of date and crucially is missing critical security fixes. Please read [3.12’s upgrade instructions](https://meta.akkoma.dev/t/akkoma-stable-2024-03-securer-i-barely-know-her/681) carefully and then upgrade as soon as you can. (If the problem is resolved by the upgrade, please close the issue) If the issue persist on 3.12, please check the live dashboard for more information once mem usage has already risen a fair bit. While logged ina as an admin, you can access the live dashboard under `/phoenix/live_dashboard`. The “Home”, “Processes” *(sort by memory usage)* and “Ecto Stats” tabs are probably particularly helpful. Post gain with this info attached Your log excerpt doesn’t seem too unusual
Contributor

I have also seen this on GenServer Social, but I am never around when it starts so I am not sure what triggers it yet, I figured it was just a "me issue" so never prioritized digging in.

It starts about once a month since was originally deployed(2022), and continues to happen on version 3,12.

There is no one process consuming memory, instead the process count in live dashboard grows rapidly until it runs out of memory. On a 16gb RAM server, this takes about 1 day. (the app usually hovers around 1,6gb memory usage)

The last time it happened it seemed like :ssl_gen_statem.init/1 was the process getting replicated. I'm not sure if it's a retry in a pool gone rogue or what.

I'll attach to the running instance and try to get better data the next time it happens.

I have also seen this on GenServer Social, but I am never around when it starts so I am not sure what triggers it yet, I figured it was just a "me issue" so never prioritized digging in. It starts about once a month since was originally deployed(2022), and continues to happen on version 3,12. There is no one process consuming memory, instead the process count in live dashboard grows rapidly until it runs out of memory. On a 16gb RAM server, this takes about 1 day. (the app usually hovers around 1,6gb memory usage) The last time it happened it seemed like `:ssl_gen_statem.init/1` was the process getting replicated. I'm not sure if it's a retry in a pool gone rogue or what. I'll attach to the running instance and try to get better data the next time it happens.
Author

I finally updated to akkoma 3.12.1. The live dashboard is accessible and ready whenever this bug might manifest again.
I also created a grafana alert to hopefully notify me on high memory usage to look into it when it happens.

Unfortunately, this issue occurs about once a month, it's hard to determine whether it was fixed through the update or hasn't occurred again. But @paulyd reported the issue persists on 3.12. Let's wait for more data.

I finally updated to akkoma 3.12.1. The live dashboard is accessible and ready whenever this bug might manifest again. I also created a grafana alert to hopefully notify me on high memory usage to look into it when it happens. Unfortunately, this issue occurs about once a month, it's hard to determine whether it was fixed through the update or hasn't occurred again. But @paulyd reported the issue persists on 3.12. Let's wait for more data.
Contributor

seems like this tends to occur when akkoma is unable to fetch a post, see also AkkomaGang/akkoma#787

seems like this tends to occur when akkoma is unable to fetch a post, see also AkkomaGang/akkoma#787
Contributor

#762 (merged in 76ded10a70) does fix it for when the other server responds with a 429, but I think there should be a global backoff for any sort of fetch that could fail so that it doesn't end up clogging up the queue and eating up memory

#762 (merged in 76ded10a709859ce661867bf28d778fb7f3a2a21) does fix it for when the other server responds with a 429, but I think there should be a global backoff for any sort of fetch that could fail so that it doesn't end up clogging up the queue and eating up memory

i found out what it is! it was a relic from old, OLD, OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOLD pleroma wherein some post fetches would use an ad-hoc Task which can infinitely recurse

PR raised, that should be fixed soon

i found out what it is! it was a relic from old, OLD, OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOLD pleroma wherein some post fetches would use an ad-hoc `Task` which can infinitely recurse PR raised, that should be fixed soon
Author

Very cool, thanks for digging into it! I'm closing this issue in belief that the problem is fixed now, but people can reopen (or comment on) it if issues still arise again.

Very cool, thanks for digging into it! I'm closing this issue in belief that the problem is fixed now, but people can reopen (or comment on) it if issues still arise again.
Sign in to join this conversation.
No milestone
No project
No assignees
5 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: AkkomaGang/akkoma#736
No description provided.