Avoid accumulation of stale data in websockets #806

Merged
floatingghost merged 1 commit from Oneric/akkoma:websocket_fullsweep into develop 2024-06-23 02:19:37 +00:00
Member

This is the patch which came up recently again; originally reported on IRC by bjo and jmjl on IRC in March and they also tested global ERL_FULLSWEEP_AFTER. Nobody responded re testing this more specific tweak though. Given all the other pointers towards websockets it seems likely to be the culprit though.

Putting this here for more visibility or for merging anyway if you think it’s reasonably likely to be correct.
See commit message for more details.

This is the patch which came up recently again; originally reported on IRC by bjo and jmjl on IRC in March and they also tested global `ERL_FULLSWEEP_AFTER`. Nobody responded re testing this more specific tweak though. Given all the other pointers towards websockets it seems likely to be the culprit though. Putting this here for more visibility or for merging anyway if you think it’s reasonably likely to be correct. See commit message for more details.
Oneric added 1 commit 2024-06-16 16:03:00 +00:00
Avoid accumulation of stale data in websockets
All checks were successful
ci/woodpecker/pr/lint Pipeline was successful
ci/woodpecker/pr/test Pipeline was successful
ci/woodpecker/pr/build-amd64 Pipeline was successful
ci/woodpecker/pr/build-arm64 Pipeline was successful
ci/woodpecker/pr/docs Pipeline was successful
4b5a398f22
We’ve received reports of some specific instances slowly accumulating
more and more binary data over time up to OOMs and globally setting
ERL_FULLSWEEP_AFTER=0 has proven to be an effective countermeasure.
However, this incurs increased cpu perf costs everywhere and is
thus not suitable to apply out of the box.

There are many reports unrelated to Akkoma of long-lived Phoenix
websockets getting into a state unfavourable for the garbage
collector depending on usage pattern, resulting in exactly the
observed behaviour.
Therefore it seems likely affected instances are using timeline
streaming and do so in just the right way to trigger this. We
can tune the garbage collector just for websocket processes
and use a more lenient value of 20 to keep the added perf cost
in check.

Unfortunately none of the affected instances responded to inquieries
to test this more selective gc tuning, so this is not fully veriefied.
However, given the general reports regarding websockets and Pleroma —
as it turns out — also applying and having properly tested a very
similar tweak seems to support this theory.

Ref.:
  https://www.erlang.org/doc/man/erlang#ghlink-process_flag-2-idp226
  https://blog.guzman.codes/using-phoenix-channels-high-memory-usage-save-money-with-erlfullsweepafter
  https://git.pleroma.social/pleroma/pleroma/-/merge_requests/4060
Author
Member

looks like you also have accounts here @bjo @jmjl; can you test this patch and remove the global ERL_FULLSWEEP_AFTER? Ideally also verify the memleak still occurs when neither this patch is applied nor global ERL_FULLSWEEP_AFTER set.

looks like you also have accounts here @bjo @jmjl; can you test this patch and remove the global `ERL_FULLSWEEP_AFTER`? Ideally also verify the memleak still occurs when neither this patch is applied nor global `ERL_FULLSWEEP_AFTER` set.
floatingghost reviewed 2024-06-16 16:29:41 +00:00
@ -61,1 +63,4 @@
# process is long-lived and can sometimes accumulate stale data in such a way it's
# not freed by young garbage cycles, thus make full collection sweeps more frequent
:erlang.process_flag(:fullsweep_after, @fullsweep_after)

it looks like you shoullllld be able to specify this in ws options rather than having to use the lower-level Erlang functions

https://hexdocs.pm/phoenix/Phoenix.Endpoint.html#socket/3-websocket-configuration

it looks like you shoullllld be able to specify this in ws options rather than having to use the lower-level Erlang functions https://hexdocs.pm/phoenix/Phoenix.Endpoint.html#socket/3-websocket-configuration
Author
Member

We currently don’t use Phoenix’ higher level Websocket interface for this, instead WebsocketHandler is listed as a higher prio endpoint than Pleroma.Web:Endpoint. I don’t know why this is the case (maybe the higher-level interface didn’t exist yet when streaming support was introduced seven years ago) and when i wrote this patch somem onths ago i had hoped to first get confirmation whether this really is the issue. Pleroma migrated to Phoenix’ WebSockets which is why they can use this setting in their patch. If there’s no reason to stick with the custom implementation we prob should migrate as well.

There’s also lib/phoenix/transports/web_socket/raw.ex which defines a module in Phoenix’ namespace Phoenix.Transports.WebSocket.Raw but i have no idea what this does.

We currently don’t use Phoenix’ higher level Websocket interface for this, instead `WebsocketHandler` is listed as a higher prio endpoint than `Pleroma.Web:Endpoint`. I don’t know why this is the case (maybe the higher-level interface didn’t exist yet when streaming support was introduced seven years ago) and when i wrote this patch somem onths ago i had hoped to first get confirmation whether this really is the issue. Pleroma migrated to Phoenix’ WebSockets which is why they can use this setting in their patch. If there’s no reason to stick with the custom implementation we prob should migrate as well. There’s also `lib/phoenix/transports/web_socket/raw.ex` which defines a module in Phoenix’ namespace `Phoenix.Transports.WebSocket.Raw` but i have no idea what this does.
First-time contributor

I will try go use a Source based instance the next days to test it.

I will try go use a Source based instance the next days to test it.
Author
Member

Thanks! When you start your from-source server, it be great if you could use

MIX_ENV=prod elixir --sname akkoma --cookie akkoma_cookie -S mix phx.server

such that — in case the patch doesn't work — you can run the memleak diagnostic tool later: diagnostic-tools/binary-leak-checker.sh

This way we’ll (hopefully) be able to get insight into which processes are actually responsible for the leaks.
Ideally test if connecting a remote shell works ahead of time; if it doesn’t you may have configured epmd too strictly, but you can just ask on IRC or here about it.

Thanks! When you start your from-source server, it be great if you could use ``` MIX_ENV=prod elixir --sname akkoma --cookie akkoma_cookie -S mix phx.server ``` such that — in case the patch doesn't work — you can run the memleak diagnostic tool later: [diagnostic-tools/binary-leak-checker.sh](https://akkoma.dev/AkkomaGang/akkoma/src/branch/develop/diagnostic-tools/binary-leak-checker.sh) This way we’ll (hopefully) be able to get insight into which processes are actually responsible for the leaks. Ideally test if connecting a remote shell works ahead of time; if it doesn’t you may have configured `epmd` too strictly, but you can just ask on IRC or here about it.
floatingghost approved these changes 2024-06-17 21:36:23 +00:00
floatingghost left a comment
Owner

approving code-side, will need to wait to see if this fixes the issue before merge

approving code-side, will need to wait to see if this fixes the issue before merge
First-time contributor

Unfortunately I wasn't able to package the source variant, so I'm running now a binary install with 4b5a398f22, and it seems much better now: Akkoma is using around 300-400MB memory.

Unfortunately I wasn't able to package the source variant, so I'm running now a binary install with 4b5a398f22, and it seems much better now: Akkoma is using around 300-400MB memory.
Oneric force-pushed websocket_fullsweep from 4b5a398f22 to 13e2a811ec 2024-06-22 23:22:58 +00:00 Compare

feedback is good by me, we should watch out to see if migrating to the new phoenix interface is worth it or not

feedback is good by me, we should watch out to see if migrating to the new phoenix interface is worth it or not
floatingghost merged commit f66135ed08 into develop 2024-06-23 02:19:37 +00:00
floatingghost deleted branch websocket_fullsweep 2024-06-23 02:19:37 +00:00
First-time contributor

Memory usage is still fine, actually 350 MB running since 2024-06-21 21:32:48 CEST

Memory usage is still fine, actually 350 MB running since 2024-06-21 21:32:48 CEST
Sign in to join this conversation.
No description provided.