Compare commits

...
Sign in to create a new pull request.

11 commits

Author SHA1 Message Date
2e384dcda9 [TEST] dbprune: fuzzymode
Splitting it up hopefully helps with load and OOMs (on smaller VPSes).
But untested.

Also: technically can delete activities referencing actually existing
objects, but atm won't happen for anything we understand in the first
place. But risks future changes leading to loss of (remote) data.....
Local activities are never affected though

Less severe this might also miss some technically deleteable entries
if they’ll later get dealt with by the regular auto pruning anyway.
2024-05-15 03:51:14 +02:00
c127d48308 dbprune/activites: prune array activities first
Some checks failed
ci/woodpecker/pr/test Pipeline is pending
ci/woodpecker/pr/lint Pipeline failed
ci/woodpecker/pr/test/1 unknown status
ci/woodpecker/pr/test/2 unknown status
ci/woodpecker/pr/test/3 unknown status
ci/woodpecker/pr/test/4 unknown status
ci/woodpecker/pr/build-amd64 unknown status
ci/woodpecker/pr/docs unknown status
ci/woodpecker/pr/build-arm64 Pipeline was successful
This query is less costly; if something goes wrong or gets aborted later
at least this part will arelady be done.
2024-05-15 03:45:50 +02:00
40ae91a45c dbprune: allow splitting array and single activity prunes
The former is typically just a few reports; it doesn't make sense to
rerun it over and over again in batched prunes or if a full prune OOMed.
2024-05-15 03:45:50 +02:00
3c319ea732 dbprune: use query! 2024-05-15 01:43:53 +02:00
91e4f4f885 dbprune: add more logs
Pruning can go on for a long time; give admins some insight into that
something is happening to make it less frustrating and to make it easier
which part of the process is stalled should this happen.

Again most of the changes are merely reindents;
review with whitespace changes hidden recommended.
2024-05-15 01:43:53 +02:00
7e03886886 dbprune: shortcut array activity search
This brought down query costs from 7,953,740.90 to 47,600.97
2024-05-15 01:43:53 +02:00
1caac640da Test both standalone and flag mode for pruning orphaned activities 2024-05-15 01:43:53 +02:00
b03947917a Also allow limiting the initial prune_object
May sometimes be helpful to get more predictable runtime
than just with an age-based limit.

The subquery for the non-keep-threads path is required
since delte_all does not directly accept limit().

Again most of the diff is just adjusting indentation, best
hide whitespace-only changes with git diff -w or similar.
2024-05-15 01:43:53 +02:00
3258842d0c Log number of deleted rows in prune_orphaned_activities
This gives feedback when to stop rerunning limited batches.

Most of the diff is just adjusting indentation; best reviewed
with whitespace-only changes hidden, e.g. `git diff -w`.
2024-05-14 23:45:10 +02:00
ff684ba8ea Add standalone prune_orphaned_activities CLI task
This part of pruning can be very expensive and bog down the whole
instance to an unusable sate for a long time. It can thus be desireable
to split it from prune_objects and run it on its own in smaller limited batches.

If the batches are smaller enough and spaced out a bit, it may even be possible
to avoid any downtime. If not, the limit can still help to at least make the
downtime duration somewhat more predictable.
2024-05-14 23:45:10 +02:00
f5b5838c4d refactor: move prune_orphaned_activities into own function
No logic changes. Preparation for standalone orphan pruning.
2024-05-14 23:45:10 +02:00
4 changed files with 360 additions and 105 deletions

View file

@ -92,6 +92,8 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
- Akkoma API is now documented
- ability to auto-approve follow requests from users you are already following
- The SimplePolicy MRF can now strip user backgrounds from selected remote hosts
- New standalone `prune_orphaned_activities` mix task with configurable batch limit
- The `prune_objects` mix task now accepts a `--limit` parameter for initial object pruning
## Changed
- OTP builds are now built on erlang OTP26

View file

@ -50,9 +50,40 @@ This will prune remote posts older than 90 days (configurable with [`config :ple
- `--keep-threads` - Don't prune posts when they are part of a thread where at least one post has seen local interaction (e.g. one of the posts is a local post, or is favourited by a local user, or has been repeated by a local user...). It also wont delete posts when at least one of the posts in that thread is kept (e.g. because one of the posts has seen recent activity).
- `--keep-non-public` - Keep non-public posts like DM's and followers-only, even if they are remote.
- `--limit` - limits how many remote posts get pruned. This limit does **not** apply to any of the follow up jobs. If wanting to keep the database load in check it is thus advisable to run the standalone `prune_orphaned_activities` task with a limit afterwards instead of passing `--prune-orphaned-activities` to this task.
- `--prune-orphaned-activities` - Also prune orphaned activities afterwards. Activities are things like Like, Create, Announce, Flag (aka reports)... They can significantly help reduce the database size.
- `--vacuum` - Run `VACUUM FULL` after the objects are pruned. This should not be used on a regular basis, but is useful if your instance has been running for a long time before pruning.
## Prune orphaned activities from the database
This will prune activities which are no longer referenced by anything.
Such activities might be the result of running `prune_objects` without `--prune-orphaned-activities`.
The same notes and warnings apply as for `prune_objects`.
The task will print out how many rows were freed in total in its last
line of output in the form `Deleted 345 rows`.
When running the job in limited batches this can be used to determine
when all orphaned activities have been deleted.
=== "OTP"
```sh
./bin/pleroma_ctl database prune_orphaned_activities [option ...]
```
=== "From Source"
```sh
mix pleroma.database prune_orphaned_activities [option ...]
```
### Options
- `--limit n` - Only delete up to `n` activities in each query making up this job, i.e. if this job runs two queries at most `2n` activities will be deleted. Running this task repeatedly in limited batches can help maintain the instances responsiveness while still freeing up some space.
- `--no-singles` - Do not delete activites referencing single objects
- `--no-arrays` - Do not delete activites referencing an array of objects
- `--fuzzy` - Run fuzzy delete subqueries where supported; potentially more resource friendly but might miss a few activities or rarely delete some actually referenced remote activities (local data is never deleted)
## Create a conversation for all existing DMs
Can be safely re-run

View file

@ -20,6 +20,194 @@ defmodule Mix.Tasks.Pleroma.Database do
@shortdoc "A collection of database related tasks"
@moduledoc File.read!("docs/docs/administration/CLI_tasks/database.md")
defp maybe_limit(query, limit_cnt) do
if is_number(limit_cnt) and limit_cnt > 0 do
limit(query, [], ^limit_cnt)
else
query
end
end
defp limit_statement(limit) when is_number(limit) do
if limit > 0 do
"LIMIT #{limit}"
else
""
end
end
defp restrict_to_local_singles_type(query, types) do
query
|> where(
[a],
not a.local and
fragment("?->>'type' = ANY(?)", a.data, ^types) and
fragment("jsonb_typeof(?->'object') = 'string'", a.data)
)
end
defp prune_activity_by_id(idquery) do
from(
a in Pleroma.Activity,
where: a.id in subquery(idquery)
)
|> Repo.delete_all()
end
defp prune_orphaned_activities_singles(limit, opts)
defp prune_orphaned_activities_singles(limit, true = _fuzzy) do
# WARNING: the *_only parts match our current implementation, but other servers
# appears to be more lenient what those activities can refer to. Atm
# deleting those is no loss since we don't udnerstand them anyway, but
# should this chagne this absolutetly needs adjusting!!!
qdel_useronly =
from(
a in Pleroma.Activity,
select: a.id,
left_join: u in Pleroma.User,
on: fragment("?->>'object'", a.data) == u.ap_id,
where: is_nil(u.id)
)
|> restrict_to_local_singles_type(["Block", "Move"])
|> maybe_limit(limit)
{del_useronly, _} = prune_activity_by_id(qdel_useronly)
Logger.info("- Deleted #{del_useronly} activities related to users")
qdel_actonly =
from(
a in Pleroma.Activity,
select: a.id,
left_join: a2 in Pleroma.Activity,
on: fragment("?->>'object' = ?->>'id'", a.data, a2.data),
where: is_nil(a2.id)
)
|> restrict_to_local_singles_type(["Accept", "Reject"])
|> maybe_limit(limit)
{del_actonly, _} = prune_activity_by_id(qdel_actonly)
Logger.info("- Deleted #{del_actonly} activities related to other activities")
qdel_objonly =
from(
a in Pleroma.Activity,
select: a.id,
left_join: o in Pleroma.Object,
on: fragment("?->>'object' = ?->>'id'", a.data, o.data),
where: is_nil(o.id)
)
|> restrict_to_local_singles_type(["Like", "EmojiReact", "Announce"])
|> maybe_limit(limit)
{del_objonly, _} = prune_activity_by_id(qdel_objonly)
Logger.info("- Deleted #{del_actonly} activities related to non-activity objects")
qdel_mixed =
from(
a in Pleroma.Activity,
select: a.id,
left_join: o in Pleroma.Object,
on: fragment("?->>'object' = ?->>'id'", a.data, o.data),
left_join: a2 in Pleroma.Activity,
on: fragment("?->>'object' = ?->>'id'", a.data, a2.data),
left_join: u in Pleroma.User,
on: fragment("?->>'object'", a.data) == u.ap_id,
where:
is_nil(o.id) and
is_nil(a2.id) and
is_nil(u.id)
)
|> restrict_to_local_singles_type(["Create", "Add", "Remove"])
|> maybe_limit(limit)
{del_mixed, _} = prune_activity_by_id(qdel_mixed)
Logger.info("- Deleted #{del_actonly} activities related to various types")
del_useronly + del_actonly + del_objonly + del_mixed
end
defp prune_orphaned_activities_singles(limit, false = _fuzzy) do
%{:num_rows => del_single} =
"""
delete from public.activities
where id in (
select a.id from public.activities a
left join public.objects o on a.data ->> 'object' = o.data ->> 'id'
left join public.activities a2 on a.data ->> 'object' = a2.data ->> 'id'
left join public.users u on a.data ->> 'object' = u.ap_id
where not a.local
and jsonb_typeof(a."data" -> 'object') = 'string'
and o.id is null
and a2.id is null
and u.id is null
#{limit_statement(limit)}
)
"""
|> Repo.query!([], timeout: :infinity)
Logger.info("Prune activity singles: deteleted #{del_single} rows...")
del_single
end
defp prune_orphaned_activities_array(limit) do
%{:num_rows => del_array} =
"""
delete from public.activities
where id in (
select a.id from public.activities a
join json_array_elements_text((a."data" -> 'object')::json) as j
on a.data->>'type' = 'Flag'
left join public.objects o on j.value = o.data ->> 'id'
left join public.activities a2 on j.value = a2.data ->> 'id'
left join public.users u on j.value = u.ap_id
group by a.id
having max(o.data ->> 'id') is null
and max(a2.data ->> 'id') is null
and max(u.ap_id) is null
#{limit_statement(limit)}
)
"""
|> Repo.query!([], timeout: :infinity)
Logger.info("Prune activity arrays: deteleted #{del_array} rows...")
del_array
end
def prune_orphaned_activities(limit \\ 0, opts \\ []) when is_number(limit) do
# Activities can either refer to a single object id, and array of object ids
# or contain an inlined object (at least after going through our normalisation)
#
# Flag is the only type we support with an array (and always has arrays).
# Update the only one with inlined objects, but old Update activities are
#
# We already regularly purge old Delte, Undo, Update and Remove and if
# rejected Follow requests anyway; no need to explicitly deal with those here.
#
# Since theres an index on types and there are typically only few Flag
# activites, its _much_ faster to utilise the index. To avoid accidentally
# deleting useful activities should more types be added, keep typeof for singles.
# Prune activities who link to an array of objects
del_array =
if Keyword.get(opts, :arrays, true) do
prune_orphaned_activities_array(limit)
else
0
end
# Prune activities who link to a single object
del_single =
if Keyword.get(opts, :singles, true) do
fuzzy = Keyword.get(opts, :fuzzy, false)
prune_orphaned_activities_singles(limit, fuzzy)
else
0
end
del_single + del_array
end
def run(["remove_embedded_objects" | args]) do
{options, [], []} =
OptionParser.parse(
@ -62,6 +250,38 @@ def run(["update_users_following_followers_counts"]) do
)
end
def run(["prune_orphaned_activities" | args]) do
{options, [], []} =
OptionParser.parse(
args,
strict: [
limit: :integer,
singles: :boolean,
arrays: :boolean,
fuzzy: :boolean
]
)
start_pleroma()
{limit, options} = Keyword.pop(options, :limit, 0)
log_message = "Pruning orphaned activities"
log_message =
if limit > 0 do
log_message <> ", limiting deletion to #{limit} rows"
else
log_message
end
Logger.info(log_message)
deleted = prune_orphaned_activities(limit, options)
Logger.info("Deleted #{deleted} rows")
end
def run(["prune_objects" | args]) do
{options, [], []} =
OptionParser.parse(
@ -70,7 +290,8 @@ def run(["prune_objects" | args]) do
vacuum: :boolean,
keep_threads: :boolean,
keep_non_public: :boolean,
prune_orphaned_activities: :boolean
prune_orphaned_activities: :boolean,
limit: :integer
]
)
@ -79,6 +300,8 @@ def run(["prune_objects" | args]) do
deadline = Pleroma.Config.get([:instance, :remote_post_retention_days])
time_deadline = NaiveDateTime.utc_now() |> NaiveDateTime.add(-(deadline * 86_400))
limit_cnt = Keyword.get(options, :limit, 0)
log_message = "Pruning objects older than #{deadline} days"
log_message =
@ -110,129 +333,124 @@ def run(["prune_objects" | args]) do
log_message
end
log_message =
if limit_cnt > 0 do
log_message <> ", limiting to #{limit_cnt} rows"
else
log_message
end
Logger.info(log_message)
if Keyword.get(options, :keep_threads) do
# We want to delete objects from threads where
# 1. the newest post is still old
# 2. none of the activities is local
# 3. none of the activities is bookmarked
# 4. optionally none of the posts is non-public
deletable_context =
if Keyword.get(options, :keep_non_public) do
Pleroma.Activity
|> join(:left, [a], b in Pleroma.Bookmark, on: a.id == b.activity_id)
|> group_by([a], fragment("? ->> 'context'::text", a.data))
|> having(
[a],
not fragment(
# Posts (checked on Create Activity) is non-public
"bool_or((not(?->'to' \\? ? OR ?->'cc' \\? ?)) and ? ->> 'type' = 'Create')",
a.data,
^Pleroma.Constants.as_public(),
a.data,
^Pleroma.Constants.as_public(),
a.data
{del_obj, _} =
if Keyword.get(options, :keep_threads) do
# We want to delete objects from threads where
# 1. the newest post is still old
# 2. none of the activities is local
# 3. none of the activities is bookmarked
# 4. optionally none of the posts is non-public
deletable_context =
if Keyword.get(options, :keep_non_public) do
Pleroma.Activity
|> join(:left, [a], b in Pleroma.Bookmark, on: a.id == b.activity_id)
|> group_by([a], fragment("? ->> 'context'::text", a.data))
|> having(
[a],
not fragment(
# Posts (checked on Create Activity) is non-public
"bool_or((not(?->'to' \\? ? OR ?->'cc' \\? ?)) and ? ->> 'type' = 'Create')",
a.data,
^Pleroma.Constants.as_public(),
a.data,
^Pleroma.Constants.as_public(),
a.data
)
)
)
else
Pleroma.Activity
|> join(:left, [a], b in Pleroma.Bookmark, on: a.id == b.activity_id)
|> group_by([a], fragment("? ->> 'context'::text", a.data))
end
|> having([a], max(a.updated_at) < ^time_deadline)
|> having([a], not fragment("bool_or(?)", a.local))
|> having([_, b], fragment("max(?::text) is null", b.id))
|> select([a], fragment("? ->> 'context'::text", a.data))
else
Pleroma.Activity
|> join(:left, [a], b in Pleroma.Bookmark, on: a.id == b.activity_id)
|> group_by([a], fragment("? ->> 'context'::text", a.data))
end
|> having([a], max(a.updated_at) < ^time_deadline)
|> having([a], not fragment("bool_or(?)", a.local))
|> having([_, b], fragment("max(?::text) is null", b.id))
|> maybe_limit(limit_cnt)
|> select([a], fragment("? ->> 'context'::text", a.data))
Pleroma.Object
|> where([o], fragment("? ->> 'context'::text", o.data) in subquery(deletable_context))
else
if Keyword.get(options, :keep_non_public) do
Pleroma.Object
|> where(
[o],
fragment(
"?->'to' \\? ? OR ?->'cc' \\? ?",
o.data,
^Pleroma.Constants.as_public(),
o.data,
^Pleroma.Constants.as_public()
)
)
|> where([o], fragment("? ->> 'context'::text", o.data) in subquery(deletable_context))
else
deletable =
if Keyword.get(options, :keep_non_public) do
Pleroma.Object
|> where(
[o],
fragment(
"?->'to' \\? ? OR ?->'cc' \\? ?",
o.data,
^Pleroma.Constants.as_public(),
o.data,
^Pleroma.Constants.as_public()
)
)
else
Pleroma.Object
end
|> where([o], o.updated_at < ^time_deadline)
|> where(
[o],
fragment("split_part(?->>'actor', '/', 3) != ?", o.data, ^Pleroma.Web.Endpoint.host())
)
|> maybe_limit(limit_cnt)
|> select([o], o.id)
Pleroma.Object
|> where([o], o.id in subquery(deletable))
end
|> where([o], o.updated_at < ^time_deadline)
|> where(
[o],
fragment("split_part(?->>'actor', '/', 3) != ?", o.data, ^Pleroma.Web.Endpoint.host())
)
end
|> Repo.delete_all(timeout: :infinity)
|> Repo.delete_all(timeout: :infinity)
Logger.info("Deleted #{del_obj} objects...")
if !Keyword.get(options, :keep_threads) do
# Without the --keep-threads option, it's possible that bookmarked
# objects have been deleted. We remove the corresponding bookmarks.
"""
delete from public.bookmarks
where id in (
select b.id from public.bookmarks b
left join public.activities a on b.activity_id = a.id
left join public.objects o on a."data" ->> 'object' = o.data ->> 'id'
where o.id is null
)
"""
|> Repo.query([], timeout: :infinity)
%{:num_rows => del_bookmarks} =
"""
delete from public.bookmarks
where id in (
select b.id from public.bookmarks b
left join public.activities a on b.activity_id = a.id
left join public.objects o on a."data" ->> 'object' = o.data ->> 'id'
where o.id is null
)
"""
|> Repo.query!([], timeout: :infinity)
Logger.info("Deleted #{del_bookmarks} orphaned bookmarks...")
end
if Keyword.get(options, :prune_orphaned_activities) do
# Prune activities who link to a single object
"""
delete from public.activities
where id in (
select a.id from public.activities a
left join public.objects o on a.data ->> 'object' = o.data ->> 'id'
left join public.activities a2 on a.data ->> 'object' = a2.data ->> 'id'
left join public.users u on a.data ->> 'object' = u.ap_id
where not a.local
and jsonb_typeof(a."data" -> 'object') = 'string'
and o.id is null
and a2.id is null
and u.id is null
)
"""
|> Repo.query([], timeout: :infinity)
# Prune activities who link to an array of objects
"""
delete from public.activities
where id in (
select a.id from public.activities a
join json_array_elements_text((a."data" -> 'object')::json) as j on jsonb_typeof(a."data" -> 'object') = 'array'
left join public.objects o on j.value = o.data ->> 'id'
left join public.activities a2 on j.value = a2.data ->> 'id'
left join public.users u on j.value = u.ap_id
group by a.id
having max(o.data ->> 'id') is null
and max(a2.data ->> 'id') is null
and max(u.ap_id) is null
)
"""
|> Repo.query([], timeout: :infinity)
del_activities = prune_orphaned_activities()
Logger.info("Deleted #{del_activities} orphaned activities...")
end
"""
DELETE FROM hashtags AS ht
WHERE NOT EXISTS (
SELECT 1 FROM hashtags_objects hto
WHERE ht.id = hto.hashtag_id)
"""
|> Repo.query()
%{:num_rows => del_hashtags} =
"""
DELETE FROM hashtags AS ht
WHERE NOT EXISTS (
SELECT 1 FROM hashtags_objects hto
WHERE ht.id = hto.hashtag_id)
"""
|> Repo.query!()
Logger.info("Deleted #{del_hashtags} no longer used hashtags...")
if Keyword.get(options, :vacuum) do
Logger.info("Starting vacuum...")
Maintenance.vacuum("full")
end
Logger.info("All done!")
end
def run(["prune_task"]) do

View file

@ -470,7 +470,7 @@ test "it prunes orphaned activities with the --prune-orphaned-activities" do
assert length(activities) == 4
end
test "it prunes orphaned activities with the --prune-orphaned-activities when the objects are referenced from an array" do
test "it prunes orphaned activities with prune_orphaned_activities when the objects are referenced from an array" do
%Object{} |> Map.merge(%{data: %{"id" => "existing_object"}}) |> Repo.insert()
%User{} |> Map.merge(%{ap_id: "existing_actor"}) |> Repo.insert()
@ -478,6 +478,7 @@ test "it prunes orphaned activities with the --prune-orphaned-activities when th
|> Map.merge(%{
local: false,
data: %{
"type" => "Flag",
"id" => "remote_activity_existing_object",
"object" => ["non_ existing_object", "existing_object"]
}
@ -488,6 +489,7 @@ test "it prunes orphaned activities with the --prune-orphaned-activities when th
|> Map.merge(%{
local: false,
data: %{
"type" => "Flag",
"id" => "remote_activity_existing_actor",
"object" => ["non_ existing_object", "existing_actor"]
}
@ -498,6 +500,7 @@ test "it prunes orphaned activities with the --prune-orphaned-activities when th
|> Map.merge(%{
local: false,
data: %{
"type" => "Flag",
"id" => "remote_activity_existing_activity",
"object" => ["non_ existing_object", "remote_activity_existing_actor"]
}
@ -508,6 +511,7 @@ test "it prunes orphaned activities with the --prune-orphaned-activities when th
|> Map.merge(%{
local: false,
data: %{
"type" => "Flag",
"id" => "remote_activity_without_existing_referenced_object",
"object" => ["owo", "whats_this"]
}
@ -517,7 +521,7 @@ test "it prunes orphaned activities with the --prune-orphaned-activities when th
assert length(Repo.all(Activity)) == 4
Mix.Tasks.Pleroma.Database.run(["prune_objects"])
assert length(Repo.all(Activity)) == 4
Mix.Tasks.Pleroma.Database.run(["prune_objects", "--prune-orphaned-activities"])
Mix.Tasks.Pleroma.Database.run(["prune_orphaned_activities"])
activities = Repo.all(Activity)
assert length(activities) == 3