Inlined reblog / announce / boost, like, reply and poll vote info desyncs from activities #956

Open
opened 2025-08-01 20:52:19 +00:00 by Oneric · 6 comments
Owner

Your setup

From source

Extra details

Fedora (i believe)

Version

basically 16d7d612ff (exact: this fork)

PostgreSQL version

17

Current Status (updated)

  • there are now mix tasks available to retroactively fix desyncs for stuff inlined into objects: #964
    • but poll resyncs are missing but may be impossible for already closed polls and must not apply to remote polls
  • we still aren’t sure about the root cause and desyncs most likely will continue to happen during operation (necessitating regular resyncs via the above mix task)
    • candidates are:
      • cachex object cache not being synced to the failure/success of and leaking across transactions
      • inlined copies only getting update after the main processing transaction ended — and funny things are known to happen due to this split (all local database stuff should happen in the same transaction; only federation and notification effects etc should be done outside the transaction)
  • the basic premise of having multiple references for one thing which might desync applies to more which needs to be investigated:
    • poll votes are handled analogously to Like and Announces, just with the authoritative reference being Answer objects rather than an activity type
      AFFECTED: I can now confirm poll votes are affected too. From incoming votes on a multi-choice poll, both vote Answer objects from a particular unlucky single voter were accepted and inserted into the database, but only one showed up in the inlined count. This effect manifest near immediately while the poll was still running ruling out any late-rot theories thus further cementing the dubious elixir-side caches as the main suspect
    • follow requests have both a dedicated table, but (in their current implementation) also sometimes need to look up and act on the followactivities (and via a custom field the follow state is also duplicated in those activities...).
      Unlike the others, here no elixir-side caching is directly involved in either side though

What’s wrong?

I noticed some, but not all, old posts I once boosted, still showed up in my user profile (as expected), but the post details showed as if I didn’t boost it (in some cases the post details claimed nobody boosted it despite it showing up in my profile).

Attempting to re-boost the post via akko-fe, didn’t change the visual status nor the logical state. Notably the api/v1/statuses/:id/reblog API did return a success though albeit the copy of the post details in its response body too indicated "reblogged": "false".
This means affected posts are both impossible to re-boost and also un-boost via the frontend.

Manually calling POST api/v1/statuses/:id/unreblog still worked though restoring things to a consistent state.


As it turns out, I got somehow lost from the inlined announcements list inside the posts data column. This is an inlined JSON array of the AP ids of all users who boosted a post. The Announce activities still existed in the database however, leading to it showing up in the profile(, preventing deletion during db prunes) and likely preventing /reblog attempts from doing anything at all.

I have no idea how this happened or if it’s an still existing bug or something already fixed.
Ideally such info shouldn’t be duplicated at all, but I imagine the db will collapse if every render has to join objects and activities via AP IDs to get the list of boosters.

Even if the underlying issue is already fixed or we figure out what the issue is¹, there might be more affected posts and instances than the few I’ve seen myself. If this issue is more widespread, adding a mix task to fix up the db state might make sense.


[1]: Afairc though, it used to be normal and was only corrupted at some unknown later point. Else I’d have guessed it’s due to elixir-side caches not being in sync with transaction failures / success and a concurrent change to the same post ended up poisoning the cache and thus undoing the initial change from the boost.

Severity

I can manage

Have you searched for this issue?

  • I have double-checked and have not found this issue mentioned anywhere.
### Your setup From source ### Extra details Fedora (i believe) ### Version basically 16d7d612ffe21d3c83606dadb6ccfdf0217184ee (exact: [this fork](https://akkoma.dev/norm/akkoma/commit/03714b70b78f717ad99d1d5e09523aa671cdcdb2)) ### PostgreSQL version 17 ### Current Status (updated) - [X] there are now `mix` tasks available to retroactively fix desyncs for stuff inlined into `objects`: #964 - [ ] but poll resyncs are missing but may be impossible for already closed polls and must not apply to remote polls - [ ] we still aren’t sure about the root cause and desyncs most likely will continue to happen during operation *(necessitating regular resyncs via the above mix task)* - candidates are: - cachex object cache not being synced to the failure/success of and leaking across transactions - inlined copies only getting update _after_ the main processing transaction ended — and funny things are known to happen due to this split *(all local database stuff should happen in the same transaction; only federation and notification effects etc should be done outside the transaction)* - [ ] the basic premise of having multiple references for one thing which might desync applies to more which needs to be investigated: - [X] poll votes are handled analogously to `Like` and `Announce`s, just with the authoritative reference being `Answer` *objects* rather than an activity type **AFFECTED**: I can now confirm poll votes are affected too. From incoming votes on a multi-choice poll, both vote Answer objects from a particular unlucky single voter were accepted and inserted into the database, but only one showed up in the inlined count. This effect manifest near immediately while the poll was still running ruling out any late-rot theories thus further cementing the dubious elixir-side caches as the main suspect - [ ] follow requests have both a dedicated table, but (in their current implementation) _also_ sometimes need to look up and act on the `follow`activities (and via a custom field the follow state is also duplicated in those activities...). Unlike the others, here no elixir-side caching is directly involved in either side though ### What’s wrong? I noticed some, but not all, old posts I once boosted, still showed up in my user profile *(as expected)*, but the post details showed as if I didn’t boost it *(in some cases the post details claimed _nobody_ boosted it despite it showing up in my profile)*. Attempting to re-boost the post via akko-fe, didn’t change the visual status nor the logical state. Notably the `api/v1/statuses/:id/reblog` API did return a success though albeit the copy of the post details in its response body too indicated `"reblogged": "false"`. This means affected posts are both impossible to re-boost and _also_ un-boost via the frontend. Manually calling `POST api/v1/statuses/:id/unreblog` still worked though restoring things to a consistent state. ---- As it turns out, I got somehow lost from the inlined `announcements` list inside the posts `data` column. This is an inlined JSON array of the AP ids of all users who boosted a post. The `Announce` activities still existed in the database however, leading to it showing up in the profile(, preventing deletion during db prunes) and likely preventing `/reblog` attempts from doing anything at all. I have no idea how this happened or if it’s an still existing bug or something already fixed. Ideally such info shouldn’t be duplicated at all, but I imagine the db will collapse if every render has to join `objects` and `activities` via AP IDs to get the list of boosters. Even if the underlying issue is already fixed or we figure out what the issue is¹, there might be more affected posts and instances than the few I’ve seen myself. If this issue is more widespread, adding a `mix` task to fix up the db state might make sense. ---- [1]: Afairc though, it used to be normal and was only corrupted at some unknown later point. Else I’d have guessed it’s due to elixir-side caches not being in sync with transaction failures / success and a concurrent change to the same post ended up poisoning the cache and thus undoing the initial change from the boost. ### Severity I can manage ### Have you searched for this issue? - [x] I have double-checked and have not found this issue mentioned anywhere.
Author
Owner

After checking the database with an (unoptimised) query, I can confirm this also affects 24 objects on my own instance; the oldest from shortly after instance creation and the newest from end of May (so about two months ago).

SELECT o.id, o.data->>'id', o.inserted_at, o.updated_at, o.data->>'announcements'
FROM objects AS o,
     jsonb_to_record(o.data) AS ia(announcements text[])
WHERE
  jsonb_typeof(o.data->'announcements') = 'array' AND
  EXISTS (
    SELECT a.data->>'actor'
    FROM activities AS a
    WHERE a.data->>'type' = 'Announce' AND
          COALESCE(a.data->'object'->>'id', a.data->>'object') = o.data->>'id' AND
          a.data->>'actor' <> ALL(ia.announcements)
    LIMIT 1
  )
;
After checking the database with an (unoptimised) query, I can confirm this also affects 24 objects on my own instance; the oldest from shortly after instance creation and the newest from end of May (so about two months ago). ```sql SELECT o.id, o.data->>'id', o.inserted_at, o.updated_at, o.data->>'announcements' FROM objects AS o, jsonb_to_record(o.data) AS ia(announcements text[]) WHERE jsonb_typeof(o.data->'announcements') = 'array' AND EXISTS ( SELECT a.data->>'actor' FROM activities AS a WHERE a.data->>'type' = 'Announce' AND COALESCE(a.data->'object'->>'id', a.data->>'object') = o.data->>'id' AND a.data->>'actor' <> ALL(ia.announcements) LIMIT 1 ) ; ```
Author
Owner

For a single affected post, this query fixes the issue:

UPDATE objects AS o
--- CAUTION: this syntax requires PostgreSQL 14 !!
SET data['announcements'] = to_jsonb(array(
  SELECT a.data->>'actor'
  FROM activities AS a
  WHERE a.data->>'type' = 'Announce' AND
        COALESCE(a.data->'object'->>'id', a.data->>'object') = o.data->>'id' AND
        a.data->>'actor' IS NOT NULL
))
WHERE o.id = 83;

UPDATE objects
SET data['announcement_count'] = to_jsonb(jsonb_array_length(data->'announcements'))
WHERE id = 83;

Besides identifying and fixing the underlying issue ofc, it remains to optimise both queries, convert them into Ecto format and create a mix task which both detects and fixes this where necessary.
Checking the opposite direction too, i.e. users being listed in announcements etc without an Announce activity existing, seems like a good idea too.

We might also want to check whether anything else is implemented similarly and could be affected; namely replies, likes and emoji reactions

UPDATE (without checking the other direction still):

  • replies is indeed also heavily affected (some ~4700 occurrences; if excluding non-public replies it’s still ~4600); since akko-fe uses context to fetcht threads this doesn’t show up much though
    Desynced repliesCounts "only" occured 57 times
  • likes too is affected (29 instances on my instance); this should be similarly visible in the frontend like the originally reported boost issue
  • inlined reactions use a different format which is more effort to fully check. There are no cases of all inlined emoji reactions having vanished while associated EmojiReact activities remain, but it seems plausible this one suffers too
For a single affected post, this query fixes the issue: ```sql UPDATE objects AS o --- CAUTION: this syntax requires PostgreSQL 14 !! SET data['announcements'] = to_jsonb(array( SELECT a.data->>'actor' FROM activities AS a WHERE a.data->>'type' = 'Announce' AND COALESCE(a.data->'object'->>'id', a.data->>'object') = o.data->>'id' AND a.data->>'actor' IS NOT NULL )) WHERE o.id = 83; UPDATE objects SET data['announcement_count'] = to_jsonb(jsonb_array_length(data->'announcements')) WHERE id = 83; ``` Besides identifying and fixing the underlying issue ofc, it remains to optimise both queries, convert them into `Ecto` format and create a `mix` task which both detects and fixes this where necessary. Checking the opposite direction too, i.e. users being listed in `announcements` etc _without_ an `Announce` activity existing, seems like a good idea too. We might also want to check whether anything else is implemented similarly and could be affected; namely replies, likes and emoji reactions **UPDATE** (without checking the other direction still): - ~~`replies` is indeed also **heavily** affected (some \~4700 occurrences; if excluding non-public replies it’s still \~4600); since akko-fe uses `context` to fetcht threads this doesn’t show up much though~~ Desynced `repliesCount`s "only" occured 57 times - `likes` too is affected *(29 instances on my instance)*; this should be similarly visible in the frontend like the originally reported boost issue - inlined `reactions` use a different format which is more effort to fully check. There are no cases of _all_ inlined emoji reactions having vanished while associated `EmojiReact` activities remain, but it seems plausible this one suffers too
Oneric changed title from Inlined reblogs / announces / boosts info desyncs from Announce activities to Inlined reblog / announce / boost info desyncs from Announce activities 2025-08-01 21:39:18 +00:00
Oneric changed title from Inlined reblog / announce / boost info desyncs from Announce activities to Inlined reblog / announce / boost, like and reply info desyncs from activities 2025-08-01 23:48:36 +00:00
Author
Owner

(Still unoptimised) queries which identify and fix both missing (activity exists, but not in inlined data) and zombie (still in inlined data, but no activity) in one go for likes and announcements. Since they only differ by trivial replacements, just the one for likes is shown below:

WITH ref(apid, correct) AS (
  SELECT data->>'object' AS apid, to_jsonb(ARRAY_AGG(data->>'actor'))
  FROM activities
  WHERE
   data->>'type' = 'Like'
   AND data->>'id' IS NOT NULL
   AND data->>'actor' IS NOT NULL
 GROUP BY apid
)
UPDATE objects AS o
--- this syntax needs PostgreSQL >=14
SET data['likes'] = ref.correct,
    data['like_count'] = to_jsonb(jsonb_array_length(ref.correct))
FROM ref
WHERE
  o.data->>'id' = ref.apid
  AND jsonb_typeof(o.data->'likes') = 'array'
  AND NOT (
    (ref.correct @> (o.data->'likes')) AND (ref.correct <@ (o.data->'likes'))
  )
;

Since they don't seem to have much value, I’m inclined to just drop inlined replies now already rather than resyncing them. The reply count will be kept though since it’s used in regular status responses. While it too desyncs, this is only a cosmetic issue.
UPDATE turns out the inlined replies array already is unused anyway. Them being in the database is just an artifact of it temporarily holding the AP IDs of existing replies to fetch when a post is initially received.

Query to fix replies count using only publicly-visible replies (note: only one of two instances in the code of increasing this counter currently restricts this to publicly-visible content)

WITH ref(apid, cnt) AS (
  SELECT data->>'inReplyTo' AS apid, COUNT(*)
  FROM objects
  WHERE
    data->>'type' <> 'Answer'
    AND data->>'inReplyTo' IS NOT NULL
    AND (
      data->'to' @> '"https://www.w3.org/ns/activitystreams#Public"'::jsonb OR
      data->'cc' @> '"https://www.w3.org/ns/activitystreams#Public"'::jsonb
    )
  GROUP BY apid
)
UPDATE objects AS o
--- this syntax requires PostgreSQL >= 14
SET data['repliesCount'] = to_jsonb(ref.cnt)
FROM ref
WHERE
  o.data->>'id' = ref.apid AND
  (o.data->>'repliesCount')::bigint <> ref.cnt
;

(The format of inlined reactions remains a pain to deal with, so I still haven’t looked at it properly.
In fact it’s even worse, Pleroma.Web.PleromaAPI.EmojiReactionController::filter_allowed_users/3 shows there are three different allowed formats for array elements)

(Still unoptimised) queries which identify and fix both missing *(activity exists, but not in inlined data)* and zombie *(still in inlined data, but no activity)* in one go for likes and announcements. Since they only differ by trivial replacements, just the one for likes is shown below: ```sql WITH ref(apid, correct) AS ( SELECT data->>'object' AS apid, to_jsonb(ARRAY_AGG(data->>'actor')) FROM activities WHERE data->>'type' = 'Like' AND data->>'id' IS NOT NULL AND data->>'actor' IS NOT NULL GROUP BY apid ) UPDATE objects AS o --- this syntax needs PostgreSQL >=14 SET data['likes'] = ref.correct, data['like_count'] = to_jsonb(jsonb_array_length(ref.correct)) FROM ref WHERE o.data->>'id' = ref.apid AND jsonb_typeof(o.data->'likes') = 'array' AND NOT ( (ref.correct @> (o.data->'likes')) AND (ref.correct <@ (o.data->'likes')) ) ; ``` Since they don't seem to have much value, I’m inclined to just drop inlined replies now already rather than resyncing them. The reply count will be kept though since it’s used in regular status responses. While it too desyncs, this is only a cosmetic issue. **UPDATE** turns out the inlined `replies` array already is unused anyway. Them being in the database is just an artifact of it _temporarily_ holding the AP IDs of existing replies to fetch when a post is initially received. Query to fix replies count using only publicly-visible replies *(note: only one of two instances in the code of increasing this counter currently restricts this to publicly-visible content)* ```sql WITH ref(apid, cnt) AS ( SELECT data->>'inReplyTo' AS apid, COUNT(*) FROM objects WHERE data->>'type' <> 'Answer' AND data->>'inReplyTo' IS NOT NULL AND ( data->'to' @> '"https://www.w3.org/ns/activitystreams#Public"'::jsonb OR data->'cc' @> '"https://www.w3.org/ns/activitystreams#Public"'::jsonb ) GROUP BY apid ) UPDATE objects AS o --- this syntax requires PostgreSQL >= 14 SET data['repliesCount'] = to_jsonb(ref.cnt) FROM ref WHERE o.data->>'id' = ref.apid AND (o.data->>'repliesCount')::bigint <> ref.cnt ; ``` *(The format of inlined reactions remains a pain to deal with, so I still haven’t looked at it properly. In fact it’s even worse, `Pleroma.Web.PleromaAPI.EmojiReactionController::filter_allowed_users/3` shows there are **three** different allowed formats for array elements)*
Author
Owner

Alright, I can now confirm reactions were indeed also affected, but apart from the expected desync as observed in other inline copies, there were also oddities resulting from Iceshrimp.NET federating (from its POV) remote emoji reactions with a @remote.domain indicator as part of the emoji shortcode. This sometimes lead to it being treated as a seperate emoji and sometimes correctly being folded into the orignal emoji (perhaps depending on the order in which reactions were processed).
This might also be why sometimes remote emoji reactions just don’t seem to work (the API will split the name at the @ and look for a match for the initial part)

I’ll let the validator drop these remote indicators and add a migration to fix up any existing activities.

I had 5 such Iceshrimp.NET-flavour remote emoji reactions 3 of which counted as separate emoji entries in reaction and 1 desync unrelated to such remote indicators.

Since pure SQL is often more readable than fragment-infested ecto, here’s a pure SQL query to check for reactions desync — assuming each custom emoji reaction has exactly one element in the tag array:

WITH expanded AS (
  SELECT data->>'object' AS apid,
         TRIM(data->>'content', ':') AS emoji_name,
         ARRAY_AGG(DISTINCT data->>'actor') AS actors,
         data#>>'{tag,0,icon,url}' AS url
  FROM activities
  WHERE
    data->>'type' = 'EmojiReact' AND
    data->>'actor' IS NOT NULL AND
    TRIM(data->>'content', ':') IS NOT NULL
  GROUP BY
    apid, emoji_name, url
), ref(apid, correct) AS (
  SELECT apid, jsonb_agg(DISTINCT ARRAY[to_jsonb(emoji_name), to_jsonb(actors), to_jsonb(url)])
  FROM expanded
  GROUP BY apid
)
SELECT r.apid, o.data->'reactions', r.correct
FROM objects AS o
     JOIN ref AS r ON r.apid = o.data->>'id'
WHERE NOT (r.correct @> (o.data->'reactions') AND r.correct <@ (o.data->'reactions'))
;
Alright, I can now confirm `reactions` were indeed also affected, but apart from the expected desync as observed in other inline copies, there were also oddities resulting from Iceshrimp.NET federating (from its POV) remote emoji reactions with a `@remote.domain` indicator as part of the emoji shortcode. This sometimes lead to it being treated as a seperate emoji and sometimes correctly being folded into the orignal emoji (perhaps depending on the order in which reactions were processed). This might also be why sometimes remote emoji reactions just don’t seem to work *(the API will split the name at the `@` and look for a match for the initial part)* I’ll let the validator drop these remote indicators and add a migration to fix up any existing activities. I had 5 such Iceshrimp.NET-flavour remote emoji reactions 3 of which counted as separate emoji entries in `reaction` and 1 desync unrelated to such remote indicators. Since pure SQL is often more readable than `fragment`-infested ecto, here’s a pure SQL query to check for `reactions` desync — assuming each custom emoji reaction has exactly one element in the `tag` array: ```sql WITH expanded AS ( SELECT data->>'object' AS apid, TRIM(data->>'content', ':') AS emoji_name, ARRAY_AGG(DISTINCT data->>'actor') AS actors, data#>>'{tag,0,icon,url}' AS url FROM activities WHERE data->>'type' = 'EmojiReact' AND data->>'actor' IS NOT NULL AND TRIM(data->>'content', ':') IS NOT NULL GROUP BY apid, emoji_name, url ), ref(apid, correct) AS ( SELECT apid, jsonb_agg(DISTINCT ARRAY[to_jsonb(emoji_name), to_jsonb(actors), to_jsonb(url)]) FROM expanded GROUP BY apid ) SELECT r.apid, o.data->'reactions', r.correct FROM objects AS o JOIN ref AS r ON r.apid = o.data->>'id' WHERE NOT (r.correct @> (o.data->'reactions') AND r.correct <@ (o.data->'reactions')) ; ```
Author
Owner

Just noticed most of the inlined copies are updated only in side effects, i.e. after the main transaction ended already and we know some funny things occasionally happen during side-effect processing (#888). Though the odd crashes from #888 don’t seem to always cause a mismatch like observed here, they might still account for a portion of the missing entries

Just noticed most of the inlined copies are updated only in _side effects_, i.e. _after_ the main transaction ended already and we know some funny things occasionally happen during side-effect processing (#888). Though the odd crashes from #888 don’t seem to always cause a mismatch like observed here, they might still account for a portion of the missing entries
Oneric changed title from Inlined reblog / announce / boost, like and reply info desyncs from activities to Inlined reblog / announce / boost, like, reply and poll vote info desyncs from activities 2026-02-05 23:54:39 +00:00
Author
Owner

I can confirm this also affects poll votes.

From incoming votes on a multi-choice poll, both vote Answer objects from a particular unlucky single voter were accepted and inserted into the database, but only one showed up in the inlined count. This effect manifest near immediately while the poll was still running ruling out any late-rot theories thus further cementing the dubious elixir-side caches as the main suspect

The mix task should be extended to also resync local(!) polls, though if we delete Answer objects (from remote users) after a poll closed, this might only be applicable to still open polls. (And if we don't delete them yet, we perhaps should since they’re no longer relevant (other than to retroactively fix this bug))

I can confirm this also affects poll votes. From incoming votes on a multi-choice poll, both vote Answer objects from a particular unlucky single voter were accepted and inserted into the database, but only one showed up in the inlined count. This effect manifest near immediately while the poll was still running ruling out any late-rot theories thus further cementing the dubious elixir-side caches as the main suspect The mix task should be extended to also resync local(!) polls, though _if_ we delete `Answer` objects (from remote users) after a poll closed, this might only be applicable to still open polls. (And if we don't delete them yet, we perhaps should since they’re no longer relevant (other than to retroactively fix this bug))
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
AkkomaGang/akkoma#956
No description provided.