Update hashtag prune to account for followed hashtags #844

Open
norm wants to merge 2 commits from norm/akkoma:hashtag-prune into develop
Contributor

Currently pruning hashtags with the prune_objects task only accounts for whether that hashtag is associated with an object, but this may lead to a foreign key constraint violation if that hashtag has no objects but is followed by a local user.

This adds an additional check to see if that hashtag has any followers before proceeding to delete it.

Currently pruning hashtags with the `prune_objects` task only accounts for whether that hashtag is associated with an object, but this may lead to a foreign key constraint violation if that hashtag has no objects but is followed by a local user. This adds an additional check to see if that hashtag has any followers before proceeding to delete it.
norm added 1 commit 2024-10-25 15:11:07 +00:00
Update hashtag prune to account for followed hashtags
Some checks are pending
ci/woodpecker/pr/build-amd64 Pipeline is pending
ci/woodpecker/pr/build-arm64 Pipeline is pending
ci/woodpecker/pr/docs Pipeline is pending
ci/woodpecker/pr/lint Pipeline is pending
ci/woodpecker/pr/test Pipeline is pending
893d9ce93a
norm force-pushed hashtag-prune from 893d9ce93a to 40da4e88ea 2024-10-25 15:56:27 +00:00 Compare
norm added 1 commit 2024-10-25 16:25:28 +00:00
Use LEFT JOIN instead of UNION for hashtag pruning
Some checks are pending
ci/woodpecker/pr/build-amd64 Pipeline is pending
ci/woodpecker/pr/build-arm64 Pipeline is pending
ci/woodpecker/pr/docs Pipeline is pending
ci/woodpecker/pr/lint Pipeline is pending
ci/woodpecker/pr/test Pipeline is pending
f7fe187c70
norm force-pushed hashtag-prune from f7fe187c70 to 88a8086ad3 2024-10-25 16:26:35 +00:00 Compare
Oneric reviewed 2024-10-26 22:23:09 +00:00
Oneric left a comment
Member

Seems fine, though hearing it both variants were verified to produce the same result would be reassuring

Notably yet another, more standard and simpler variant to write this is

DELETE FROM hashtags AS ht
WHERE ht.id NOT IN (
  SELECT hashtag_id FROM hashtags_objects
  UNION
  SELECT hashtag_id FROM user_follows_hashtag
);

or alternatively with UNION ALL instead of UNION.

On my db (small, young instance; PostgreSQL 16) the initial SELECT 1 .. UNION variant performs atrociously (821 012.04) according to EXPLAIN and while UNION ALL is somewhat better (69 485.28), USING ... LEFT JOIN is estimated to take an order of magnitude less still (6 020.80). The above with UNION is a tad better (5 810.82) and UNION ALL supposedly better still (4 081.40) outperforming even the pre-patch version.
For reference the previous query without followed hashtags was estimated at 4 478.42 and if rewritten with an NOT IN instead of NOT EXISTS 3 362.64.
However, these numbers may be skewed by my user_follow_hashtag_table being empty and the cost estimate for UNION vs UNION ALL may potentially be quite a bit off depending on how hashtag frequency is distributed.

Ideally real execution times (EXPLAIN ANALYZE) for each would be tested for the same action via aborted transactions on multiple, bigger instances. (When doing this keep caching in mind, and measure the execution time of each query a couple of times in a row to avoid skewing results). Practically though, USING .. LEFT JOIN may very well be good enough since other queries during prune take orders of magnitude more time so while i welcome further experimenting and optimising on this, it’s not a reason to hold this bugfix back as far as I’m concerned

Seems fine, though hearing it both variants were verified to produce the same result would be reassuring Notably yet another, more standard and simpler variant to write this is ```sql DELETE FROM hashtags AS ht WHERE ht.id NOT IN ( SELECT hashtag_id FROM hashtags_objects UNION SELECT hashtag_id FROM user_follows_hashtag ); ``` or alternatively with `UNION ALL` instead of `UNION`. On my db *(small, young instance; PostgreSQL 16)* the initial `SELECT 1 .. UNION` variant performs atrociously *(821 012.04)* according to `EXPLAIN` and while `UNION ALL` is somewhat better *(69 485.28)*, `USING ... LEFT JOIN` is estimated to take an order of magnitude less still *(6 020.80)*. The above with `UNION` is a tad better *(5 810.82)* and `UNION ALL` supposedly better still *(4 081.40)* outperforming even the pre-patch version. For reference the previous query without followed hashtags was estimated at 4 478.42 and if rewritten with an `NOT IN` instead of `NOT EXISTS` 3 362.64. **However**, these numbers may be skewed by my `user_follow_hashtag_table` being empty and the cost estimate for `UNION` vs `UNION ALL` may potentially be quite a bit off depending on how hashtag frequency is distributed. Ideally real execution times (`EXPLAIN ANALYZE`) for each would be tested for the same action via aborted transactions on multiple, bigger instances. *(When doing this keep caching in mind, and measure the execution time of each query a couple of times in a row to avoid skewing results)*. Practically though, `USING .. LEFT JOIN` may very well be good enough since other queries during prune take orders of magnitude more time so while i welcome further experimenting and optimising on this, it’s not a reason to hold this bugfix back as far as I’m concerned
Author
Contributor

Running the simpler query resulted in a foreign key violation.

I was however able to confirm that both the SELECT 1 ... UNION variant and the LEFT JOIN variant both deleted the same amount of rows.

akkoma=# BEGIN;
BEGIN
akkoma=*# 
      DELETE FROM hashtags AS ht
      WHERE NOT EXISTS (
        SELECT 1 FROM hashtags_objects hto
akkoma(*# 
        WHERE ht.id = hto.hashtag_id
        UNION
        SELECT 1 FROM user_follows_hashtag ufht
        WHERE ht.id = ufht.hashtag_id);
DELETE 1
akkoma=*# rollback;
ROLLBACK
akkoma=# begin;
BEGIN
akkoma=*# 
      DELETE FROM hashtags
      USING hashtags AS ht
      LEFT JOIN hashtags_objects hto
            ON ht.id = hto.hashtag_id
      LEFT JOIN user_follows_hashtag ufht
            ON ht.id = ufht.hashtag_id
      WHERE
          hashtags.id = ht.id
          AND hto.hashtag_id is NULL
          AND ufht.hashtag_id is NULL;
DELETE 1
akkoma=*# rollback;
~~Running the simpler query resulted in a foreign key violation.~~ I was however able to confirm that both the `SELECT 1 ... UNION` variant and the `LEFT JOIN` variant both deleted the same amount of rows. ``` akkoma=# BEGIN; BEGIN akkoma=*# DELETE FROM hashtags AS ht WHERE NOT EXISTS ( SELECT 1 FROM hashtags_objects hto akkoma(*# WHERE ht.id = hto.hashtag_id UNION SELECT 1 FROM user_follows_hashtag ufht WHERE ht.id = ufht.hashtag_id); DELETE 1 akkoma=*# rollback; ROLLBACK akkoma=# begin; BEGIN akkoma=*# DELETE FROM hashtags USING hashtags AS ht LEFT JOIN hashtags_objects hto ON ht.id = hto.hashtag_id LEFT JOIN user_follows_hashtag ufht ON ht.id = ufht.hashtag_id WHERE hashtags.id = ht.id AND hto.hashtag_id is NULL AND ufht.hashtag_id is NULL; DELETE 1 akkoma=*# rollback; ```
Author
Contributor

EXPLAIN ANALYZE on the UNION version took an average of 1486.024ms, while LEFT JOIN took 286.3125 ms in my testing so the LEFT JOIN query is definitely more optimal in my case.

`EXPLAIN ANALYZE` on the `UNION` version took an average of 1486.024ms, while `LEFT JOIN` took 286.3125 ms in my testing so the `LEFT JOIN` query is definitely more optimal in my case.
Member

Running the simpler query resulted in a foreign key violation.

oops, crucial omission: try again with NOT IN instead of IN
(i’ll edited the previous post to avoid further confusion and fix the slightly different costs)

> Running the simpler query resulted in a foreign key violation. oops, crucial omission: try again with `NOT IN` instead of `IN` *(i’ll edited the previous post to avoid further confusion and fix the slightly different costs)*
Author
Contributor

Seems like that query is slower than the LEFT JOIN but not as slow as the UNION. I got an average of 426.755 ms over 3 runs of that query on my database.

Seems like that query is slower than the `LEFT JOIN` but not as slow as the `UNION`. I got an average of 426.755 ms over 3 runs of that query on my database.
Author
Contributor

This is the query plan I got when running that query:

                                                                     QUERY PLAN                                                                      
-----------------------------------------------------------------------------------------------------------------------------------------------------
 Delete on hashtags ht  (cost=119932.75..123683.54 rows=0 width=0) (actual time=416.968..416.970 rows=0 loops=1)
   ->  Seq Scan on hashtags ht  (cost=119932.75..123683.54 rows=89472 width=6) (actual time=416.328..416.952 rows=1 loops=1)
         Filter: (NOT (hashed SubPlan 1))
         Rows Removed by Filter: 168277
         SubPlan 1
           ->  HashAggregate  (cost=86009.25..115757.55 rows=1670080 width=8) (actual time=311.000..328.918 rows=168277 loops=1)
                 Group Key: hashtags_objects.hashtag_id
                 Planned Partitions: 4  Batches: 1  Memory Usage: 19473kB
                 ->  Append  (cost=0.00..34080.20 rows=1670080 width=8) (actual time=0.009..158.232 rows=1588534 loops=1)
                       ->  Seq Scan on hashtags_objects  (cost=0.00..25728.07 rows=1670007 width=8) (actual time=0.008..82.662 rows=1588461 loops=1)
                       ->  Seq Scan on user_follows_hashtag  (cost=0.00..1.73 rows=73 width=8) (actual time=0.004..0.008 rows=73 loops=1)
 Planning Time: 0.075 ms
 Trigger for constraint hashtags_objects_hashtag_id_fkey: time=0.115 calls=1
 Trigger for constraint user_follows_hashtag_hashtag_id_fkey: time=0.039 calls=1
 Execution Time: 425.866 ms
(15 rows)
This is the query plan I got when running that query: ``` QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------- Delete on hashtags ht (cost=119932.75..123683.54 rows=0 width=0) (actual time=416.968..416.970 rows=0 loops=1) -> Seq Scan on hashtags ht (cost=119932.75..123683.54 rows=89472 width=6) (actual time=416.328..416.952 rows=1 loops=1) Filter: (NOT (hashed SubPlan 1)) Rows Removed by Filter: 168277 SubPlan 1 -> HashAggregate (cost=86009.25..115757.55 rows=1670080 width=8) (actual time=311.000..328.918 rows=168277 loops=1) Group Key: hashtags_objects.hashtag_id Planned Partitions: 4 Batches: 1 Memory Usage: 19473kB -> Append (cost=0.00..34080.20 rows=1670080 width=8) (actual time=0.009..158.232 rows=1588534 loops=1) -> Seq Scan on hashtags_objects (cost=0.00..25728.07 rows=1670007 width=8) (actual time=0.008..82.662 rows=1588461 loops=1) -> Seq Scan on user_follows_hashtag (cost=0.00..1.73 rows=73 width=8) (actual time=0.004..0.008 rows=73 loops=1) Planning Time: 0.075 ms Trigger for constraint hashtags_objects_hashtag_id_fkey: time=0.115 calls=1 Trigger for constraint user_follows_hashtag_hashtag_id_fkey: time=0.039 calls=1 Execution Time: 425.866 ms (15 rows) ```
Member

The query plan matches mine; notably here your cost estimate for the simpler version (123683.54) is already worse than the 42254.88 you got for LEFT JOIN yesterday.

So yep, USING ... LEFT JOIN appears to be the clear winner for akko’s db

The query plan matches mine; notably here your cost estimate for the simpler version *(`123683.54`)* is already worse than the `42254.88` you got for `LEFT JOIN` yesterday. So yep, `USING ... LEFT JOIN` appears to be the clear winner for akko’s db
Oneric approved these changes 2024-10-26 23:08:30 +00:00
Some checks are pending
ci/woodpecker/pr/build-amd64 Pipeline is pending
ci/woodpecker/pr/build-arm64 Pipeline is pending
ci/woodpecker/pr/docs Pipeline is pending
ci/woodpecker/pr/lint Pipeline is pending
ci/woodpecker/pr/test Pipeline is pending
This pull request can be merged automatically.
This branch is out-of-date with the base branch
You are not authorized to merge this pull request.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u hashtag-prune:norm-hashtag-prune
git checkout norm-hashtag-prune
Sign in to join this conversation.
No description provided.