[bug] Japanese full text search using tsja doesn't work when logged in #650

Open
opened 2023-10-20 10:02:12 +00:00 by chayleaf · 4 comments

Your setup

From source

Extra details

NixOS @ 81e8f48ebdecf07aab321182011b067aafc78896

Version

3.10.4

PostgreSQL version

13

What were you trying to do?

I followed this guide. At first I wanted to open a report about Japanese search not working at all, but then I tried a search in private mode, and it works. However, when I'm logged in, it still doesn't work.

What did you expect to happen?

I expected Japanese search to always work.

What actually happened?

Japanese search doesn't work when logged in.

Logs

No response

Severity

I can manage

Have you searched for this issue?

  • I have double-checked and have not found this issue mentioned anywhere.
### Your setup From source ### Extra details NixOS @ 81e8f48ebdecf07aab321182011b067aafc78896 ### Version 3.10.4 ### PostgreSQL version 13 ### What were you trying to do? I followed [this guide](https://docs.akkoma.dev/stable/configuration/howto_search_cjk/). At first I wanted to open a report about Japanese search not working at all, but then I tried a search in private mode, and [it works](https://fedi.pavluk.org/search?query=%E8%A8%80%E8%AA%9E). However, when I'm logged in, it still doesn't work. ### What did you expect to happen? I expected Japanese search to always work. ### What actually happened? Japanese search doesn't work when logged in. ### Logs _No response_ ### Severity I can manage ### Have you searched for this issue? - [x] I have double-checked and have not found this issue mentioned anywhere.
chayleaf added the
bug
label 2023-10-20 10:02:12 +00:00
Author

When enabling debug logs

SELECT a0."id", a0."data", a0."local", a0."actor", a0."recipients", a0."inserted_at", a0."updated_at", o1."id", o1."data", o1."inserted_at", o1."updated_at" FROM "activities" AS a0 INNER JOIN "objects" AS o1 ON (o1."data"->>'id') = COALESCE(a0."data"->'object'->>'id', a0."data"->>'object') INNER JOIN LATERAL (SELECT is_active from users WHERE ap_id = a0."actor" AND is_active = TRUE) AS f2 ON TRUE WHERE (a0."data"->>'type' = 'Create') AND ($1 = ANY(a0."recipients")) AND (to_tsvector($2::oid::regconfig, o1."data"->>'content') @@ websearch_to_tsquery($3)) AND (a0."local" = TRUE) ORDER BY a0."id" desc nulls last LIMIT $4 OFFSET $5 ["https://www.w3.org/ns/activitystreams#Public", 678472, "日本語", 20, 0]

is there in the anonymous search results, but there's nothing that includes 678472 (the search language oid) in the logged in search results. Still, select current_setting('default_text_search_config')::regconfig::oid; runs in both cases. I assume this means the query fails when logged in.

postgresql logs confirm this:

Oct 20 23:46:07 server postgres[1983882]: [1983882] ERROR:  canceling statement due to user request
Oct 20 23:46:07 server postgres[1983882]: [1983882] STATEMENT:  SELECT a0."id", a0."data", a0."local", a0."actor", a0."recipients", a0."inserted_at", a0."updated_at", o1."id", o1."data", o1."inserted_at", o1."updated_at" FROM "activities" AS a0 INNER JOIN "objects" AS o1 ON (o1."data"->>'id') = COALESCE(a0."data"->'object'->>'id', a0."data"->>'object') INNER JOIN LATERAL (SELECT is_active from users WHERE ap_id = a0."actor" AND is_active = TRUE) AS f2 ON TRUE WHERE (a0."data"->>'type' = 'Create') AND ($1 = ANY(a0."recipients")) AND (to_tsvector($2::oid::regconfig, o1."data"->>'content') @@ websearch_to_tsquery($3)) AND (NOT (a0."actor" = ANY($4))) ORDER BY a0."id" desc nulls last LIMIT $5 OFFSET $6
Oct 20 23:46:07 server postgres[1984689]: [1984689] FATAL:  terminating connection due to administrator command
Oct 20 23:46:07 server postgres[1984689]: [1984689] STATEMENT:  SELECT a0."id", a0."data", a0."local", a0."actor", a0."recipients", a0."inserted_at", a0."updated_at", o1."id", o1."data", o1."inserted_at", o1."updated_at" FROM "activities" AS a0 INNER JOIN "objects" AS o1 ON (o1."data"->>'id') = COALESCE(a0."data"->'object'->>'id', a0."data"->>'object') INNER JOIN LATERAL (SELECT is_active from users WHERE ap_id = a0."actor" AND is_active = TRUE) AS f2 ON TRUE WHERE (a0."data"->>'type' = 'Create') AND ($1 = ANY(a0."recipients")) AND (to_tsvector($2::oid::regconfig, o1."data"->>'content') @@ websearch_to_tsquery($3)) AND (NOT (a0."actor" = ANY($4))) ORDER BY a0."id" desc nulls last LIMIT $5 OFFSET $6
Oct 20 23:46:07 server postgres[1984690]: [1984690] FATAL:  terminating connection due to administrator command
Oct 20 23:46:07 server postgres[1984690]: [1984690] STATEMENT:  SELECT a0."id", a0."data", a0."local", a0."actor", a0."recipients", a0."inserted_at", a0."updated_at", o1."id", o1."data", o1."inserted_at", o1."updated_at" FROM "activities" AS a0 INNER JOIN "objects" AS o1 ON (o1."data"->>'id') = COALESCE(a0."data"->'object'->>'id', a0."data"->>'object') INNER JOIN LATERAL (SELECT is_active from users WHERE ap_id = a0."actor" AND is_active = TRUE) AS f2 ON TRUE WHERE (a0."data"->>'type' = 'Create') AND ($1 = ANY(a0."recipients")) AND (to_tsvector($2::oid::regconfig, o1."data"->>'content') @@ websearch_to_tsquery($3)) AND (NOT (a0."actor" = ANY($4))) ORDER BY a0."id" desc nulls last LIMIT $5 OFFSET $6
Oct 20 23:46:07 server postgres[1805940]: [1805940] LOG:  background worker "parallel worker" (PID 1984689) exited with exit code 1
Oct 20 23:46:07 server postgres[1805940]: [1805940] LOG:  background worker "parallel worker" (PID 1984690) exited with exit code 1

this happens both with missing words like ajwoawjfwao and Japanese words

And postgres complains WARNING: invalid value for parameter "default_text_search_config": "japanese" every once in a while, even though it doesn't log anything when trying SELECT * FROM ts_debug('日本語が喋れるんかい')

When enabling debug logs ``` SELECT a0."id", a0."data", a0."local", a0."actor", a0."recipients", a0."inserted_at", a0."updated_at", o1."id", o1."data", o1."inserted_at", o1."updated_at" FROM "activities" AS a0 INNER JOIN "objects" AS o1 ON (o1."data"->>'id') = COALESCE(a0."data"->'object'->>'id', a0."data"->>'object') INNER JOIN LATERAL (SELECT is_active from users WHERE ap_id = a0."actor" AND is_active = TRUE) AS f2 ON TRUE WHERE (a0."data"->>'type' = 'Create') AND ($1 = ANY(a0."recipients")) AND (to_tsvector($2::oid::regconfig, o1."data"->>'content') @@ websearch_to_tsquery($3)) AND (a0."local" = TRUE) ORDER BY a0."id" desc nulls last LIMIT $4 OFFSET $5 ["https://www.w3.org/ns/activitystreams#Public", 678472, "日本語", 20, 0] ``` is there in the anonymous search results, but there's nothing that includes 678472 (the search language oid) in the logged in search results. Still, `select current_setting('default_text_search_config')::regconfig::oid;` runs in both cases. I assume this means the query fails when logged in. postgresql logs confirm this: ``` Oct 20 23:46:07 server postgres[1983882]: [1983882] ERROR: canceling statement due to user request Oct 20 23:46:07 server postgres[1983882]: [1983882] STATEMENT: SELECT a0."id", a0."data", a0."local", a0."actor", a0."recipients", a0."inserted_at", a0."updated_at", o1."id", o1."data", o1."inserted_at", o1."updated_at" FROM "activities" AS a0 INNER JOIN "objects" AS o1 ON (o1."data"->>'id') = COALESCE(a0."data"->'object'->>'id', a0."data"->>'object') INNER JOIN LATERAL (SELECT is_active from users WHERE ap_id = a0."actor" AND is_active = TRUE) AS f2 ON TRUE WHERE (a0."data"->>'type' = 'Create') AND ($1 = ANY(a0."recipients")) AND (to_tsvector($2::oid::regconfig, o1."data"->>'content') @@ websearch_to_tsquery($3)) AND (NOT (a0."actor" = ANY($4))) ORDER BY a0."id" desc nulls last LIMIT $5 OFFSET $6 Oct 20 23:46:07 server postgres[1984689]: [1984689] FATAL: terminating connection due to administrator command Oct 20 23:46:07 server postgres[1984689]: [1984689] STATEMENT: SELECT a0."id", a0."data", a0."local", a0."actor", a0."recipients", a0."inserted_at", a0."updated_at", o1."id", o1."data", o1."inserted_at", o1."updated_at" FROM "activities" AS a0 INNER JOIN "objects" AS o1 ON (o1."data"->>'id') = COALESCE(a0."data"->'object'->>'id', a0."data"->>'object') INNER JOIN LATERAL (SELECT is_active from users WHERE ap_id = a0."actor" AND is_active = TRUE) AS f2 ON TRUE WHERE (a0."data"->>'type' = 'Create') AND ($1 = ANY(a0."recipients")) AND (to_tsvector($2::oid::regconfig, o1."data"->>'content') @@ websearch_to_tsquery($3)) AND (NOT (a0."actor" = ANY($4))) ORDER BY a0."id" desc nulls last LIMIT $5 OFFSET $6 Oct 20 23:46:07 server postgres[1984690]: [1984690] FATAL: terminating connection due to administrator command Oct 20 23:46:07 server postgres[1984690]: [1984690] STATEMENT: SELECT a0."id", a0."data", a0."local", a0."actor", a0."recipients", a0."inserted_at", a0."updated_at", o1."id", o1."data", o1."inserted_at", o1."updated_at" FROM "activities" AS a0 INNER JOIN "objects" AS o1 ON (o1."data"->>'id') = COALESCE(a0."data"->'object'->>'id', a0."data"->>'object') INNER JOIN LATERAL (SELECT is_active from users WHERE ap_id = a0."actor" AND is_active = TRUE) AS f2 ON TRUE WHERE (a0."data"->>'type' = 'Create') AND ($1 = ANY(a0."recipients")) AND (to_tsvector($2::oid::regconfig, o1."data"->>'content') @@ websearch_to_tsquery($3)) AND (NOT (a0."actor" = ANY($4))) ORDER BY a0."id" desc nulls last LIMIT $5 OFFSET $6 Oct 20 23:46:07 server postgres[1805940]: [1805940] LOG: background worker "parallel worker" (PID 1984689) exited with exit code 1 Oct 20 23:46:07 server postgres[1805940]: [1805940] LOG: background worker "parallel worker" (PID 1984690) exited with exit code 1 ``` this happens both with missing words like ajwoawjfwao and Japanese words And postgres complains `WARNING: invalid value for parameter "default_text_search_config": "japanese"` every once in a while, even though it doesn't log anything when trying `SELECT * FROM ts_debug('日本語が喋れるんかい')`
Author

After trying out various queries, the main issue seems to be that the query simply takes too long to execute for most queries

notably, if I replace websearch_to_tsquery() function call with the result directly, it gets noticeably faster (previously I couldn't wait for it to finish, now it takes a couple seconds). If it's cached, that would constitute a partial fix.

After trying out various queries, the main issue seems to be that the query simply takes too long to execute for most queries notably, if I replace `websearch_to_tsquery()` function call with the result directly, it gets noticeably faster (previously I couldn't wait for it to finish, now it takes a couple seconds). If it's cached, that would constitute a partial fix.
Author

@floatingghost I've verified that caching websearch_to_tsquery massively helps (though plain SELECT * FROM objects WHERE data->>content = ... is still much much faster), but I have no idea how to do it - Elixir most likely doesn't support the tsquery Postgres type, so it probably has to become an even more convoluted SQL expression, and both my SQL and Elixir skills aren't enough for that. Alternatively, a SQL expression reordering could help too (doing the SELECT * FROM objects before everything else), but, again, I have no idea how to do that. As Russians say, "this is the extent of my authority".

@floatingghost I've verified that caching `websearch_to_tsquery` massively helps (though plain `SELECT * FROM objects WHERE data->>content = ...` is still much much faster), but I have no idea how to do it - Elixir most likely doesn't support the `tsquery` Postgres type, so it probably has to become an even more convoluted SQL expression, and both my SQL and Elixir skills aren't enough for that. Alternatively, a SQL expression reordering could help too (doing the `SELECT * FROM objects` before everything else), but, again, I have no idea how to do that. As Russians say, "this is the extent of my authority".
Member

notably, if I replace websearch_to_tsquery() function call with the result directly, it gets noticeably faster (previously I couldn't wait for it to finish, now it takes a couple seconds). If it's cached, that would constitute a partial fix.
I've verified that caching websearch_to_tsquery massively helps

hmm, i can’t quite reproduce this with english. Currently we use the single argument version of websearch_to_tsvector which is indeed not ideal since it’s not immutable like the two argument version with explicit language configuration. The latter really ought to be cached by PostgreSQL already but the current version should already be cached too since the config can’t change during an (implicit) transaction. With PostgreSQL both appear to get cached for me, with just a minor cost difference since the one arg version needs to query the current config once.
Unless ofc setting up a custom config somehow degraded to_tsquery into an unstable function.

In theory we could extract the call into the FROM part with a cross join (as used for better readability in modern PostgreSQL docs about textseach), but there’s a problem: this requests a join operation which is costly and prior to PostgreSQL 13 (meaning your setup just barely make the cut), Postgres wasn’t smart enough to optimise this join away incurring a substantial performance penalty.

Can you test if perf is improved if you change the websearch_to_tsquery(?) call here to websearch_to_tsquery('japanese', ?)?

though plain SELECT * FROM objects WHERE data->>content = ... is still much much faster

A plain comparison will actually need to iterate through the entire table and compare each post’s content to the search term. Fulltext search with ts_queries and ts_vector can use the objects_fts index and should be much faster. Are you sure you run the mix task to update your index to the new language setting?

And postgres complains WARNING: invalid value for parameter "default_text_search_config": "japanese" every once in a while, even though it doesn't log anything when trying SELECT * FROM ts_debug('日本語が喋れるんかい')

Are you sure you setup tsja and the default_text_search_config for the right user and database? I’m not really familiar with custom text search configs but this does seem odd


For more insight please connect to the akkoma database as the akkoma user and run

\d objects_fts

EXPLAIN SELECT data->>'id' FROM objects WHERE to_tsvector('japanese', data->>'content') @@ websearch_to_tsquery('言語');

EXPLAIN SELECT data->>'id' FROM objects WHERE to_tsvector('japanese', data->>'content') @@ websearch_to_tsquery('japanese', '言語');

EXPLAIN SELECT data->>'id' FROM objects,  websearch_to_tsquery('言語') AS query WHERE to_tsvector('japanese', data->>'content') @@ query;

EXPLAIN SELECT data->>'id' FROM objects,  websearch_to_tsquery('japanese', '言語') AS query WHERE to_tsvector('japanese', data->>'content') @@ query;
> notably, if I replace websearch_to_tsquery() function call with the result directly, it gets noticeably faster (previously I couldn't wait for it to finish, now it takes a couple seconds). If it's cached, that would constitute a partial fix. > I've verified that caching websearch_to_tsquery massively helps hmm, i can’t quite reproduce this with `english`. Currently we use the single argument version of `websearch_to_tsvector` which is indeed not ideal since it’s not immutable like the two argument version with explicit language configuration. The latter really ought to be cached by PostgreSQL already but the current version should already be cached too since the config can’t change during an (implicit) transaction. With PostgreSQL both appear to get cached for me, with just a minor cost difference since the one arg version needs to query the current config once. Unless ofc setting up a custom config somehow degraded `to_tsquery` into an unstable function. In theory we could extract the call into the `FROM` part with a cross join (as used for better readability in modern PostgreSQL docs about textseach), but there’s a problem: this requests a join operation which is costly and [prior to PostgreSQL 13](https://github.com/postgres/postgres/commit/7266d0997dd2a0632da38a594c78e25ff21df67e) (meaning your setup just barely make the cut), Postgres wasn’t smart enough to optimise this join away incurring a substantial performance penalty. Can you test if perf is improved if you change the `websearch_to_tsquery(?)` call [here](https://akkoma.dev/AkkomaGang/akkoma/src/commit/21a81e111165bd29ef953d5e51c682f3dd94dfb4/lib/pleroma/search/database_search.ex#L103) to `websearch_to_tsquery('japanese', ?)`? > though plain SELECT * FROM objects WHERE data->>content = ... is still much much faster A plain comparison will actually need to iterate through the entire table and compare each post’s content to the search term. Fulltext search with ts_queries and ts_vector can use the `objects_fts` index and should be much faster. Are you sure you run the mix task to update your index to the new language setting? > And postgres complains WARNING: invalid value for parameter "default_text_search_config": "japanese" every once in a while, even though it doesn't log anything when trying SELECT * FROM ts_debug('日本語が喋れるんかい') Are you sure you setup tsja and the default_text_search_config for the right user and database? I’m not really familiar with custom text search configs but this does seem odd --- For more insight please connect to the akkoma database as the akkoma user and run ```sql \d objects_fts EXPLAIN SELECT data->>'id' FROM objects WHERE to_tsvector('japanese', data->>'content') @@ websearch_to_tsquery('言語'); EXPLAIN SELECT data->>'id' FROM objects WHERE to_tsvector('japanese', data->>'content') @@ websearch_to_tsquery('japanese', '言語'); EXPLAIN SELECT data->>'id' FROM objects, websearch_to_tsquery('言語') AS query WHERE to_tsvector('japanese', data->>'content') @@ query; EXPLAIN SELECT data->>'id' FROM objects, websearch_to_tsquery('japanese', '言語') AS query WHERE to_tsvector('japanese', data->>'content') @@ query; ```
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: AkkomaGang/akkoma#650
No description provided.