Tweak search #1113

Merged
Oneric merged 20 commits from Oneric/akkoma:search-overhaul into develop 2026-05-22 20:25:10 +00:00
Owner

Both post and user search. Hashtag "search" remains unchanged.

Most of this should be a clear, objective improvements, but there are also some more opinionated/subjective changes.
The bulk of the changes is about user search which in addition to its poor result quality, #1112 showed to perform atrociously awful, which should now be better (albeit a planner confusion keeps it from being as efficient as it could be on my instance, but still on average clearly better than before. See commit message for perf measurements. Should improve when/if we overhaul user deletion and actually drop deleted users from the database).
And some changes ot directly related to search but a consequence of changes made for search.

As a short summary of the more noteworthy and opinionated points:

  • post search:
    • can now match content warnings
      • (this requires manual intervention during upgrades for instances using the non-default RUM index (which was already enabled via manually running an optional migration))
    • opinionated_(?; on the other hand, less opinionated than the old behaviour)_: new installs default to simple rather than english FTS config
    • unchanged: results are still sorted by id (date) not match quality. This is the same for all other fedi implementations i know of (when not using an external search provider like Meilisearch).
      Ranking by match quality is (presumably much? but haven’t tested it on real-world db tbh) more expensive and can only ever work with offset pagination rather than id pagination.
  • user search
    • new faster and more restrictive path for explicit nickname lookups (query prefixed by @); to be used by mention autocomplete (see also AkkomaGang/akkoma-fe#507)
    • generally more precise result filtering
    • but not (yet?) changing ranking of entries which made it through filter. Let’s see if the better filters already suffice
    • opinionated: no more boosting of follow* users. We have a follow_only API toggle, it doesn’t make too much sense and degraded query performance (at least the way it was implemented before)
  • general
    • new fine-grained restrict_unauthenticated toggles for search. Previously it was: everything, local-only or nothing with the latter being tied to a full instance lockdown (private instance)
    • the plain unique nickname index on the users table is replaced by an explicitly case-folded index. All queries (i hope, if i didn’t miss any) have been changed to match (was necessary for fast and case-insensitive starts_with queries)
      • if no issues pop up this would also allow us to change the column type to regular text for a minor efficiency increase. Apparently citext is also considered “legacy” by postgres devs (the more powerful replacement are non-deterministic ICU collations, but they are not a good fit for prefix lookups as needed in the new user search; our other citext column email though could in principle migrate to that)
Both post and user search. Hashtag "search" remains unchanged. Most of this should be a clear, objective improvements, but there are also some more opinionated/subjective changes. The bulk of the changes is about user search which in addition to its poor result quality, #1112 showed to perform _atrociously awful_, which should now be better *(albeit a planner confusion keeps it from being as efficient as it could be on my instance, but still on average clearly better than before. See commit message for perf measurements. Should improve when/if we overhaul user deletion and actually drop deleted users from the database)*. And some changes ot directly related to search but a consequence of changes made for search. As a short summary of the more noteworthy and opinionated points: - post search: - can now match content warnings - *(this requires manual intervention during upgrades for instances using the non-default RUM index (which was already enabled via manually running an optional migration))* - opinionated_(?; on the other hand, less opinionated than the old behaviour)_: new installs default to `simple` rather than `english` FTS config - _unchanged:_ results are still sorted by id (date) not match quality. This is the same for all other fedi implementations i know of *(when not using an external search provider like Meilisearch)*. Ranking by match quality is *(presumably much? but haven’t tested it on real-world db tbh)* more expensive and can only ever work with `offset` pagination rather than `id` pagination. - user search - new faster and more restrictive path for explicit nickname lookups *(query prefixed by `@`)*; to be used by mention autocomplete (see also https://akkoma.dev/AkkomaGang/akkoma-fe/pulls/507) - generally more precise result filtering - _but_ not (yet?) changing ranking of entries which made it through filter. Let’s see if the better filters already suffice - opinionated: no more boosting of follow* users. We have a `follow_only` API toggle, it doesn’t make too much sense and degraded query performance *(at least the way it was implemented before)* - general - new fine-grained `restrict_unauthenticated` toggles for search. Previously it was: everything, local-only or nothing with the latter being tied to a full instance lockdown *(private instance)* - the plain unique `nickname` index on the users table is replaced by an explicitly case-folded index. All queries (i hope, if i didn’t miss any) have been changed to match *(was necessary for fast and case-insensitive `starts_with` queries)* - if no issues pop up this would also allow us to change the column type to regular `text` for a minor efficiency increase. Apparently `citext` is also considered “legacy” by postgres devs *(the more powerful replacement are non-deterministic ICU collations, but they are not a good fit for prefix lookups as needed in the new user search; our other citext column `email` though could in principle migrate to that)*
By itself the name implies only local lookups are performed.
get_cached_by_ap_id does NOT perform network lookups.
(And get_cached_by_id also not, although it too does a weird extra indirection through ap_id)
We already queried the text search config anyway for GIN and
the version with an explicit config is an immutable function
which ensures the query planner doesn’t do silly things and caches
the evaluation result.
(PostgreSQL 16 seems to execute even the one-arg version only once)

This might help with the repeated reevaluation problem reported in
#650. Note, another even more
foolproof way to ensure only a single eval happens is using a cross join
with the websearch_to_tsquery result. However this can incur significant
overhead for performing a join operation. Since PostgreSQL 13 [1], the
planner is smart enough to re-inline cross joins with immutable
functions, but before that both to_tsvector variants suffer from this
and as of PostgreSQL 16 the one-arg version still does.

Maybe fixes: #650
Setting a language matching the dominant post and query language
allows normalising inputs to also match on slight, usually
inconsequential differences (like singular vs plural form)
and omitting common filler words without too much meaning on their own.
Though the latter can also be undesirable (in the default config)
ending up reducing e.g. "but why an apple" to just "appl".

If the language does _not_ match stop-word removal and normalisation can
utterly mangle the post text and query into uselessness. Thus, forcing
specifically English by default with possible vague search issues
and a hard to discover change mechanism necessitating a costly index
rebuild seems like a bad idea.

Instead default new installs to the language-agnostic "simple" config,
which has no stop words and performs no word normalisation. It still
strips HTML tags and attributes though. If desired search can be made
more forgiving by changing the config, but there will be no straight up
_issues_ with the default.
Since we cannot know whether the config was already customised or
intentionally set to english and to avoid forcing a costly index
rebuild, existing installs will keep their current setting.
Already known content is still located via URI or nick
to not shift offset positions.
And for the database search, don't bother with a FTS
search if we already have an exact AP id match. Locating
this was almost certainly the intent of the user query.
Both "summary" and "content" are HTML fragments.
For the latter we often also have source.content
as an alternative with the original (often plaintext) markup.
However, PostgreSQL’s to_tsvector already strips out HTML tags
with their attributes.
It however will not know about e.g. MFM markup or BBCode.
Thus it’s actually _better_ to use the HTML form here.
Most our other domain extractions in SQL use split_part.
Indeed testing shows split_part is almost ten times faster
than the regex substring, so let’s adopt it here too.

Running each variant on a real-owrld user table
took ~340ms with substring and ~35ms with split_part
Just as in the preceeding commit,
split_part is about 100× more efficient
The old user search query invovled several layers of nested subqueries
and the FTS conmdition and index were rather complex too. At the same
time, the manually generated ts_vector was overly permissive, allowing
entries where _any_ of the input fragments (after splitting in
preprocessing) occur, without a strong guarantee for matches containing
_all_ fragments being preferred at the end.
Furthermore, before the preceding commit it contained an akward
UNION-like via (twice) passed in array of pre-determined preferred
results. Now it still contains a two independently queried instances
of following (and a single follower) list for restricting results
if requested and always boosting search ranks of users with a follow*
relation.
Meaning both performance and result quality were quite poor.

For one particular instance, as reported in
#1112 (comment)
this lead to user search queries taking 500ms(!) on average making it
_the_ individually most costly query, taking about 20× longer than
the second worst, status full-text search. User search also took
10% of the total DB time during a full week eventhough there were
only 31 queries.

This new approach splits the index and condition logic for
nicknames and display names, using FTS only for the latter.
This results individually simpler conditions, allowing stricter criteria
for nickname matches and an overall simpler query.
Despite using now two indexes, the combined size of these new indexes
is actually smaller than the old one (at least on the instance used
for initial testing).
In principle the new nickname index is logically redundant with the
new, explicitly casefolded index since the column type "citext" is
already case-insenstitive. However, the planner cannot pick up either
version with/woithout an explicit CASEFOLD in the query and the
starts_with function doe not honour citext’s case-insensitivity.
A future commit might resolve this duplication by migrating everything
to explicit case-folding.

Search terms explicitly marked as nicknames by a leading @
will now _only_ match on the nickname column, increasing result
relevancy and sppeding the query up. This can be used e.g. for
mention autocomplete suggestions.
Our akkoma-fe frontend adopted this as part of
AkkomaGang/akkoma-fe#507

One more opinioted change in this commit is the removal of bossted
search ranks for users with follow* relations. This seemed more like
a crutch to alleviate the general poor results. When using search to
discover an account, eiother the account is well-known and the query
already pretty exact, or the account is not well-known and boosting
somewhat similar well-known accounts defeats the point.
For mention autocomplete-suggestions it might make more sense, but
the new explicit nickname path should already improve this and
currently akkoma-fe does not use the backends ordering for this anyway.
Furthermore, as it was implemented before it forced a subquery layer
degrading performance and it will always necessitate passing,
potentially very large follow* lists from the database to elixir
and back, adding more overhead.

Other than this, the actual ranking of results meeting the initial
filters is not touched in this commit (yet?). Let’s first see if the
striciter initial filters are enough to let the ranking work sensibly.
(Considered alternatives involve either ts_rank_cd which appears to
 scale differently from the trigramm ranking making it hard to combine
 both meaningfully, a set of somewhat arbitrary heuristics or both.
 The current ranking logic is more straightforward at least and
 also appears somewhat sensible already.)

Comparing execution times of the old and new fuzzy queries on a couple
inputs on a copy of db of the same instance reporting ~500ms average
execution time atm, but on a different, faster machine.
The comparsions is done without any block or follow restrictions,
nor the recently removed top_user_ids on the old version.

 search input  |  old [ms]  |  new [ms]
 --------------|------------|-----------
     "ONeriC"  |    0.552   |   9.117
          "a"  |  293.303   |  40.083
   "John Doe"  |   37.579   |   0.323
   "Jane Doe"  |    7.108   |   0.238

The degradation for the first input is the result of a planner mishap
from too coarse estiamtes / statistics. It expects way more rows to
match the FTS filter than actually do and thus decides a parallel scan
of the is_active filter to AND the (expected large) result set later
would be a good idea. Since the actual FTS result set is small though,
now almost all time is spent on the is_active index scan.
For queries with more than one word, the planner’s estimate of FTS
matches is more accurate and no such parallel is_active scan is
performed.
This may be improved once we rework user deletes and actually clean up
deleted users. Or perhaps it might turn out the is_active index is not
needed at all; possible usecases atm might involve the user index
endpoint listing all users (though it might prefer a last_status
scan anyway) and improving the active user estimate in our telemetry
statistics.

Even with this planner mishap though, the new query clearly performs
better in the "worst case" of very short search terms (commonly
encountered in searches for mention autocomplete suggestions)
and due to the different ts_vector also searches with multiple words.
It’s always much better than the previously reported average
performance and never atrociously bad.

Furthermore, for first-character automcplete suggestions this mishap
can now be avoided (if the client cooperates) through the
exxplicit-nickname alternative path.
It is only used by admin API search
It is confusing for the single-value form of the nickname constraint
to act differently from the list version.
Furthermore all usages actually expecting a substring match via
the :nickname key wanted to match specifically a suffix and
might have processed unintended extra results before.
E.g. when collecting results for @funny.tld it would have also
matched users on funny.tld.otherinstance.example.

Exact matches can also be performed more efficiently
than either of the other two.
db: drop redundant nickname index
Some checks failed
ci/woodpecker/pr/test/2 Pipeline failed
ci/woodpecker/pr/test/1 Pipeline failed
58996e4625
All remaining nickname filters have been converted to use explicit casefolding.
If no issues show up, we can also convert the column type to regular text later
Oneric force-pushed search-overhaul from 58996e4625
Some checks failed
ci/woodpecker/pr/test/2 Pipeline failed
ci/woodpecker/pr/test/1 Pipeline failed
to 3af928ac07
Some checks failed
ci/woodpecker/pr/test/2 Pipeline failed
ci/woodpecker/pr/test/1 Pipeline failed
2026-04-26 15:47:38 +00:00
Compare
Author
Owner

good thing CI uses an older postgres; apparently newer version like 18 autoconvert citext to text for CASEFOLD, but the version in CI (15?) errors out on CASEFOLD with a citext argument.
Now everything casefold call should explicitly convert to text

good thing CI uses an older postgres; apparently newer version like 18 autoconvert `citext` to `text` for `CASEFOLD`, but the version in CI (15?) errors out on `CASEFOLD` with a `citext` argument. Now everything casefold call should explicitly convert to `text`
Oneric force-pushed search-overhaul from 3af928ac07
Some checks failed
ci/woodpecker/pr/test/2 Pipeline failed
ci/woodpecker/pr/test/1 Pipeline failed
to 256ff4e600
Some checks failed
ci/woodpecker/pr/test/2 Pipeline failed
ci/woodpecker/pr/test/1 Pipeline failed
2026-04-26 15:52:49 +00:00
Compare
Oneric changed title from Overhaul search to Tweak search 2026-04-26 15:53:31 +00:00
Author
Owner

oh, actually the issue is CASEFOLD itself not yet existing
hmm...

Apparently it was added only very recently in PostgreSQL 18; too new to just bump our minimal required version. So i guess i’ll change this to use LOWER instead later. (The advantage of CASEFOLD is that it can work better on glyphs with ambigous lower/upper forms or when only the lower xor upper form exists but not both.)

oh, actually the issue is `CASEFOLD` itself not yet existing hmm... Apparently it was added only very recently in PostgreSQL 18; too new to just bump our minimal required version. So i guess i’ll change this to use `LOWER` instead later. *(The advantage of `CASEFOLD` is that it can work better on glyphs with ambigous lower/upper forms or when only the lower xor upper form exists but not both.)*
Oneric force-pushed search-overhaul from 256ff4e600
Some checks failed
ci/woodpecker/pr/test/2 Pipeline failed
ci/woodpecker/pr/test/1 Pipeline failed
to b9925c3e12
Some checks failed
ci/woodpecker/pr/test/2 Pipeline failed
ci/woodpecker/pr/test/1 Pipeline failed
2026-04-26 16:13:57 +00:00
Compare
Oneric force-pushed search-overhaul from b9925c3e12
Some checks failed
ci/woodpecker/pr/test/2 Pipeline failed
ci/woodpecker/pr/test/1 Pipeline failed
to b436369de2
Some checks failed
ci/woodpecker/pr/test/2 Pipeline failed
ci/woodpecker/pr/test/1 Pipeline failed
2026-04-26 16:22:32 +00:00
Compare
Author
Owner

and the "pg_c_utf8" collation is also relatively recent; seemingly introduced in postgresql 17

Since an explicit collation is only used for starts_with here, i think just using the much older "C"/"POSIX" there should give equivalent results. (For lowercasing or casefolding though they differ; the latter cannot map e.g. accented characters like À → à)

EDIT: with this its now all good even on CI’s postgres 15, it seems

and the `"pg_c_utf8"` collation is also relatively recent; seemingly introduced in postgresql 17 Since an explicit collation is only used for `starts_with` here, i think just using the much older `"C"`/`"POSIX"` there should give equivalent results. *(For lowercasing or casefolding though they differ; the latter cannot map e.g. accented characters like `À → à`)* **EDIT**: with this its now all good even on CI’s postgres 15, it seems
Oneric force-pushed search-overhaul from b436369de2
Some checks failed
ci/woodpecker/pr/test/2 Pipeline failed
ci/woodpecker/pr/test/1 Pipeline failed
to b396825b4d
All checks were successful
ci/woodpecker/pr/test/2 Pipeline was successful
ci/woodpecker/pr/test/1 Pipeline was successful
2026-04-26 16:41:00 +00:00
Compare
Oneric force-pushed search-overhaul from b396825b4d
All checks were successful
ci/woodpecker/pr/test/2 Pipeline was successful
ci/woodpecker/pr/test/1 Pipeline was successful
to 775337754d
Some checks failed
ci/woodpecker/pr/test/2 Pipeline failed
ci/woodpecker/pr/test/1 Pipeline failed
2026-04-26 22:52:18 +00:00
Compare
Oneric force-pushed search-overhaul from 775337754d
Some checks failed
ci/woodpecker/pr/test/2 Pipeline failed
ci/woodpecker/pr/test/1 Pipeline failed
to c92b794da2
All checks were successful
ci/woodpecker/pr/test/2 Pipeline was successful
ci/woodpecker/pr/test/1 Pipeline was successful
2026-04-26 23:02:53 +00:00
Compare
user/search: break tie on equal search rank
All checks were successful
ci/woodpecker/pr/test/2 Pipeline was successful
ci/woodpecker/pr/test/1 Pipeline was successful
9b94b30346
By preferring local accounts and then accounts
the server knows about for longer.

I considered using last_status_at to prefer accounts
with more recent publicly visible activity, but this
might make pagination iffy when new statuses are posted
in between API calls.
Oneric force-pushed search-overhaul from 9b94b30346
All checks were successful
ci/woodpecker/pr/test/2 Pipeline was successful
ci/woodpecker/pr/test/1 Pipeline was successful
to 80bc30b568
All checks were successful
ci/woodpecker/pr/test/2 Pipeline was successful
ci/woodpecker/pr/test/1 Pipeline was successful
2026-05-04 20:02:21 +00:00
Compare
Oneric force-pushed search-overhaul from 80bc30b568
All checks were successful
ci/woodpecker/pr/test/2 Pipeline was successful
ci/woodpecker/pr/test/1 Pipeline was successful
to a2eac6d414
All checks were successful
ci/woodpecker/pr/test/2 Pipeline was successful
ci/woodpecker/pr/test/1 Pipeline was successful
2026-05-05 18:33:58 +00:00
Compare
Oneric force-pushed search-overhaul from a2eac6d414
All checks were successful
ci/woodpecker/pr/test/2 Pipeline was successful
ci/woodpecker/pr/test/1 Pipeline was successful
to 5b05ab84f6
All checks were successful
ci/woodpecker/pr/test/2 Pipeline was successful
ci/woodpecker/pr/test/1 Pipeline was successful
2026-05-09 22:01:45 +00:00
Compare
Author
Owner

Seems all good on my instance and didn’t break ihba either, so seems good to go

Seems all good on my instance and didn’t break ihba either, so seems good to go
Oneric merged commit fb392a8562 into develop 2026-05-22 20:25:10 +00:00
Oneric deleted branch search-overhaul 2026-05-22 20:25:11 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
AkkomaGang/akkoma!1113
No description provided.