1
0
Fork 0

[TESTING] dbsearch: actually rank search results

Until now we always sorted search matches by id (basically: most recent)
and grabbed the first couple. However, if no special input modifiers are
used websearch_to_tsquery creates a pretty loose search vector and the
most recent matches might be far from the most relevant matches.

Thus rank results by their relevancy instead. Normalisation mode 16
takes the amount of unique words of a post into account since a
post with 500 unique words matching a 5 word query is likely less
relevant than a post consisting entirely out of the 5 queried words.

Let’s hope this improves on our notoriously bad search.

The tradeoff here being more costly queries. ON a simplified mockup
query without any filtering or joins with activities, the planner’s
cost estimate didn’ŧ change much, but measured wall clock time for
a single query increased from ~1.7ms to 2.3ms.

As far as I can tell, the cost of ts_rank(_cd) was never discussed
before. The original version 9f0a2a714b
didn’t sort at all and there’s no associated discussion. Later it was
sorted by date 1dd2c8163f but quickly
changed to sorting by id ff5e957476
(which isn’t to different with old sequential ids and current FlakeIDs).
This was carried forward until eventually being removed in
817c66bc3e prob because pagination sorts
anyway.

For now RUM results continue to be ranked solely by recencey
as it did since its introduction in
  01c45ddc9e and
  f1e67bdc31
It is possible to make it use an efficient relevancy-based ranking,
but this requires changes to its index which is beyond the scope of
this commit.

TODO: not sure how much the normalisation helps or if a non-log
  or non-unique word normalisation would be better.
This commit is contained in:
Oneric 2024-05-03 23:36:51 +02:00
parent 788a0bdc26
commit d5881d08b7
2 changed files with 19 additions and 2 deletions

View File

@ -6,6 +6,9 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
## Unreleased
## Fixed
- Search results for the default built-in GIN search are now actually ranked by relevancy
## 2024.04
## Added

View File

@ -15,6 +15,9 @@ defmodule Pleroma.Search.DatabaseSearch do
@behaviour Pleroma.Search.SearchBackend
# See: https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-RANKING
@rank_normalisation 16
def search(user, search_query, options \\ []) do
index_type = if Pleroma.Config.get([:database, :rum_enabled]), do: :rum, else: :gin
limit = Enum.min([Keyword.get(options, :limit), 40])
@ -31,7 +34,7 @@ defmodule Pleroma.Search.DatabaseSearch do
|> maybe_restrict_author(author)
|> maybe_restrict_blocked(user)
|> Pagination.fetch_paginated(
%{"offset" => offset, "limit" => limit, "skip_order" => index_type == :rum},
%{"offset" => offset, "limit" => limit, "skip_order" => true},
:offset
)
|> maybe_fetch(user, search_query)
@ -86,7 +89,18 @@ defmodule Pleroma.Search.DatabaseSearch do
o.data,
^tsc,
^search_query
)
),
order_by: [
desc:
fragment(
"ts_rank_cd(to_tsvector(?::oid::regconfig, ?->>'content'), websearch_to_tsquery(?::oid::regconfig, ?), ?)",
^tsc,
o.data,
^tsc,
^search_query,
@rank_normalisation
)
]
)
end