[TESTING] dbsearch: actually rank search results

Until now we always sorted search matches by id (basically: most recent) and grabbed the first couple. However, if no special input modifiers are used websearch_to_tsquery creates a pretty loose search vector and the most recent matches might be far from the most relevant matches. Thus rank results by their relevancy instead. Normalisation mode 16 takes the amount of unique words of a post into account since a post with 500 unique words matching a 5 word query is likely less relevant than a post consisting entirely out of the 5 queried words. Let’s hope this improves on our notoriously bad search. The tradeoff here being more costly queries. ON a simplified mockup query without any filtering or joins with activities, the planner’s cost estimate didn’ŧ change much, but measured wall clock time for a single query increased from ~1.7ms to 2.3ms. As far as I can tell, the cost of ts_rank(_cd) was never discussed before. The original version 9f0a2a714b didn’t sort at all and there’s no associated discussion. Later it was sorted by date 1dd2c8163f but quickly changed to sorting by id ff5e957476 (which isn’t to different with old sequential ids and current FlakeIDs). This was carried forward until eventually being removed in 817c66bc3e prob because pagination sorts anyway. For now RUM results continue to be ranked solely by recencey as it did since its introduction in 01c45ddc9e and f1e67bdc31 It is possible to make it use an efficient relevancy-based ranking, but this requires changes to its index which is beyond the scope of this commit. TODO: not sure how much the normalisation helps or if a non-log or non-unique word normalisation would be better.
2024-05-03 23:36:51 +02:00 · 2024-05-03 23:36:51 +02:00 · d5881d08b7
parent 788a0bdc26
commit d5881d08b7
2 changed files with 19 additions and 2 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -6,6 +6,9 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).

 ## Unreleased

+## Fixed
+- Search results for the default built-in GIN search are now actually ranked by relevancy
+
 ## 2024.04

 ## Added
--- a/lib/pleroma/search/database_search.ex
+++ b/lib/pleroma/search/database_search.ex
@ -15,6 +15,9 @@ defmodule Pleroma.Search.DatabaseSearch do

  @behaviour Pleroma.Search.SearchBackend

+  # See: https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-RANKING
+  @rank_normalisation 16
+
  def search(user, search_query, options \\ []) do
    index_type = if Pleroma.Config.get([:database, :rum_enabled]), do: :rum, else: :gin
    limit = Enum.min([Keyword.get(options, :limit), 40])
@ -31,7 +34,7 @@ defmodule Pleroma.Search.DatabaseSearch do
      |> maybe_restrict_author(author)
      |> maybe_restrict_blocked(user)
      |> Pagination.fetch_paginated(
-        %{"offset" => offset, "limit" => limit, "skip_order" => index_type == :rum},
+        %{"offset" => offset, "limit" => limit, "skip_order" => true},
        :offset
      )
      |> maybe_fetch(user, search_query)
@ -86,7 +89,18 @@ defmodule Pleroma.Search.DatabaseSearch do
          o.data,
          ^tsc,
          ^search_query
-        )
+        ),
+      order_by: [
+        desc:
+          fragment(
+            "ts_rank_cd(to_tsvector(?::oid::regconfig, ?->>'content'), websearch_to_tsquery(?::oid::regconfig, ?), ?)",
+            ^tsc,
+            o.data,
+            ^tsc,
+            ^search_query,
+            @rank_normalisation
+          )
+      ]
    )
  end