From bc84698087763d61fc55928970b6e81ddb2a5900 Mon Sep 17 00:00:00 2001 From: Oneric Date: Fri, 3 May 2024 23:36:51 +0200 Subject: [PATCH] [TESTING] dbsearch: actually rank search results MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Until now we always sorted search matches by id (basically: most recent) and grabbed the first couple. However, if no special input modifiers are used websearch_to_tsquery creates a pretty loose search vector and the most recent matches might be far from the most relevant matches. Thus rank results by their relevancy instead. Normalisation mode 16 takes the amount of unique words of a post into account since a post with 500 unique words matching a 5 word query is likely less relevant than a post consisting entirely out of the 5 queried words. Let’s hope this improves on our notoriously bad search. The tradeoff here being more costly queries. ON a simplified mockup query without any filtering or joins with activities, the planner’s cost estimate didn’ŧ change much, but measured wall clock time for a single query increased from ~1.7ms to 2.3ms. As far as I can tell, the cost of ts_rank(_cd) was never discussed before. The original version 9f0a2a714b498edfbacc638fa79e06e3a8dc4d04 didn’t sort at all and there’s no associated discussion. Later it was sorted by date 1dd2c8163f233a205d6f110af010a637403e163e but quickly changed to sorting by id ff5e9574760accbf92f6e351819e1566b835002e (which isn’t to different with old sequential ids and current FlakeIDs). This was carried forward until eventually being removed in 817c66bc3ecf26596cbbc6086a9dc9b95b88fc0a prob because pagination sorts anyway. For now RUM results continue to be ranked solely by recencey as it did since its introduction in 01c45ddc9ead715131b3c583caa14fcf20845354 and f1e67bdc312ba16a37916024244d6cb9d4417c9e It is possible to make it use an efficient relevancy-based ranking, but this requires changes to its index which is beyond the scope of this commit. TODO: not sure how much the normalisation helps or if a non-log or non-unique word normalisation would be better. --- CHANGELOG.md | 3 +++ lib/pleroma/search/database_search.ex | 18 ++++++++++++++++-- 2 files changed, 19 insertions(+), 2 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 8649d65c8..bc69dddeb 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -6,6 +6,9 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/). ## Unreleased +## Fixed +- Search results for the default built-in GIN search are now actually ranked by relevancy + ## 2024.04 ## Added diff --git a/lib/pleroma/search/database_search.ex b/lib/pleroma/search/database_search.ex index bf566c3cb..010e3ccbb 100644 --- a/lib/pleroma/search/database_search.ex +++ b/lib/pleroma/search/database_search.ex @@ -15,6 +15,9 @@ defmodule Pleroma.Search.DatabaseSearch do @behaviour Pleroma.Search.SearchBackend + # See: https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-RANKING + @rank_normalisation 16 + def search(user, search_query, options \\ []) do index_type = if Pleroma.Config.get([:database, :rum_enabled]), do: :rum, else: :gin limit = Enum.min([Keyword.get(options, :limit), 40]) @@ -31,7 +34,7 @@ def search(user, search_query, options \\ []) do |> maybe_restrict_author(author) |> maybe_restrict_blocked(user) |> Pagination.fetch_paginated( - %{"offset" => offset, "limit" => limit, "skip_order" => index_type == :rum}, + %{"offset" => offset, "limit" => limit, "skip_order" => true}, :offset ) |> maybe_fetch(user, search_query) @@ -86,7 +89,18 @@ defp query_with(q, :gin, search_query) do o.data, ^tsc, ^search_query - ) + ), + order_by: [ + desc: + fragment( + "ts_rank_cd(to_tsvector(?::oid::regconfig, ?->>'content'), websearch_to_tsquery(?::oid::regconfig, ?), ?)", + ^tsc, + o.data, + ^tsc, + ^search_query, + @rank_normalisation + ) + ] ) end