Add backfilling of posts #846
No reviewers
Labels
No labels
approved, awaiting change
bug
configuration
documentation
duplicate
enhancement
extremely low priority
feature request
Fix it yourself
help wanted
invalid
mastodon_api
needs change/feedback
needs docs
needs tests
not a bug
planned
pleroma_api
privacy
question
static_fe
triage
wontfix
No milestone
No project
No assignees
3 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: AkkomaGang/akkoma#846
Loading…
Add table
Reference in a new issue
No description provided.
Delete branch "tudbut/akkoma:develop"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Hey! I decided it was kinda dumb that outboxes exist everywhere yet no software uses them for their intended (or any, really) purpose.
So now I added backfilling of posts from remote instances. This only fetches a single page, and will re-fetch whenever the user's page is requested, so that it won't cause there to be only a slice of posts. :D
2aefeeee61
to34f78ab4cd
the implementation looks very sensible - i saw the pr title and thought "i'm gonna have to ask for this to be a worker" but it already is which is nice
two little things before this can be merged - first off, a changelog entry would be nice
and also, this will need some tests - current remote fetcher worker tests are over here -> https://akkoma.dev/AkkomaGang/akkoma/src/branch/develop/test/pleroma/workers/remote_fetcher_worker_test.exs
should give you a decent start - you don't need to be too extensive, but some basic input-output checks would be good i think
Thanks! ill go look at the tests stuff while on the train (in a few minutes).
I haven’t had time to take a proper look but at from a glance I’m concerned about some parts of this, though feel free to point out how i missed the (obvious) justification and/or reason why this is good and safe on my quick glance
@ -20,3 +20,3 @@
def cast(%{"id" => object}), do: cast(object)
def cast(_), do: :error
def cast(o), do: {:error, o}
is this signature change actually safe? Why is it done?
Note this is also inconsistent with the return of a prior function clause
this can be reverted. i experienced several errors that were not specific enough and doing this told me what the issue was. ill revert it if you want
reverting this in the next commit
@ -1773,3 +1775,3 @@
end
def pin_data_from_featured_collection(%{
def activity_data_from_outbox_collection(%{
This seems to replace pinned fetches everywhere, meaning we no longer are guaranteed to get pinned posts of a user profile?
pins are still fetched, using the same function. it treats an outbox the same as the featured collection
if the naming is misleading, i can change it
yes, please use a more generic name, calling "fetch_outbox" to fetch the pinned collection is just as confusing as if leaving the original name and alling
fetch_pins
to get the outbox@ -641,0 +640,4 @@
defp handle_incoming_normalised(%{"object" => o}, options),
do: handle_incoming_normalised(o, options)
defp handle_incoming_normalised(o, _), do: {:error, {:unknown_object_class, o}}
Just processing the referenced object if we don't understand the activity is imo a bad idea and likely to break stuff: e.g. consider a malformed Delete which might now end up actually inserting the object instead of deleting it
It can also end up breaking #845 (which i also haven’t reviewed yet)
similarly the changed return signature from
:error
to a tuple{:error, {.., ..}
needs some attention to check whether anything is affected by thisthe issue is that when fetching other things (e.g. pins, we get objects directly. akkoma just wraps things in an activity no matter what, so we get an activity that's inside an activity)
this arises from https://akkoma.dev/AkkomaGang/akkoma/src/branch/develop/lib/pleroma/object/fetcher.ex#L150 where we wrap everything in a fake activity
we can modify said fake create activity to have some sort of indication that it is fake
adding such an indication in the next commit
I don't like yet another "fake activity" marker (We already have one) with different semantics. Instead, the fetch logic should be changed to only wrap objects which actually need this
Both this and #845 could use a new
fetchable_objects
constantWhere? I couldn't find it personally, but I'm also pretty new to the codebase
Multiple actually:
and also a
"fake" => true
property sometimes.Furthermore,
Create
activities for fetched remote posts have anull
id since a few years or so (and fixing a fallout from this keeps getting defered due to priority and time, see #712 )The need to deal with them in al subsequent places makes things messy and imo they ideally shouldn't exist at all.
Unfortunately some of this is hard to fix due to fundamental design choices from early days (e.g. requiring the presence of a
Create
activity for all viewable objects).Fortunately for what you’re trying to do here, it’s easy enough to just not create unecessary activities in the first place, avoiding all of this.
Wouldn't I need to completely rewrite the function around https://akkoma.dev/AkkomaGang/akkoma/src/branch/develop/lib/pleroma/object/fetcher.ex#L150 for that? Or at the very least completely bypass the function?
I can definitely switch to using "fake" => true or something. Or I can check if the ID is null
I just realized this can actually be done relatively easily and will be implementing it now
@ -306,6 +306,8 @@ def show(%{assigns: %{user: for_user}} = conn, %{id: nickname_or_id} = params) d
def statuses(%{assigns: %{user: reading_user}} = conn, params) do
with %User{} = user <- User.get_cached_by_nickname_or_id(params.id, for: reading_user),
:visible <- User.visible_for(user, reading_user) do
ActivityPub.enqueue_outbox_fetches(user)
imho this should be limited to, or there should at least be an option to limit this to, authenticated requests only. Processing new posts entailing pulling in their whole threads, new refeenced users and their
pinned(now generally latest) posts and their threads and so on can become quite costlyAlso no need to enquee fetched if the viewed profile is a local user
Furthermore, there prob should be a timeout (e.g. 60 min) so we don't constantly spam outbox fetches for popular remote users
we do always fetch the outbox, but activities in it are still cached. a timeout may make sense either way.
local users never get an outbox address property, its only added to the user object when its a remote one. so technically we're already not fetching local outboxes. making that more explicit makes sense tho
This is insufficent, since:
And also doesn’t address the limitation to authenticated users only. As I mentioned this can be costly and I don’t want unauthenticated users browsing user profiles to ever be able to create such a significant workload for the instance for profiles actual local users might never even look at. (Letting unauthenticated viewers trigger this also exacerbates the request spam issue)
I've made it only fetch once per minute per user, and ignore local users. If one minute is too small, I propose making this a setting, though I'm not familiar with how to do that at this point. 10 mins should also be fine to hard-code, but an hour feels a little much.
The timeout mechanism seems like it should work and uses mechansims you’re familiar with
Though this much additional DB traffic to frequently update timestamp might (untested) negatively affect overall perf. Given the cooldown period is well within typicall continuous-uptime, I was thinking of just using a runtime-only hint in caches; see e.g. #762 for an example of a (new)
cachex
set being used for timeoutsTo add the restriction for logged-in users, see e.g.
lib/pleroma/web/mastodon_api/controllers/timeline_controller.ex
’spublic/2
functionIf this is too much, we can also adjust those bits later before merge (though it might take a while)
I have now added the cachex bit. Ill be pushing this once i have verified it still works, and will next be looking at the logged in users restriction.
Okay! Hope this is good now (no more when not logged in, cachex for timeouts). If so, I think that would be all in terms of what you requested, but let me know if I've missed anything!
Great! Unfortunately it will take a bit before i can take a proper look at everything again, but i posted some quick remarks on the new commits already
@ -0,0 +6,4 @@
add_if_not_exists(:outbox, :text)
end
execute("update users set outbox = replace(inbox, 'inbox', 'outbox')")
This does not generally hold
im aware. but for migrating instances, no outboxes will be populated at first so this is kinda the only solution if you want previously fetched users to have the property
any idea how to do this more accurately?
don’t populate it and wait for the next user refresh to occur or Update activity to arrive filling outbox with the correct value
(also this breaks for e.g. users named "inbox")
@Oneric i made some changes but i also still have a few questions about part of your feedback (feel free to take your time, but just in case you havent seen it yet)
okay so since i have quite strong unmedicated ADHD, fixing the issues with this will probably take me a while since my motivation has disappeared now.
ill convert this to a draft for now and continue when i can either get medicated (only a few more months i think) or maybe by some miracle ill be bored/motivated enough to continue (train journeys may help here)
Add backfilling of poststo WIP: Add backfilling of postsSorry if my curt comments yesterday ended up sounding harsh; I’m stretched thin on time atm and just quickly pointed out whatever seemed in need of further change
Oh dont worry, its not related to you. I just lose motivation after doing one project for a few days. And itll return, just takes a while
@ -1839,0 +1852,4 @@
{true} <-
{last_fetch > now ||
last_fetch < now - 60}
) do
I'm not too familiar with elixir yet - Is this idiomatic?
Wrnagling the fallback for
nil
into a sure-to-failNaiveDateTime
will be akward usingwith
, a small function is probably better:Also for the current version:
wrt to typical styling: just run
mix format
and code style will be adjusted automatically3e4424684f
tofdaff0a70b
488d4b7c2e
to73e587ed9b
73e587ed9b
toe37ffa1b1f
phew sorry for the commit spam, ill probably end up squashing all that later
@floatingghost
Your recently added migration breaks due to trying to parse users, but at that point not all columns are set up. This won't happen in prod at this point because there's no migrations afterwards, but just as an early report, I'm putting it here.
So far i'm just moving the migration after mine to fix it for my instances, but this will cause issues in the future if left in.
whoops, good catch!
I think we can avoid parsing the
Suer
struct at all here by justselect
ing the fields we need in the initial querythis is still happening and makes developing noticeably harder
The following patch should fix it; you can insert it at the start:
Fix SigningKey migration patch
should i add that into the PR or will it be pushed to develop soon? id just merge from that if so
I do not know when it will be pushed to develop, but since you can easily drop the patch again later if needed with
git rebase -i develop
i’d personally just add it here for now@tudbut: the data migration fix was pushed yesterday
WIP: Add backfilling of poststo Add backfilling of posts@ -1857,0 +1868,4 @@
@timer,
outbox_address,
true,
ttl: Config.get!([:activitypub, :outbox_refetch_cooldown])
hmm.. i imagine some will want to disable this completely and intuitively a cooldown of
:infinity
makes sense for this. As is:infinity
won't prevent fetches though as at least one fetch will happen before there’s a cache entry. Perhaps check for:infinity
before the cache look upgood idea, will do!
@ -1857,0 +1889,4 @@
def enqueue_outbox_fetches(%{local: false} = user) do
make_user_from_ap_id(user.ap_id)
:ok
Technically AP spec requires all actors to provide at least an inbox and outbox and this should never happen (well, .. except for old, by now unreachable, pre-AP users). If an implementation decided to omit
outbox
though due to having little relevance in Server2Server interaction, this would end up refetching the user each time the account statuses are queried — including duplicates for simultaneous API calls. That’s not nice on remote serversWhile still not ideal (no dedupe yet), the common approach in the codebase is to use
User.get_or_fetch_by_ap_id
. It will refresh the user with a built in cooldown *(by default 1 day, can be changed iwthmaximum_age:
parametergood catch! luckily an easy fix :D
will do tomorrow
@ -309,0 +312,4 @@
_ ->
ActivityPub.enqueue_outbox_fetches(user)
end
style nit: this can be simplified to an
if
withoutelse
how did i miss that! probably forgot that if statements exist or something :P
Most comments are just informational remarks or minor nits.
The biggest remaining hurdle appears to be avoiding repetitive errors from trying to reinsert activities without performance impact; not sure what the best course of action is.
I was originally also concerned about whether fetch-specific security checks after retrieval hold up when dealing with activities since so far the code only had to handle terminal objects, but it seems like it’s fine.
@ -373,2 +373,3 @@
authorized_fetch_mode: false,
max_collection_objects: 50
max_collection_objects: 50,
outbox_refetch_cooldown: 5 * 60
note: 5min still seems too short to me as a default, but ultimately this is up to floati to decide
ill do 30 mins as default
@ -1589,2 +1589,3 @@
featured_address = data["featured"]
{:ok, pinned_objects} = fetch_and_prepare_featured_from_ap_id(featured_address)
{:ok, pinned_objects} = fetch_and_prepare_collection_from_ap_id(featured_address)
outbox_address = data["outbox"]
note: the preëxisting
{:ok, _} =
pattern match here causes the process to crash if an error occurs while fetching the collection, which is undesireable if the process also did anything other than fetching this collectionI think i fixed this already either in #862 or some not-yet published patches building further on top of it (but it’s hard to keep track of so many intedependend patch trees), so prepare for merge conflicts possibly showing up here
thanks, but i dont fear merge conflicts anymore :p
@ -1857,0 +1879,4 @@
})
a ->
Logger.debug("Not refetching outbox (TTL not reached)")
a
is unsused creating a compiler warning (this should be the same in your elixir version).Instead of a variable use an underscore
_
as a generic placeholderYeah i was using it for debug output but then removed that.
@ -308,1 +308,4 @@
:visible <- User.visible_for(user, reading_user) do
case reading_user do
nil -> {}
This will be replaced with an
if
anyway but for future reference:mix format
:If one branch uses multiple lines, all branches must break condition and body by a newline
replaced by if !is_nil
@ -13,0 +18,4 @@
Enum.each(outbox, fn {ap_id, _} ->
if is_nil(Object.get_cached_by_ap_id(ap_id)) do
Fetcher.fetch_object_from_id(ap_id, depth: 1)
end
Fetcher.fetch_object_from_id
already checks whether the object is already locally known, no need to check again here.Neither checks whether an activity of this AP id is already known though which will likely blow up the amount of
already known
error logs. But activity lookups aren't cached and thus costly atm (most lookups also don't directly use theid
but itsobject
property anyway)...awesome, thank you. simplified to remove conditional
@ -0,0 +3,4 @@
def up do
alter table(:users) do
add_if_not_exists(:outbox, :text)
This is likely the default anyway, but for consistency with the definition in
user.ex
it’d be good to define thenil
defauti took example from users_add_inboxes:
and it doesnt seem like an argument specifying default exists for the function
found default parameter in documentation :D
@ -0,0 +7,4 @@
file # libmagic
ffmpeg
];
}
If this was just added because it shows up as a new file you can add shell.nix to a user-side .gitignore
i do think it makes sense to have it, but its obviously external to the PR's subject. ill remove it if desired
Also 3 tests are failing, make sure to run
mix test || mix test --failed
and fix any errors:aside: i love this collection name in the tests
okay so i just re-discovered why i didnt use the tests: i get about a hundred failures for irrelevant reasons like these URLs or somehow broken MRF policies
ill push without testing it myself again
this push includes a
mix format
which changed a few files that aren't related. i had been avoiding this before by restoring such files on every format but i hope its fine to just commit themthis looks like the test env isn’t applied for some reason. Are you manually overriding
MIX_ENV
while runningmix test || mix test --failed
?Please do not reformat any preëxisting code unrelated to the PR. Everything should already be
mix format
clean. If you’re using an older elixir version the format guidelines might differ from what our canonical dev version used in CI expects; changing it will just cause CI failures latermy favourite kind of collection ^^
Thinking a bit about it, I believe the best was to deal with the issue of constraint errors from reprocessing outbox activities is to remember the last processed outbox element and abort further insertion when it’s encountered again. With this check added, also querying the database to check for any pre-existing (e.g. from federation) activity shouldn't be too costly anymore.
We should not remember just any last activity from the user which e.g. arrived via federation, since then we might miss entries inbetween which were not federated and the ordering might be different. Only from outbox fetches.
This leaves the issue of whether to add the activity check to the outbox fetching specifically or to all fetches in
Object.Fetcher
; not sureView command line instructions
Checkout
From your project repository, check out a new branch and test the changes.