Make Remote Cache purge itself free from old media #71

Open
opened 2022-08-19 07:43:46 +00:00 by puniko · 5 comments
Contributor

Remote Cache is somewhat broken (as we all know), in that it doesn't purge old cached data leading to an every growing cache directory which you'd have to clean manually.

With that, enabling remote cache doesn't make a lot of sense.

So i suggest making a way to purge old remote media automatically, or at least have a cli task that does it, so we could run a cron to handle it.

or remove it entierly (altho, i would like to have a working remote cache. makes loading media way smoother)

Remote Cache is somewhat broken (as we all know), in that it doesn't purge old cached data leading to an every growing cache directory which you'd have to clean manually. With that, enabling remote cache doesn't make a lot of sense. So i suggest making a way to purge old remote media automatically, or at least have a cli task that does it, so we could run a cron to handle it. or remove it entierly (altho, i would like to have a working remote cache. makes loading media way smoother)
Owner

Question would be what criteria to use for expiration and what to do if someone requests expired media. For example we could do something like keeping media for a number of days (perhaps configurable) like maybe 30 days or so. After that they are removed from disk and if someone requests them they are handled as if remote cache was disabled for them.

Implementation wise, this expiring could be possible by checking drive_file.createdAt. Disabling remote cache for those should be possible with UPDATE drive_file SET storedInternal = false, isLink = true when they are removed from disk. Not sure if this correctly handles cases when the file is stored in S3.

Question would be what criteria to use for expiration and what to do if someone requests expired media. For example we could do something like keeping media for a number of days (perhaps configurable) like maybe 30 days or so. After that they are removed from disk and if someone requests them they are handled as if remote cache was disabled for them. Implementation wise, this expiring could be possible by checking `drive_file.createdAt`. Disabling remote cache for those should be possible with `UPDATE drive_file SET storedInternal = false, isLink = true` when they are removed from disk. Not sure if this correctly handles cases when the file is stored in S3.
Author
Contributor

Question would be what criteria to use for expiration

i think, last time it was fetched from remote, or last time it was accessed. or it could also use a threadhold in datastorage (might be harder to implement). maybe we can peek at pleroma what it uses as criteria.

and what to do if someone requests expired media.

pleroma does refetch it and caches it again afaik (might be wrong on that tho)

After that they are removed from disk and if someone requests them they are handled as if remote cache was disabled for them.

i don't think this is a good way of handling it. i think in that case, it should be cached again, but maybe for a shorter time? if remote cache is enabled, a client should not have to fetch from another place ever.

> Question would be what criteria to use for expiration i think, last time it was fetched from remote, or last time it was accessed. or it could also use a threadhold in datastorage (might be harder to implement). maybe we can peek at pleroma what it uses as criteria. > and what to do if someone requests expired media. pleroma does refetch it and caches it again afaik (might be wrong on that tho) > After that they are removed from disk and if someone requests them they are handled as if remote cache was disabled for them. i don't think this is a good way of handling it. i think in that case, it should be cached again, but maybe for a shorter time? if remote cache is enabled, a client should not have to fetch from another place ever.
Owner

last time it was accessed

was trying to avoid having to keep track of that. Having to store last access time could probably be annoying. Not sure if we could use the file system metadata reliably for this. Further, I'm not sure if we could even do it because files might be cached, in which case we would have no way to know when the last access time was.

last time it was fetched from remote

Mmh, I think the file is only fetched once? Although I guess with Misskey drive it could be used on multiple posts.

refetch it and caches it again

again, was trying to avoid having to keep track of additional data, but it might be unavoidable.

if remote cache is enabled, a client should not have to fetch from another place ever

I never said they should fetch it from the remote. We could use media proxy. In that case we might also be able to leverage HTTP caching and/or caching of the web server to avoid having to handle caching ourselves.

> last time it was accessed was trying to avoid having to keep track of that. Having to store last access time could probably be annoying. Not sure if we could use the file system metadata reliably for this. Further, I'm not sure if we could even do it because files might be cached, in which case we would have no way to know when the last access time was. > last time it was fetched from remote Mmh, I think the file is only fetched once? Although I guess with Misskey drive it could be used on multiple posts. > refetch it and caches it again again, was trying to avoid having to keep track of additional data, but it might be unavoidable. > if remote cache is enabled, a client should not have to fetch from another place ever I never said they should fetch it from the remote. We could use media proxy. In that case we might also be able to leverage HTTP caching and/or caching of the web server to avoid having to handle caching ourselves.
Owner

was trying to avoid having to keep track of that. Having to store last access time could probably be annoying. Not sure if we could use the file system metadata reliably for this. Further, I'm not sure if we could even do it because files might be cached, in which case we would have no way to know when the last access time was.

Yeah if we use the filesystem, it may just not work if it's mounted with noatime which means no access time is recorded. Probably the best option is using the time it was fetched from the remote.

Mmh, I think the file is only fetched once? Although I guess with Misskey drive it could be used on multiple posts

Does Misskey perform a duplicate check of some sort when fetching files?

I never said they should fetch it from the remote. We could use media proxy. In that case we might also be able to leverage HTTP caching and/or caching of the web server to avoid having to handle caching ourselves.

I believe Pleroma does use Nginx's server-side caching for their implementation (could be wrong), so it's possible we could do the same if that's the case.

> was trying to avoid having to keep track of that. Having to store last access time could probably be annoying. Not sure if we could use the file system metadata reliably for this. Further, I'm not sure if we could even do it because files might be cached, in which case we would have no way to know when the last access time was. Yeah if we use the filesystem, it may just not work if it's mounted with `noatime` which means no access time is recorded. Probably the best option is using the time it was fetched from the remote. > Mmh, I think the file is only fetched once? Although I guess with Misskey drive it could be used on multiple posts Does Misskey perform a duplicate check of some sort when fetching files? > I never said they should fetch it from the remote. We could use media proxy. In that case we might also be able to leverage HTTP caching and/or caching of the web server to avoid having to handle caching ourselves. I believe Pleroma does use Nginx's server-side caching for their implementation (could be wrong), so it's possible we could do the same if that's the case.
Owner

Probably the best option is using the time it was fetched from the remote.

We currently only have the createdAt time in the database.

Does Misskey perform a duplicate check of some sort when fetching files?

Yes, it uses an MD5 hash of the file. If there is another file with the same MD5 hash (and the force option is not set) the temporary file that was just downloaded will be discarded.

0965d3cbd9/packages/backend/src/services/drive/add-file.ts (L356-L367)

> Probably the best option is using the time it was fetched from the remote. We currently only have the `createdAt` time in the database. > Does Misskey perform a duplicate check of some sort when fetching files? Yes, it uses an MD5 hash of the file. If there is another file with the same MD5 hash (and the `force` option is not set) the temporary file that was just downloaded will be discarded. https://akkoma.dev/FoundKeyGang/FoundKey/src/commit/0965d3cbd9145f5d0ddd9df8cd0e8e6ea0588b31/packages/backend/src/services/drive/add-file.ts#L356-L367
Johann150 added this to the (deleted) project 2022-10-31 19:39:56 +00:00
Johann150 added the
fix
label 2022-12-23 10:22:19 +00:00
Johann150 removed this from the (deleted) project 2022-12-23 10:22:22 +00:00
Sign in to join this conversation.
No Label
feature
fix
upkeep
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: FoundKeyGang/FoundKey#71
No description provided.