Websites are increasingly getting more bloated with tricks like inlining content (e.g., CNN.com) which puts pages at or above 5MB. This value may still be too low.
Rich Media parsing was previously handled on-demand with a 2 second HTTP request timeout and retained only in Cachex. Every time a Pleroma instance is restarted it will have to request and parse the data for each status with a URL detected. When fetching a batch of statuses they were processed in parallel to attempt to keep the maximum latency at 2 seconds, but often resulted in a timeline appearing to hang during loading due to a URL that could not be successfully reached. URLs which had images links that expire (Amazon AWS) were parsed and inserted with a TTL to ensure the image link would not break.
Rich Media data is now cached in the database and fetched asynchronously. Cachex is used as a read-through cache. When the data becomes available we stream an update to the clients. If the result is returned quickly the experience is almost seamless. Activities were already processed for their Rich Media data during ingestion to warm the cache, so users should not normally encounter the asynchronous loading of the Rich Media data.
Implementation notes:
- The async worker is a Task with a globally unique process name to prevent duplicate processing of the same URL
- The Task will attempt to fetch the data 3 times with increasing sleep time between attempts
- The HTTP request obeys the default HTTP request timeout value instead of 2 seconds
- URLs that cannot be successfully parsed due to an unexpected error receives a negative cache entry for 15 minutes
- URLs that fail with an expected error will receive a negative cache with no TTL
- Activities that have no detected URLs insert a nil value in the Cachex :scrubber_cache so we do not repeat parsing the object content with Floki every time the activity is rendered
- Expiring image URLs are handled with an Oban job
- There is no automatic cleanup of the Rich Media data in the database, but it is safe to delete at any time
- The post draft/preview feature makes the URL processing synchronous so the rendered post preview will have an accurate rendering
Overall performance of timelines and creating new posts which contain URLs is greatly improved.
This lets us:
- avoid issues with broken hash indices for PostgreSQL <10
- drop runtime checks and legacy codepaths for <11 in db search
- always enable custom query plans for performance optimisation
PostgreSQL 11 is already EOL since 2023-11-09, so
in theory everyone should already have moved on to 12 anyway.
The newest git HEAD of MIME already knows about APNG, but this
hasn’t been released yet. Without this, APNG attachments from
remote posts won’t display as images in frontends.
Fixes: akkoma#657
By mapping all extensions related to our custom privileged types
back to innocuous text/plain, our custom types will never automatically
be inserted which was one of the factors making impersonation possible.
Note, this does not invalidate the upload and emoji Content-Type
restrictions from previous commits. Apart from counterfeit AP objects
there are other payloads with standard types this protects against,
e.g. *.js Javascript payloads as used in prior frontend injections.
This actually was already intended before to eradict all future
path-traversal-style exploits and to fix issues with some
characters like akkoma#610 in 0b2ec0ccee. However, Dedupe and
AnonymizeFilename got mixed up. The latter only anonymises the name
in Content-Disposition headers GET parameters (with link_name),
_not_ the upload path.
Even without Dedupe, the upload path is prefixed by an UUID,
so it _should_ already be hard to guess for attackers. But now
we actually can be sure no path shenanigangs occur, uploads
reliably work and save some disk space.
While this makes the final path predictable, this prediction is
not exploitable. Insertion of a back-reference to the upload
itself requires pulling off a successfull preimage attack against
SHA-256, which is deemed infeasible for the foreseeable futures.
Dedupe was already included in the default list in config.exs
since 28cfb2c37a, but this will get overridde by whatever the
config generated by the "pleroma.instance gen" task chose.
Upload+delete tests running in parallel using Dedupe might be flaky, but
this was already true before and needs its own commit to fix eventually.
The lack thereof enables spoofing ActivityPub objects.
A malicious user could upload fake activities as attachments
and (if having access to remote search) trick local and remote
fedi instances into fetching and processing it as a valid object.
If uploads are hosted on the same domain as the instance itself,
it is possible for anyone with upload access to impersonate(!)
other users of the same instance.
If uploads are exclusively hosted on a different domain, even the most
basic check of domain of the object id and fetch url matching should
prevent impersonation. However, it may still be possible to trick
servers into accepting bogus users on the upload (sub)domain and bogus
notes attributed to such users.
Instances which later migrated to a different domain and have a
permissive redirect rule in place can still be vulnerable.
If — like Akkoma — the fetching server is overly permissive with
redirects, impersonation still works.
This was possible because Plug.Static also uses our custom
MIME type mappings used for actually authentic AP objects.
Provided external storage providers don’t somehow return ActivityStream
Content-Types on their own, instances using those are also safe against
their users being spoofed via uploads.
Akkoma instances using the OnlyMedia upload filter
cannot be exploited as a vector in this way — IF the
fetching server validates the Content-Type of
fetched objects (Akkoma itself does this already).
However, restricting uploads to only multimedia files may be a bit too
heavy-handed. Instead this commit will restrict the returned
Content-Type headers for user uploaded files to a safe subset, falling
back to generic 'application/octet-stream' for anything else.
This will also protect against non-AP payloads as e.g. used in
past frontend code injection attacks.
It’s a slight regression in user comfort, if say PDFs are uploaded,
but this trade-off seems fairly acceptable.
(Note, just excluding our own custom types would offer no protection
against non-AP payloads and bear a (perhaps small) risk of a silent
regression should MIME ever decide to add a canonical extension for
ActivityPub objects)
Now, one might expect there to be other defence mechanisms
besides Content-Type preventing counterfeits from being accepted,
like e.g. validation of the queried URL and AP ID matching.
Inserting a self-reference into our uploads is hard, but unfortunately
*oma does not verify the id in such a way and happily accepts _anything_
from the same domain (without even considering redirects).
E.g. Sharkey (and possibly other *keys) seem to attempt to guard
against this by immediately refetching the object from its ID, but
this is easily circumvented by just uploading two payloads with the
ID of one linking to the other.
Unfortunately *oma is thus _both_ a vector for spoofing and
vulnerable to those spoof payloads, resulting in an easy way
to impersonate our users.
Similar flaws exists for emoji and media proxy.
Subsequent commits will fix this by rigorously sanitising
content types in more areas, hardening our checks, improving
the default config and discouraging insecure config options.
This fixes an oversight in e99e2407f3
which added background_removal as a possible SimplePolicy setting.
However, it did _not_ add a default value to the base config and
as it turns out instance_list doesn’t handle unset options well.
In effect this caused federating instances with SimplePolicy enabled
but background_removal not explicitly configured to always trip up for
outgoing account updates in check_background_removal (and incoming
updates from Sharkey).
For added ""fun"" this error was able to block account updates made
e.g. via /api/v1/accounts/update_credentials.
Tests were unaffected since they explicitly override
all relevant config options.
Set a default to avoid all this
(note to self: don’t forget next time, baka!)
When no image description is filled in, Pleroma allowed fallbacks.
Those were (based on a setting) either the filename, or a fixed description.
Neither are good options for image descriptions imo, so here we remove this.
Note that there's two tests removed who supposedly tested something else.
But examining closer, they didn't seem to test what they claimed to test,
so I removed them rather than try to "fix" them.
This should help mitigate negative impacts related to block-retaliation
and block-circumvention when blocks become visible to the blocked party.
Instances interested in broadcasting blocks can turn this on if they
wish. This should have always been the default.
See also: AkkomaGang/akkoma-fe#274
Argos Translate is a Python module for translation and can be used as a command line tool.
This is also the engine for LibreTranslate, for which we already have a module.
Here we can use the engine directly from our server without doing requests to a third party or having to install our own LibreTranslate webservice (obviously you do have to install Argos Translate).
One thing that's currently still missing from Argos Translate is auto-detection of languages (see <https://github.com/argosopentech/argos-translate/issues/9>). For now, when no source language is provided, we just return the text unchanged, supposedly translated from the target language. That way you get a near immediate response in pleroma-fe when clicking Translate, after which you can select the source language from a dropdown.
Argos Translate also doesn't seem to handle html very well. Therefore we give admins the option to strip the html before translating. I made this an option because I'm unsure if/how this will change in the future.
Co-authored-by: ilja <git@ilja.space>
Reviewed-on: AkkomaGang/akkoma#351
Co-authored-by: ilja <akkoma.dev@ilja.space>
Co-committed-by: ilja <akkoma.dev@ilja.space>
- Drop Expect-CT
Expect-CT has been redundant since 2018 when Certificate Transparency became mandated and required for all CAs and browsers. This header is only implemented in Chrome and is now deprecated. HTTP header analysers do not check this anymore as this is enforced by default. See https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Expect-CT
- Raise HSTS to 2 years and explicitly preload
The longer age for HSTS, the better. Header analysers prefer 2 years over 1 year now as free TLS is very common using Let's Encrypt.
For HSTS to be fully effective, you need to submit your root domain (domain.tld) to https://hstspreload.org. However, a requirement for this is the "preload" directive in Strict-Transport-Security. If you do not have "preload", it will reject your domain.
- Drop X-Download-Options
This is an IE8-era header when Adobe products used to use the IE engine for making outbound web requests to embed webpages in things like Adobe Acrobat (PDFs). Modern apps are using Microsoft Edge WebView2 or Chromium Embedded Framework. No modern browser checks or header analyser check for this.
- Set base-uri to 'none'
This is to specify the domain for relative links (`<base>` HTML tag). pleroma-fe does not use this and it's an incredibly niche tag.
I use all of these myself on my instance by rewriting the headers with zero problems. No breakage observed.
I have not compiled my Elixr changes, but I don't see why they'd break.
Co-authored-by: r3g_5z <june@terezi.dev>
Reviewed-on: AkkomaGang/akkoma#294
Co-authored-by: @r3g_5z@plem.sapphic.site <june@terezi.dev>
Co-committed-by: @r3g_5z@plem.sapphic.site <june@terezi.dev>