Commit graph

73 commits

Author SHA1 Message Date
5da9cbd8a5 RichMedia refactor
Rich Media parsing was previously handled on-demand with a 2 second HTTP request timeout and retained only in Cachex. Every time a Pleroma instance is restarted it will have to request and parse the data for each status with a URL detected. When fetching a batch of statuses they were processed in parallel to attempt to keep the maximum latency at 2 seconds, but often resulted in a timeline appearing to hang during loading due to a URL that could not be successfully reached. URLs which had images links that expire (Amazon AWS) were parsed and inserted with a TTL to ensure the image link would not break.

Rich Media data is now cached in the database and fetched asynchronously. Cachex is used as a read-through cache. When the data becomes available we stream an update to the clients. If the result is returned quickly the experience is almost seamless. Activities were already processed for their Rich Media data during ingestion to warm the cache, so users should not normally encounter the asynchronous loading of the Rich Media data.

Implementation notes:

- The async worker is a Task with a globally unique process name to prevent duplicate processing of the same URL
- The Task will attempt to fetch the data 3 times with increasing sleep time between attempts
- The HTTP request obeys the default HTTP request timeout value instead of 2 seconds
- URLs that cannot be successfully parsed due to an unexpected error receives a negative cache entry for 15 minutes
- URLs that fail with an expected error will receive a negative cache with no TTL
- Activities that have no detected URLs insert a nil value in the Cachex :scrubber_cache so we do not repeat parsing the object content with Floki every time the activity is rendered
- Expiring image URLs are handled with an Oban job
- There is no automatic cleanup of the Rich Media data in the database, but it is safe to delete at any time
- The post draft/preview feature makes the URL processing synchronous so the rendered post preview will have an accurate rendering

Overall performance of timelines and creating new posts which contain URLs is greatly improved.
2024-06-09 17:33:48 +01:00
Alex Gleason
3ff9c5e2a6
Break out activity-specific HTML functions into Pleroma.Activity.HTML
Fixes cycles in lib/pleroma/ecto_type/activity_pub/object_validators/safe_text.ex
2021-05-29 12:29:11 -05:00
Haelwenn (lanodan) Monnier
c4439c630f
Bump Copyright to 2021
grep -rl '# Copyright © .* Pleroma' * | xargs sed -i 's;Copyright © .* Pleroma .*;Copyright © 2017-2021 Pleroma Authors <https://pleroma.social/>;'
2021-01-13 07:49:50 +01:00
lain
e1e7e4d379 Object: Rework how Object.normalize works
Now it defaults to not fetching, and the option is named.
2021-01-04 13:38:31 +01:00
lain
713612c377 Cachex: Make caching provider switchable at runtime.
Defaults to Cachex.
2020-12-18 17:44:46 +01:00
rinpatch
e198ba492e Rich Media: Do not cache URLs for preview statuses
Closes #1987
2020-09-05 20:53:46 +03:00
rinpatch
46236d1d87 html.ex: optimize external url extraction
By using a :not() selector and only extracting attributes from the
first match.
2020-09-02 12:45:20 +03:00
Alexander Strizhakov
6512ef6879
excluding attachment links from RichMedia 2020-06-29 15:25:57 +03:00
Haelwenn (lanodan) Monnier
6da6540036
Bump copyright years of files changed after 2020-01-07
Done via the following command:
git diff fcd5dd259a --stat --name-only | xargs sed -i '/Pleroma Authors/c# Copyright © 2017-2020 Pleroma Authors <https:\/\/pleroma.social\/>'
2020-03-02 06:08:45 +01:00
rinpatch
472132215e Use floki's new APIs for parsing fragments 2020-02-16 01:55:26 +03:00
237b2068f9 Revert "Merge branch 'feat/floki-fasthtml' into 'develop'"
This reverts merge request !2194
2020-02-11 16:55:18 +00:00
rinpatch
ea1631d7e6 Make Floki use fast_html 2020-02-11 16:17:21 +03:00
Egor Kislitsyn
b7a57d8e38 Use Pleroma.Utils.compile_dir/1 in Pleroma.HTML.compile_scrubbers/0 2019-12-10 00:38:01 +07:00
rinpatch
d6c89068f3 HTML: Compile Scrubbers on boot
This makes it possible to configure their behavior on OTP releases.
2019-12-08 20:35:41 +03:00
rinpatch
a21340caa1 Fix never matching clause
`length/1` is only used with lists.
2019-12-08 16:46:18 +03:00
Egor Kislitsyn
cf52106e05
Update Floki dependency 2019-12-02 13:38:35 +07:00
Egor Kislitsyn
a98cda7758
Fix Pleroma.HTML.extract_first_external_url/2 2019-11-29 15:49:35 +07:00
rinpatch
ae59b38203 Rip out the rest of htmlsanitizeex 2019-10-30 09:20:13 +03:00
rinpatch
77cfb08b8c Remove commented-out code 2019-10-29 20:58:54 +03:00
rinpatch
08f6837065 Switch from HtmlSanitizeEx to FastSanitize 2019-10-29 01:18:08 +03:00
Egor Kislitsyn
cf3041220a Add support for rel="ugc" 2019-09-19 14:56:10 +07:00
lain
ef43016b2c Merge branch 'feature/custom-fields' into 'develop'
Add custom profile fields

See merge request pleroma/pleroma!1488
2019-08-20 12:44:14 +00:00
Haelwenn (lanodan) Monnier
a6a814420d
html.ex: Allow sub and sup elements by default
Closes: https://git.pleroma.social/pleroma/pleroma/issues/1191
2019-08-14 22:49:13 +02:00
Egor Kislitsyn
f7bbf99caa Use info.fields instead of source_data for remote users 2019-08-14 14:52:54 +07:00
rinpatch
035368d363 Rich Media: Skip Microformats hashtags
When fixing this problem I incorrectly assumed a.hashtag is
the proper way for detecting hashtags, but it is just something Pleroma and
Mastodon add. Per microformats it should be detected by the presense of rel=tag.

This MR adds a check for rel=tag, but I still left a.hashtag just in case
2019-06-19 00:46:30 +03:00
rinpatch
d0ebc0edf3 Fix hashtags being picked up by rich media parser
Closes #989
2019-06-14 14:34:42 +03:00
Egor Kislitsyn
99f70c7e20 Use Pleroma.Config everywhere 2019-05-30 15:33:58 +07:00
Haelwenn (lanodan) Monnier
85b5c60694
Pleroma.Formatter: width/height to class=emoji 2019-05-03 16:25:58 +02:00
rinpatch
51e26f14f7 Remove redundant ensure_scrubbed_html
It is never used as handling for fake and non-fake activities was merged
into one function above it
2019-05-01 13:52:44 +03:00
Sachin Joshi
85fa2fbce4 add scrubber for html special char 2019-05-01 01:37:17 +05:45
kaniini
030a7876b4 Merge branch 'security/fix-html-class-scrubbing' into 'develop'
html: lock down allowed class attributes to only those related to microformats

See merge request pleroma/pleroma!1090
2019-04-23 23:07:56 +00:00
William Pitcock
f5535e5743 html: lock down allowed class attributes to only those related to microformats 2019-04-23 23:03:45 +00:00
rinpatch
627e5a0a49 Merge branch 'develop' into feature/database-compaction 2019-04-17 12:22:32 +03:00
rinpatch
f0f30019e1 Refactor html caching functions to have a key instead of a module, use more correct terminology and fix summaries in mastoapi 2019-04-05 15:19:44 +03:00
rinpatch
975482f091 insert object defaults for fake activities and make credo happy 2019-04-01 12:16:51 +03:00
rinpatch
45ba10bf47 Fix the issue with HTML scrubber 2019-04-01 11:55:59 +03:00
Fong-Wan Chau
4ed2618f6c Allow 'rel' attribute on <a> link with specific values (for hashtag recognition). 2019-03-17 11:03:19 -04:00
Haelwenn (lanodan) Monnier
fb82f6fc7c
[Credo] Remove parentesis on argument-less functions 2019-03-13 04:26:56 +01:00
Haelwenn (lanodan) Monnier
381fe44172
HTML.Scrubber.Default: Consistency 2019-02-09 14:59:21 +01:00
Haelwenn (lanodan) Monnier
2272934a5e
Stash 2019-02-09 14:59:21 +01:00
Haelwenn (lanodan) Monnier
60ea29dfe6
Credo fixes: alias grouping/ordering 2019-02-09 14:59:20 +01:00
William Pitcock
a2bb5d890d html: don't attempt to parse nil content 2019-02-05 05:06:17 +00:00
William Pitcock
ddb5545202 rich media: kill some testsuite noise 2019-01-28 20:55:33 +00:00
William Pitcock
be9abb2cc5 html: add utility function to extract first URL from an object and cache the result 2019-01-26 14:55:12 +00:00
William Pitcock
1ddab78247 html: allow microformats-related markup through the html filter 2019-01-16 03:54:01 +00:00
Rin Toshaka
1e2d58982e oopsies 2019-01-05 00:25:31 +01:00
Rin Toshaka
846082e54f Different caches based on the module. Remove scrubber version since it is not relevant anymore 2019-01-05 00:19:46 +01:00
William Pitcock
980b5288ed update copyright years to 2019 2018-12-31 15:41:47 +00:00
Rin Toshaka
7e09c2bd7d Move scrubber cache-related functions to Pleroma.HTML 2018-12-31 08:19:48 +01:00
Rin Toshaka
c50353e6ae shame on me for not testing after revert 2018-12-30 20:44:17 +01:00