[feat] more permissive post language selection #462

Open
opened 2023-02-11 16:49:30 +00:00 by yheuhtozr · 6 comments

The idea

Thank you for all your time and hard work to bring about the update! I'm happy to see the nice little post language selector on my instance.

I know that the current state is just the first step of the whole picture, and I don't know about the future roadmap, but anyway here are my couple of suggestions to make it better.

  • Allow free input as the post language
    • Current implementation is restricted within ISO 639 two-letter codes, I guess. This is by far insufficient to describe all the languages the world would need. It should instead accept any well-formed BCP 47 format if validation is necessary (as defined in the ActivityPub recommendation). Since the BCP 47 format is half open-ended, listing all available options is unrealistic, so we better allow user to hand-type language tags, with the preset language list only as suggestion.
    • Why do we need BCP 47 at all? Well, it's all about how tech meddles with languages. Especially, some "languages" can only be correctly represented by hyphenated codes: en-GB (British English) and en-US (American English) may not be terribly different, but es-ES (Spanish Spanish), es-MX (Mexican Spanish), or es-AR (Argentine Spanish) are so diverged that a very common word in any of them could be an obscene word in others.
    • Accurate language tagging improves accessibility, e.g. sometimes the same letter should be displayed in different shapes according to language and script; screen readers can correctly pronounce words as in the intended language.
  • Allow option for empty value or non-linguistic content
    • Obviously, what language does a kitty photo belong? Beyond words.
    • Alternatively, there are some ISO 639 codes explicitly for no language content (zxx) or undetermined (und).

The reasoning

No response

Have you searched for this feature request?

  • I have double-checked and have not found this feature request mentioned anywhere.
  • This feature is related to the Akkoma backend specifically, and not pleroma-fe.
### The idea Thank you for all your time and hard work to bring about the update! I'm happy to see the nice little post language selector on my instance. I know that the current state is just the first step of the whole picture, and I don't know about the future roadmap, but anyway here are my couple of suggestions to make it better. - Allow free input as the post language - Current implementation is restricted within ISO 639 two-letter codes, I guess. This is by far insufficient to describe all the languages the world would need. It should instead accept any well-formed [BCP 47 format](https://www.w3.org/International/articles/language-tags/index.en) if validation is necessary (as defined in [the ActivityPub recommendation](https://www.w3.org/TR/activitystreams-vocabulary/#dfn-content)). Since the BCP 47 format is half open-ended, listing all available options is unrealistic, so we better allow user to hand-type language tags, with the preset language list only as suggestion. - Why do we need BCP 47 at all? Well, it's all about how tech meddles with languages. Especially, some "languages" can only be correctly represented by hyphenated codes: `en-GB` (British English) and `en-US` (American English) may not be terribly different, but `es-ES` (Spanish Spanish), `es-MX` (Mexican Spanish), or `es-AR` (Argentine Spanish) are so diverged that a very common word in any of them could be an obscene word in others. - Accurate language tagging improves accessibility, e.g. sometimes the same letter should be displayed in different shapes according to language and script; screen readers can correctly pronounce words as in the intended language. - Allow option for empty value or non-linguistic content - Obviously, what language does a kitty photo belong? Beyond words. - Alternatively, there are some ISO 639 codes explicitly for no language content (`zxx`) or undetermined (`und`). ### The reasoning _No response_ ### Have you searched for this feature request? - [x] I have double-checked and have not found this feature request mentioned anywhere. - [x] This feature is related to the Akkoma backend specifically, and not pleroma-fe.
yheuhtozr added the
feature request
label 2023-02-11 16:49:30 +00:00

whilst an empty value might be desired, valid input will not be expanded beyond ISO639

this is due to the API being compatible with the mastodon equivalent

see https://docs.joinmastodon.org/methods/statuses/#create for more information

as such, the activitypub spec does not factor into this

whilst an empty value might be desired, valid input will not be expanded beyond ISO639 this is due to the API being compatible with the mastodon equivalent see https://docs.joinmastodon.org/methods/statuses/#create for more information as such, the activitypub spec does not factor into this
Author

So, am I correct that ActivityPub does not cover how server communicates with client, and for that part Akkoma follows Mastodon API? Then I'd like to look into why they chose ISO 639 as value range (I don't think it a very mainstream choice).

By the way, since ISO 639 itself already contains thousands of codes and being updated irregularly, I still think free input is a better solution lest you need to add options every time people want any, but I'm not sure.

So, am I correct that ActivityPub does not cover how server communicates with client, and for that part Akkoma follows Mastodon API? Then I'd like to look into why they chose ISO 639 as value range (I don't think it a very mainstream choice). By the way, since ISO 639 itself already contains thousands of codes and being updated irregularly, I still think free input is a better solution lest you need to add options every time people want any, but I'm not sure.

yes, activitypub only dictates server-to-server communication in practice
client-to-server de-facto uses mastodon API

yes, activitypub only dictates server-to-server communication in practice client-to-server de-facto uses mastodon API

given that we want to maintain compatibiltiy with the masto API, posting languages will remain ISO639

given that we want to maintain compatibiltiy with the masto API, posting languages will remain ISO639
Author

So, as I checked Mastodon's code base, they are actually made to handle BCP 47 strings. They just discard non-language parts for their internal use, and not likely to break their and those derivative software.

Also, you can see the Mastodon API document contradicts with its implementation. The Status object returned by POST /api/v1/statuses API which you mentioned before, is supposed to have language attribute which is "String (ISO 639 Part 1 two-letter language code)", but they do return some ISO 639-3 codes which you see in the previous source file.

Moreover, this issue suggests that Friendica sends out language in BCP 47 format, and Mastodon API just passes through it. You can try:

$ wget -q -O - https://mastodon.social/api/v1/statuses/109976145338061924
{"id":"109976145338061924","created_at":"2023-03-06T11:39:23.000Z","in_reply_to_id":"109974324301757156","in_reply_to_account_id":"11307","sensitive":false,"spoiler_text":"","visibility":"public","language":"en-us","uri":"https://f.lapo.it/objects/6a1cc041-1564-05d0-ebe2-f4e513729754","url":"https://f.lapo.it/display/6a1cc041-1564-05d0-ebe2-f4e513729754","replies_count":1,"reblogs_count":0,"favourites_count":0,"edited_at":null,"content":"\u003cspan class=\"h-card\"\u003e\u003ca href=\"https://mastodon.social/users/mcc\" class=\"u-url mention\" rel=\"nofollow noopener noreferrer\" target=\"_blank\"\u003e@\u003cspan\u003emcc\u003c/span\u003e\u003c/a\u003e\u003c/span\u003e Interesting!\u003cbr\u003eIt seems to be available (only?) on iTunes US and on BluRay in some countries, but not anywhere here in Italy.","reblog":null,"account":{"id":"763792","username":"lapo","acct":"lapo@f.lapo.it","display_name":"Lapo Luchini","locked":true,"bot":false,"discoverable":true,"group":false,"created_at":"2019-03-15T00:00:00.000Z","note":"Lapo Luchini is a carbon-based life form.","url":"https://f.lapo.it/profile/lapo","avatar":"https://files.mastodon.social/cache/accounts/avatars/000/763/792/original/573fbbc923a1617d.png","avatar_static":"https://files.mastodon.social/cache/accounts/avatars/000/763/792/original/573fbbc923a1617d.png","header":"https://mastodon.social/headers/original/missing.png","header_static":"https://mastodon.social/headers/original/missing.png","followers_count":61,"following_count":48,"statuses_count":118,"last_status_at":"2023-03-06","emojis":[],"fields":[]},"media_attachments":[],"mentions":[{"id":"11307","username":"mcc","url":"https://mastodon.social/@mcc","acct":"mcc"}],"tags":[],"emojis":[],"card":null,"poll":null}

So, their restriction to ISO 639 in their API has no technical ground. Instead, it seems to owe to Eugen's pretty opinionated view on how language code works ("opinionated" is a euphemism). After all it looks like a Mastodon-specific design decision on this matter.

In this case, is it possible for us to just extend the spec on our side? If it has to stick strictly to Mastodon API, I am thinking of raising PR to modify their documentation.

So, as I checked Mastodon's [code base](https://github.com/mastodon/mastodon/blob/main/app/helpers/languages_helper.rb), they are actually made to handle BCP 47 strings. They just discard non-language parts for their internal use, and not likely to break their and those derivative software. Also, you can see the Mastodon API document contradicts with its implementation. The [`Status` object](https://docs.joinmastodon.org/entities/Status/) returned by [`POST /api/v1/statuses` API](https://docs.joinmastodon.org/methods/statuses/#create) which you mentioned before, is supposed to have `language` attribute which is "String (ISO 639 Part 1 two-letter language code)", but they do return some ISO 639-3 codes which you see in the previous source file. Moreover, [this issue](https://github.com/mastodon/mastodon/issues/23990) suggests that Friendica sends out language in BCP 47 format, and Mastodon API just passes through it. You can try: ``` bash $ wget -q -O - https://mastodon.social/api/v1/statuses/109976145338061924 {"id":"109976145338061924","created_at":"2023-03-06T11:39:23.000Z","in_reply_to_id":"109974324301757156","in_reply_to_account_id":"11307","sensitive":false,"spoiler_text":"","visibility":"public","language":"en-us","uri":"https://f.lapo.it/objects/6a1cc041-1564-05d0-ebe2-f4e513729754","url":"https://f.lapo.it/display/6a1cc041-1564-05d0-ebe2-f4e513729754","replies_count":1,"reblogs_count":0,"favourites_count":0,"edited_at":null,"content":"\u003cspan class=\"h-card\"\u003e\u003ca href=\"https://mastodon.social/users/mcc\" class=\"u-url mention\" rel=\"nofollow noopener noreferrer\" target=\"_blank\"\u003e@\u003cspan\u003emcc\u003c/span\u003e\u003c/a\u003e\u003c/span\u003e Interesting!\u003cbr\u003eIt seems to be available (only?) on iTunes US and on BluRay in some countries, but not anywhere here in Italy.","reblog":null,"account":{"id":"763792","username":"lapo","acct":"lapo@f.lapo.it","display_name":"Lapo Luchini","locked":true,"bot":false,"discoverable":true,"group":false,"created_at":"2019-03-15T00:00:00.000Z","note":"Lapo Luchini is a carbon-based life form.","url":"https://f.lapo.it/profile/lapo","avatar":"https://files.mastodon.social/cache/accounts/avatars/000/763/792/original/573fbbc923a1617d.png","avatar_static":"https://files.mastodon.social/cache/accounts/avatars/000/763/792/original/573fbbc923a1617d.png","header":"https://mastodon.social/headers/original/missing.png","header_static":"https://mastodon.social/headers/original/missing.png","followers_count":61,"following_count":48,"statuses_count":118,"last_status_at":"2023-03-06","emojis":[],"fields":[]},"media_attachments":[],"mentions":[{"id":"11307","username":"mcc","url":"https://mastodon.social/@mcc","acct":"mcc"}],"tags":[],"emojis":[],"card":null,"poll":null} ``` So, their restriction to ISO 639 in their API has no technical ground. Instead, it seems to owe to [Eugen's pretty opinionated view on how language code works](https://github.com/mastodon/mastodon/issues/18538#issuecomment-1139639976) ("opinionated" is a euphemism). After all it looks like a Mastodon-specific design decision on this matter. In this case, is it possible for us to just extend the spec on our side? If it has to stick strictly to Mastodon API, I am thinking of raising PR to modify their documentation.
Author
ping @floatingghost
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: AkkomaGang/akkoma#462
No description provided.