Post a Comment On: /dev/dump

"Your language sucks..."

20 Comments -

1 – 20 of 20
Blogger András said...

You think collation in French is hard? Try Hungarian! :)

The Hungarian alphabet contains several "double letters" (cs, dz, gy, ly, ny, sz, ty, zs) and a "triple letter" (dzs), but these don't have their own code points -- they're just represented by their constituent letters in writing. However, they have an impact on collation and, by inference regular expression matching (think [a-z], which doesn't include "zs" because "zs" is sorted after z!).

The official collation order is:

A, Á, B, C, Cs, D, Dz, Dzs, E, É, F, G, Gy, H, I, Í, J, K, L, Ly, M, N, Ny, O, Ó, Ö, Ő, P, Q, R, S, Sz, T, Ty, U, Ú, Ü, Ű, V, W, X, Y, Z, Zs

(Luckily, the language doesn't insist on title case -- normally all constituent characters of all collation symbols are capitalized.)

This means that words starting with "cs" are sorted after words starting with e.g. "cu", and that /^[a-c]/ shouldn't match e.g. "csók" (kiss) because it starts with a cs, not a c.

But this is where it gets _really_ interesting, because these special letter combinations can also occur (especially in compound words) without being a "single letter (collation symbol) represented by two characters". For example, "mézsör" contains a z, followed by an s, as does "őzsuta"; "pácsó" contains a c followed by an s (so that "pácsó" would be sorted between "pácra" and "páctól", not between pácsr* and pácst*, where c and s represent the collating symbol "cs").

Another pathological example is "egészség" (health -- literally "wholeness"), which contains an "sz" followed by a "z", not an "s" followed by a "zs" -- and the only way to know that is from a dictionary. "Rézszínű" ("copper coloured") is a compound word that may not even appear in dictionaries, so how do you guess wether it has zs-s or z-sz? (In case you're wondering, it's the latter.)

I'm fairly certain there are even words that can be read two ways and you have to infer which one it is based on context, which makes regex matching fun, to say the least. And if someone makes a sick pun that depends on _both_ readings...

October 18, 2014 at 1:44 AM

Blogger András said...

You think collation in French is hard? Try Hungarian! :)

The Hungarian alphabet contains several "double letters" (cs, dz, gy, ly, ny, sz, ty, zs) and a "triple letter" (dzs), but these don't have their own code points -- they're just represented by their constituent letters in writing. However, they have an impact on collation and, by inference regular expression matching (think [a-z], which doesn't include "zs" because "zs" is sorted after z!).

The official collation order is:

A, Á, B, C, Cs, D, Dz, Dzs, E, É, F, G, Gy, H, I, Í, J, K, L, Ly, M, N, Ny, O, Ó, Ö, Ő, P, Q, R, S, Sz, T, Ty, U, Ú, Ü, Ű, V, W, X, Y, Z, Zs

(Luckily, the language doesn't insist on title case -- normally all constituent characters of all collation symbols are capitalized.)

This means that words starting with "cs" are sorted after words starting with e.g. "cu", and that /^[a-c]/ shouldn't match e.g. "csók" (kiss) because it starts with a cs, not a c.

But this is where it gets _really_ interesting, because these special letter combinations can also occur (especially in compound words) without being a "single letter (collation symbol) represented by two characters". For example, "mézsör" contains a z, followed by an s, as does "őzsuta"; "pácsó" contains a c followed by an s (so that "pácsó" would be sorted between "pácra" and "páctól", not between pácsr* and pácst*, where c and s represent the collating symbol "cs").

Another pathological example is "egészség" (health -- literally "wholeness"), which contains an "sz" followed by a "z", not an "s" followed by a "zs" -- and the only way to know that is from a dictionary. "Rézszínű" ("copper coloured") is a compound word that may not even appear in dictionaries, so how do you guess wether it has zs-s or z-sz? (In case you're wondering, it's the latter.)

I'm fairly certain there are even words that can be read two ways and you have to infer which one it is based on context, which makes regex matching fun, to say the least. And if someone makes a sick pun that depends on _both_ readings...

October 18, 2014 at 1:45 AM

Blogger Mithaldu said...

Your point about german is incorrect. It is acceptable to replace ß with ss in computing environments. (Because we germans know the english computocracy made too many things that can't handle it.)

They are however not the same letter or sound. Example:

Fass (barrel) has a short a

Fuß (foot) has a long u

That is because double consonants in german shorten the preceding vowel, and while ß is in pronounciation a combination of s and z (not s and s), it is a single consonant.

October 18, 2014 at 7:42 AM

Blogger 题叶 said...

No, Chinese characters are limited, though there is a large plenty of them. But we never create new characters now. And out IME soution does not allow us to do that, while English people just combine words to create ones.

October 18, 2014 at 7:45 AM

Blogger Michael Richter said...

"ASCII is universal. Why did you fight it? But that's probably just me being mostly Anglo-centric / bigoted."

The AMERICAN Standard Code for Information Interchange is "universal". Huh. I see.

So, I'm left with a few options here:
1.You are incredibly stupid.
2.You are incredibly ignorant.
3.You are incredibly delusional.

Which one do you think is the kindest interpretation?

October 18, 2014 at 7:52 AM

Blogger Peter Jeschke said...

Point 8) No, ß and ss are not equivalent. They are used in different cases and influence the pronunciation. (An example - Correct: Spaß (Fun) Incorrect: Spass. There's no such word)

Many people write ss if they can't write ß for some reason (eg they don't have the key on their keyboard). But that's still wrong, just accepted.

October 18, 2014 at 9:42 AM

Blogger ivanhoe said...

Actually Russian is not strictly phonetic, they have 2 letters called soft and hard sign, which are used only to affect how the letter before them is pronounced. On the other hand Serbian Cyrillic alphabet is indeed 100% phonetic, and unlike what you claim it's still pretty easy for automatic processing, even if many Serbs use latin alphabet to write, because detection and transliteration between the two alphabets is quite easy, especially from cyrillic to latin (you can directly replace letters one by one). Also nothing schizophrenic about the two alphabets, honestly, it's simply a reminiscence from the days of Yugoslav Federation, where Serbs had to communicate on a daily bases with people from other parts of Yugoslavia who were using only the latin alphabet... being on the same (code) page simplifies things :)

October 18, 2014 at 10:42 AM

Blogger Garrett D'Amore said...

Thanks for pointing out the difference between ß and "ss". I seem to recall they were equivalent in the collation order, but now I'm having trouble finding it. (The new collation input files are huge, and painful to work with.)

October 18, 2014 at 10:52 AM

Blogger Garrett D'Amore said...

With respect to ASCII. ASCII is universal now. It is also known as ISO646, which is an ISO (International) standard.

ASCII sucks for people who need letters that fall outside of it. But having colliding code points makes things hard, given that these days ~the world relies on ASCII.

(And yes -- pretty much all the Internet standards started with ASCII, though some now accept UTF-8. But originally HTML, domain names, RFC 822, etc. all could only be encoded in ASCII. For many countries, ISO-8859 standards added extensions to the character set for different languages, even Turkish has an ISO 8859 standard that doesn't collide with ASCII (8859-9).

In fact, even Russian, using KOI8-R, offers an 8-bit character set that leaves ASCII in the lower order bits.

Unicode does this too. UTF-8 is a strict superset of ASCII. So does EUC (extended Asian encodings).

So yes, ASCII *is* universal.

We can argue whether this is a result of bigotry, accident, or other causes. As I said earlier, if your encoding system conflicts with ASCII, its broken. And the reason is all those other universal things -- like oh, e-mail.

So, maybe you hate America, but hating on ASCII is just plain stupid.

October 18, 2014 at 10:59 AM

Blogger Garrett D'Amore said...

Oh by the way, while my use of ß was unfortunate as an example, there are definitely others. Most of them are compounds with accents or ligatures like ṻ or Æ. (And not just in latin either: Greek has them too, an example: ώ.) In some languages these characters are unique and distinct from the separation, but often they aren't, and the single code point can stand be decomposed into multiple components. (This actually leads to a whole standard around "normalization forms"...)

October 18, 2014 at 11:09 AM

Blogger Peter said...

Canadian English has special rules for collation -- the phone books (at least in the days when they were printed on dead trees) have a separate section for names starting with "Mac" and "Mc", and these both sort together (e.g., "Macdonald" and "McDonald" are both considered the same)

October 18, 2014 at 11:10 AM

Blogger Garrett D'Amore said...

Russian hard and soft signs (ъ and ь respectively) probably do violate strict phonetical rules -- since they affect pronunciation of an earlier letter. But, they generally have no impact on text processing (sorting).

If a given language uses a script form consistently its no problem. But when users mix and match from two scripts interchangeably (so that Г and G can be used interchangeably for example), it does horrible things for sorting, etc. You have two equivalent forms (for collation) with different code points. Presumably when you convert upper to lower case, it just works, although there can be confusion. Is "m" a lower case latin "M", or is it a lower case Cyrilic "T". Fortunately in the code points the two forms have separate identities, although visually they can be impossible to distinguish from one another.)

For POSIX, we don't support mixing both Cyrillic and Latin -- a locale must choose one or the other as a primary (though you can use both, they don't sort identically, and message catalogs will exclusively use one script or the other, depending on which is chosen.)

October 18, 2014 at 12:45 PM

Blogger Garrett D'Amore said...

Btw, in case you're wondering, if I were to design a language for text processing, it would probably look a lot like English. (Actually, in this era, with ASCII *everywhere*, it would *really* look a lot like English.)

What I'd change in English is to get rid of case. Case sucks. Many languages dispense with case entirely and are better off for it.

Probably also I'd nuke articles. While I have no problem with them, Russian does fine without them. And they create problems because we insist on sorting certain things with special handling for articles -- many algorithms ignore articles for sorting purposes.)

Otherwise English works really really well. Russian is slightly better off without those articles, too!

One thing that I would definitely make certain of is to ensure that accent characters were not used at all. If you want to use a different letter -- just use a different letter, don't use some accent character to modify it.

(Hmm... I wonder, does International Phoenetic Alphabet -- IPA -- meet these criteria already? I should check it out.)

October 18, 2014 at 1:13 PM

Blogger Russ Williams said...

This rant is very tautological. ASCII is dominant because English speakers were dominant when computer character sets were first codified. If e.g. Polish speakers had been dominant when computer character sets were first codified, then by the same argument you'd need to say that English sucks since a hypothetical Polish-created "ASCII" might not contain Q and V and X, for example, which English needs.

Considering diacritical marks to be somehow inherently problematic or harder is completely English-centric; if ASCII had been written by Poles, it would have ą ę ł ć ś ź ż ó as fundamental distinct characters along with a e l c s z o. That ć and c look similar (one visually contains the other) has as much significance as the fact that b and l or o look similar in English (b visually contains l and o). Whatever language was used to define the core (single-byte) character set would be considered normal & convenient. There's nothing inherently more difficult about ą as a letter than a as a letter. They are simply 2 distinct letters.

You also seem a bit blind to quirks of English (or biased and more willing to forgive them) if you regard English as "pretty awesome for data-processing". E.g. special case coding needed for various irregular nouns and verbs in text output. (Or even needing to use a different form for plurals: "1 result" vs "2 results" requires more coding.) E.g. various spelling differences between US & British English.

Which is not to say that there aren't plenty of worse annoyances in some other languages. But the idea that English is "awesome" for data processing seems dubious, and very dependent on the historical accident that ASCII was created by English speakers, so by no inherent virtue of English itself it of course is most conveniently represented in ASCII.

October 20, 2014 at 4:11 AM

Blogger Richard Cobbe said...

I sympathize with your difficulties in processing non-ASCII text, although you've got a much harder problem than I ever had -- I was just trying to deal with classical Greek.

There are a couple of factual points that you might find interesting, although they don't help with the underlying problem very much.

First, I'm not sure it's really straightforward to say that languages shouldn't use diacriticals but should instead use different letters. There are some cases, as in French, where e and é are pronounced differently, so different letters (or digraphs) might make sense -- though there are many cases in French where the accent doesn't indicate different pronunciation: "ou" and "où" mean "or" and "where," respectively, but they're pronounced identically.

In (modern) Greek, however, the accent on ώ indicates that this is the syllable that gets the stress, rather than a modification of the vowel sound. Maybe I'm just used to the traditional orthographies for these things, but it feels really strange to me to use different letters to indicate a supra-segmental property like stress.

Second, while IPA may look promising, it introduces lots of complications of its own. Do you go for a phonemic spelling, or a phonetic spelling? Or, to take an example from English, how do you spell "nuclear"? [nuklir] or [nukjəlɚ]? (Or is that a vocalic [l] in the second example?) I use the first pronunciation exclusively, but other people use the second -- and that's entirely within the constraints of American English. Supporting multiple dialects and accents only adds to the complexity.

October 26, 2014 at 7:51 PM

Blogger kaishakunin said...

You are wrong about Russian, we do have special context-dependent collation rules. Namely, Russian letter "ё" is sorted after "е" if it is the last letter or if all of the following letters are the same; otherwise, it is treated equivalent to "е".

Here's an extract from my test cases for Unicode collation:

1) "ё", "е" -> sorting -> "е", "ё"
Here "ё" is sorted after "е" because it is the last letter.

2) "ёлка", "еда", "ель" -> sorting -> "еда", "ёлка", "ель"
Here "ё" is sorted equivalent to "е" because it is not the last letter and the subsequent letters are different.

October 29, 2014 at 6:21 AM

Blogger Garrett D'Amore said...

Huh. So I didn't know that about Russian collation. So scratch it from the list of non-broken languages.

I don't think we collate English in the US the same way, but I could be mistaken. (Reading the rules for collation from the CLDR is non-trivial. At least after they have been turned into localedef grammar.)

Basically, its starting to look like *all* Natural languages suck, at least in some form. Its just some suck worse than others. (Again, this is in the context of text processing.)

I wonder about esperanto. It was an invented language; but I bet its inventors gave no thought to usefulness/ease for computing applicatons. (It being driven by political considerations rather than pragmatic considerations.)

October 29, 2014 at 6:29 AM

Blogger Garrett D'Amore said...

Also, for the record, yes, I avoided grammar, which includes rules for pluralization, verbs, etc

I think pretty much *all* languages become terrible if you start to consider grammar considerations; especially since even the most strongly rule based languages (German?) still have *some* exceptions. (English is better because it has fewer and simpler rules than many European languages; but its worse because it has far more exceptions than most.)

I have no idea about non-European language grammars; I suspect that I'm just blissfully ignorant. Gosh, given the other challenges some of those languages have (tones, and character sets with thousands -- tens of thousands -- of glyphs), I would hope that they would have much much simpler grammars.

October 29, 2014 at 6:35 AM

Blogger Russ Williams said...

Esperanto of course was indeed created before electronic computers. But it is by design much more regular than English and other ethnic/national languages. It has literally no irregular nouns or verbs, for example. And its orthography is much more rational (especially than that of English), with basically one letter = one phoneme. (Of course you will complain that some of its letters ĉĝĥĵŝŭ are not in ASCII, by accident of history.)

So it is arguably much better suited for computer processing (ignoring the ASCII issue, which by default makes English about the only suitable language if ASCII is your top priority.)

Indeed it has been used as a bridge/hub language in some translation projects, perhaps the most successful having been DLT by Toon Witkam.

I have some direct personal experience with minor text processing in Esperanto, as I wrote a very simple program to verify the syllables and accents in a very long epic poem to help out someone translating it from Vietnamese. Such a program would have been far more complex for English, requiring a dictionary database of the pronunciation of words, whereas in Esperanto the pronunciation is completely reliably deducible from the spelling.

October 29, 2014 at 6:37 AM

Blogger Garrett D'Amore said...

So ASCII is a historical accident. I have no qualm with languages different glyphs; although I object when there are thousands of them (non-phonetic languages).

In today's world, most languages will fit inside BMP of Unicode, and that's good enough for me. :-)

I do object when people use accents to modify characters but don't use a full separate code point for the new character. If we treat these as unique characters rather than composed forms, then all is well. :-)

It sounds like Esperanto is far better than many others, probably because no thinking human would intentionally create irregular forms.

October 29, 2014 at 7:57 AM