As of 2016-02-26, there will be no more posts for this blog. s/blog/pba/
Showing posts with label language. Show all posts

I thought 'You got to be kidding me!' when I read the following sentence from "Breaking down the language barriersix years in:"

In early 2006, we rolled out our first languages: Chinese, then Arabic.

In case you don't know, I am a native Chinese speaker and I can tell you one thing: Chinese-English translation by Google Translate is just terrible. By my standard, there are only two results, acceptable and unacceptable for translation. There is no such thing like a value in between 0.0 and 1.0 when evaluating the quality of a translation in real life. Either you understand or don't. 56.17% accuracy, what the heck does that actually mean?

After I read that sentence, I copied the rest of post and got a Chinese translation from Google Translate. The translation is still bad, I hadn't tried Google Translate for a while. The sentence reads like a collection of words you are familiar with, but when you put them altogether in a sentence, not much sense at all. You have to guess the entire meaning by reading a word or a phrase, section by section, and you can still get lost in translation, literally, but in bad one.

Eventually, I went back to read the original English version.

I can't speak on other language translation since English and Chinese are the only two I know. But I have seen Korean translation wasn't considered as good, either. I was watching a live stream when an American asked Koreans for a translation and asked about the translation by Google Translate.

It seems Google Translate could have good results with Indo-European languages (or just between Germanic or same branch). I had once asked a German about an English translation of a German song by Google Translate, he said it was okay.

If you need a translation of a word, Google Translate might be okay, but still not for Chinese translation, it's like a land mine and you don't even know you are dead already. The difference is huge between any two languages, or where do you think those funny Engrish signs from?

Even in between English dialects, spelling, phrases, and so have already had some significant differences. Well, Google Translate doesn't have dialects choices, but seems to understand.

Machine translation hasn't reached the practical level in my humble opinion, "breaking down the language barrier," to be perfectly honest, it's still a big chuck of rock. I don't want to diminish Google Translate, but it's simply a fact. Needless to read how much effort they have put in, in order to understand the computation complexity is high. Just think about how much time a human has to put in for being a good translator.

To be honest, I don't expect seeing much improvement for next decade, machine translation has been around since 90s, that's first time I had used a portable device which can perform a full sentence translation not just a single word or phrase, which was a great selling point at the time even it could only do well with very short sentences. It might be around when computer was able to display multiple languages.

If you are using Google Translate for serious business (not literally), use it at your own risk. But I would advice hiring a real human translator for the time being.

Obviously, I am not the only who had tried to look up "instagram" on Google Dictionary:


I was read this sentence "I instagrammed a photo of the ..." which was the reason I thought "instagram" could be a word. Well, it is not. By the way, Google Dictionary should not suggest in the same way as normal Google Search does. It shall only gives you suggestions that it has definitions.

I didn't find it on instagr.am either, but its About Us page hints a bit [emphasis added]:

Instagram came from that inspirationcould we make sharing your life as instant and magic as those first Polaroid pictures must have felt?

Not sure where the "gram" was from, "photogram," doesn't look so?

You probably have heard that Facebook bought Instagram for one billion dollars, it's been all over on Internet, discussed if that price was worth. To be honest, I don't really care about that deal, I don't own an iPhone nor do I use Facebook. Just someone wants to learn a new word whenever he has a chance and is up to that.

So, what's the definition of it? Doesn't matter since it is not a real word, only thing I need to know it's what Instagram is for and I know that already.

Whenever I need to turn a non-word into a not-really-a-word verb, I would suffix "'d," for example "instgram'd." I think it is a better way to write when you try to give readers a feeling of something.

Do you know whats the output of the following code?

#!/usr/bin/env python2
import locale
locale.setlocale(locale.LC_CTYPE, 'tr_TR.utf8')
print 'i'.upper()

The answer is i not I.

It began with this bug report, where I commmented

This is so bizarre: scancode: 42 name:KEY_SHiFT_L

Why is it small i?

I didnt understand why that key symbol has a small i, later I knew the i is a totally different story in Turkish language as we used to know the i in English, and it actually has four different letters of i.

1   i, İ, ı, and I

(The header title is actually: i, İ, ı, and I.)

So, a quick lesson of switching case for i in Turkish (Hopefully I am not making mistakes ;):

  • dotted i and İ
    • [upper] i (ASCII i) to İ (U+0130)
    • [lower] İ (U+0130) to i (ASCII i)
  • dotless ı and I
    • [upper] ı (U+0131) to I (ASCII I)
    • [lower] I (ASCII I) to ı (U+0131)

Two of these four are actually the same we have in ASCII characters.

Did you notice something wasnt right? As I said at the beginning the upper case of i is still i, but the wiki says the upper case of i is İ (U+0130), the dotted cap I. I believe Python couldnt do it right even the locale has been set, but Python is not the only one, from the wiki:

Dotless i (and dotted capital I) is handled problematically in the Turkish locales of several software packages, including Oracle DBMS, Java,[1] and Unixware 7, where implicit capitalization of names of keywords, variables, and tables has effects not foreseen by the application developers. The C or US English locales do not have these problems.

However, if you set the locale correctly (with right charset), it has no problem:

import locale
locale.setlocale(locale.LC_CTYPE, 'tr_TR.iso88599')

lower_i = '\xfd i'
upper_I = 'I \xdd'
print 'lower_i', lower_i.decode('iso8859-9').encode('utf-8')
print '2_upper', lower_i.upper().decode('iso8859-9').encode('utf-8')
print
print 'upper_I', upper_I.decode('iso8859-9').encode('utf-8')
print '2_lower', upper_I.lower().decode('iso8859-9').encode('utf-8')
lower_i ı i
2_upper I İ

upper_I I İ
2_lower ı i

They are correct.

But! (here comes my favorite word) If you use unicode string, you get unexpected result:

import locale
locale.setlocale(locale.LC_CTYPE, 'tr_TR.utf8')

lower_i = u'\u0131 i'
upper_I = u'I \u0130'
print 'lower_i', lower_i.encode('utf-8')
print '2_upper', lower_i.upper().encode('utf-8')
print
print 'upper_I', upper_I.encode('utf-8')
print '2_lower', upper_I.lower().encode('utf-8')
lower_i ı i
2_upper I I

upper_I I İ
2_lower i i

For dotless small i and dot cap I, they have correct result. The other two are not. However, if you are not really dealing locale stuff, i.e. Turkish, this might be what you want, see next section.

The only way I know to deal with this is to manually replace.

import locale
locale.setlocale(locale.LC_CTYPE, 'tr_TR.utf8')

lower_i = u'\u0131 i'
upper_I = u'I \u0130'
print 'lower_i', lower_i.encode('utf-8')
print '2_upper', lower_i.replace(u'i', u'\u0130').upper().encode('utf-8')
lower_i ı i
2_upper I İ

2   String Normalization with Turkish locale

In the bug of that project, we have key symbols all switched to upper cases, then use it to compare to a value which is from predefined table. The data in table is all CAPS, so this is the problem, we can never find the match since i isnt being switched to I.

This is just one case. When coding, the metadata most likely is just [a-z0-9-_]+, they are always ASCII. You might sanitize them to make sure, e.g. blog post slug. Say a post title is This Is A Post, a typical slug would be this-is-a-post. If you only use str string, you end up with thIs-Is-a-post.

A quick fix is to convert the string to Unicode and that would be fine. If you are using Python 3, you wont be aware of this.

Another way is to set the locale, which I did for that bug at first.

3   Conclusion

Locale is as painful as key stuff and no, I cant speak Turkish and yes, I only read that wiki page. (Okay, okay, half of it)

While I was reading that wiki page, I was shocked to read about the lack of the dotless i on phone system caused deaths.