Language Processing

Introduction

ASCII

Single byte encoding only using the bottom 7 bits(0-127). The top bit is always 0.

In English, 128 symbols are enough to represent all character. But in other situations, French for example, they are insufficient. So we use the top bit to represent accents so that there are up to 256 characters. ‘é’ encodes as 1000 0010(130).
But here comes another problem. In different languages, the same binary encoding represents different characters, such as 130 in French is é, but in Hebrew, it is Gimel(ג). Not to mention Chinese characters(more than 100 thousand). So we introduce another encoding system, unicode.

Unicode

“Unicode encoding” is more properly known as UTF-16: 2 bytes per “code point”. This is the native format of strings in .NET. Values outside the Basic Multilingual Plane(BMP) are encoded as surrogate pairs. These are relatively rarely used - which is a good job, as very few developers get them right, I suspect. “Unicode” is really the character set - it is unfortunate that the term is also used as a synonym for UTF-16 in .NET and various Windows applications.
Unicode can be implemented by different character encodings. The most commonly used encodings are UTF-8, UTF-7 and UTF-32:
UTF-8: Variable length encoding, 1-4 bytes covers every current character. ASCII values are encoded as ASCII.
UTF-7: Usually used for mail encoding. Chances are if you think you need it and you’re not doing mail, you’re wrong. (not widely used at all.)
UTF-32: Fixed width encoding using 4 bytes per code point. This isn’t very efficient, but makes life easier outside the BMP.

UTF-8

UTF-8 has become the dominant character encoding for the World Wide Web. The rule of UTF-8 is:
(1) If this is in ASCII, UTF-8 is the same with ASCII
(2) For n UTF bytes(n > 1), the first n bits in the first byte set as 1, the n + 1 bit sets as 0, all the first two bits in the following bytes are all 10. The rest of the bits are represented as the unicode of the character.

UTF Bytes Hexadecimal Binary
1 0000 0000 to 0000 007F 0xxxxxxx
2 0000 0080 to 0000 07FF 110xxxxx 10xxxxxx
3 0000 0800 to 0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
4 0001 0000 to 0010 FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

For example, the unicode of “严” is 4E25 (100111000100101). According to the table above, this character belongs to the third row. So its UTF-8 is E4B8A5 (11100100 10111000 10100101).
We can see that actually unicode is different with UTF-8. But there are some libraries that can do the convert.

Read More

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×