Character Encoding - ASCII, Unicode & UTF-8

Mar 18 2015 (Mar 18 2015) English>Technology>Language Processing 4 minutes read (About 647 words)

Introduction

ASCII

Single byte encoding only using the bottom 7 bits(0-127). The top bit is always 0.
ASCII Chart
In English, 128 symbols are enough to represent all character. But in other situations, French for example, they are insufficient. So we use the top bit to represent accents so that there are up to 256 characters. ‘é’ encodes as 1000 0010(130).
But here comes another problem. In different languages, the same binary encoding represents different characters, such as 130 in French is é, but in Hebrew, it is Gimel(ג). Not to mention Chinese characters(more than 100 thousand). So we introduce another encoding system, unicode.

Unicode

“Unicode encoding” is more properly known as UTF-16: 2 bytes per “code point”. This is the native format of strings in .NET. Values outside the Basic Multilingual Plane(BMP) are encoded as surrogate pairs. These are relatively rarely used - which is a good job, as very few developers get them right, I suspect. “Unicode” is really the character set - it is unfortunate that the term is also used as a synonym for UTF-16 in .NET and various Windows applications.
Unicode can be implemented by different character encodings. The most commonly used encodings are UTF-8, UTF-7 and UTF-32:
UTF-8: Variable length encoding, 1-4 bytes covers every current character. ASCII values are encoded as ASCII.
UTF-7: Usually used for mail encoding. Chances are if you think you need it and you’re not doing mail, you’re wrong. (not widely used at all.)
UTF-32: Fixed width encoding using 4 bytes per code point. This isn’t very efficient, but makes life easier outside the BMP.

UTF-8

UTF-8 has become the dominant character encoding for the World Wide Web. The rule of UTF-8 is:
(1) If this is in ASCII, UTF-8 is the same with ASCII
(2) For n UTF bytes(n > 1), the first n bits in the first byte set as 1, the n + 1 bit sets as 0, all the first two bits in the following bytes are all 10. The rest of the bits are represented as the unicode of the character.

UTF Bytes	Hexadecimal	Binary
1	0000 0000 to 0000 007F	0xxxxxxx
2	0000 0080 to 0000 07FF	110xxxxx 10xxxxxx
3	0000 0800 to 0000 FFFF	1110xxxx 10xxxxxx 10xxxxxx
4	0001 0000 to 0010 FFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

For example, the unicode of “严” is 4E25 (100111000100101). According to the table above, this character belongs to the third row. So its UTF-8 is E4B8A5 (11100100 10111000 10100101).
We can see that actually unicode is different with UTF-8. But there are some libraries that can do the convert.

Preprocessing

Read UTF-8 file:

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.UnsupportedEncodingException;
 
public class readUTF8 {
    public static void main(String[] args){
 
        try {
            File fileDir = new File("\test.txt");
            FileInputStream fis = new FileInputStream(fileDir);
            InputStreamReader isr = new InputStreamReader(fis, "UTF-8");
            BufferedReader in = new BufferedReader(isr);
 
            String str;
 
            while ((str = in.readLine()) != null) {
                System.out.println(str);
            }
 
            in.close();
        } 
        catch (UnsupportedEncodingException e) {
            System.out.println(e.getMessage());
        } 
        catch (IOException e) {
            System.out.println(e.getMessage());
        }
        catch (Exception e) {
            System.out.println(e.getMessage());
        }
    }
}

Write UTF-8 file out:

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.UnsupportedEncodingException;
import java.io.Writer;
 
public class test {
    public static void main(String[] args){
 
        try {
            File fileDir = new File("\test.txt");

            FileOutputStream fos = new FileOutputStream(fileDir);
            OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF-8");
            Writer out = new BufferedWriter(osw);
 
            out.append("Website UTF-8").append("\r\n");
            out.append("?? UTF-8").append("\r\n");
            out.append("??????? UTF-8").append("\r\n");
 
            out.flush();
            out.close();
        } 
        catch (UnsupportedEncodingException e) {
            System.out.println(e.getMessage());
        } 
        catch (IOException e) {
            System.out.println(e.getMessage());
        }
        catch (Exception e) {
            System.out.println(e.getMessage());
        } 
    }   
}

Convert from String to byte[]:

1 2	String s = "some text here"; byte[] b = s.getBytes("UTF-8");

Convert from byte[] to String:

1 2	byte[] b = {(byte) 99, (byte)97, (byte)116}; String s = new String(b, "US-ASCII");

#InterviewNote #Encoding