Sunday 22 May 2022

 

EXTRACT ONE BYTE HEX CODE FROM UTF-8 BYTES REPRESENTING A UNICODE CHARACTER (UTF-8 TRICKS)

[This article assumes the reader knows the hexadecimal number system]

What is UTF8?

Multilingual characters are indexed in a table having each a 16 bit address. This is Unicode table. Thus, using two bytes (each having 8 bits) we could represent any character other than English. But as the English characters are represented in a single byte, we need an encoding which is universal for supporting all the scripts accepted worldwide. More over after addition of more charsets from scripts in different parts of the world (as extended Unicode, in a vow to include all the character and symbols around the globe), the 16 bits became insufficient for Unicode. The UTF-8 brought the solution. Utf-8 is such an encoding system which gives us this privilege to represent any character used anywhere within same common encoding system.

 The utf-8 encoding system needs one to four bytes in a stream to represent a character or a symbol.

The Utf-8 byte sequences follow a definite unique pattern. The pattern is used to validate the correct utf8 sequence. Below is the demonstration of the utf8 pattern:

Single Byte UTF-8

For single byte utf-8 encoding, used pattern is:0xxxxxxx (7 bits are used)

Thus it can represent 00000000 to 01111111  i.e. Unicode U+0000 to U+007F. So, these are sufficient to express English characters.

Two Byte UTF-8

For 2 byte utf-8 encoding, used pattern is: 110xxxxx 10xxxxxx. (Here, 5 + 6 = 11 bits are used)

Thus, it can represent U+0080 to U+07FF. This is used for scripts like Greek, Arabic etc.

Three Byte UTF-8

For 3 byte utf-8 encoding, used pattern is: 1110xxxx 10xxxxxx 10xxxxxx. Thus it uses total 4 + 6 + 6 = 16 bits. Thus, we can represent U+0800 to U+FFFF. This is used for scripts like Devanagari, Bengali, Gurumukhi and others Indic language script characters falls in this range.

After this we need more than 16 bits to represent extended Unicode characters.

Four Byte UTF-8

For 4 byte UTF-8 encoding, used pattern is: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx. Thus, it uses a total of 21 bits (3 + 6 + 6 + 6).  Thus, it can represent from U+10000 to U+10FFFF.

This unique pattern of utf8 encoding system is required to validate the true utf8 sequence.

Example Decode

In the following example we will extract a single byte hex code from a three byte utf8 stream. The hex code will be unique within the Unicode range used for that script.

The utf-8 sequence for Bengali is E0 A7 80 and the position of in Unicode table is U+09C0.

Now, utf-8 pattern for Bengali characters(3 bytes sequence) is :1110 xxxx  - 10xx xxxx - 10xx xxxx.

Now, 1st byte is E0; thus, in this case 1st byte becomes 1110 0000

The 2nd byte is A7, thus it becomes, 1010 0111

The 3rd byte is 80, thus it becomes, 1000 0000

Thus, the whole bit sequence for becomes 0000 1001 – 1100 0000 (if we extract pattern identification bits). Thus, converting to HEX, it becomes U+09C0 as per the Unicode table.

Now, we will do the same through coding:

For this, we made the third byte as the base, replaced its the leading 10 with the last two bits from the middle byte. The rest leading bits will always form a 9 which is not required for this range of Unicode characters.

.            uint8_t a[3];

.           a[0] = 0xE0;

.           a[1] = 0xA7;

.           a[2] = 0x80;                             [utf8 sequence for ]

.           uint8_t x = a[1]&0x03;            [we take last 2 bits only]

.           x = x<<6;                                  [shift bits by 6 steps left]

.           uint8_t y = a[2]&0x3F;            [resets the left most 2 bits to 0]

.           y = x|y;

This is 0xC0 which is the hex code of , unique up to the required range.