EXTRACT
ONE BYTE HEX CODE FROM UTF-8 BYTES REPRESENTING A UNICODE CHARACTER (UTF-8
TRICKS)
[This
article assumes the reader knows the hexadecimal number system]
What is
UTF8?
Multilingual
characters are indexed in a table having each a 16 bit address. This is Unicode
table. Thus, using two bytes (each having 8 bits) we could represent any
character other than English. But as the English characters are represented in
a single byte, we need an encoding which is universal for supporting all the
scripts accepted worldwide. More over after addition of more charsets from
scripts in different parts of the world (as extended Unicode, in a vow to
include all the character and symbols around the globe), the 16 bits became
insufficient for Unicode. The UTF-8 brought the solution. Utf-8 is such an encoding
system which gives us this privilege to represent any character used anywhere
within same common encoding system.
The utf-8 encoding system needs one to four
bytes in a stream to represent a character or a symbol.
The Utf-8
byte sequences follow a definite unique pattern. The pattern is used to
validate the correct utf8 sequence. Below is the demonstration of the utf8
pattern:
Single
Byte UTF-8
For single
byte utf-8 encoding, used pattern is:0xxxxxxx (7 bits are used)
Thus it can
represent 00000000 to 01111111 i.e.
Unicode U+0000 to U+007F. So, these are sufficient to express English
characters.
Two Byte
UTF-8
For 2 byte
utf-8 encoding, used pattern is: 110xxxxx 10xxxxxx. (Here, 5 + 6 = 11 bits are
used)
Thus, it can
represent U+0080 to U+07FF. This is used for scripts like Greek, Arabic etc.
Three
Byte UTF-8
For 3 byte
utf-8 encoding, used pattern is: 1110xxxx 10xxxxxx 10xxxxxx. Thus it uses total
4 + 6 + 6 = 16 bits. Thus, we can represent U+0800 to U+FFFF. This is used for
scripts like Devanagari, Bengali, Gurumukhi and others Indic language script
characters falls in this range.
After this
we need more than 16 bits to represent extended Unicode characters.
Four Byte
UTF-8
For 4 byte
UTF-8 encoding, used pattern is: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx. Thus, it
uses a total of 21 bits (3 + 6 + 6 + 6). Thus, it can represent from U+10000 to
U+10FFFF.
This unique
pattern of utf8 encoding system is required to validate the true utf8 sequence.
Example
Decode
In the
following example we will extract a single byte hex code from a three byte utf8
stream. The hex code will be unique within the Unicode range used for that
script.
The utf-8
sequence for Bengali ী is E0 A7 80 and the position of ী in Unicode table is U+09C0.
Now, utf-8
pattern for Bengali characters(3 bytes sequence) is :1110 xxxx - 10xx xxxx - 10xx xxxx.
Now, 1st
byte is E0; thus, in this case 1st byte becomes 1110 0000
The 2nd
byte is A7, thus it becomes, 1010 0111
The 3rd
byte is 80, thus it becomes, 1000 0000
Thus, the
whole bit sequence for ী becomes 0000 1001 – 1100 0000 (if we
extract pattern identification bits). Thus, converting to HEX, it becomes
U+09C0 as per the Unicode table.
Now, we will
do the same through coding:
For this, we made the third byte as
the base, replaced its the leading 10 with the last two bits from the middle
byte. The rest leading bits will always form a 9 which is not required for this
range of Unicode characters.
. uint8_t a[3];
. a[0]
= 0xE0;
. a[1]
= 0xA7;
. a[2]
= 0x80; [utf8 sequence for ী ]
. uint8_t
x = a[1]&0x03; [we take
last 2 bits only]
. x
= x<<6; [shift
bits by 6 steps left]
. uint8_t
y = a[2]&0x3F; [resets the
left most 2 bits to 0]
. y
= x|y;
This is 0xC0 which is the hex code of ী ,
unique up to the required range.