UTF-8 Encoding & Decoding — Zero-Knowledge Guide
What you store are bytes. UTF-8 tells you how to turn characters into bytes (encoding) and back
(decoding).
1) Bits, bytes, hex (what is C3?)
A byte is 8 bits. We write a byte as two hexadecimal (hex) digits. Each hex digit = 4 bits.
Example: hex C3 ⇒ C = 12 = 1100, 3 = 0011 ⇒ 1100 0011.
So the reason C3 “becomes” 1100 0011 is: it’s just hex → binary.
0=0000 1=0001 2=0010 3=0011 4=0100 5=0101 6=0110 7=0111
8=1000 9=1001 A=1010 B=1011 C=1100 D=1101 E=1110 F=1111
2) UTF-8 lead/continuation prefixes
Look at the first bits of the first byte:
• 0xxxxxxx → 1 byte total (ASCII range).
• 110xxxxx → 2 bytes total (next must start with 10).
• 1110xxxx → 3 bytes total (then two 10 bytes).
• 11110xxx → 4 bytes total (then three 10 bytes).
• Continuation bytes always start 10xxxxxx.
3) ENCODING by hand (char ⇒ bytes)
Example: Encode ‘£’ (U+00A3).
Step 1: U+00A3 = hex A3 = binary 1010 0011.
Step 2: Range is U+0080–07FF ⇒ use 2-byte template 110xxxxx 10xxxxxx.
Step 3: Fill x’s from right to left. Last 6 bits → 2nd byte: 10 100011 = 1010 0011 (A3).
Remaining bits (pad to 5) → 1st byte: 00010 ⇒ 110 00010 = 1100 0010 (C2).
Answer: C2 A3.
Another quick one: ‘é’ (U+00E9) ⇒ C3 A9.
4) DECODING by hand (bytes ⇒ char)
Example: Decode C3 A9.
Step 1: C3⇒1100 0011 (starts with 110 ⇒ 2-byte char). A9⇒1010 1001.
Step 2: Strip prefixes: from first drop 110 ⇒ 00011; from second drop 10 ⇒ 101001.
Step 3: Join bits: 00011 101001 = 1110 1001 = hex E9 = U+00E9 = ‘é’.
5) What to remember for exams
• ASCII stays 1 byte in UTF-8. Others use 2–4 bytes.
• Count leading 1s in the first byte to know how many bytes long the character is.
• Show working and units (bytes) when asked for file sizes.