[go: up one dir, main page]

0% found this document useful (0 votes)
3 views1 page

Uni Code Image

Uploaded by

senbeth11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views1 page

Uni Code Image

Uploaded by

senbeth11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

UTF-8 Encoding & Decoding — Zero-Knowledge Guide

What you store are bytes. UTF-8 tells you how to turn characters into bytes (encoding) and back
(decoding).

1) Bits, bytes, hex (what is C3?)


A byte is 8 bits. We write a byte as two hexadecimal (hex) digits. Each hex digit = 4 bits.
Example: hex C3 ⇒ C = 12 = 1100, 3 = 0011 ⇒ 1100 0011.
So the reason C3 “becomes” 1100 0011 is: it’s just hex → binary.

0=0000 1=0001 2=0010 3=0011 4=0100 5=0101 6=0110 7=0111


8=1000 9=1001 A=1010 B=1011 C=1100 D=1101 E=1110 F=1111

2) UTF-8 lead/continuation prefixes


Look at the first bits of the first byte:

• 0xxxxxxx → 1 byte total (ASCII range).

• 110xxxxx → 2 bytes total (next must start with 10).

• 1110xxxx → 3 bytes total (then two 10 bytes).

• 11110xxx → 4 bytes total (then three 10 bytes).

• Continuation bytes always start 10xxxxxx.

3) ENCODING by hand (char ⇒ bytes)


Example: Encode ‘£’ (U+00A3).
Step 1: U+00A3 = hex A3 = binary 1010 0011.
Step 2: Range is U+0080–07FF ⇒ use 2-byte template 110xxxxx 10xxxxxx.
Step 3: Fill x’s from right to left. Last 6 bits → 2nd byte: 10 100011 = 1010 0011 (A3).
Remaining bits (pad to 5) → 1st byte: 00010 ⇒ 110 00010 = 1100 0010 (C2).
Answer: C2 A3.
Another quick one: ‘é’ (U+00E9) ⇒ C3 A9.

4) DECODING by hand (bytes ⇒ char)


Example: Decode C3 A9.
Step 1: C3⇒1100 0011 (starts with 110 ⇒ 2-byte char). A9⇒1010 1001.
Step 2: Strip prefixes: from first drop 110 ⇒ 00011; from second drop 10 ⇒ 101001.
Step 3: Join bits: 00011 101001 = 1110 1001 = hex E9 = U+00E9 = ‘é’.

5) What to remember for exams


• ASCII stays 1 byte in UTF-8. Others use 2–4 bytes.

• Count leading 1s in the first byte to know how many bytes long the character is.

• Show working and units (bytes) when asked for file sizes.

You might also like