UTF-8, UTF-16, UTF-32: Encoding Methods
Part of Character Sets — GCSE Computer Science
This key facts covers UTF-8, UTF-16, UTF-32: Encoding Methods within Character Sets for GCSE Computer Science. Revise Character Sets in Memory & Storage for GCSE Computer Science with 15 exam-style questions and 18 flashcards. This topic appears less often, but it can still be a useful differentiator on mixed-topic papers. It is section 6 of 10 in this topic. Use this key facts to connect the idea to the wider topic before moving on to questions and flashcards.
Topic position
Section 6 of 10
Practice
15 questions
Recall
18 flashcards
UTF-8, UTF-16, UTF-32: Encoding Methods
Unicode Transformation Formats (UTF):
Unicode defines WHAT each character's code is. UTF defines HOW to store those codes in bytes.
UTF-8 (Most Common):
- Variable length: 1 to 4 bytes per character
- ASCII compatible: ASCII characters still use 1 byte (efficient!)
- English text: 1 byte per character (same size as ASCII)
- Accented letters: 2 bytes (é, ñ, ü)
- Chinese/Japanese: 3 bytes per character
- Emoji: 4 bytes
- Advantages: Efficient for English, backward compatible with ASCII
- Disadvantage: Asian languages take 3× more space than UTF-16
UTF-16:
- Variable length: 2 or 4 bytes per character
- Most characters: 2 bytes (including Chinese, Japanese, Korean)
- Emoji & rare: 4 bytes (surrogate pairs)
- Use case: Windows internals, Java, JavaScript strings
- Advantage: Efficient for Asian languages
- Disadvantage: English takes 2× space vs ASCII/UTF-8
UTF-32:
- Fixed length: Exactly 4 bytes per character (always)
- Advantage: Simple - every character same size, easy indexing
- Disadvantage: Wastes space - 'A' takes 4 bytes (0x00000041)
- Use case: Internal processing where speed > space