Giving Every Character a Number
You already know that computers store everything as binary — patterns of 0s and 1s. But how does a computer store text? How does it know the difference between the letter “A”, the number “7”, and a question mark?
The answer is surprisingly simple: every character is given a unique number. When you type the letter “A” on your keyboard, the computer does not store the shape of the letter — it stores the number 65. When it needs to display that character on screen, it looks up number 65 in a table and draws the corresponding shape.
This system of mapping characters to numbers is called character encoding. Think of it like a secret code book — everyone agrees that “A” means 65, “B” means 66, a space means 32, and so on. As long as every computer uses the same code book, text can be stored, transmitted, and displayed correctly everywhere.
There are two main character encoding standards you need to know for your GCSE:
- ASCII — the original standard, designed in the 1960s for English-only text
- Unicode — the modern universal standard that covers every writing system on Earth (plus emojis!)
ASCII — The Original Standard
ASCII stands for American Standard Code for Information Interchange. It was created in 1963 and quickly became the universal way to encode text in early computers. ASCII uses 7 bits per character, which means it can represent 27 = 128 different characters (numbered 0 to 127).
Those 128 characters include:
- Uppercase letters: A (65) to Z (90) — 26 characters
- Lowercase letters: a (97) to z (122) — 26 characters
- Digits: 0 (48) to 9 (57) — 10 characters
- Space: code 32
- Punctuation and symbols: ! @ # $ % & * ( ) and many more
- Control characters: codes 0–31 (things like “new line”, “tab”, and “backspace” — invisible characters that control formatting)
Key ASCII Code Ranges
You do not need to memorise the entire ASCII table, but you must know these key values:
| Character | ASCII Code | Binary (7-bit) |
|---|---|---|
| Space | 32 | 0100000 |
| 0 | 48 | 0110000 |
| 1 | 49 | 0110001 |
| 9 | 57 | 0111001 |
| A | 65 | 1000001 |
| B | 66 | 1000010 |
| Z | 90 | 1011010 |
| a | 97 | 1100001 |
| b | 98 | 1100010 |
| z | 122 | 1111010 |
How ASCII Enables Alphabetical Sorting
Notice that the letters are in order: A=65, B=66, C=67, and so on. This is not a coincidence — it was designed this way deliberately. Because “A” has a smaller code number than “B”, a computer can sort words alphabetically simply by comparing the numbers. The computer does not need to “know” the alphabet — it just compares the codes.
The Uppercase/Lowercase Pattern
There is a neat relationship between uppercase and lowercase letters in ASCII:
A = 65 a = 97 difference = 32
B = 66 b = 98 difference = 32
C = 67 c = 99 difference = 32
...
Z = 90 z = 122 difference = 32
The lowercase version is ALWAYS 32 more than the uppercase.
This means that to convert an uppercase letter to lowercase, a computer simply adds 32 to the ASCII code. To convert lowercase to uppercase, it subtracts 32. This is incredibly efficient and still used in programming today.
The Limitation of ASCII
With only 128 characters, ASCII can represent English text and nothing more. It has no room for:
- Accented letters (like é, ü, ñ)
- Other alphabets (Chinese, Arabic, Japanese, Hindi, Greek, Russian…)
- Mathematical symbols beyond the basics
- Emojis
In the 1960s this was fine — computers were mostly used by English-speaking scientists and engineers. But as computers spread around the world, 128 characters was nowhere near enough. Something far bigger was needed.
/\_/\ ( o.o ) > ^ < /| |\ (_| |_)
Unicode — The Universal Standard
As the internet connected people across the globe, it became clear that the world needed a single encoding system that could handle every writing system. The solution was Unicode, first published in 1991.
Unicode uses up to 32 bits per character, giving it the theoretical capacity to represent over 4 billion characters. In practice, Unicode currently defines over 149,000 characters covering:
- Every modern writing system: Latin, Chinese, Arabic, Japanese (Kanji, Hiragana, Katakana), Hindi (Devanagari), Korean (Hangul), Cyrillic (Russian), Greek, Hebrew, Thai, and many more
- Historical scripts: Egyptian hieroglyphs, Cuneiform, Runic
- Mathematical symbols, musical notation, and technical symbols
- Emojis: thousands of them, from smiley faces to flags to food
Backwards Compatibility with ASCII
A brilliant design decision: the first 128 Unicode characters are identical to ASCII. This means that A is still 65, a is still 97, and 0 is still 48 in Unicode. Any old ASCII text file is automatically valid Unicode too. This is called backwards compatibility — the new system works with everything the old system created.
UTF-8: The Most Common Encoding
Unicode defines the numbers (called code points), but it needs an encoding scheme to turn those numbers into actual bytes stored on disk. The most widely used encoding is UTF-8, which uses a clever variable-length system:
- 1 byte for ASCII characters (codes 0–127) — identical to ASCII
- 2 bytes for accented European letters, Greek, Cyrillic, Arabic, Hebrew
- 3 bytes for Chinese, Japanese, and Korean characters
- 4 bytes for emojis, rare historical scripts, and mathematical symbols
This is incredibly efficient: English text takes exactly the same space in UTF-8 as in ASCII (1 byte per character), while characters from other languages use only as many bytes as they need.
Emojis in Unicode
Every emoji has a unique Unicode code point, written with a “U+” prefix followed by a hexadecimal number:
| Emoji | Name | Code Point | UTF-8 Bytes |
|---|---|---|---|
| 😀 | Grinning Face | U+1F600 | 4 bytes |
| 👍 | Thumbs Up | U+1F44D | 4 bytes |
| ❤ | Red Heart | U+2764 | 3 bytes |
| 🚀 | Rocket | U+1F680 | 4 bytes |
| 🎮 | Video Game Controller | U+1F3AE | 4 bytes |
Why do emojis look different on different devices? The Unicode standard only defines the code point and a general description (e.g. “grinning face”). Each company — Apple, Google, Samsung, Microsoft — designs its own artwork for each emoji. That is why the same thumbs-up emoji (U+1F44D) can look quite different on an iPhone versus an Android phone. The code is the same, but the visual design is different.
Storage Comparison
Text: "Hello"
ASCII: H(72) e(101) l(108) l(108) o(111) = 5 bytes
UTF-8: H(72) e(101) l(108) l(108) o(111) = 5 bytes (same!)
Chinese text: 你好 ("hello" in Chinese)
UTF-8: 你(3 bytes) 好(3 bytes) = 6 bytes
Emoji: 😀
UTF-8: 😀(4 bytes) = 4 bytes
Key insight: UTF-8 is efficient because simple characters
use fewer bytes, and complex ones use more.
| Feature | ASCII | Unicode (UTF-8) |
|---|---|---|
| Bits per character | 7 bits (stored in 1 byte) | 8 to 32 bits (1 to 4 bytes) |
| Total characters | 128 | Over 149,000 |
| Languages supported | English only | Every written language |
| Emojis | None | Thousands |
| Storage (English text) | 1 byte per character | 1 byte per character |
| Storage (other scripts) | Not supported | 2–4 bytes per character |
| Backwards compatible? | — | Yes (first 128 = ASCII) |
Interactive Exercises
Exercise 1: ASCII Code Lookup
A random character will appear below. Type its ASCII code number.
Hint: A=65, a=97, 0=48. Letters and digits are in order from there.
Exercise 2: ASCII Decoder
The ASCII codes below spell out a word. Decode them and type the word.
Remember: A=65, a=97. Work out each letter from the code.
Exercise 3: Text to ASCII Converter
Type any text below and see its ASCII/Unicode codes, binary representation, and byte count in real time.
Test Yourself
Click on each question to reveal the answer. Try to work it out yourself first!
Answer: 128 characters.
With 7 bits, the number of possible combinations is 27 = 128. These are numbered 0 to 127.
Answer: A = 65, a = 97.
The difference between an uppercase and its lowercase equivalent is always 32. So to go from ‘A’ (65) to ‘a’ (97), add 32. This pattern holds for every letter in the alphabet.
Answer:
- ASCII only supports English characters. It cannot represent letters from other languages such as Chinese, Arabic, Hindi, or even accented European letters like é or ü.
- With only 128 characters, there is no room for additional symbols, emojis, or the characters needed by the billions of people who do not write in English.
Answer: Unicode uses up to 32 bits per character, allowing it to represent over 149,000 characters. This includes every modern writing system (Chinese, Arabic, Japanese, Hindi, Korean, etc.), historical scripts, mathematical symbols, and emojis. Unicode is also backwards compatible with ASCII — the first 128 Unicode characters are identical to ASCII, so existing ASCII text works without any changes.
Answer: UTF-8 is the most widely used encoding scheme for Unicode. It uses a variable-length system: ASCII characters take only 1 byte, while other characters use 2, 3, or 4 bytes as needed. This makes it efficient because English text is no larger than in ASCII, but it can still represent every Unicode character. UTF-8 is used by the vast majority of websites, emails, and modern software.
Answer: “Python”
Working: 80=P, 121=y, 116=t, 104=h, 111=o, 110=n.
Method: P is the 16th letter of the alphabet, and uppercase letters start at 65, so P = 65 + 15 = 80. The lowercase letters: y = 97 + 24 = 121, t = 97 + 19 = 116, h = 97 + 7 = 104, o = 97 + 14 = 111, n = 97 + 13 = 110.
Answer: 500 bytes.
Each ASCII character is stored using 1 byte (8 bits, though only 7 are needed for the code — the 8th bit is typically 0). So 500 characters × 1 byte = 500 bytes. In UTF-8, these same 500 ASCII characters would also be 500 bytes, since UTF-8 uses 1 byte for any character in the 0–127 range.
Key Vocabulary
Make sure you know all of these terms for your exam:
| Term | Definition |
|---|---|
| ASCII | American Standard Code for Information Interchange — a 7-bit character encoding that represents 128 characters including English letters, digits, punctuation, and control characters. |
| Unicode | A universal character encoding standard that can represent over 149,000 characters from every writing system in the world, plus emojis and symbols. |
| UTF-8 | The most common Unicode encoding scheme. Uses variable-length encoding: 1 byte for ASCII characters, 2–4 bytes for other characters. Backwards compatible with ASCII. |
| Character Encoding | An agreed-upon system that assigns a unique number to each character (letter, digit, symbol) so that computers can store, transmit, and display text. |
| Code Point | The unique number assigned to a character in Unicode, written with a “U+” prefix and a hexadecimal value (e.g. U+0041 for ‘A’, U+1F600 for the grinning face emoji). |
| Backwards Compatible | When a newer system is designed to work correctly with data or files created for an older system. Unicode is backwards compatible with ASCII because its first 128 code points are identical. |
Exam Tips
- Saying “Unicode uses 2 bytes per character.” This is a common misconception. UTF-8 uses a variable number of bytes (1 to 4). Only the older UTF-16 encoding uses a minimum of 2 bytes.
- Forgetting backwards compatibility. If asked “what is the advantage of Unicode over ASCII,” always mention that Unicode is backwards compatible — it does not break existing ASCII files.
- Confusing the character with its code. The digit character ‘7’ has ASCII code 55, not 7. The character ‘0’ has code 48, not 0. This trips up many students.
Past Paper Questions
Try these exam-style questions, then click to reveal the mark scheme answer.
Explain the difference between ASCII and Unicode character encoding. [2] marks
Mark scheme:
- ASCII uses 7 bits per character and can represent 128 characters (1)
- Unicode uses up to 32 bits per character and can represent characters from every language in the world (1)
The ASCII code for the letter 'A' is 65. What is the ASCII code for the letter 'D'? [1] mark
Mark scheme:
- 68 (65 + 3) (1)
Give one advantage and one disadvantage of using Unicode instead of ASCII. [2] marks
Mark scheme:
- Advantage: Unicode can represent characters from all languages / supports emojis and special symbols (1)
- Disadvantage: Unicode uses more bits per character so files are larger / requires more storage space (1)
Character Encoding in Everyday Life
Character encoding is something you use every single day without thinking about it:
- Every text message you send is encoded as a sequence of Unicode numbers, transmitted as bytes, and decoded back into characters on the other person’s phone.
- Every web page you visit declares its encoding (almost always UTF-8) in its HTML header, so your browser knows how to interpret the bytes into readable text.
- Every emoji you send is a Unicode code point. When you send a thumbs-up to a friend with a different phone, they receive the same code point (U+1F44D) but see their device’s own artwork for it.
- Every programming language uses character encoding when working with strings. Python 3, for example, uses Unicode by default, which is why you can write
print("你好")and it works perfectly.
Have you ever seen strange characters like “�” or “é” on a website? That happens when the encoding is wrong — the browser is trying to decode bytes using one encoding scheme when the file was saved in a different one. Understanding character encoding helps you understand and fix these problems.
From the simple 128-character ASCII table of the 1960s to Unicode’s 149,000+ characters today, character encoding is a brilliant example of how computer science evolves to meet the needs of a connected world.
Video Resources
Further Reading
- BBC Bitesize — Edexcel GCSE Computer Science — Full specification coverage for data representation including character encoding
- Isaac Computer Science — Data Representation — In-depth explanations of ASCII, Unicode, and text encoding
- GCSE Topic 2: Data Representation — Interactive revision tools and character encoding activities