Unicode Basics
What is Unicode?
Unicode is a universal character encoding standard designed to represent text in most of the world's writing systems. Unlike earlier encoding standards like ASCII, Unicode can represent characters from virtually all languages, including complex scripts, emojis, and special symbols.
Character Representation
In Unicode, each character is assigned a unique code point, which is a numerical value ranging from 0 to 0x10FFFF. These code points are typically represented in hexadecimal format.
Code Point Types
| Code Point Range |
Type |
| U+0000 - U+007F |
Basic Latin |
| U+0080 - U+07FF |
Latin Extended and Other Scripts |
| U+0800 - U+FFFF |
More Complex Scripts |
| U+10000 - U+10FFFF |
Supplementary Planes |
Encoding Methods
Unicode supports multiple encoding methods, including:
- UTF-8 (Variable-length encoding)
- UTF-16 (16-bit encoding)
- UTF-32 (32-bit encoding)
graph TD
A[Unicode Code Point] --> B{Encoding Method}
B --> |UTF-8| C[Variable Length Encoding]
B --> |UTF-16| D[16-bit Encoding]
B --> |UTF-32| E[32-bit Encoding]
Supplementary Characters
Characters beyond the Basic Multilingual Plane (BMP) require special handling and are represented using surrogate pairs in UTF-16.
Java Unicode Support
Java uses UTF-16 internally for character representation, which means it natively supports Unicode and can handle characters from all planes.
Example Code
public class UnicodeDemo {
public static void main(String[] args) {
// Unicode character representation
char emoji = '\uD83D'; // First part of surrogate pair
char emojiSecond = '\uDE0A'; // Second part of surrogate pair
System.out.println("Emoji: " + emoji + emojiSecond);
}
}
Why Unicode Matters
Unicode enables:
- Multilingual text processing
- Consistent character representation
- Global software internationalization
By providing a comprehensive character encoding standard, Unicode has become essential in modern software development, especially for applications targeting a global audience.