Unicode Basics
What is Unicode?
Unicode is a universal character encoding standard designed to represent text in most of the world's writing systems. Unlike traditional character encodings, Unicode provides a unique code point for every character across different languages and scripts.
Unicode Character Representation
In Java, Unicode characters are represented using 16-bit code points, which can represent up to 65,536 unique characters. The standard uses a hexadecimal format, typically written as U+XXXX.
graph LR
A[Character] --> B[Unicode Code Point]
B --> C[Hexadecimal Representation]
Unicode Character Types
Type |
Description |
Example |
Basic Latin |
Standard ASCII characters |
A, b, 1, @ |
Supplementary Multilingual Plane |
Extended characters |
æą, ð, ⊠|
Emoji |
Graphical symbols |
ð, ð, ð |
Code Point vs Char in Java
In Java, a char
is a 16-bit unsigned integer that represents a Unicode character. However, some Unicode characters require 32 bits and are called supplementary characters.
Example of Unicode Conversion
public class UnicodeExample {
public static void main(String[] args) {
// Basic Latin character
char latinChar = 'A'; // U+0041
// Unicode character
char unicodeChar = 'æą'; // U+6C49
System.out.println("Latin Char: " + latinChar);
System.out.println("Unicode Char: " + unicodeChar);
}
}
Practical Considerations
When working with Unicode in Java, developers must be aware of:
- Character encoding
- Potential character representation limitations
- Proper handling of supplementary characters
At LabEx, we recommend understanding these nuances for robust character manipulation in Java applications.