Advanced Unicode Techniques
Normalization Techniques
Unicode normalization ensures consistent text representation by transforming characters into a standard form.
public class NormalizationDemo {
public static void main(String[] args) {
String text1 = "รฉ"; // Composed form
String text2 = "e\u0301"; // Decomposed form
// Normalize to canonical composition
String normalized1 = Normalizer.normalize(text1, Normalizer.Form.NFC);
String normalized2 = Normalizer.normalize(text2, Normalizer.Form.NFC);
System.out.println(text1.equals(text2)); // false
System.out.println(normalized1.equals(normalized2)); // true
}
}
graph TD
A[Unicode Normalization]
A --> B[NFC: Canonical Composition]
A --> C[NFD: Canonical Decomposition]
A --> D[NFKC: Compatibility Composition]
A --> E[NFKD: Compatibility Decomposition]
Regular Expression with Unicode
Pattern |
Description |
Example |
\p{L} |
Any letter |
Matches 'A', 'ไธญ', 'รฑ' |
\p{N} |
Any number |
Matches '1', 'เน', 'ูฃ' |
\p{P} |
Any punctuation |
Matches '!', 'ใ', 'ยฟ' |
Unicode-aware String Processing
public class UnicodeRegexDemo {
public static void main(String[] args) {
String text = "Hello, ไธ็! 123 Cafรฉ";
// Unicode-aware regex
Pattern letterPattern = Pattern.compile("\\p{L}+");
Pattern numberPattern = Pattern.compile("\\p{N}+");
Matcher letterMatcher = letterPattern.matcher(text);
Matcher numberMatcher = numberPattern.matcher(text);
while (letterMatcher.find()) {
System.out.println("Letters: " + letterMatcher.group());
}
while (numberMatcher.find()) {
System.out.println("Numbers: " + numberMatcher.group());
}
}
}
Internationalization and Localization
public class LocalizationDemo {
public static void main(String[] args) {
// Set specific locale
Locale japaneseLocale = new Locale("ja", "JP");
ResourceBundle bundle = ResourceBundle.getBundle("messages", japaneseLocale);
String greeting = bundle.getString("welcome");
System.out.println(greeting);
// Locale-specific formatting
NumberFormat currencyFormat = NumberFormat.getCurrencyInstance(japaneseLocale);
System.out.println(currencyFormat.format(1000));
}
}
- Use
StringBuilder
for string manipulations
- Prefer
String.codePointAt()
over manual character handling
- Cache regex patterns for repeated use
Text Segmentation
public class BreakIteratorDemo {
public static void main(String[] args) {
String text = "Hello, ไธ็! How are you?";
// Character-level iteration
BreakIterator charIterator = BreakIterator.getCharacterInstance();
charIterator.setText(text);
int start = charIterator.first();
for (int end = charIterator.next(); end != BreakIterator.DONE;
start = end, end = charIterator.next()) {
System.out.println(text.substring(start, end));
}
}
}
Advanced Text Comparison
public class TextComparisonDemo {
public static void main(String[] args) {
String text1 = "cafรฉ";
String text2 = "cafe\u0301";
Collator collator = Collator.getInstance();
collator.setStrength(Collator.PRIMARY);
System.out.println(collator.compare(text1, text2)); // 0 (equal)
}
}
Best Practices
- Understand Unicode complexity
- Use built-in Java Unicode handling methods
- Test with diverse character sets
LabEx recommends continuous learning and practice with Unicode techniques for robust internationalization.