Hadar – Chinese Simplified/Traditional Conversion
Project: Hadar GitHub Language: C++
Overview
Hadar is a high-performance Chinese text conversion engine, designed for Simplified ↔ Traditional Chinese conversion. It is an upgraded evolution of OpenCC, maintaining the core dictionary-based, rule-driven conversion approach while optimizing for speed and memory efficiency.
The engine uses static word/phrase dictionaries and applies deterministic rules for conversion. It is primarily suited for offline preprocessing, NLP pipelines, and batch text normalization.
Key Features
- Simplified ↔ Traditional conversion: Accurate conversion at word and phrase level, not only character-by-character.
- Rule-driven mapping: Handles ambiguous mappings with priority rules.
- Static dictionaries: Memory-efficient, fast lookup for large-scale corpora.
- Batch processing optimized: Capable of processing millions of characters efficiently.
- Unicode support: Fully supports UTF-8 encoded Chinese text.
Typical Use Cases
- Preprocessing text for Chinese NLP pipelines.
- Normalizing user-generated content across Simplified and Traditional Chinese.
- Large-scale corpus conversion for search engines or text analytics.
- Integration into offline or batch processing systems requiring deterministic conversion.
Industrial Fit
| Requirement | Hadar Support |
|---|---|
| Simplified ↔ Traditional conversion | ✔️ |
| Word/phrase level accuracy | ✔️ |
| Dynamic updates | ❌ |
| High-volume batch processing | ✔️ |
| Unicode support | ✔️ |