IKAnalyzer is an open source, lightweight Chinese word segmentation toolkit developed based on Java language. Since the launch of version 1.0 in December 2006, IKAnalyzer has launched 4 major versions. Initially, it was a Chinese word segmentation component based on the open source project Luence, which combined dictionary word segmentation and grammatical analysis algorithms. Starting from version 3.0, IK has developed into a public word segmentation component for Java, independent of the Lucene project, and provides a default optimized implementation of Lucene. In the 2012 version, IK implemented a simple word segmentation ambiguity elimination algorithm, marking the evolution of the IK word segmenter from simple dictionary segmentation to simulated semantic word segmentation.
IKAnalyzer2012 features:
It adopts a unique "forward iteration of the finest-grained segmentation algorithm" and supports two segmentation modes: fine-grained and intelligent word segmentation;
In the system environment: Core2i73.4G dual-core, 4G memory, window764-bit, SunJDK1.6_2964-bit ordinary PC environment test, IK2012 has a high-speed processing capability of 1.6 million words/second (3000KB/S).
The 2012 version of the intelligent word segmentation mode supports simple word segmentation disambiguation processing and quantifier merging output.
It adopts a multi-subprocessor analysis mode, supports: word segmentation processing of English letters, numbers, Chinese vocabulary, etc., is compatible with Korean and Japanese character optimized dictionary storage, and has a smaller memory footprint. Supports user dictionary extended definitions. In particular, in the 2012 version, the dictionary supports Chinese, English, and digital mixed words.