ToolGood.Words Download - ToolGood.Words Source code download

ToolGood.Words

Python

3.1.0.0

Download

ToolGood.Words

A high-performance illegal word (sensitive word) detection component, comes with Traditional and Simplified Chinese interchange, supports full-width and half-width interchange, obtains the first letter of Pinyin, obtains Pinyin letters, Pinyin fuzzy search and other functions.

C# language, using StringSearchEx2.Replace filtering, the filtering speed on the 48k sensitive vocabulary exceeds 300 million characters per second. (cpu i7 8750h)

csharp folder description:

 ToolGood.Pinyin.Build:          生成词的拼音
ToolGood.Pinyin.Pretreatment:   生成拼音预处理，核对拼音，词组最小化
ToolGood.Transformation.Build： 生成简体繁体转换文档，更新时文档放在同一目录下，词库参考 https://github.com/BYVoid/OpenCC
ToolGood.Words.Contrast:        字符串搜索对比
ToolGood.Words.Test:            单元测试
ToolGood.Words:                 本项目源代码

Illegal word (sensitive word) detection (string search)

Illegal word (sensitive word) detection class: StringSearch , StringSearchEx , StringSearchEx2 , WordsSearch , WordsSearchEx , WordsSearchEx2 , IllegalWordsSearch ;

StringSearch , StringSearchEx , StringSearchEx2 , StringSearchEx3 : The search result returned by the FindFirst method is string type.
WordsSearch , WordsSearchEx , WordsSearchEx2 , WordsSearchEx3 : The search result returned by FindFirst method is of the WordsSearchResult type. WordsSearchResult not only contains keywords, but also the starting position, ending position of the keyword, keyword serial number, etc.
IllegalWordsSearch : A special class for filtering illegal words (sensitive words). You can set the skip word length. The default is full-width to half-width. Ignore case, skip words, repeated words, and blacklist . The search FindFirst method returns IllegalWordsSearchResult , which has keywords and corresponds to the original text. Start, position, blacklist type.
IllegalWordsSearch , StringSearchEx , StringSearchEx2 , WordsSearchEx , WordsSearchEx2 use Save and Load methods to speed up initialization.
Common methods are: SetKeywords , ContainsAny , FindFirst , FindAll , Replace
IllegalWordsSearch unique methods: SetSkipWords (set skip words), SetBlacklist (set blacklist).
IllegalWordsSearch field UseIgnoreCase : Set whether to ignore case or not. It must be before the SetKeywords method. Note: This field is invalid when using the Load method.
StringSearchEx3 and WordsSearchEx3 are pointer-optimized versions. During actual measurement, it was found that the performance fluctuated greatly.

    string s = "中国|国人|zg人" ;
    string test = "我是中国人" ;

    StringSearch iwords = new StringSearch ( ) ;
    iwords . SetKeywords ( s . Split ( '|' ) ) ;
    
    var b = iwords . ContainsAny ( test ) ;
    Assert . AreEqual ( true , b ) ;

    var f = iwords . FindFirst ( test ) ;
    Assert . AreEqual ( "中国" , f ) ;

    var all = iwords . FindAll ( test ) ;
    Assert . AreEqual ( "中国" , all [ 0 ] ) ;
    Assert . AreEqual ( "国人" , all [ 1 ] ) ;
    Assert . AreEqual ( 2 , all . Count ) ;

    var str = iwords . Replace ( test , '*' ) ;
    Assert . AreEqual ( "我是***" , str ) ;

Illegal word (sensitive word) detection (string search) (supports wildcards)

Illegal word (sensitive word) detection categories: StringMatch , StringMatchEx , WordsMatch , WordsMatchEx .

Supports some regular expression types: . (dot) ? (question mark) [] (square brackets) (|) (brackets and vertical bars)

    string s = ".[中美]国|国人|zg人" ;
    string test = "我是中国人" ;

    WordsMatch wordsSearch = new WordsMatch ( ) ;
    wordsSearch . SetKeywords ( s . Split ( '|' ) ) ;

    var b = wordsSearch . ContainsAny ( test ) ;
    Assert . AreEqual ( true , b ) ;

    var f = wordsSearch . FindFirst ( test ) ;
    Assert . AreEqual ( "是中国" , f . Keyword ) ;

    var alls = wordsSearch . FindAll ( test ) ;
    Assert . AreEqual ( "是中国" , alls [ 0 ] . Keyword ) ;
    Assert . AreEqual ( ".[中美]国" , alls [ 0 ] . MatchKeyword ) ;
    Assert . AreEqual ( 1 , alls [ 0 ] . Start ) ;
    Assert . AreEqual ( 3 , alls [ 0 ] . End ) ;
    Assert . AreEqual ( 0 , alls [ 0 ] . Index ) ; //返回索引Index,默认从0开始
    Assert . AreEqual ( "国人" , alls [ 1 ] . Keyword ) ;
    Assert . AreEqual ( 2 , alls . Count ) ;

    var t = wordsSearch . Replace ( test , '*' ) ;
    Assert . AreEqual ( "我****" , t ) ;

Traditional and simplified interchange, full-width and half-width interchange, numbers converted to Chinese uppercase, pinyin operations

    // 转成简体
    WordsHelper . ToSimplifiedChinese ( "我愛中國" ) ;
    WordsHelper . ToSimplifiedChinese ( "我愛中國" , 1 ) ; // 港澳繁体 转 简体
    WordsHelper . ToSimplifiedChinese ( "我愛中國" , 2 ) ; // 台湾正体 转 简体
    // 转成繁体
    WordsHelper . ToTraditionalChinese ( "我爱中国" ) ;
    WordsHelper . ToTraditionalChinese ( "我爱中国" , 1 ) ; // 简体 转 港澳繁体
    WordsHelper . ToTraditionalChinese ( "我爱中国" , 2 ) ; // 简体 转 台湾正体
    // 转成全角
    WordsHelper . ToSBC ( "abcABC123" ) ;
    // 转成半角
    WordsHelper . ToDBC ( "ａｂｃＡＢＣ１２３" ) ;
    // 数字转成中文大写
    WordsHelper . ToChineseRMB ( 12345678901.12 ) ;
    // 中文转成数字
    WordsHelper . ToNumber ( "壹佰贰拾叁亿肆仟伍佰陆拾柒万捌仟玖佰零壹元壹角贰分" ) ;
    // 获取全拼
    WordsHelper . GetPinyin ( "我爱中国" ) ; //WoAiZhongGuo   
    WordsHelper . GetPinyin ( "我爱中国" , "," ) ; //Wo,Ai,Zhong,Guo   
    WordsHelper . GetPinyin ( "我爱中国" , true ) ; //WǒÀiZhōngGuó

    // 获取首字母
    WordsHelper . GetFirstPinyin ( "我爱中国" ) ; //WAZG
    // 获取全部拼音
    WordsHelper . GetAllPinyin ( '传' ) ; //Chuan,Zhuan
    // 获取姓名
    WordsHelper . GetPinyinForName ( "单一一" ) //ShanYiYi
    WordsHelper . GetPinyinForName ( "单一一" , "," ) //Shan,Yi,Yi
    WordsHelper . GetPinyinForName ( "单一一" , true ) //ShànYīYī

Pinyin branch

ToolGood.Words.Pinyin pursues faster loading speed (currently only C# code).

pinyin matching

PinyinMatch : Methods include SetKeywords , SetIndexs , Find , and FindIndex .

PinyinMatch<T> : Methods include SetKeywordsFunc , SetPinyinFunc , SetPinyinSplitChar , and Find .

    string s = "北京|天津|河北|辽宁|吉林|黑龙江|山东|江苏|上海|浙江|安徽|福建|江西|广东|广西|海南|河南|湖南|湖北|山西|内蒙古|宁夏|青海|陕西|甘肃|新疆|四川|贵州|云南|重庆|西藏|香港|澳门|台湾" ;

    PinyinMatch match = new PinyinMatch ( ) ;
    match . SetKeywords ( s . Split ( '|' ) . ToList ( ) ) ;

    var all = match . Find ( "BJ" ) ;
    Assert . AreEqual ( "北京" , all [ 0 ] ) ;
    Assert . AreEqual ( 1 , all . Count ) ;

    all = match . Find ( "北J" ) ;
    Assert . AreEqual ( "北京" , all [ 0 ] ) ;
    Assert . AreEqual ( 1 , all . Count ) ;

    all = match . Find ( "北Ji" ) ;
    Assert . AreEqual ( "北京" , all [ 0 ] ) ;
    Assert . AreEqual ( 1 , all . Count ) ;

    all = match . Find ( "S" ) ;
    Assert . AreEqual ( "山东" , all [ 0 ] ) ;
    Assert . AreEqual ( "江苏" , all [ 1 ] ) ;

    var all2 = match . FindIndex ( "BJ" ) ;
    Assert . AreEqual ( 0 , all2 [ 0 ] ) ;
    Assert . AreEqual ( 1 , all . Count ) ;

Performance comparison

After performing 100,000 performance comparisons, the results are as follows:

Note: C#'s built-in regularization is very slow. StringSearchEx2.ContainsAny is more than 88,000 times more efficient than Regex.IsMatch , which is related to the number of keywords.

Regex.Matches operates similarly to IQueryable , only returning MatchCollection without calculation.

In the Find All test, (the detected text contains sensitive words and will not be displayed. You can debug and check by yourself).

FastFilter can only detect 7

StringSearch detected 14

Interlude: After scrutinizing the magic Regex.Matches for 3ms, I found that Regex.Matches has a small problem,

Regex.Matches can only detect 11

Implemented in other languages

Lua version

Author: wenlifan Address: https://github.com/wenlifan/SensitiveWordFilter

Recommend it

"ToolGood Content Review System" is officially open source, with Windows and Linux dual platforms, and the memory usage is less than 100M.

Official website: https://toolgood.com/

Open source code: https://github.com/toolgood/ToolGood.TextFilter

Sensitive Information Filtering Research Association, Q group: 128994346 (full)

I am not a teacher, so please do not ask simple questions about project usage, loading, etc.

Articles related to sensitive words

1. Things about sensitive word filtering scheme

2. Common company sensitive word review system

3. Solution for newbies to filter sensitive words

4. Filtering methods for commonly used sensitive words on the Internet

5. ToolGood.Words algorithm filtering sensitive word optimization principle (charge 30 yuan, one meal at KFC)

6. Detailed explanation of ToolGood.TextFilter open source code optimization (charge 300 yuan) Compare with the IllegalWordsSearch algorithm to explain the optimization points of the ToolGood.TextFilter filtering algorithm and how to reduce memory usage. There is still a small part that has not been written yet. Those who are impatient can buy it first. I will continue to update it.

7. Regular to DFA algorithm (C# version, JAVA version) (charge 30 yuan, one KFC meal) One of the core algorithms of ToolGood.TextFilter uses regular to DFA.

8. C# version of picture pornography (charge 30 yuan, one meal at KFC)