A high-performance illegal word (sensitive word) detection component, comes with Traditional and Simplified Chinese interchange, supports full-width and half-width interchange, obtains the first letter of Pinyin, obtains Pinyin letters, Pinyin fuzzy search and other functions.
C#
language, using StringSearchEx2.Replace
filtering, the filtering speed on the 48k sensitive vocabulary exceeds 300 million characters per second. (cpu i7 8750h)
csharp folder description:
ToolGood.Pinyin.Build: 生成词的拼音
ToolGood.Pinyin.Pretreatment: 生成拼音预处理,核对拼音,词组最小化
ToolGood.Transformation.Build: 生成简体繁体转换文档,更新时文档放在同一目录下,词库参考 https://github.com/BYVoid/OpenCC
ToolGood.Words.Contrast: 字符串搜索对比
ToolGood.Words.Test: 单元测试
ToolGood.Words: 本项目源代码
Illegal word (sensitive word) detection class: StringSearch
, StringSearchEx
, StringSearchEx2
, WordsSearch
, WordsSearchEx
, WordsSearchEx2
, IllegalWordsSearch
;
StringSearch
, StringSearchEx
, StringSearchEx2
, StringSearchEx3
: The search result returned by the FindFirst
method is string
type.WordsSearch
, WordsSearchEx
, WordsSearchEx2
, WordsSearchEx3
: The search result returned by FindFirst
method is of the WordsSearchResult
type. WordsSearchResult
not only contains keywords, but also the starting position, ending position of the keyword, keyword serial number, etc.IllegalWordsSearch
: A special class for filtering illegal words (sensitive words). You can set the skip word length. The default is full-width to half-width. Ignore case, skip words, repeated words, and blacklist . The search FindFirst
method returns IllegalWordsSearchResult
, which has keywords and corresponds to the original text. Start, position, blacklist type.IllegalWordsSearch
, StringSearchEx
, StringSearchEx2
, WordsSearchEx
, WordsSearchEx2
use Save
and Load
methods to speed up initialization.SetKeywords
, ContainsAny
, FindFirst
, FindAll
, Replace
IllegalWordsSearch
unique methods: SetSkipWords
(set skip words), SetBlacklist
(set blacklist).IllegalWordsSearch
field UseIgnoreCase
: Set whether to ignore case or not. It must be before the SetKeywords
method. Note: This field is invalid when using the Load
method.StringSearchEx3
and WordsSearchEx3
are pointer-optimized versions. During actual measurement, it was found that the performance fluctuated greatly. string s = "中国|国人|zg人" ;
string test = "我是中国人" ;
StringSearch iwords = new StringSearch ( ) ;
iwords . SetKeywords ( s . Split ( '|' ) ) ;
var b = iwords . ContainsAny ( test ) ;
Assert . AreEqual ( true , b ) ;
var f = iwords . FindFirst ( test ) ;
Assert . AreEqual ( "中国" , f ) ;
var all = iwords . FindAll ( test ) ;
Assert . AreEqual ( "中国" , all [ 0 ] ) ;
Assert . AreEqual ( "国人" , all [ 1 ] ) ;
Assert . AreEqual ( 2 , all . Count ) ;
var str = iwords . Replace ( test , '*' ) ;
Assert . AreEqual ( "我是***" , str ) ;
Illegal word (sensitive word) detection categories: StringMatch
, StringMatchEx
, WordsMatch
, WordsMatchEx
.
Supports some regular expression types: .
(dot) ?
(question mark) []
(square brackets) (|)
(brackets and vertical bars)
string s = ".[中美]国|国人|zg人" ;
string test = "我是中国人" ;
WordsMatch wordsSearch = new WordsMatch ( ) ;
wordsSearch . SetKeywords ( s . Split ( '|' ) ) ;
var b = wordsSearch . ContainsAny ( test ) ;
Assert . AreEqual ( true , b ) ;
var f = wordsSearch . FindFirst ( test ) ;
Assert . AreEqual ( "是中国" , f . Keyword ) ;
var alls = wordsSearch . FindAll ( test ) ;
Assert . AreEqual ( "是中国" , alls [ 0 ] . Keyword ) ;
Assert . AreEqual ( ".[中美]国" , alls [ 0 ] . MatchKeyword ) ;
Assert . AreEqual ( 1 , alls [ 0 ] . Start ) ;
Assert . AreEqual ( 3 , alls [ 0 ] . End ) ;
Assert . AreEqual ( 0 , alls [ 0 ] . Index ) ; //返回索引Index,默认从0开始
Assert . AreEqual ( "国人" , alls [ 1 ] . Keyword ) ;
Assert . AreEqual ( 2 , alls . Count ) ;
var t = wordsSearch . Replace ( test , '*' ) ;
Assert . AreEqual ( "我****" , t ) ;
// 转成简体
WordsHelper . ToSimplifiedChinese ( "我愛中國" ) ;
WordsHelper . ToSimplifiedChinese ( "我愛中國" , 1 ) ; // 港澳繁体 转 简体
WordsHelper . ToSimplifiedChinese ( "我愛中國" , 2 ) ; // 台湾正体 转 简体
// 转成繁体
WordsHelper . ToTraditionalChinese ( "我爱中国" ) ;
WordsHelper . ToTraditionalChinese ( "我爱中国" , 1 ) ; // 简体 转 港澳繁体
WordsHelper . ToTraditionalChinese ( "我爱中国" , 2 ) ; // 简体 转 台湾正体
// 转成全角
WordsHelper . ToSBC ( "abcABC123" ) ;
// 转成半角
WordsHelper . ToDBC ( "abcABC123" ) ;
// 数字转成中文大写
WordsHelper . ToChineseRMB ( 12345678901.12 ) ;
// 中文转成数字
WordsHelper . ToNumber ( "壹佰贰拾叁亿肆仟伍佰陆拾柒万捌仟玖佰零壹元壹角贰分" ) ;
// 获取全拼
WordsHelper . GetPinyin ( "我爱中国" ) ; //WoAiZhongGuo
WordsHelper . GetPinyin ( "我爱中国" , "," ) ; //Wo,Ai,Zhong,Guo
WordsHelper . GetPinyin ( "我爱中国" , true ) ; //WǒÀiZhōngGuó
// 获取首字母
WordsHelper . GetFirstPinyin ( "我爱中国" ) ; //WAZG
// 获取全部拼音
WordsHelper . GetAllPinyin ( '传' ) ; //Chuan,Zhuan
// 获取姓名
WordsHelper . GetPinyinForName ( "单一一" ) //ShanYiYi
WordsHelper . GetPinyinForName ( "单一一" , "," ) //Shan,Yi,Yi
WordsHelper . GetPinyinForName ( "单一一" , true ) //ShànYīYī
ToolGood.Words.Pinyin pursues faster loading speed (currently only C# code).
PinyinMatch
: Methods include SetKeywords
, SetIndexs
, Find
, and FindIndex
.
PinyinMatch<T>
: Methods include SetKeywordsFunc
, SetPinyinFunc
, SetPinyinSplitChar
, and Find
.
string s = "北京|天津|河北|辽宁|吉林|黑龙江|山东|江苏|上海|浙江|安徽|福建|江西|广东|广西|海南|河南|湖南|湖北|山西|内蒙古|宁夏|青海|陕西|甘肃|新疆|四川|贵州|云南|重庆|西藏|香港|澳门|台湾" ;
PinyinMatch match = new PinyinMatch ( ) ;
match . SetKeywords ( s . Split ( '|' ) . ToList ( ) ) ;
var all = match . Find ( "BJ" ) ;
Assert . AreEqual ( "北京" , all [ 0 ] ) ;
Assert . AreEqual ( 1 , all . Count ) ;
all = match . Find ( "北J" ) ;
Assert . AreEqual ( "北京" , all [ 0 ] ) ;
Assert . AreEqual ( 1 , all . Count ) ;
all = match . Find ( "北Ji" ) ;
Assert . AreEqual ( "北京" , all [ 0 ] ) ;
Assert . AreEqual ( 1 , all . Count ) ;
all = match . Find ( "S" ) ;
Assert . AreEqual ( "山东" , all [ 0 ] ) ;
Assert . AreEqual ( "江苏" , all [ 1 ] ) ;
var all2 = match . FindIndex ( "BJ" ) ;
Assert . AreEqual ( 0 , all2 [ 0 ] ) ;
Assert . AreEqual ( 1 , all . Count ) ;
After performing 100,000 performance comparisons, the results are as follows:
Note: C#'s built-in regularization is very slow. StringSearchEx2.ContainsAny
is more than 88,000 times more efficient than Regex.IsMatch
, which is related to the number of keywords.
Regex.Matches
operates similarly to IQueryable
, only returning MatchCollection
without calculation.
In the Find All test, (the detected text contains sensitive words and will not be displayed. You can debug and check by yourself).
FastFilter
can only detect 7
StringSearch
detected 14
Interlude: After scrutinizing the magic Regex.Matches
for 3ms, I found that Regex.Matches
has a small problem,
Regex.Matches
can only detect 11
Author: wenlifan Address: https://github.com/wenlifan/SensitiveWordFilter
"ToolGood Content Review System" is officially open source, with Windows and Linux dual platforms, and the memory usage is less than 100M.
Official website: https://toolgood.com/
Open source code: https://github.com/toolgood/ToolGood.TextFilter
Sensitive Information Filtering Research Association, Q group: 128994346 (full)
I am not a teacher, so please do not ask simple questions about project usage, loading, etc.
1. Things about sensitive word filtering scheme
2. Common company sensitive word review system
3. Solution for newbies to filter sensitive words
4. Filtering methods for commonly used sensitive words on the Internet
5. ToolGood.Words algorithm filtering sensitive word optimization principle (charge 30 yuan, one meal at KFC)
6. Detailed explanation of ToolGood.TextFilter open source code optimization (charge 300 yuan) Compare with the IllegalWordsSearch algorithm to explain the optimization points of the ToolGood.TextFilter filtering algorithm and how to reduce memory usage. There is still a small part that has not been written yet. Those who are impatient can buy it first. I will continue to update it.
7. Regular to DFA algorithm (C# version, JAVA version) (charge 30 yuan, one KFC meal) One of the core algorithms of ToolGood.TextFilter uses regular to DFA.
8. C# version of picture pornography (charge 30 yuan, one meal at KFC)
The Bitcoin private key collision machine uses the computer's idle performance (3G memory) to win 250,000 Bitcoins.
Bitcoin private key collision machine (charge 50 yuan)
Bitcoin private key collision machine source code (charge 500 yuan)