ToolGood.Words下載 - ToolGood.Words原始碼下載

ToolGood.Words

Python

3.1.0.0

下載

ToolGood.Words

一款高性能非法字詞(敏感字詞)偵測元件，附帶繁體簡體互換，支援全角半角互換，取得拼音首字母，取得拼音字母，拼音模糊搜尋等功能。

C#語言，使用StringSearchEx2.Replace過濾，在48k敏感詞庫上的過濾速度超過3億字元每秒。（cpu i7 8750h）

csharp 資料夾說明：

 ToolGood.Pinyin.Build:          生成词的拼音
ToolGood.Pinyin.Pretreatment:   生成拼音预处理，核对拼音，词组最小化
ToolGood.Transformation.Build： 生成简体繁体转换文档，更新时文档放在同一目录下，词库参考 https://github.com/BYVoid/OpenCC
ToolGood.Words.Contrast:        字符串搜索对比
ToolGood.Words.Test:            单元测试
ToolGood.Words:                 本项目源代码

非法詞(敏感詞)偵測（字串搜尋）

非法詞(敏感詞)偵測類別： StringSearch 、 StringSearchEx 、 StringSearchEx2 、 WordsSearch 、 WordsSearchEx 、 WordsSearchEx2 、 IllegalWordsSearch ;

StringSearch 、 StringSearchEx 、 StringSearchEx2 、 StringSearchEx3 : 搜尋FindFirst方法傳回結果為string類型。
WordsSearch 、 WordsSearchEx 、 WordsSearchEx3 : 搜尋FindFirst方法傳回結果為WordsSearchEx2 、 WordsSearchEx3 : 搜尋FindFirst方法回傳結果為WordsSearchResult類型, WordsSearchResult不只有關鍵字，還有關鍵字的開始位置、結束位置，關鍵字序號等。
IllegalWordsSearch : 過濾非法詞（敏感詞）專用類，可設定跳字長度，預設全角轉半角，忽略大小寫，跳詞，重複詞，黑名單，搜尋FindFirst方法返回為IllegalWordsSearchResult ,有關鍵字，對應原文，開始、位置，黑名單類型。
IllegalWordsSearch 、 StringSearchEx 、 StringSearchEx2 、 WordsSearchEx 、 WordsSearchEx2使用Save 、 Load方法，可以加快初始化。
共同方法有： SetKeywords 、 ContainsAny 、 FindFirst 、 FindAll 、 Replace
IllegalWordsSearch獨有方法： SetSkipWords （設定跳字）、 SetBlacklist （設定黑名單）。
IllegalWordsSearch欄位UseIgnoreCase ：設定為忽略否大小寫,必須在SetKeywords方法之前，附註：使用Load方法則該欄位無效。
StringSearchEx3 、 WordsSearchEx3為指標版優化版，實測時發現效能浮動比較大。

    string s = "中国|国人|zg人" ;
    string test = "我是中国人" ;

    StringSearch iwords = new StringSearch ( ) ;
    iwords . SetKeywords ( s . Split ( '|' ) ) ;
    
    var b = iwords . ContainsAny ( test ) ;
    Assert . AreEqual ( true , b ) ;

    var f = iwords . FindFirst ( test ) ;
    Assert . AreEqual ( "中国" , f ) ;

    var all = iwords . FindAll ( test ) ;
    Assert . AreEqual ( "中国" , all [ 0 ] ) ;
    Assert . AreEqual ( "国人" , all [ 1 ] ) ;
    Assert . AreEqual ( 2 , all . Count ) ;

    var str = iwords . Replace ( test , '*' ) ;
    Assert . AreEqual ( "我是***" , str ) ;

非法詞(敏感詞)偵測（字串搜尋）（支援通配符）

非法詞(敏感詞)偵測類別： StringMatch 、 StringMatchEx 、 WordsMatch 、 WordsMatchEx 。

支援部分正規表示式類型： . （點） ? (問號) [] (方括號) (|) (括號與垂直線)

    string s = ".[中美]国|国人|zg人" ;
    string test = "我是中国人" ;

    WordsMatch wordsSearch = new WordsMatch ( ) ;
    wordsSearch . SetKeywords ( s . Split ( '|' ) ) ;

    var b = wordsSearch . ContainsAny ( test ) ;
    Assert . AreEqual ( true , b ) ;

    var f = wordsSearch . FindFirst ( test ) ;
    Assert . AreEqual ( "是中国" , f . Keyword ) ;

    var alls = wordsSearch . FindAll ( test ) ;
    Assert . AreEqual ( "是中国" , alls [ 0 ] . Keyword ) ;
    Assert . AreEqual ( ".[中美]国" , alls [ 0 ] . MatchKeyword ) ;
    Assert . AreEqual ( 1 , alls [ 0 ] . Start ) ;
    Assert . AreEqual ( 3 , alls [ 0 ] . End ) ;
    Assert . AreEqual ( 0 , alls [ 0 ] . Index ) ; //返回索引Index,默认从0开始
    Assert . AreEqual ( "国人" , alls [ 1 ] . Keyword ) ;
    Assert . AreEqual ( 2 , alls . Count ) ;

    var t = wordsSearch . Replace ( test , '*' ) ;
    Assert . AreEqual ( "我****" , t ) ;

繁體簡體互換、全角半角互換、數字轉成中文大寫、拼音操作

    // 转成简体
    WordsHelper . ToSimplifiedChinese ( "我愛中國" ) ;
    WordsHelper . ToSimplifiedChinese ( "我愛中國" , 1 ) ; // 港澳繁体 转 简体
    WordsHelper . ToSimplifiedChinese ( "我愛中國" , 2 ) ; // 台湾正体 转 简体
    // 转成繁体
    WordsHelper . ToTraditionalChinese ( "我爱中国" ) ;
    WordsHelper . ToTraditionalChinese ( "我爱中国" , 1 ) ; // 简体 转 港澳繁体
    WordsHelper . ToTraditionalChinese ( "我爱中国" , 2 ) ; // 简体 转 台湾正体
    // 转成全角
    WordsHelper . ToSBC ( "abcABC123" ) ;
    // 转成半角
    WordsHelper . ToDBC ( "ａｂｃＡＢＣ１２３" ) ;
    // 数字转成中文大写
    WordsHelper . ToChineseRMB ( 12345678901.12 ) ;
    // 中文转成数字
    WordsHelper . ToNumber ( "壹佰贰拾叁亿肆仟伍佰陆拾柒万捌仟玖佰零壹元壹角贰分" ) ;
    // 获取全拼
    WordsHelper . GetPinyin ( "我爱中国" ) ; //WoAiZhongGuo   
    WordsHelper . GetPinyin ( "我爱中国" , "," ) ; //Wo,Ai,Zhong,Guo   
    WordsHelper . GetPinyin ( "我爱中国" , true ) ; //WǒÀiZhōngGuó

    // 获取首字母
    WordsHelper . GetFirstPinyin ( "我爱中国" ) ; //WAZG
    // 获取全部拼音
    WordsHelper . GetAllPinyin ( '传' ) ; //Chuan,Zhuan
    // 获取姓名
    WordsHelper . GetPinyinForName ( "单一一" ) //ShanYiYi
    WordsHelper . GetPinyinForName ( "单一一" , "," ) //Shan,Yi,Yi
    WordsHelper . GetPinyinForName ( "单一一" , true ) //ShànYīYī

拼音分支

ToolGood.Words.Pinyin 追求更快的載入速度（目前只有C#程式碼）。

拼音匹配

PinyinMatch ：方法有SetKeywords 、 SetIndexs 、 Find 、 FindIndex 。

PinyinMatch<T> ：方法有SetKeywordsFunc 、 SetPinyinFunc 、 SetPinyinSplitChar 、 Find 。

    string s = "北京|天津|河北|辽宁|吉林|黑龙江|山东|江苏|上海|浙江|安徽|福建|江西|广东|广西|海南|河南|湖南|湖北|山西|内蒙古|宁夏|青海|陕西|甘肃|新疆|四川|贵州|云南|重庆|西藏|香港|澳门|台湾" ;

    PinyinMatch match = new PinyinMatch ( ) ;
    match . SetKeywords ( s . Split ( '|' ) . ToList ( ) ) ;

    var all = match . Find ( "BJ" ) ;
    Assert . AreEqual ( "北京" , all [ 0 ] ) ;
    Assert . AreEqual ( 1 , all . Count ) ;

    all = match . Find ( "北J" ) ;
    Assert . AreEqual ( "北京" , all [ 0 ] ) ;
    Assert . AreEqual ( 1 , all . Count ) ;

    all = match . Find ( "北Ji" ) ;
    Assert . AreEqual ( "北京" , all [ 0 ] ) ;
    Assert . AreEqual ( 1 , all . Count ) ;

    all = match . Find ( "S" ) ;
    Assert . AreEqual ( "山东" , all [ 0 ] ) ;
    Assert . AreEqual ( "江苏" , all [ 1 ] ) ;

    var all2 = match . FindIndex ( "BJ" ) ;
    Assert . AreEqual ( 0 , all2 [ 0 ] ) ;
    Assert . AreEqual ( 1 , all . Count ) ;

性能對比

執行10萬次效能對比，結果如下：

註:C#自帶正規則很慢， StringSearchEx2.ContainsAny是Regex.IsMatch效率的8.8萬倍多，跟關鍵字數有關。

Regex.Matches的運作方式跟IQueryable的類似，只回傳MatchCollection ,還沒計算。

在Find All測試中, （檢測出的文本中有敏感詞彙，就不顯示了，大家可自行調試查看）。

FastFilter只能偵測出7個

StringSearch偵測出14個

插曲：在細查Regex.Matches神奇3ms，我發現Regex.Matches有個小問題，

Regex.Matches只能偵測出11個

其他語言實現

Lua版本

作者：wenlifan 地址：https://github.com/wenlifan/SensitiveWordFilter

敏感字詞相關文章

1.敏感詞過濾方案那些事

2.普通公司敏感詞審核制度

3.新人小白過濾敏感詞方案

4.網路常用敏感詞過濾方法

5.ToolGood.Words演算法過濾敏感詞優化原理（收費30元，一頓KFC）

6.ToolGood.TextFilter開源程式碼最佳化詳解（收費300元）與IllegalWordsSearch演算法進行對比，闡述了ToolGood.TextFilter過濾演算法最佳化點，如何減少記憶體使用量。還有一小部分未寫好，心急的人可以先買，我會持續更新。

7.正則轉DFA演算法（C#版、JAVA版）（收費30元，一頓KFC） ToolGood.TextFilter的一個核心演算法就使用到正規轉DFA。

8.C#版圖片鑑黃（收費30元，一頓KFC）

比特幣私鑰碰撞機

比特幣私鑰碰撞機，利用電腦空閒效能(3G記憶體)，搏25萬枚比特幣。

比特幣私鑰碰撞機（收費50元）

比特幣私鑰碰撞機原始碼（收費500元）

展開

附加信息

版本 3.1.0.0
類型 Python
更新時間 2024-12-21
大小 74.71MB
來自於 Github

相關應用

爲您推薦

chat.petals.dev

其他源碼

1.0.0
GPT Prompt Templates

其他源碼

1.0.0
GPTyped

其他源碼

GPTyped 1.0.5
Google Blog Converters(部落格資料轉換器)

Python

1.0 R54
Nuitka

Python

1.0.0
azure storage python

Python

v2.1.0
waymo open dataset

其他源碼

December 2023 Update
wp functions

其他類別

1.0.0
termwind

其他類別

v2.3.0

相關資訊全部

ToolGood.Words

ToolGood.Words

非法詞(敏感詞)偵測（字串搜尋）

非法詞(敏感詞)偵測（字串搜尋）（支援通配符）

繁體簡體互換、全角半角互換、數字轉成中文大寫、拼音操作

拼音分支

拼音匹配

性能對比

其他語言實現

Lua版本

推薦一下

敏感字詞相關文章

比特幣私鑰碰撞機

奇蹟之詞：填字遊戲

單字黃金：字謎遊戲

益智問答遊戲：Words Up

地獄之語中文版（Inferno Words）

詭計

南非蘭特

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Blog Converters(部落格資料轉換器)

Nuitka

azure storage python

waymo open dataset

wp functions

termwind