nlp fluency Download - nlp fluency Source code download

nlp fluency

AI Source Code

1.0.0

Download

nlp-fluency

illustrate

A collection of methods for assessing natural language fluency
Including ngrams , gpt , masked bert several different methods of calculating fluency; for kenlm method, please refer to Su Shen's blog
Welcome stars, issues and PRs

method	introduce	Model	Case
ngrams	Use ngram to calculate the probability of the next word [one-way sliding window]	Baidu Netdisk: no8i (based on thucnew summary data set training); can also be trained with other corpora through train_ngramslm.py	Case
gpt	Use Chinese gpt to calculate the probability of the next word [one-way]	Baidu Netdisk: qmzg; You can also visit the link to obtain other gpt pre-trained Chinese models, or train yourself	Case
bert	Mask the words in the sentence, and then predict the distribution of the mask words, and then get the probability of the word [two-way]	Baidu Netdisk: ma3b; You can also visit the link to obtain other BERT pre-trained Chinese models, or train yourself	Case
albert	Same as bert, but the model is smaller	Baidu Netdisk: q6pb; You can also visit the link to obtain other Albert pre-trained Chinese models, or train yourself	Case

usage

torch and transformers need to be installed, please install them yourself. Use cases can be found in example.py

function:

score: negative value, the bigger the better
perplexity: positive value, the smaller the better

enter:

Chinese
It can be a sentence or a paragraph
The ngram method requires the input of jieba word segmentation sentences. See the case below for details.

Test corpus

 sentences = [
    "中国人的性情是总喜欢调和折中的，譬如你说，这屋子太暗，须在这里开一个窗，大家一定不允许的。但如果你主张拆掉屋顶他们就来调和，愿意开窗了。" ,
    "惟将终夜长开眼，报答平生未展眉" ,
    "我原以为，你身为汉朝老臣，来到阵前，面对两军将士，必有高论。没想到，竟说出如此粗鄙之语！" ,
    "人生当中成功只是一时的，失败却是主旋律，但是如何面对失败，却把人分成不同的样子，有的人会被失败击垮，有的人能够不断的爬起来继续向前，我想真正的成熟，应该不是追求完美，而是直面自己的缺憾，这才是生活的本质，罗曼罗兰说过，这个世界上只有一种真正的英雄主义，那就是认清生活的真相，并且仍然热爱它。难道向上攀爬的那条路不是比站在顶峰更让人热血澎湃吗？" ,
    "我在树上游泳。" ,
    "我在游泳池游泳。" ,
    "我游泳在游泳池。" ,
    "尤是为了,更佳大的,念,念,李是彼,更伟大的多,你只会用这种方法解决问题吗!" ,
]

ngrams

For details on the training model, see train_ngramslm.py

Since this model is trained using the Tsinghua abstract data set and lacks corpus of ancient poems and texts, the ppl of some non-vernacular texts are relatively high. Others are relatively accurate and perform well in semantics, and are not affected by long and short sentences.

 import jieba
import time
from models import NgramsLanguageModel


start_time = time . time ()

model = NgramsLanguageModel . from_pretrained ( "./thucnews_lm_model" )

print ( f"Loading ngrams model cost { time . time () - start_time :.3f } seconds." )

for s in sentences :
    ppl = model . perplexity (
        x = jieba . lcut ( s ),   # 经过切词的句子或段落
        verbose = False ,     # 是否显示详细的probability，default=False
    )
    print ( f"ppl: { ppl :.5f } # { s } " )

print ( model . perplexity ( jieba . lcut ( sentences [ - 4 ]), verbose = True ))

# Loading ngrams model cost 26.640 seconds.
#
# ppl: 8572.17074 # 中国人的性情是总喜欢调和折中的，譬如你说，这屋子太暗，须在这里开一个窗，大家一定不允许的。但如果你主张拆掉屋顶他们就来调和，愿意开窗了。
# ppl: 660033.44283 # 惟将终夜长开眼，报答平生未展眉
# ppl: 121955.03294 # 我原以为，你身为汉朝老臣，来到阵前，面对两军将士，必有高论。没想到，竟说出如此粗鄙之语！
# ppl: 6831.79220 # 人生当中成功只是一时的，失败却是主旋律，但是如何面对失败，却把人分成不同的样子，有的人会被失败击垮，有的人能够不断的爬起来继续向前，我想真正的成熟，应该不是追求完美，而是直面自己的缺憾，这才是生活的本质，罗曼罗兰说过，这个世界上只有一种真正的英雄主义，那就是认清生活的真相，并且仍然热爱它。难道向上攀爬的那条路不是比站在顶峰更让人热血澎湃吗？
# ppl: 12816.52860 # 我在树上游泳。
# ppl: 7122.96754 # 我在游泳池游泳。
# ppl: 61286.99997 # 我游泳在游泳池。
# ppl: 135742.90546 # 尤是为了,更佳大的,念,念,李是彼,更伟大的多,你只会用这种方法解决问题吗!
#
# ['我', '在'] | 0.00901780
# ['在', '树上'] | 0.00003544
# ['树上', '游泳'] | 0.00000059
# ['游泳', '。'] | 0.00019609
# l score: -13.64571794
# 12816.528602897242

bert

Bert is generally better than the ngrams method. Although Albert is fast, the effect is not ideal.

 from models import MaskedBert , MaskedAlbert

model = MaskedAlbert . from_pretrained ( "/home/baojunshan/data/pretrained_models/albert_base_zh" )

# model = MaskedBert.from_pretrained(
#     path="/home/baojunshan/data/pretrained_models/chinese_bert_wwm_ext_pytorch",
#     device="cpu",  # 使用cpu或者cuda:0，default=cpu
#     sentence_length=50,  # 长句做切句处理，段落会被切成最大不超过该变量的句子集，default=50
# )

for s in sentences :
    ppl = model . perplexity (
        x = " " . join ( s ),   # 每个字空格隔开或者输入一个list
        verbose = False ,     # 是否显示详细的probability，default=False
        temperature = 1.0 ,   # softmax的温度调节，default=1
        batch_size = 100 ,    # 推理时的batch size，可根据cpu或gpu而定，default=100
    )
    print ( f"ppl: { ppl :.5f } # { s } " )

model . perplexity ( sentences [ - 4 ], verbose = True )
# model.score(...) # 参数相同

# ppl: 4.20476 # 中国人的性情是总喜欢调和折中的，譬如你说，这屋子太暗，须在这里开一个窗，大家一定不允许的。但如果你主张拆掉屋顶他们就来调和，愿意开窗了。
# ppl: 71.91608 # 惟将终夜长开眼，报答平生未展眉
# ppl: 2.59046 # 我原以为，你身为汉朝老臣，来到阵前，面对两军将士，必有高论。没想到，竟说出如此粗鄙之语！
# ppl: 1.99123 # 人生当中成功只是一时的，失败却是主旋律，但是如何面对失败，却把人分成不同的样子，有的人会被失败击垮，有的人能够不断的爬起来继续向前，我想真正的成熟，应该不是追求完美，而是直面自己的缺憾，这才是生活的本质，罗曼罗兰说过，这个世界上只有一种真正的英雄主义，那就是认清生活的真相，并且仍然热爱它。难道向上攀爬的那条路不是比站在顶峰更让人热血澎湃吗？
# ppl: 10.55426 # 我在树上游泳。
# ppl: 4.38016 # 我在游泳池游泳。
# ppl: 6.56533 # 我游泳在游泳池。
# ppl: 22.52334 # 尤是为了,更佳大的,念,念,李是彼,更伟大的多,你只会用这种方法解决问题吗!
# 我 | 0.00039561
# 在 | 0.96003467
# 树 | 0.00347330
# 上 | 0.42612109
# 游 | 0.95590442
# 泳 | 0.17133135
# 。 | 0.74459237
# l score: -3.39975392

gpt

The effect of GPT is not ideal. Regardless of the result itself, the method of using gpt to calculate fluency has certain problems. When predicting the probability of the next word, all previous words are always estimated as correct, which will affect the results. deviation.

 from models import GPT

model = GPT . from_pretrained (
    path = "/home/baojunshan/data/pretrained_models/chinese_gpt2_pytorch" ,
    device = "cpu" ,
    sentence_length = 50
)

for s in sentences :
    ppl = model . perplexity (
        x = " " . join ( s ),   # 每个字空格隔开或者输入一个list
        verbose = False ,     # 是否显示详细的probability，default=False
        temperature = 1.0 ,   # softmax的温度调节，default=1
        batch_size = 100 ,    # 推理时的batch size，可根据cpu或gpu而定，default=100
    )
    print ( f"ppl: { ppl :.5f } # { s } " )

model . perplexity ( sentences [ - 4 ], verbose = True )

ppl : 901.41065 # 中国人的性情是总喜欢调和折中的，譬如你说，这屋子太暗，须在这里开一个窗，大家一定不允许的。但如果你主张拆掉屋顶他们就来调和，愿意开窗了。
ppl : 7773.85606 # 惟将终夜长开眼，报答平生未展眉
ppl : 949.33750 # 我原以为，你身为汉朝老臣，来到阵前，面对两军将士，必有高论。没想到，竟说出如此粗鄙之语！
ppl : 906.79251 # 人生当中成功只是一时的，失败却是主旋律，但是如何面对失败，却把人分成不同的样子，有的人会被失败击垮，有的人能够不断的爬起来继续向前，我想真正的成熟，应该不是追求完美，而是直面自己的缺憾，这才是生活的本质，罗曼罗兰说过，这个世界上只有一种真正的英雄主义，那就是认清生活的真相，并且仍然热爱它。难道向上攀 爬的那条路不是比站在顶峰更让人热血澎湃吗？
ppl : 798.38110 # 我在树上游泳。
ppl : 729.68857 # 我在游泳池游泳。
ppl : 469.11313 # 我游泳在游泳池。
ppl : 927.94576 # 尤是为了,更佳大的,念,念,李是彼,更伟大的多,你只会用这种方法解决问题吗!
我 | 0.00924169
在 | 0.00345525
树 | 0.00000974
上 | 0.22259754
游 | 0.00021145
泳 | 0.00004592
。 | 0.00719284
l score : - 9.64093376

plan

Implement ngrams, gpt, bert mask methods
Using gan's discriminator
At present, the implementation of bert and gpt methods is relatively rough and slow, and will be accelerated in the future.
Although the models of bert and gpt can be trained and then loaded by yourself, the repo will also provide a train method later.
The current fluency detection methods are relatively old, and we will try to add some of the latest methods in the future (I have to read the paper again

Quote

 @misc{nlp-fluency,
  author = {Junshan Bao},
  title = {nlp-fluency},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {url{https://github.com/baojunshan/nlp-fluency}},
}

Expand

Additional Information

Version 1.0.0
Type AI Source Code
Update Time 2025-01-07
size 50MB
From Github

Related Applications

GitHub sgrebnov/cordova plugin background download

2024-11-05
Wa ch ull navra maza navsacha 2 2024 ull ovie Fr e Online On Strea ings

2024-11-03
Wa ch navra maza navsacha 2 2024 ull ovie Online For Fr e Strea ings At Home

2024-11-03
Wa ch the greatest of all time 2024 ull ovie Online For Fr e Strea ings At Home

2024-11-02
wolfs 2024 f llmo ie f lmyz lla dow load ree 7 0p 4 0p a d 10 0p

2024-11-01
GitHub actions/download artifact

2024-11-01

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
node telegram bot api

AI Source Code

v0.50.0
typebot.io

AI Source Code

v3.1.2
python wechaty getting started

AI Source Code

1.0.0
waymo open dataset

Other source code

December 2023 Update
wp functions

Other categories

1.0.0
termwind

Other categories

v2.3.0

Related Information All