Convert Chinese characters to Pinyin. It can be used for Chinese character phonetic notation, sorting, and retrieval (Russian translation).
The initial version of the code refers to the implementation of hotoo/pinyin.
Contents
pip install pypinyin
> >> from pypinyin import pinyin , lazy_pinyin , Style
> >> pinyin ( '中心' ) # or pinyin(['中心']),参数值为列表时表示输入的是已分词后的数据
[[ 'zhōng' ], [ 'xīn' ]]
> >> pinyin ( '中心' , heteronym = True ) # 启用多音字模式
[[ 'zhōng' , 'zhòng' ], [ 'xīn' ]]
> >> pinyin ( '中心' , style = Style . FIRST_LETTER ) # 设置拼音风格
[[ 'z' ], [ 'x' ]]
> >> pinyin ( '中心' , style = Style . TONE2 , heteronym = True )
[[ 'zho1ng' , 'zho4ng' ], [ 'xi1n' ]]
> >> pinyin ( '中心' , style = Style . TONE3 , heteronym = True )
[[ 'zhong1' , 'zhong4' ], [ 'xin1' ]]
> >> pinyin ( '中心' , style = Style . BOPOMOFO ) # 注音风格
[[ 'ㄓㄨㄥ' ], [ 'ㄒㄧㄣ' ]]
> >> lazy_pinyin ( '威妥玛拼音' , style = Style . WADEGILES )
[ 'wei' , "t'o" , 'ma' , "p'in" , 'yin' ]
> >> lazy_pinyin ( '中心' ) # 不考虑多音字的情况
[ 'zhong' , 'xin' ]
> >> lazy_pinyin ( '战略' , v_to_u = True ) # 不使用 v 表示 ü
[ 'zhan' , 'lüe' ]
# 使用 5 标识轻声
> >> lazy_pinyin ( '衣裳' , style = Style . TONE3 , neutral_tone_with_five = True )
[ 'yi1' , 'shang5' ]
# 变调 nǐ hǎo -> ní hǎo
> >> lazy_pinyin ( '你好' , style = Style . TONE2 , tone_sandhi = True )
[ 'ni2' , 'ha3o' ]
Things to note :
neutral_tone_with_five=True
to use 5
to identify soft tone).v
to represent ü
(can be turned on by parameter v_to_u=True
to use ü
instead of v
).嗯
is not en
as most people think, and there is a pinyin that has neither initials nor finals. Please see the explanation in the FAQ below for details.Command line tools:
$ pypinyin 音乐
yīn yuè
$ python -m pypinyin.tools.toneconvert to-tone ' zhong4 xin1 '
zhòng xīn
For detailed documentation, please visit: https://pypinyin.readthedocs.io/.
For questions about project code development, you can check out the development documentation.
Pinyin accuracy can be improved by the following methods:
>> from pypinyin import load_phrases_dict , load_single_dict
>> load_phrases_dict ({ '桔子' : [[ 'jú' ], [ 'zǐ' ]]}) # 增加 "桔子" 词组
>> load_single_dict ({ ord ( '还' ): 'hái,huán' }) # 调整 "还" 字的拼音顺序或覆盖默认拼音
# 使用 phrase-pinyin-data 项目中 cc_cedict.txt 文件中的拼音数据优化结果
> >> from pypinyin_dict . phrase_pinyin_data import cc_cedict
> >> cc_cedict . load ()
# 使用 pinyin-data 项目中 kXHC1983.txt 文件中的拼音数据优化结果
> >> from pypinyin_dict . pinyin_data import kxhc1983
> >> kxhc1983 . load ()
> >> # 使用其他分词模块分词,比如 jieba 之类,
>> > #或者基于 phrases_dict.py 里的词语数据使用其他分词算法分词
>> > words = list ( jieba . cut ( '每股24.67美元的确定性协议' ))
> >> pinyin ( words )
> >> from pypinyin import Style , pinyin
> >> pinyin ( '下雨天' , style = Style . INITIALS )
[[ 'x' ], [ '' ], [ 't' ]]
Because according to the "Chinese Pinyin Plan", y, w, ü (yu) are not initial consonants.
In the initial consonant style (INITIALS), Chinese characters such as "Rain", "I", and "Yuan" return empty strings, because according to the "Chinese Pinyin Scheme", y, w, ü (yu) are not initial consonants, and in some specific finals When there is no initial consonant, y or w is added, and ü also has its own specific rules. ——@hotoo
If you think this brings you trouble, then please also be careful with some Chinese characters without initial consonants (such as "ah", "hungry", "press", "ang", etc.). At this time, what you may need is the first letter style (FIRST_LETTER) . ——@hotoo
Reference: hotoo/pinyin#57, #22, #27, #44
If you feel that this behavior is not what you want, and you just want to treat y as the initial consonant, you can specify strict=False
, which may meet your expectations:
> >> from pypinyin import Style , pinyin
> >> pinyin ( '下雨天' , style = Style . INITIALS )
[[ 'x' ], [ '' ], [ 't' ]]
> >> pinyin ( '下雨天' , style = Style . INITIALS , strict = False )
[[ 'x' ], [ 'y' ], [ 't' ]]
See the effects of the strict parameter for details.
Yes, in strict=True
mode, there are very few pinyin that have neither initial consonants nor finals. For example, the following pinyin (from the Chinese characters嗯
,呒
,呣
,唔
):
ń ńg ňg ǹg ň ǹ m̄ ḿ m̀
It is particularly important to note that all pinyin for嗯
has neither initial consonants nor finals, and the default pinyin for呣
has neither initial consonants nor finals. See #109 #259 #284 for details.
You can use the auxiliary function provided by the pypinyin.contrib.tone_convert
module to convert standard pinyin to obtain different styles of pinyin. For example, convert zhōng
to zhong
, or obtain initial consonant or final consonant data in Pinyin:
> >> from pypinyin . contrib . tone_convert import to_normal , to_tone , to_initials , to_finals
> >> to_normal ( 'zhōng' )
'zhong'
> >> to_tone ( 'zhong1' )
'zhōng'
> >> to_initials ( 'zhōng' )
'zh'
> >> to_finals ( 'zhōng' )
'ong'
For more auxiliary functions for pinyin conversion, please see the documentation of the pypinyin.contrib.tone_convert
module.
If you don't particularly care about the accuracy of Pinyin, you can save memory by setting the environment variables PYPINYIN_NO_PHRASES
and PYPINYIN_NO_DICT_COPY
. See documentation for details
For more FAQ details, see the FAQ section of the documentation.