adnb34g 发表于 2019-3-29 13:30:46

汉语言处理工具pyhanlp的拼音转换与字符正则化

汉字转拼音HanLP中的汉字转拼音功能也十分的强大。说明:l HanLP不仅支持基础的汉字转拼音,还支持声母、韵母、音调、音标和输入法首字母首声母功能。l HanLP能够识别多音字,也能给繁体中文注拼音。l 最重要的是,HanLP采用的模式匹配升级到AhoCorasickDoubleArrayTrie,性能大幅提升,能够提供毫秒级的响应速度!算法详解:l 《汉字转拼音与简繁转换的Java实现》1. # 汉字转拼音2. Pinyin = JClass("com.hankcs.hanlp.dictionary.py.Pinyin")3. text = "重载不是重任!"4. pinyin_list = HanLP.convertToPinyinList(text) 5. print("原文,", end=" ")6. print(text)7. print("拼音(数字音调),", end=" ")8. print(pinyin_list)9. print("拼音(符号音调),", end=" ")10. for pinyin in pinyin_list:11. print("%s," % pinyin.getPinyinWithToneMark(), end=" ")12. print("\n拼音(无音调),", end=" ")13. for pinyin in pinyin_list:14. print("%s," % pinyin.getPinyinWithoutTone(), end=" ")15. print("\n声调,", end=" ")16. for pinyin in pinyin_list:17. print("%s," % pinyin.getTone(), end=" ")18. print("\n声母,", end=" ")19. for pinyin in pinyin_list:20. print("%s," % pinyin.getShengmu(), end=" ")21. print("\n韵母,", end=" ")22. for pinyin in pinyin_list:23. print("%s," % pinyin.getYunmu(), end=" ")24. print("\n输入法头,", end=" ")25. for pinyin in pinyin_list:26. print("%s," % pinyin.getHead(), end=" ") 27. print()28. # 拼音转换可选保留无拼音的原字符29. print(HanLP.convertToPinyinString("截至2012年,", " ", True))30. print(HanLP.convertToPinyinString("截至2012年,", " ", False)) 1.原文, 重载不是重任!2.拼音(数字音调), 3.拼音(符号音调), chóng, zǎi, bú, shì, zhòng, rèn, none, 4.拼音(无音调), chong, zai, bu, shi, zhong, ren, none, 5.声调, 2, 3, 2, 4, 4, 4, 5, 6.声母, ch, z, b, sh, zh, r, none, 7.韵母, ong, ai, u, i, ong, en, none, 8.输入法头, ch, z, b, sh, zh, r, none, 9.jie zhi none none none none nian none10.jie zhi 2 0 1 2 nian ,拼音转中文HanLP中的数据结构和接口是灵活的,组合这些接口,可以自己创造新功能,我们可以使用AhoCorasickDoubleArrayTrie实现的最长分词器,需要用户调用setTrie()提供一个AhoCorasickDoubleArrayTrie 1.StringDictionary = JClass(2."com.hankcs.hanlp.corpus.dictionary.StringDictionary")3.CommonAhoCorasickDoubleArrayTrieSegment = JClass(4."com.hankcs.hanlp.seg.Other.CommonAhoCorasickDoubleArrayTrieSegment")5.Config = JClass("com.hankcs.hanlp.HanLP$Config")6.7.TreeMap = JClass("java.util.TreeMap")8.TreeSet = JClass("java.util.TreeSet")9.10.dictionary = StringDictionary()11.dictionary.load(Config.PinyinDictionaryPath)12.entry = {}13.m_map = TreeMap()14.for entry in dictionary.entrySet():15.pinyins = entry.getValue().replace("[\\d,]", "")16.words = m_map.get(pinyins)17.if words is None:18.words = TreeSet()19.m_map.put(pinyins, words)20.words.add(entry.getKey())21.words = TreeSet()22.words.add("绿色")23.words.add("滤色")24.m_map.put("lvse", words)25.26.segment = CommonAhoCorasickDoubleArrayTrieSegment(m_map)27.print(segment.segment("renmenrenweiyalujiangbujianlvse"))28.print(segment.segment("lvsehaihaodajiadongxidayinji")) 1.]2., haihaodajiadongxidayinji/null] 字符正则化演示正规化字符配置项的效果(繁体->简体,全角->半角,大写->小写)。该配置项位于hanlp.properties中,通过Normalization=true来开启(现在直接通过HanLP.Config.Normalization开启即可)。 切换配置后必须删除CustomDictionary.txt.bin缓存,否则只影响动态插入的新词。在我动笔前一个星期,已经有同学添加了,添加自定义词典之后,自动删除缓存的功能。地址请参阅github.com/hankcs/HanLP/pull/954,现在只需要开启正则化即可 1.CustomDictionary =JClass("com.hankcs.hanlp.dictionary.CustomDictionary")2.print("HanLP.Config.Normalization = False\n")3.HanLP.Config.Normalization = False4.CustomDictionary.insert("爱听4G", "nz 1000")5.print(HanLP.segment("爱听4g"))6.print(HanLP.segment("爱听4G"))7.print(HanLP.segment("爱听4G"))8.print(HanLP.segment("爱听4G"))9.print(HanLP.segment("愛聽4G"))10.11.print(HanLP.segment("喜欢4G"))12.print(HanLP.segment("hankcs在臺灣寫代碼")) 13.14.print("\nHanLP.Config.Normalization = True\n")15.HanLP.Config.Normalization = True16.print(HanLP.segment("爱听4g"))17.print(HanLP.segment("爱听4G"))18.print(HanLP.segment("爱听4G"))19.print(HanLP.segment("爱听4G"))20.print(HanLP.segment("愛聽4G"))21.22.print(HanLP.segment("喜欢4G"))23.print(HanLP.segment("hankcs在臺灣寫代碼"))24.25.HanLP.Config.ShowTermNature = False27.text = HanLP.s2tw("现在的HanLP已经添加了添加自定义词典之后,自动删除缓存的功能,现在只需要开启正则化即可")28.print(text)29.print(HanLP.segment(text))30.HanLP.Config.ShowTermNature = False 1.HanLP.Config.Normalization = False2.3.[爱听4g]4.[爱听4G]5.[爱, 听, 4, G]6.[爱, 听, 4, G]7.[愛, 聽, 4, G]8.[喜欢, 4, G]9.10.11.HanLP.Config.Normalization = True12.13.[爱听4g]14.[爱听4g]15.[爱听4g]16.[爱听4g]17.[爱听4g]18.[喜欢, 4, g]19.20.現在的HanLP已經新增了新增自定義詞典之後,自動刪除快取的功能,現在只需要開啟正則化即可21.[现在, 的, hanlp, 已经, 新增, 了, 新增, 自定义, 词典, 之后, ,, 自动, 删除, 快, 取, 的---------------------
页: [1]
查看完整版本: 汉语言处理工具pyhanlp的拼音转换与字符正则化