评论内容分词

基于jieba的中文分词

GitHub的项目地址:jieba

jieba的使用非常简单,在这里为了对评论文本进行审核,我们只使用最基础的分词功能,以下是全模式的分词示例:

import jieba

seg_list = jieba.cut("今天军训我摸鱼")

for word in seg_list:
    print(word)
```
最终的分词结果会返回到一个列表对象中,我们对这个列表进行循环即可得到分词。

我们尝试基于jieba分词,再将分词与与停用库中的词语进行对比,从而实现评论过滤的作用。

```

import jieba

jieba.load_userdict('stop-words.txt')

with open('stop-words.txt', 'r', encoding='utf-8') as open:
    stopwords = open.read().split('\n')
seg_list = jieba.cut('今天军训我摸鱼')


for word in seg_list:
    print(word)
    if word in stopwords:
        print(f"帖子内容中含有停用词:{word}")

在这里我们尝试导入了一个用户自定义词典`jieba.load_userdict('stop-words.txt')`,从而加强在分词过程中的准确度。

jieba也存在一些问题,比如说在导入库的时候速度很慢:

import jieba

在实际测试的过程中,我们尝试对一个大小为231KB的纯文本进行分词,其中:加载jieba耗时:26.336201667785645秒,分词耗时:1.2523243427276611秒。

import time
import csv

start_jieba = time.time()
# 开始记录时间
import jieba

end_jieba = time.time()
print(f"加载jieba耗时:{end_jieba - start_jieba}秒")

start = time.time()

with open('Three Body.txt', 'r', encoding='gbk') as file:
    text = file.read()
    print('开始分词')
    seg_list = jieba.cut(text)
    with open('jieba_cut.csv', 'w', newline='', encoding='utf-8') as output_file:
        writer = csv.writer(output_file)
        for word in seg_list:
            writer.writerow([word])


# 结束记录时间
end = time.time()

print(f"分词耗时:{end - start}秒")

基于HanLP的中文分词

GitHub的项目地址:HanLP

HanLP和jieba对于我做评论分词的最大区别在于HanLP的分词准确度更高。HanLP在做对于长文本的分词过程中能够对语法、词性、词义等进行更为细致的分割。

同时HanLP为我们提供了多任务单任务的选项,能让我们自行在**快速**与**准确**之间权衡。但是**多任务**对于显存的要求比较高(我没有跑成功

HanLP的开发者为我们提供了很简单的使用方法:

from hanlp_restful import HanLPClient
HanLP = HanLPClient('https://www.hanlp.com/api', auth=None, language='zh') # auth不填则匿名,zh中文,mul多语种

HanLP.parse("2021年HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。阿婆主来到北京立方庭参观自然语义科技公司。")

最后的运行结果是会返回一个JSON结构体,类似这样:

{
  "tok/fine": [
    ["2021年", "HanLPv2.1", "为", "生产", "环境", "带来", "次", "世代", "最", "先进", "的", "多", "语种", "NLP", "技术", "。"],
    ["阿婆主", "来到", "北京", "立方庭", "参观", "自然", "语义", "科技", "公司", "。"]
  ],
  "tok/coarse": [
    ["2021年", "HanLPv2.1", "为", "生产", "环境", "带来", "次世代", "最", "先进", "的", "多语种", "NLP", "技术", "。"],
    ["阿婆主", "来到", "北京立方庭", "参观", "自然语义科技公司", "。"]
  ],
  "pos/ctb": [
    ["NT", "NR", "P", "NN", "NN", "VV", "JJ", "NN", "AD", "JJ", "DEG", "CD", "NN", "NR", "NN", "PU"],
    ["NN", "VV", "NR", "NR", "VV", "NN", "NN", "NN", "NN", "PU"]
  ],
  "pos/pku": [
    ["t", "nx", "p", "vn", "n", "v", "b", "n", "d", "a", "u", "a", "n", "nx", "n", "w"],
    ["n", "v", "ns", "ns", "v", "n", "n", "n", "n", "w"]
  ],
  "pos/863": [
    ["nt", "w", "p", "v", "n", "v", "a", "nt", "d", "a", "u", "a", "n", "ws", "n", "w"],
    ["n", "v", "ns", "n", "v", "n", "n", "n", "n", "w"]
  ],
  "ner/pku": [
    [],
    [["北京立方庭", "ns", 2, 4], ["自然语义科技公司", "nt", 5, 9]]
  ],
  "ner/msra": [
    [["2021年", "DATE", 0, 1], ["HanLPv2.1", "ORGANIZATION", 1, 2]],
    [["北京", "LOCATION", 2, 3], ["立方庭", "LOCATION", 3, 4], ["自然语义科技公司", "ORGANIZATION", 5, 9]]
  ],
  "ner/ontonotes": [
    [["2021年", "DATE", 0, 1], ["HanLPv2.1", "ORG", 1, 2]],
    [["北京立方庭", "FAC", 2, 4], ["自然语义科技公司", "ORG", 5, 9]]
  ],
  "srl": [
    [[["2021年", "ARGM-TMP", 0, 1], ["HanLPv2.1", "ARG0", 1, 2], ["为生产环境", "ARG2", 2, 5], ["带来", "PRED", 5, 6], ["次世代最先进的多语种NLP技术", "ARG1", 6, 15]], [["最", "ARGM-ADV", 8, 9], ["先进", "PRED", 9, 10], ["技术", "ARG0", 14, 15]]],
    [[["阿婆主", "ARG0", 0, 1], ["来到", "PRED", 1, 2], ["北京立方庭", "ARG1", 2, 4]], [["阿婆主", "ARG0", 0, 1], ["参观", "PRED", 4, 5], ["自然语义科技公司", "ARG1", 5, 9]]]
  ],
  "dep": [
    [[6, "tmod"], [6, "nsubj"], [6, "prep"], [5, "nn"], [3, "pobj"], [0, "root"], [8, "amod"], [15, "nn"], [10, "advmod"], [15, "rcmod"], [10, "assm"], [13, "nummod"], [15, "nn"], [15, "nn"], [6, "dobj"], [6, "punct"]],
    [[2, "nsubj"], [0, "root"], [4, "nn"], [2, "dobj"], [2, "conj"], [9, "nn"], [9, "nn"], [9, "nn"], [5, "dobj"], [2, "punct"]]
  ],
  "sdp": [
    [[[6, "Time"]], [[6, "Exp"]], [[5, "mPrep"]], [[5, "Desc"]], [[6, "Datv"]], [[13, "dDesc"]], [[0, "Root"], [8, "Desc"], [13, "Desc"]], [[15, "Time"]], [[10, "mDegr"]], [[15, "Desc"]], [[10, "mAux"]], [[8, "Quan"], [13, "Quan"]], [[15, "Desc"]], [[15, "Nmod"]], [[6, "Pat"]], [[6, "mPunc"]]],
    [[[2, "Agt"], [5, "Agt"]], [[0, "Root"]], [[4, "Loc"]], [[2, "Lfin"]], [[2, "ePurp"]], [[8, "Nmod"]], [[9, "Nmod"]], [[9, "Nmod"]], [[5, "Datv"]], [[5, "mPunc"]]]
  ],
  "con": [
    ["TOP", [["IP", [["NP", [["NT", ["2021年"]]]], ["NP", [["NR", ["HanLPv2.1"]]]], ["VP", [["PP", [["P", ["为"]], ["NP", [["NN", ["生产"]], ["NN", ["环境"]]]]]], ["VP", [["VV", ["带来"]], ["NP", [["ADJP", [["NP", [["ADJP", [["JJ", ["次"]]]], ["NP", [["NN", ["世代"]]]]]], ["ADVP", [["AD", ["最"]]]], ["VP", [["JJ", ["先进"]]]]]], ["DEG", ["的"]], ["NP", [["QP", [["CD", ["多"]]]], ["NP", [["NN", ["语种"]]]]]], ["NP", [["NR", ["NLP"]], ["NN", ["技术"]]]]]]]]]], ["PU", ["。"]]]]]],
    ["TOP", [["IP", [["NP", [["NN", ["阿婆主"]]]], ["VP", [["VP", [["VV", ["来到"]], ["NP", [["NR", ["北京"]], ["NR", ["立方庭"]]]]]], ["VP", [["VV", ["参观"]], ["NP", [["NN", ["自然"]], ["NN", ["语义"]], ["NN", ["科技"]], ["NN", ["公司"]]]]]]]], ["PU", ["。"]]]]]]
  ]
}

和jieba分词的低耗时不同,加载HanLP耗时:6.783897161483765秒,分词耗时:62.710567235946655秒。

import time
import csv

# 开始记录时间
start = time.time()

import hanlp
HanLP = hanlp.pipeline() \
    .append(hanlp.utils.rules.split_sentence, output_key='sentences') \
    .append(hanlp.load('FINE_ELECTRA_SMALL_ZH'), output_key='tok') \
    .append(hanlp.load('CTB9_POS_ELECTRA_SMALL'), output_key='pos') \
    .append(hanlp.load('MSRA_NER_ELECTRA_SMALL_ZH'), output_key='ner', input_key='tok') \
    .append(hanlp.load('CTB9_DEP_ELECTRA_SMALL', conll=0), output_key='dep', input_key='tok')\
    .append(hanlp.load('CTB9_CON_ELECTRA_SMALL'), output_key='con', input_key='tok')

# 结束记录时间
end = time.time()
print(f"加载HanLP耗时:{end - start}秒") 

# 开始记录时间
start = time.time()

with open('Three Body.txt', 'r', encoding='gbk') as file:
    text = file.read()
    json_file = HanLP(text)
    print('开始分词')
    with open('hanlp_cut.csv', 'w', newline='', encoding='utf-8') as output_file:
        writer = csv.writer(output_file)
        for word in json_file['tok']:
            writer.writerow([word])

# 结束记录时间
end = time.time()
print(f"分词耗时:{end - start}秒")

tok_fine = json_file['tok']

总结

对于jieba和HanPL来说,两种算法应该是各有千秋,不过我们目前大概率会采用jieba,因为它速度快,而且占用消耗资源较低。如果未来有机会的话,我们也愿意在高性能GPU服务器上部署HanPL,不过那就都是后话了。

小树,小树!