Resolve segmenter to process Chinese Dialogues with jieba, langid, stanford segmenter

During generating a word2vec model with Chinese data, it is very important to segment the Chinese sentences.

在处理中文数据,训练词向量模型时,中文自动分词怎么办?

Fortunately, there are some awesome utilities which are introduced online.

幸运的是,互联网上有多个开源的工具完成分词。

Java

Built by Stanford NLP Software.

General Pipelines for Chinese NLP Engineering with Stanford NLP Software.

http://nlp.stanford.edu/software/segmenter.shtml

#! /bin/bash
###########################################
# Process segmenting using stanford-segmenter
###########################################

# constants
baseDir=$(cd `dirname "$0"`;pwd)
SEGMENT_CMD=nlp.stanford.edu/stanford-segmenter-2015-12-09/segment.sh
workDir=$baseDir/dialogues-segmented

# functions
function process_file(){
    echo "process file" $1
    sed 's/	/@_tab_@/g' $1 > $1-tmp
    bash -x $SEGMENT_CMD -k ctb $1-tmp UTF-8 0  > $1-segmented-tmp
    sed 's/@_tab_@/	/g'  $1-segmented-tmp > $1-segmented
    rm $1-tmp $1-segmented-tmp
}

function loop_file(){
    cd $workDir
    for x in `find . -name "*.tsv"`; do
        process_file $x
    done
}

# main
[ -z "${BASH_SOURCE[0]}" -o "${BASH_SOURCE[0]}" = "$0" ] || return

if [ -f $SEGMENT_CMD ];
then
   echo "Segmenter $SEGMENT_CMD exists."
   loop_file
else
   echo "Error: Segmenter $SEGMENT_CMD does not exist."
   exit 1
fi

Python

jieba

sudo pip install jieba

img

Also, with Python, langid can be used to check the language.

sudo pip install langid

img

Chatopera博客 聊天机器人 机器学习 智能客服
北京华夏春松科技有限公司,为企业交付智能客服系统、智能对话机器人、机器人客服、Chatbot。https://www.chatopera.com
已标记关键词 清除标记