Resolve segmenter to process Chinese Dialogues with jieba, langid, stanford segmenter

During generating a word2vec model with Chinese data, it is very important to segment the Chinese sentences.

在处理中文数据,训练词向量模型时,中文自动分词怎么办?

Fortunately, there are some awesome utilities which are introduced online.

幸运的是,互联网上有多个开源的工具完成分词。

Java

Built by Stanford NLP Software.

General Pipelines for Chinese NLP Engineering with Stanford NLP Software.

http://nlp.stanford.edu/software/segmenter.shtml

#! /bin/bash
###########################################
# Process segmenting using stanford-segmenter
###########################################

# constants
baseDir=$(cd `dirname "$0"`;pwd)
SEGMENT_CMD=nlp.stanford.edu/stanford-segmenter-2015-12-09/segment.sh
workDir=$baseDir/dialogues-segmented

# functions
function process_file(){
    echo "process file" $1
    sed 's/	/@_tab_@/g' $1 > $1-tmp
    bash -x $SEGMENT_CMD -k ctb $1-tmp UTF-8 0  > $1-segmented-tmp
    sed 's/@_tab_@/	/g'  $1-segmented-tmp > $1-segmented
    rm $1-tmp $1-segmented-tmp
}

function loop_file(){
    cd $workDir
    for x in `find . -name "*.tsv"`; do
        process_file $x
    done
}

# main
[ -z "${BASH_SOURCE[0]}" -o "${BASH_SOURCE[0]}" = "$0" ] || return

if [ -f $SEGMENT_CMD ];
then
   echo "Segmenter $SEGMENT_CMD exists."
   loop_file
else
   echo "Error: Segmenter $SEGMENT_CMD does not exist."
   exit 1
fi

Python

jieba

sudo pip install jieba

img

Also, with Python, langid can be used to check the language.

sudo pip install langid

img

王海良@Chatopera 聊天机器人 机器学习 智能客服
Chatopera 联合创始人 & CEO,运营聊天机器人平台 https://bot.chatopera.com,让聊天机器人上线!2015年开始探索聊天机器人的商业应用,实现基于自然语言交互的流程引擎、语音识别、自然语言理解,2018年出版《智能问答与深度学习》一书。