使用Kaldi CVTE v2模型进行语音识别测试 1/2

初始文件

首先安装 kaldi,参考官方文档

然后,下载 http://kaldi-asr.org/models/m2 并解压到egs/cvte,保证文件 kaldi/egs/cvte/s5 文件存在。

以下介绍如何添加新的语音文件并进行识别测试。

data/wav/chat001

存储语音文件

data/wav/chat001
├── 001.wav
└── 002.wav

语音文件的录制,参考 语音处理常用工具集,命令

语音文件格式

$ sox --info data/wav/chat001/001.wav

Input File     : 'data/wav/chat001/001.wav'
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Duration       : 00:00:06.25 = 100000 samples ~ 468.75 CDDA sectors
File Size      : 200k
Bit Rate       : 256k
Sample Encoding: 16-bit Signed Integer PCM

data/chat001/test

cd egs/cvte/s5
data/chat001/test
├── conf
│   └── fbank.conf
├── frame_shift
├── spk2utt
├── text
├── utt2spk
└── wav.scp

其中,conf,frame_shift的文件拷贝自 data/fbank/test

wav.scp, 语音文件的列表

CHAT001_20200801_001 data/wav/chat001/001.wav
CHAT001_20200801_002 data/wav/chat001/002.wav

第一列和第二列之间的空格是 tab,不能使用4个空格替换,下同

text, 语音文件的对应文本

CHAT001_20200801_001	上海 浦东机场 入境 防 输入 全 闭环 管理 
CHAT001_20200801_002	北京 地铁 宣武门 站 综合 改造 新增 换乘 通道

文本中,第二列是由空格分割的单词,词汇表在 exp/chain/tdnn/graph/words.txt

对应的语素文件 exp/chain/tdnn/graph/phones.txt

词汇表和语素文件的关系 exp/chain/tdnn/graph/phones/align_lexicon.int

以上,比如 149, 133 即语素,在exp/chain/tdnn/graph/phones.txt中定义。

spk2utt, utt2spk 说话人和语音文件的映射关系。

$ cat data/chat001/test/utt2spk
CHAT001_20200801_001 CHAT001_20200801_001
CHAT001_20200801_002 CHAT001_20200801_002

$ cat data/chat001/test/spk2utt
CHAT001_20200801_001 CHAT001_20200801_001
CHAT001_20200801_002 CHAT001_20200801_002

以上,使用了 文件的索引ID作为说话人,在kaldi中,说话人是一个宽泛的概念,理想情况是为每个独立的“发音人”设定一个ID。

检查初始文件

utils/validate_data_dir.sh data/chat001/test

自动解决错误

utils/fix_data_dir.sh data/chat001/test
自动解决错误会考虑完成 sort等。

执行解码和查看WER

run脚本

kaldi/egs/cvte/s5/run_chat001.sh

#!/bin/bash


. ./cmd.sh
. ./path.sh

# step 1: generate fbank features
obj_dir=data/chat001

for x in test; do
  rm -rf fbank/$x
  mkdir -p fbank/$x

  # compute fbank without pitch
  steps/make_fbank.sh --nj 1 --cmd "run.pl" $obj_dir/$x exp/make_fbank/$x fbank/$x || exit 1;
  # compute cmvn
  steps/compute_cmvn_stats.sh $obj_dir/$x exp/fbank_cmvn/$x fbank/$x || exit 1;
done

# #step 2: offline-decoding
test_data=data/chat001/test
dir=exp/chain/tdnn

steps/nnet3/decode.sh --acwt 1.0 --post-decode-acwt 10.0 \
  --nj 1 --num-threads 1 \
  --cmd "$decode_cmd" --iter final \
  --frames-per-chunk 50 \
  $dir/graph $test_data $dir/decode_chat001_test

# # note: the model is trained using "apply-cmvn-online",
# # so you can modify the corresponding code in steps/nnet3/decode.sh to obtain the best performance,
# # but if you directly steps/nnet3/decode.sh,
# # the performance is also good, but a little poor than the "apply-cmvn-online" method.

该脚本执行中分为一下几步:

Step1 - 生成测试数据的特征

$ tree data/chat001/test
data/chat001/test
├── cmvn.scp
├── conf
│   └── fbank.conf
├── feats.scp
├── frame_shift
├── spk2utt
├── split1
│   └── 1
│       ├── cmvn.scp
│       ├── feats.scp
│       ├── spk2utt
│       ├── text
│       ├── utt2dur
│       ├── utt2num_frames
│       ├── utt2spk
│       └── wav.scp
├── text
├── utt2dur
├── utt2num_frames
├── utt2spk
└── wav.scp

feats.scp, utt2dur, utt2num_frames 都是 make_fbank.sh 生成,也会在 fbank/test 下生成其他文件。
cmvn.scp, 是归一化文件,steps/compute_cmvn_stats.sh 生成。
splitN 文件夹是在大量数据时,程序并发执行,然后合并,形成的一个个自文件夹。

fbank/test目录

fbank/test
├── cmvn_test.ark
├── cmvn_test.scp
├── raw_fbank_test.1.ark
└── raw_fbank_test.1.scp

exp/make_fbank目录

exp/make_fbank
└── test
    ├── make_fbank_test.1.log
    └── wav.1.scp

Step2 - 解码

steps/nnet3/decode.sh --acwt 1.0 --post-decode-acwt 10.0 \
  --nj 1 --num-threads 1 \
  --cmd "$decode_cmd" --iter final \
  --frames-per-chunk 50 \
  $dir/graph $test_data $dir/decode_chat001_test

解码同样会计算WER,可以设置输出 nBest.

查看解码信息

cat exp/chain/tdnn/decode_chat001_test/log/decode.1.log

最优WER结果

$ cat exp/chain/tdnn/decode_chat001_test/scoring_kaldi/best_cer
%WER 2.94 [ 1 / 34, 0 ins, 0 del, 1 sub ] exp/chain/tdnn/decode_chat001_test/cer_7_0.0

这次测试一共34个单词,和识别结果的编辑距离0插入,0删除,1个替换。
但是该替换单词为"闭环"在发音词典里不存在,识别结果为“闭”“环”,两个字,其实也可以认为识别准确。

其他日志

WER nBest输出 exp/chain/tdnn/decode_chat001_test/scoring_kaldi

解码命令解读

在解码阶段,执行的脚本如下:

# nnet3-latgen-faster --frame-subsampling-factor=3 --frames-per-chunk=50 --extra-left-context=0 --extra-right-context=0 --extra-left-context-initial=-1 --extra-right-context-final=-1 --minimize=false --max-active=7000 --min-active=200 --beam=15.0 --lattice-beam=8.0 --acoustic-scale=1.0 --allow-partial=true --word-symbol-table=exp/chain/tdnn/graph/words.txt exp/chain/tdnn/final.mdl exp/chain/tdnn/graph/HCLG.fst "ark,s,cs:apply-cmvn --norm-means=true --norm-vars=false --utt2spk=ark:data/chat001/test/split1/1/utt2spk scp:data/chat001/test/split1/1/cmvn.scp scp:data/chat001/test/split1/1/feats.scp ark:- |" "ark:|lattice-scale --acoustic-scale=10.0 ark:- ark:- | gzip -c >exp/chain/tdnn/decode_chat001_test/lat.1.gz"
# Started at Sat Aug  1 16:21:14 CST 2020
#
nnet3-latgen-faster --frame-subsampling-factor=3 --frames-per-chunk=50 --extra-left-context=0 --extra-right-context=0 --extra-left-context-initial=-1 --extra-right-context-final=-1 --minimize=false --max-active=7000 --min-active=200 --beam=15.0 --lattice-beam=8.0 --acoustic-scale=1.0 --allow-partial=true --word-symbol-table=exp/chain/tdnn/graph/words.txt exp/chain/tdnn/final.mdl exp/chain/tdnn/graph/HCLG.fst 'ark,s,cs:apply-cmvn --norm-means=true --norm-vars=false --utt2spk=ark:data/chat001/test/split1/1/utt2spk scp:data/chat001/test/split1/1/cmvn.scp scp:data/chat001/test/split1/1/feats.scp ark:- |' 'ark:|lattice-scale --acoustic-scale=10.0 ark:- ark:- | gzip -c >exp/chain/tdnn/decode_chat001_test/lat.1.gz'
LOG (nnet3-latgen-faster[5.5.765-f88d5]:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 1 orphan nodes.
LOG (nnet3-latgen-faster[5.5.765-f88d5]:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 2 orphan components.
LOG (nnet3-latgen-faster[5.5.765-f88d5]:Collapse():nnet-utils.cc:1472) Added 1 components, removed 2
lattice-scale --acoustic-scale=10.0 ark:- ark:-
apply-cmvn --norm-means=true --norm-vars=false --utt2spk=ark:data/chat001/test/split1/1/utt2spk scp:data/chat001/test/split1/1/cmvn.scp scp:data/chat001/test/split1/1/feats.scp ark:-
LOG (nnet3-latgen-faster[5.5.765-f88d5]:CheckAndFixConfigs():nnet-am-decodable-simple.cc:294) Increasing --frames-per-chunk from 50 to 51 to make it a multiple of --frame-subsampling-factor=3
CHAT001_20200801_001 上海 浦东机场 入境 房 输入 全 闭 环 管理
LOG (nnet3-latgen-faster[5.5.765-f88d5]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:375) Log-like per frame for utterance CHAT001_20200801_001 is 2.19918 over 208 frames.
LOG (apply-cmvn[5.5.765-f88d5]:main():apply-cmvn.cc:162) Applied cepstral mean normalization to 2 utterances, errors on 0
CHAT001_20200801_002 北京 地铁 宣武门 站 综合 改造 新增 换乘 通道
LOG (nnet3-latgen-faster[5.5.765-f88d5]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:375) Log-like per frame for utterance CHAT001_20200801_002 is 2.19511 over 333 frames.
LOG (nnet3-latgen-faster[5.5.765-f88d5]:main():nnet3-latgen-faster.cc:256) Time taken 10.9386s: real-time factor assuming 100 frames/sec is 0.673972
LOG (nnet3-latgen-faster[5.5.765-f88d5]:main():nnet3-latgen-faster.cc:259) Done 2 utterances, failed for 0
LOG (nnet3-latgen-faster[5.5.765-f88d5]:main():nnet3-latgen-faster.cc:261) Overall log-likelihood per frame is 2.19668 over 541 frames.
LOG (nnet3-latgen-faster[5.5.765-f88d5]:~CachingOptimizingCompiler():nnet-optimize.cc:710) 0.00447 seconds taken in nnet3 compilation total (breakdown: 0.00219 compilation, 0.00168 optimization, 0 shortcut expansion, 0.000385 checking, 1.1e-05 computing indexes, 0.000209 misc.) + 0 I/O.
LOG (lattice-scale[5.5.765-f88d5]:main():lattice-scale.cc:107) Done 2 lattices.
# Accounting: time=53 threads=1
# Ended (code 0) at Sat Aug  1 16:22:07 CST 2020, elapsed time 53 seconds

我们详细看一下参数列表

nnet3-latgen-faster \
    --frame-subsampling-factor=3 \
    --frames-per-chunk=50 \
    --extra-left-context=0 \
    --extra-right-context=0 \
    --extra-left-context-initial=-1 \
    --extra-right-context-final=-1 \
    --minimize=false \
    --max-active=7000 \
    --min-active=200 \
    --beam=15.0 \
    --lattice-beam=8.0 \
    --acoustic-scale=1.0 \
    --allow-partial=true \
    --word-symbol-table=exp/chain/tdnn/graph/words.txt \
    exp/chain/tdnn/final.mdl \
    exp/chain/tdnn/graph/HCLG.fst \
    "ark,s,cs:apply-cmvn --norm-means=true --norm-vars=false --utt2spk=ark:data/chat001/test/split1/1/utt2spk scp:data/chat001/test/split1/1/cmvn.scp scp:data/chat001/test/split1/1/feats.scp ark:- |" \
    "ark:|lattice-scale --acoustic-scale=10.0 ark:- ark:- | gzip -c >exp/chain/tdnn/decode_chat001_test/lat.1.gz"

nnet3-latgen-faster命令:
基于解码器LatticeFasterDecoder, 声学分来源,nnet3 模型

此外还有类似的 nnet3-latgen-faster-parallel, nnet3-latgen-faster-batch命令。

打印以下nnet3-latgen-faster的帮助:


Generate lattices using nnet3 neural net model.
Usage: nnet3-latgen-faster [options] <nnet-in> <fst-in|fsts-rspecifier> <features-rspecifier> <lattice-wspecifier> [ <words-wspecifier> [<alignments-wspecifier>] ]
See also: nnet3-latgen-faster-parallel, nnet3-latgen-faster-batch
 
Options:
  --acoustic-scale            : Scaling factor for acoustic log-likelihoods (caution: is a no-op if set in the program nnet3-compute (float, default = 0.1)
  --allow-partial             : If true, produce output even if end state was not reached. (bool, default = false)
  --beam                      : Decoding beam.  Larger->slower, more accurate. (float, default = 16)
  --beam-delta                : Increment used in decoding-- this parameter is obscure and relates to a speedup in the way the max-active constraint is applied.  Larger is more accurate. (float, default = 0.5)
  --computation.debug         : If true, turn on debug for the neural net computation (very verbose!) Will be turned on regardless if --verbose >= 5 (bool, default = false)
  --debug-computation         : If true, turn on debug for the actual computation (very verbose!) (bool, default = false)
  --delta                     : Tolerance used in determinization (float, default = 0.000976562)
  --determinize-lattice       : If true, determinize the lattice (lattice-determinization, keeping only best pdf-sequence for each word-sequence). (bool, default = true)
  --extra-left-context        : Number of frames of additional left-context to add on top of the neural net's inherent left context (may be useful in recurrent setups (int, default = 0)
  --extra-left-context-initial : If >= 0, overrides the --extra-left-context value at the start of an utterance. (int, default = -1)
  --extra-right-context       : Number of frames of additional right-context to add on top of the neural net's inherent right context (may be useful in recurrent setups (int, default = 0)
  --extra-right-context-final : If >= 0, overrides the --extra-right-context value at the end of an utterance. (int, default = -1)
  --frame-subsampling-factor  : Required if the frame-rate of the output (e.g. in 'chain' models) is less than the frame-rate of the original alignment. (int, default = 1)
  --frames-per-chunk          : Number of frames in each chunk that is separately evaluated by the neural net.  Measured before any subsampling, if the --frame-subsampling-factor options is used (i.e. counts input frames (int, default = 50)
  --hash-ratio                : Setting used in decoder to control hash behavior (float, default = 2)
  --ivectors                  : Rspecifier for iVectors as vectors (i.e. not estimated online); per utterance by default, or per speaker if you provide the --utt2spk option. (string, default = "")
  --lattice-beam              : Lattice generation beam.  Larger->slower, and deeper lattices (float, default = 10)
  --max-active                : Decoder max active states.  Larger->slower; more accurate (int, default = 2147483647)
  --max-mem                   : Maximum approximate memory usage in determinization (real usage might be many times this). (int, default = 50000000)
  --min-active                : Decoder minimum #active states. (int, default = 200)
  --minimize                  : If true, push and minimize after determinization. (bool, default = false)
  --online-ivector-period     : Number of frames between iVectors in matrices supplied to the --online-ivectors option (int, default = 0)
  --online-ivectors           : Rspecifier for iVectors estimated online, as matrices.  If you supply this, you must set the --online-ivector-period option. (string, default = "")
  --optimization.allocate-from-other : Instead of deleting a matrix of a given size and then allocating a matrix of the same size, allow re-use of that memory (bool, default = true)
  --optimization.allow-left-merge : Set to false to disable left-merging of variables in remove-assignments (obscure option) (bool, default = true)
  --optimization.allow-right-merge : Set to false to disable right-merging of variables in remove-assignments (obscure option) (bool, default = true)
  --optimization.backprop-in-place : Set to false to disable optimization that allows in-place backprop (bool, default = true)
  --optimization.consolidate-model-update : Set to false to disable optimization that consolidates the model-update phase of backprop (e.g. for recurrent architectures (bool, default = true)
  --optimization.convert-addition : Set to false to disable the optimization that converts Add commands into Copy commands wherever possible. (bool, default = true)
  --optimization.extend-matrices : This optimization can reduce memory requirements for TDNNs when applied together with --convert-addition=true (bool, default = true)
  --optimization.initialize-undefined : Set to false to disable optimization that avoids redundant zeroing (bool, default = true)
  --optimization.max-deriv-time : You can set this to the maximum t value that you want derivatives to be computed at when updating the model.  This is an optimization that saves time in the backprop phase for recurrent frameworks (int, default = 2147483647)
  --optimization.max-deriv-time-relative : An alternative mechanism for setting the --max-deriv-time, suitable for situations where the length of the egs is variable.  If set, it is equivalent to setting the --max-deriv-time to this value plus the largest 't' value in any 'output' node of the computation request. (int, default = 2147483647)
  --optimization.memory-compression-level : This is only relevant to training, not decoding.  Set this to 0,1,2; higher levels are more aggressive at reducing memory by compressing quantities needed for backprop, potentially at the expense of speed and the accuracy of derivatives.  0 means no compression at all; 1 means compression that shouldn't affect results at all. (int, default = 1)
  --optimization.min-deriv-time : You can set this to the minimum t value that you want derivatives to be computed at when updating the model.  This is an optimization that saves time in the backprop phase for recurrent frameworks (int, default = -2147483648)
  --optimization.move-sizing-commands : Set to false to disable optimization that moves matrix allocation and deallocation commands to conserve memory. (bool, default = true)
  --optimization.optimize     : Set this to false to turn off all optimizations (bool, default = true)
  --optimization.optimize-row-ops : Set to false to disable certain optimizations that act on operations of type *Row*. (bool, default = true)
  --optimization.propagate-in-place : Set to false to disable optimization that allows in-place propagation (bool, default = true)
  --optimization.remove-assignments : Set to false to disable optimization that removes redundant assignments (bool, default = true)
  --optimization.snip-row-ops : Set this to false to disable an optimization that reduces the size of certain per-row operations (bool, default = true)
  --optimization.split-row-ops : Set to false to disable an optimization that may replace some operations of type kCopyRowsMulti or kAddRowsMulti with up to two simpler operations. (bool, default = true)
  --phone-determinize         : If true, do an initial pass of determinization on both phones and words (see also --word-determinize) (bool, default = true)
  --prune-interval            : Interval (in frames) at which to prune tokens (int, default = 25)
  --utt2spk                   : Rspecifier for utt2spk option used to get ivectors per speaker (string, default = "")
  --word-determinize          : If true, do a second pass of determinization on words only (see also --phone-determinize) (bool, default = true)
  --word-symbol-table         : Symbol table for words [for debug output] (string, default = "")
 
Standard options:
  --config                    : Configuration file to read (this option may be repeated) (string, default = "")
  --help                      : Print out usage message (bool, default = false)
  --print-args                : Print the command line arguments (to stderr) (bool, default = true)
  --verbose                   : Verbose level (higher->more logging) (int, default = 0)

参考阅读

https://blog.csdn.net/qq_25750561/article/details/81070092

https://www.cnblogs.com/yszd/p/12192769.html

https://github.com/naxingyu/kaldi_cvte_model_test

王海良@Chatopera 聊天机器人 机器学习 智能客服
Chatopera 联合创始人 & CEO,运营聊天机器人平台 https://bot.chatopera.com,让聊天机器人上线!2015年开始探索聊天机器人的商业应用,实现基于自然语言交互的流程引擎、语音识别、自然语言理解,2018年出版《智能问答与深度学习》一书。