Written by
Nickolay Shmyrev
on
Why Chinese WER is important
Almost a year we haven’t updated the news here. Time goes fast and new
things keep us busy. There are some news to discuss, but they are mostly
worth a Twitter post, not a blog.
We recently spent some time working on Chinese. It is very active area
with many strong participants. Wenet has 3 active chats on Wechat, then
there are Baidu, Pika, SpeechHome, SpeechIO and many other communities.
There is an interesting ongoing SpeechIO evaluation published
just last month: something definitely missing in other parts of the
world. Nvidia also recently trained Mandarin Conformer model.
K2 team also publishes very nice models.
We see more progress with attention on streaming and latency. Latency
issues are really important for real-life applications, many teams declare
low latency but very few actually provide it, including the low latency
of intermediate results. K2 team is doing a nice job here.
In general, the direction of research still the same - bigger models
which require more compute power and training on more data. We do see
some progress here. However, there are voices raised about other aspects
of the current approach - scalability of compute, naturalness of the
result. One nice recent paper is Gary Marcus analysis of DALLE2 where
the system fails to do even basic things. It is nice to see that our
focus on interpretable AI which leans continuously gets more attention.
For example, there is a nice ContinualAI
project.
There is an opposite movement as well, some companies are training
end-to-end models from pure sound to letters for 40 languages at once and
claim their good accuracy. Such models are good when audio is clear, they
degrade significantly in presence of noise. The accuracy is reported on
clean audiobooks dataset, so you never see the difference. In general, if
we want accuracy, we need to understand not just the sounds, but also the
words and even the overall context of the audio. Some recent nice paper
in this direction is Effective Cross-Utterance Language Modeling for
Conversational Speech Recognition
Chinese language has its own specific here as usual. In traditional
Chinese words are written without spaces which leads to a thought that
you do not need to measure word error rate, it is enough to measure
character error rate instead. That is what everyone is doing. However,
while focusing on characters, we totally lose high-level semantics which
means we reduce interpretability. As a result, the system might behave
well for clean audio but will degrade much more significantly in noise.
We consider word segmentation and word-based models very important for
stability. Of course, it is not full semantics yet, but still it is an
important way to build an explainable (and tunable) speech recognition
model.
There are a number of NLP tools for Chinese LM training with words, some
of them use HMMs (Jieba), some CRF (pkuseg), some lightweight networks
(LSTMs) like Baidu’s LAC. Some use heavy Bert-based transformers and even
computer vision (Zen). The task is more or less well understood, however,
the standard test set is a bit artificial and not universal enough as
usual. Also, metrics don’t cover the speed of segmentation.
https://paperswithcode.com/task/chinese-word-segmentation
Bert-based models (F1 more than 0.98) are crazy slow, you won’t be able
to process more than 100kb of text using them even with GPU card. If you
want to process terabytes, lightweight algorithms are your choice. Also,
BERT algorithms are unstable. They show good accuracy on evaluation sets,
but anything practical from the web totally confuses them. LAC (F1 around
0.84) is ok, but it prefers to pick longer words which is not very good
for word coverage in ASR. Jieba is fast, but not very accurate (F1 around
0.7).
As a result, we decided to proceed with
pkuseg with small
modifications. It is a good project with focus on practical segmentation.
F1 is around 0.79. We find it works well for a variety of domains. The
only minor thing is that it doesn’t split numbers, but that is an
ASR-specific step which can be solved with post-processing. Thanks to 马勇
(Daniel) from Wechat group for recommending it.