Why Chinese WER is important

Almost a year we haven’t updated the news here. Time goes fast and new things keep us busy. There are some news to discuss, but they are mostly worth a Twitter post, not a blog.

We recently spent some time working on Chinese. It is very active area with many strong participants. Wenet has 3 active chats on Wechat, then there are Baidu, Pika, SpeechHome, SpeechIO and many other communities. There is an interesting ongoing SpeechIO evaluation published just last month: something definitely missing in other parts of the world. Nvidia also recently trained Mandarin Conformer model. K2 team also publishes very nice models.

We see more progress with attention on streaming and latency. Latency issues are really important for real-life applications, many teams declare low latency but very few actually provide it, including the low latency of intermediate results. K2 team is doing a nice job here.

In general, the direction of research still the same - bigger models which require more compute power and training on more data. We do see some progress here. However, there are voices raised about other aspects of the current approach - scalability of compute, naturalness of the result. One nice recent paper is Gary Marcus analysis of DALLE2 where the system fails to do even basic things. It is nice to see that our focus on interpretable AI which leans continuously gets more attention. For example, there is a nice ContinualAI project.

There is an opposite movement as well, some companies are training end-to-end models from pure sound to letters for 40 languages at once and claim their good accuracy. Such models are good when audio is clear, they degrade significantly in presence of noise. The accuracy is reported on clean audiobooks dataset, so you never see the difference. In general, if we want accuracy, we need to understand not just the sounds, but also the words and even the overall context of the audio. Some recent nice paper in this direction is Effective Cross-Utterance Language Modeling for Conversational Speech Recognition

Chinese language has its own specific here as usual. In traditional Chinese words are written without spaces which leads to a thought that you do not need to measure word error rate, it is enough to measure character error rate instead. That is what everyone is doing. However, while focusing on characters, we totally lose high-level semantics which means we reduce interpretability. As a result, the system might behave well for clean audio but will degrade much more significantly in noise. We consider word segmentation and word-based models very important for stability. Of course, it is not full semantics yet, but still it is an important way to build an explainable (and tunable) speech recognition model.

There are a number of NLP tools for Chinese LM training with words, some of them use HMMs (Jieba), some CRF (pkuseg), some lightweight networks (LSTMs) like Baidu’s LAC. Some use heavy Bert-based transformers and even computer vision (Zen). The task is more or less well understood, however, the standard test set is a bit artificial and not universal enough as usual. Also, metrics don’t cover the speed of segmentation.

https://paperswithcode.com/task/chinese-word-segmentation

Bert-based models (F1 more than 0.98) are crazy slow, you won’t be able to process more than 100kb of text using them even with GPU card. If you want to process terabytes, lightweight algorithms are your choice. Also, BERT algorithms are unstable. They show good accuracy on evaluation sets, but anything practical from the web totally confuses them. LAC (F1 around 0.84) is ok, but it prefers to pick longer words which is not very good for word coverage in ASR. Jieba is fast, but not very accurate (F1 around 0.7).

As a result, we decided to proceed with pkuseg with small modifications. It is a good project with focus on practical segmentation. F1 is around 0.79. We find it works well for a variety of domains. The only minor thing is that it doesn’t split numbers, but that is an ASR-specific step which can be solved with post-processing. Thanks to 马勇 (Daniel) from Wechat group for recommending it.