Brain, Time, CTC blank states and streaming

It is interesting that the longer we study the reality the more unusual it appears to us. For example, if we think about brain, there are two important ideas we can infer from our current understanding of brain mechanics:

  1. Brain actions are triggered by very short events - spikes
  2. Brain is a highly parallel system

These two propositions have important consequences in design of the speech systems. For example, if we consider individual sounds we produce, most likely they are generated with some spikes inside the brain which means there are moments we have a real spike in electrical activity. And if we model them in software we should have those spikes too.

This moment is actually already worked out in speech recognition. We moved from 9-state HMMs for TIDIGTS to 3-state HMMs for GMM-HMM systems to CTC architecture which exactly follows the function of the brain - there is quiet blank state and a spike state corresponding to sound. This architecture is efficient both in terms of memory and speed. There are disadvantages too, CTC might loose the best path while jumping in the blank state thus it can skip words in noisy inputs. But in general I find the idea has not just exerimental grounds but also theoretical grounds.

CTC architecture

These theoretical grounds are not frequently mentioned in the publications and lectures that explain CTC. While reading CTC papers before I never associated the word “spike” with the actual brain spike. For example, if you read the original CTC paper you might have impression that this architecture is just a good guess by the authors.

It seems to be we well worked out in speech recognition, but similar design decision are made in other areas. For example, there is blank state in Glow-TTS text encoder achitecture which is also used in VITS1 and then dropped in VITS2 without much explanation. People wonder what is that and why is it required, for example this issue. Author even claim there is no theoretical basis. For my opinion given the blank architecture is justified, the decision to drop in in VITS2 seems to be wrong. Our experiments training VITS models with blank state and without it confirm that as well.

Another area is still speech recognition but more high-level spikes. Modern LMs popularized tags for languages, special events, etc. Like translation or language tag in Whisper. They work efficiently compared to encoding language as an additional dimension. But we rarely think of them as actual brains spikes.

Ancient Greeks knew three types of time: Chronos as a continuous linear time, Cyclos as a repeating sequence of events, Kairos as a unique moment. Since Newton everything got reduced to just a linear time, but in the middle of the 20th century other concepts got back. Quantum mechanics did a big shift but also invention of computers. Here one can cite a fundamental works of Leslie Lamport on logical time, for example Time, Clocks and the Ordering of Events in a Distributed System

If we think about second point about parallel processing in the brain, we can actually infer that time in the brain is really non-linear. It is more Cyclos than Chronos. And if we apply philosophy to speech recognition, we need to review the approach to continouos streaming decoding. I’ve been a long proponent of the fast streaming decoding with low latency response but given the proposition I now think it is not really the way human brain works. Instead, we process incoming data in parallel, meaning we do not have to always generate the response with very small latency. We have a certain landmarks in time (think of spikes) where we update our brain state using parallel processing.

Recently there is a lot of efforts to reduce ASR latency but they kind of fail. Streaming error rates are way higher than non-streaming ones. People have to apply two-pass processing, think of U2 architecture. Some teams, for example Whisper and Silero teams got the right idea that continuous low latency is not really needed. Instead we can process audio with non-streaming architecture at certain points in time and get a response faster and, more importantly, that response will be more accurate. But we need a trigger. As a first approximation, a good VAD could serve as a trigger. In a longer term, we can build a trigger network that will start parallel processing at the moments of low entropy or other important moments but not too frequently.

So instead of experimenting with architectures we can really deduce the optimal one from the understanding of brain mechanics, it allows use to justify the decisions made and move into the right direction. Of course, a lot of research here, but not well understood by speech engineers. For example

Enhancement of speech-in-noise comprehension through vibrotactile stimulation at the syllabic rate

Related paper:

Spiking Music: Audio Compression with Event Based Auto-encoders