Optimization in SphinxTrain

I spend quite significant amount of time training various models. It feels like alchemy, you add this and tune there and you get nice results. And while training you can read twitter ;) I'm also 10 years in a group which is creating optimizing compilers so in theory I should know a lot about them. I rarely apply it in practice though. But being bored with several weeks training you can apply some knowledge here.



So the algorithm:

1) Train a model for a month and become bored
2) Get an idea that SphinxTrain is compiled without optimization
3) Go to SphinxTrain/config and change compilation option from -O2 to -O3
4) Measure run time of a simple bw run with time command
5) See that time doesn't really change
6) Add -pg option to CFLAGS and LDFLAGS to collect profile
7) See most of the time we are running log_diag_eval function which is a simple weighted dot product computation
8) See the assembler code of the log_diag_eval

0x42c3b0 log_diag_eval: unpcklps %xmm0,%xmm0
0x42c3b3 log_diag_eval+3: test %ecx,%ecx
0x42c3b5 log_diag_eval+5: cvtps2pd %xmm0,%xmm0
0x42c3b8 log_diag_eval+8: je 0x42c3fd log_diag_eval+77
0x42c3ba log_diag_eval+10: sub $0x1,%ecx
0x42c3bd log_diag_eval+13: xor %eax,%eax
0x42c3bf log_diag_eval+15: lea 0x4(,%rcx,4),%rcx
0x42c3c7 log_diag_eval+23: nopw 0x0(%rax,%rax,1)
0x42c3d0 log_diag_eval+32: movss (%rdi,%rax,1),%xmm1
0x42c3d5 log_diag_eval+37: subss (%rsi,%rax,1),%xmm1
0x42c3da log_diag_eval+42: unpcklps %xmm1,%xmm1
0x42c3dd log_diag_eval+45: cvtps2pd %xmm1,%xmm2
0x42c3e0 log_diag_eval+48: movss (%rdx,%rax,1),%xmm1
0x42c3e5 log_diag_eval+53: add $0x4,%rax
0x42c3e9 log_diag_eval+57: cmp %rcx,%rax
0x42c3ec log_diag_eval+60: cvtps2pd %xmm1,%xmm1
0x42c3ef log_diag_eval+63: mulsd %xmm2,%xmm1
0x42c3f3 log_diag_eval+67: mulsd %xmm2,%xmm1
0x42c3f7 log_diag_eval+71: subsd %xmm1,%xmm0
0x42c3fb log_diag_eval+75: jne 0x42c3d0 log_diag_eval+32
0x42c3fd log_diag_eval+77: repz retq

9) Understand that it's not really as good here as it can be

10) Run

gcc -DPACKAGE_NAME=\"SphinxTrain\" -DPACKAGE_TARNAME=\"sphinxtrain\" \
-DPACKAGE_VERSION=\"1.0.99\" -DPACKAGE_STRING=\"SphinxTrain\ 1.0.99\" \
-DPACKAGE_BUGREPORT=\"\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 \
-DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 \
-DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_LIBM=1 \
-I/home/nshmyrev/SphinxTrain/../sphinxbase/include \
-I/home/nshmyrev/SphinxTrain/../sphinxbase/include -I../../../include -O3 \
-g -Wall -fPIC -DPIC -c gauden.c -o obj.x86_64-unknown-linux-gnu/gauden.o \
-ftree-vectorizer-verbose=2

to see that log_diag_eval loop isn't vectorized

11) Add -ffast-math and see it doesn't help

12) Rewrite function from

float64
log_diag_eval(vector_t obs,
float32 norm,
vector_t mean,
vector_t var_fact,
uint32 veclen)
{
float64 d, diff;
uint32 l;

d = norm; /* log (1 / 2 pi |sigma^2|) */

for (l = 0; l < veclen; l++) {
diff = obs[l] - mean[l];
d -= var_fact[l] * diff * diff; /* compute -1 / (2 sigma ^2) * (x - m) ^ 2 terms */
}

return d;
}

to

log_diag_eval(vector_t obs,
float32 norm,
vector_t mean,
vector_t var_fact,
uint32 veclen)
{
float64 d, diff;
uint32 l;

d = 0.0;

for (l = 0; l < veclen; l++) {
diff = obs[l] - mean[l];
d += var_fact[l] * diff * diff; /* compute -1 / (2 sigma ^2) * (x - m) ^ 2 terms */
}

return norm - d; /* log (1 / 2 pi |sigma^2|) */
}

to turn substraction which hurts to accumulation.

13) See that loop is now vectorized. Enjoy the speed!!!

The key thing to understand here is that programming is rather flexible and compilers are rather dumb. But you have to cooperate. So you need to use very simple constructs to let compiler do his work. Moreover, this idea of using simple constructs in the code has other benefits since it helps to keep code style clean and enables automated static analysis with tools like splint.

Maybe same applies to speech recognition. We need to help computers in their efforts to understand us. Speak slowly and articulate clearly and both we and computers will enjoy the result

If you are interested about loop vectorization in GCC, see here http://gcc.gnu.org/projects/tree-ssa/vectorization.html