Some more optimization

In addition two the previous post, two more tricks for log_diag_eval.

Floats instead of double

If accumulator is float, SSE could be used more effectively

Hardcode vector length

The most common optimizaition is loop unrolling. It helps to optimize memory access as well as eliminates jump commands. But the issue here is that number of iterations in log_diag_eval can be different on various stages. GCC has interesting profile-based optimizaition for this case, see -fprofile-generate option. It runs a program and then can derive few specific optimizations form the runtime. Good point is that we actually can be almost sure in usage patters of the our target loop, so we can optimize without profiling. So, turn


for (i=0;i<veclen;i++) {
   do work
}


to

if (veclen == 40) { // Common used value, 40 floats in each frame
    for (i=0;i<40;i++) {
        do work // This will be unrolled
   } else {
    for (i=0;i<veclen;i++)
        do work
    }
}



GCC does same trick with profiler, but since our feature frame size is fixed, we can hardcode. As a result GCC will unroll first loop and it will be fast as a wind