Written by
Nickolay Shmyrev
on
Some more optimization
In addition two the previous post, two more tricks for log_diag_eval.
Floats instead of doubleIf accumulator is float, SSE could be used more effectively
Hardcode vector lengthThe most common optimizaition is loop unrolling. It helps to optimize memory access as well as eliminates jump commands. But the issue here is that number of iterations in log_diag_eval can be different on various stages. GCC has interesting profile-based optimizaition for this case, see -fprofile-generate option. It runs a program and then can derive few specific optimizations form the runtime. Good point is that we actually can be almost sure in usage patters of the our target loop, so we can optimize without profiling. So, turn
for (i=0;i<veclen;i++) {
do work
}
to
if (veclen == 40) { // Common used value, 40 floats in each frame
for (i=0;i<40;i++) {
do work // This will be unrolled
} else {
for (i=0;i<veclen;i++)
do work
}
}GCC does same trick with profiler, but since our feature frame size is fixed, we can hardcode. As a result GCC will unroll first loop and it will be fast as a wind