The project aimed to develop the vectorized implementations of the matrix multiplication for the SSE, AVX, and XEON PHI SIMD registers. The role of our company was to consult throughout the project and to choose the optimization strategy. As a result, the library was developed for the optimized multiplication of small-sized square blocks (2×2 – 16×16) by a long sequence of vectors.

The two operations were to be optimized:

- Block operation AXPY: Y += X×A, where A instead of the scalar in the local version is a square matrix of N×N size, and the rectangular matrices X and Y have the size of M×N.
- Block operation DOT:
*C**=**XT**×**Y*, where C instead of the scalar in the local version is a square matrix of N×N size, and the rectangular matrices X and Y have the size of M×N.

As a result, we reached the 7 times speedup in comparison with the unrolled loops version with the float data types and up to 3,5 times speedup with the double type.