Logo icon

Vectorization of the matrix multiplication

Developed for the engineering company «Tesis»

Purpose: fast multiplication of the block sparse matrixes

The project aimed to develop the vectorized implementations of the matrix multiplication for the SSE, AVX, and XEON PHI SIMD registers. The role of our company was to consult throughout the project and to choose the optimization strategy. As a result, the library was developed for the optimized multiplication of small-sized square blocks (2×2 – 16×16) by a long sequence of vectors.

The two operations were to be optimized:

  • Block operation AXPY: Y += X×A, where A instead of the scalar in the local version is a square matrix of N×N size, and the rectangular matrices X and Y have the size of M×N.
  • Block operation DOT: C = XT×Y, where C instead of the scalar in the local version is a square matrix of N×N size, and the rectangular matrices X and Y have the size of M×N.

As a result, we reached the 7 times speedup in comparison with the unrolled loops version with the float data types and up to 3,5 times speedup with the double type.

Specification

Client: Volgograd State Technical University, Volgograd (the project was developed for the engineering company «Tesis» )
Area of use: fast (optimized code) matrix multiplication
Type (platform): Library for Linux
Technologies and algorithms in use: C++, Assembler, Intrinsics, Intel Xeon Phi
More information:

bibliography:

Similar projects

analysis of the energy consumption of open computational packages during their start on a cluster

a research work intended for building a simulation model for several dozens of concurrently operating carrier companies, inclusive terminal centers