CPU Acceleration#

The generic implementation of matrix multiplication using a triple nested loop is reasonably fast due to the design of RLtools allowing the compiler to heavily optimizer the code (especially for smaller matrices). For larger matrices it is beneficial to use the CPUs SIMD instructions (like e.g. AVX). This can be done through BLAS libraries like Intel MKL or OpenBLAS. We found Intel MKL to be the fastest, but it does not work reliably in Cling (the C/C++ interpreter as a backend of the Jupyter notebook used for this tutorial). Hence we use OpenBLAS in this and the following examples. RLtools has a multiplexing header cpu_mux.h that automatically includes the CPU operations dispatching to the available backend. The available backend is selected by defining e.g. RL_TOOLS_BACKEND_ENABLE_OPENBLAS or RL_TOOLS_BACKEND_ENABLE_MKL (these options can also be used during the configuration phase of CMake when using RLtools natively by passing them on the CLI using -D).

[1]:

#define RL_TOOLS_BACKEND_ENABLE_OPENBLAS
#include <rl_tools/operations/cpu_mux.h>
namespace rlt = rl_tools;

We also include iostream to print the computation times later on

[2]:

#include <iostream>

OpenBLAS contains pre-compiled matrix multiplication kernels and hence needs to be linked (or loaded in case of the Cling interpreter):

[3]:

#pragma cling load("openblas")

We define two devices to compare the computation time later on. All CPU_* devices use the main memory to store containers and allocated matrices are hence compatible between them. For maximum performance it is recommended to allocate the matrices with the device that they are used on afterwards because e.g. Intel MKL allows to allocate chunks of memory at certain boundaries which allow faster loading using SIMD instructions. This is integrated into the particular rlt::malloc specialization which is dispatched to by simply calling with the desired device.

Because we don’t know the outcome of the (or rather we don’t want to hard-code it) the cpu_mux.h returns a DEVICE_FACTORY template that generates the found CPU_* device when passing a CPU specification. In this case (since we know OpenBLAS is available and that RL_TOOLS_BACKEND_ENABLE_OPENBLAS is defined) this is equal to using DEVICE_BLAS = rlt::devices::CPU_OPENBLAS<rlt::devices::DefaultCPUSpecification>.

You can play around with commenting #define RL_TOOLS_BACKEND_ENABLE_OPENBLAS out in the first cell and re-running the notebook. In that case, you can see that it will use a normal CPU for DEVICE_BLAS and result in equally slow computation times for both devices in the benchmark lateron.

[4]:

using DEVICE = rlt::devices::DefaultCPU;
using DEVICE_BLAS = rlt::devices::DEVICE_FACTORY<rlt::devices::DefaultCPUSpecification>;
using T = double;
using TI = typename DEVICE::index_t;

We allocate \(A\), \(B\), \(C\) and \(C_{blas}\) matrices to evaluate the computation:

\[C = A \cdot B\]

[5]:

DEVICE device;
DEVICE_BLAS device_blas;
constexpr TI SIZE = 500;
rlt::Matrix<rlt::matrix::Specification<T, TI, SIZE, SIZE>> A, B, C, C_blas;
auto rng = rlt::random::default_engine(typename DEVICE::SPEC::RANDOM(), 1);

We allocate all the matrices and fill \(A\) and \(B\) with random numbers:

[6]:

rlt::malloc(device_blas, A);
rlt::malloc(device_blas, B);
rlt::malloc(device_blas, C);
rlt::malloc(device_blas, C_blas);
rlt::randn(device, A, rng);
rlt::randn(device, B, rng);

Now we can benchmark the matrix multiplication using the generic implementation:

[7]:

{
    auto start = std::chrono::high_resolution_clock::now();
    rlt::multiply(device, A, B, C);
    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration<double>(end - start);
    std::cout << "Time Cling (C/C++ interpreter): " << duration.count() << " seconds" << std::endl;
}

Time Cling (C/C++ interpreter): 1.24116 seconds

Equivalently we can run the same computation using OpenBLAS and get the result in much less time:

[8]:

{
    auto start = std::chrono::high_resolution_clock::now();
    rlt::multiply(device_blas, A, B, C_blas);
    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration<double>(end - start);
    std::cout << "Time Cling (C/C++ interpreter): " << duration.count() << " seconds" << std::endl;
}

Time Cling (C/C++ interpreter): 0.00617779 seconds

Now we check the resulting matrices against each other:

[9]:

std::cout << "Absolute difference between the resulting matrices: " << rlt::abs_diff(device, C, C_blas) << std::endl;

Absolute difference between the resulting matrices: 2.95702e-09

Depending on the machine, compilers and library (versions) used, we might find that they are exactly equal but this is not necessarily the case. By changing T to float the deviations should be bigger but also for double this could happen because of floating point inaccuracies which entail that the same mathematical expression does not necessarily lead to the same result if you reorder the (finite precision) computations.

Another sanity check is printing the matrices, which is infeasible for their full size, hence we only print a subview of size \(5\). This can be done using the rlt::view operator which yields a submatrix that is a view of the original matrix. The view is an ordinary rlt::Matrix and hence can be used in all operations that take matrices as input. Since it is a view it is cheap (zero-copy) because it only carries a single pointer at runtime as well as compile-time information about the shape (dimensions and pitch).

[10]:

constexpr TI VIEW_SIZE = 5;
using VIEW = rlt::matrix::ViewSpec<VIEW_SIZE, VIEW_SIZE>;
auto vC = rlt::view(device, C, VIEW{}, 0, 0);
auto vC_blas = rlt::view(device, C_blas, VIEW{}, 0, 0);

std::cout << "Result comparison (top-left " << VIEW_SIZE << "x" << VIEW_SIZE << " for brevity)" << std::endl;
std::cout << std::endl << "C: " << std::endl;
rlt::print(device, vC);
std::cout << std::endl << "C_blas: " << std::endl;
rlt::print(device, vC_blas);

Result comparison (top-left 5x5 for brevity)

C:
   27.437886   -55.120778    -2.961325    35.691352    -1.118319
   -9.361667    37.296970   -34.454112     5.538143     8.603580
    7.339524    -4.856299    -8.972083    30.239564    21.464767
  -41.651640    23.181597   -15.082368    11.505067    20.234685
   29.919496    26.737572     9.850308    19.331748    13.406330

C_blas:
   27.437886   -55.120778    -2.961325    35.691352    -1.118319
   -9.361667    37.296970   -34.454112     5.538143     8.603580
    7.339524    -4.856299    -8.972083    30.239564    21.464767
  -41.651640    23.181597   -15.082368    11.505067    20.234685
   29.919496    26.737572     9.850308    19.331748    13.406330