CPU Acceleration#
The generic implementation of matrix multiplication using a triple nested loop is reasonably fast due to the design of RLtools allowing the compiler to heavily optimizer the code (especially for smaller matrices). For larger matrices it is beneficial to use the CPUs SIMD instructions (like e.g. AVX). This can be done through BLAS libraries like Intel MKL
or OpenBLAS. We found Intel MKL to be the fastest, but it does not work reliably in Cling (the C/C++ interpreter as a backend of the Jupyter notebook used for this tutorial). Hence we use OpenBLAS in this and the following examples. RLtools has a multiplexing header cpu_mux.h
that automatically includes the CPU operations dispatching to the available backend. The available backend is selected by defining e.g. RL_TOOLS_BACKEND_ENABLE_OPENBLAS
or RL_TOOLS_BACKEND_ENABLE_MKL
(these
options can also be used during the configuration phase of CMake when using RLtools natively by passing them on the CLI using -D
).
[1]:
#define RL_TOOLS_BACKEND_ENABLE_OPENBLAS
#include <rl_tools/operations/cpu_mux.h>
namespace rlt = rl_tools;
We also include iostream
to print the computation times later on
[2]:
#include <iostream>
OpenBLAS contains pre-compiled matrix multiplication kernels and hence needs to be linked (or loaded in case of the Cling interpreter):
[3]:
#pragma cling load("openblas")
We define two devices to compare the computation time later on. All CPU_*
devices use the main memory to store containers and allocated matrices are hence compatible between them. For maximum performance it is recommended to allocate the matrices with the device that they are used on afterwards because e.g. Intel MKL allows to allocate chunks of memory at certain boundaries which allow faster loading using SIMD instructions. This is integrated into the particular rlt::malloc
specialization which is dispatched to by simply calling with the desired device.
Because we don’t know the outcome of the (or rather we don’t want to hard-code it) the cpu_mux.h
returns a DEVICE_FACTORY
template that generates the found CPU_*
device when passing a CPU
specification. In this case (since we know OpenBLAS is available and that RL_TOOLS_BACKEND_ENABLE_OPENBLAS
is defined) this is equal to using DEVICE_BLAS = rlt::devices::CPU_OPENBLAS<rlt::devices::DefaultCPUSpecification>
.
You can play around with commenting #define RL_TOOLS_BACKEND_ENABLE_OPENBLAS
out in the first cell and re-running the notebook. In that case, you can see that it will use a normal CPU
for DEVICE_BLAS
and result in equally slow computation times for both devices in the benchmark lateron.
[4]:
using DEVICE = rlt::devices::DefaultCPU;
using DEVICE_BLAS = rlt::devices::DEVICE_FACTORY<rlt::devices::DefaultCPUSpecification>;
using T = double;
using TI = typename DEVICE::index_t;
We allocate \(A\), \(B\), \(C\) and \(C_{blas}\) matrices to evaluate the computation:
[5]:
DEVICE device;
DEVICE_BLAS device_blas;
constexpr TI SIZE = 500;
rlt::Matrix<rlt::matrix::Specification<T, TI, SIZE, SIZE>> A, B, C, C_blas;
auto rng = rlt::random::default_engine(typename DEVICE::SPEC::RANDOM(), 1);
We allocate all the matrices and fill \(A\) and \(B\) with random numbers:
[6]:
rlt::malloc(device_blas, A);
rlt::malloc(device_blas, B);
rlt::malloc(device_blas, C);
rlt::malloc(device_blas, C_blas);
rlt::randn(device, A, rng);
rlt::randn(device, B, rng);
Now we can benchmark the matrix multiplication using the generic implementation:
[7]:
{
auto start = std::chrono::high_resolution_clock::now();
rlt::multiply(device, A, B, C);
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration<double>(end - start);
std::cout << "Time Cling (C/C++ interpreter): " << duration.count() << " seconds" << std::endl;
}
Time Cling (C/C++ interpreter): 1.24116 seconds
Equivalently we can run the same computation using OpenBLAS and get the result in much less time:
[8]:
{
auto start = std::chrono::high_resolution_clock::now();
rlt::multiply(device_blas, A, B, C_blas);
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration<double>(end - start);
std::cout << "Time Cling (C/C++ interpreter): " << duration.count() << " seconds" << std::endl;
}
Time Cling (C/C++ interpreter): 0.00617779 seconds
Now we check the resulting matrices against each other:
[9]:
std::cout << "Absolute difference between the resulting matrices: " << rlt::abs_diff(device, C, C_blas) << std::endl;
Absolute difference between the resulting matrices: 2.95702e-09
Depending on the machine, compilers and library (versions) used, we might find that they are exactly equal but this is not necessarily the case. By changing T
to float
the deviations should be bigger but also for double
this could happen because of floating point inaccuracies which entail that the same mathematical expression does not necessarily lead to the same result if you reorder the (finite precision) computations.
Another sanity check is printing the matrices, which is infeasible for their full size, hence we only print a subview of size \(5\). This can be done using the rlt::view
operator which yields a submatrix that is a view of the original matrix. The view is an ordinary rlt::Matrix
and hence can be used in all operations that take matrices as input. Since it is a view it is cheap (zero-copy) because it only carries a single pointer at runtime as well as compile-time information about the
shape (dimensions and pitch).
[10]:
constexpr TI VIEW_SIZE = 5;
using VIEW = rlt::matrix::ViewSpec<VIEW_SIZE, VIEW_SIZE>;
auto vC = rlt::view(device, C, VIEW{}, 0, 0);
auto vC_blas = rlt::view(device, C_blas, VIEW{}, 0, 0);
std::cout << "Result comparison (top-left " << VIEW_SIZE << "x" << VIEW_SIZE << " for brevity)" << std::endl;
std::cout << std::endl << "C: " << std::endl;
rlt::print(device, vC);
std::cout << std::endl << "C_blas: " << std::endl;
rlt::print(device, vC_blas);
Result comparison (top-left 5x5 for brevity)
C:
27.437886 -55.120778 -2.961325 35.691352 -1.118319
-9.361667 37.296970 -34.454112 5.538143 8.603580
7.339524 -4.856299 -8.972083 30.239564 21.464767
-41.651640 23.181597 -15.082368 11.505067 20.234685
29.919496 26.737572 9.850308 19.331748 13.406330
C_blas:
27.437886 -55.120778 -2.961325 35.691352 -1.118319
-9.361667 37.296970 -34.454112 5.538143 8.603580
7.339524 -4.856299 -8.972083 30.239564 21.464767
-41.651640 23.181597 -15.082368 11.505067 20.234685
29.919496 26.737572 9.850308 19.331748 13.406330