Multiple Dispatch#

Binder

In this notebook, we show the usage of static multiple dispatch using a simple example. Static multiple dispatch is used in RLtools to provide different implementations of primitives used in deep learning, reinforcement learning etc. that are tailored to the particular device the code is executed on. In comparison to the Julia programming language which popularized dynamic/runtime multiple dispatch, in RLtools the dispatch to a particular implementation is done at compile-time, enabling the compiler to heavily optimize the code for a particular device.

First, we set up the environment as detailed in Containers

[1]:
#include <rl_tools/operations/cpu.h>
[2]:
namespace rlt = rl_tools;
using DEVICE = rlt::devices::DefaultCPU;
using T = float;
using TI = typename DEVICE::index_t;

Additionally, we create a new device (a hypothetical Accelerator) that is derived from the default CPU device. If we don’t overload any functions using this device will lead to automatic dispatch to the functions for the original device.

[3]:
struct Accelerator: DEVICE{};
[4]:
DEVICE device_1;
Accelerator device_2;
TI seed = 1;
auto rng = rlt::random::default_engine(DEVICE::SPEC::RANDOM(), seed);
rlt::Matrix<rlt::matrix::Specification<T, TI, 3, 3>> m;

We allocate the matrix using the first device. In this case it makes no difference if we allocate it using the first or second device but if we were e.g. using a GPU with separate, on-device memory we have to allocate containers for the particular device they will be used on. After allocating containers on different devices they can be copied between devices using rl_tools::copy.

[5]:
rlt::malloc(device_1, m);
rlt::set_all(device_1, m, 0);
rlt::print(device_1, m);
    0.000000     0.000000     0.000000
    0.000000     0.000000     0.000000
    0.000000     0.000000     0.000000

Now we define a new operation that takes a matrix and increments the first element by 10000000. Not that this function can deal with device_1 and device_2. Additionally, because of the template metaprogramming allowing us to pass around the matrix’s Specification at compile-time, we can use static_assert to make sure the operator can not be used on smaller matrices. This shows how static multiple dispatch allows for bounds checking at compile-time without any run-time overhead. On another note we use the index type TI to count because in RLtools we never hardcode any integer or floating point types, so that the optimal ones can be used depending on the device we are compiling for.

[6]:
template <typename SPEC>
void increment_first_element(DEVICE& device, rl_tools::Matrix<SPEC>& matrix){
    using TI = DEVICE::index_t;
    static_assert(SPEC::ROWS >= 1);
    static_assert(SPEC::COLS >= 1);
    for(TI i=0; i < 10000000; i++){
        rlt::increment(matrix, 0, 0, 1);
    }
}

Now we can benchmark the runtime of this, admittably horribly inefficient implementation of the incrementation operation:

[7]:
#include <chrono>
#include <iostream>
auto start = std::chrono::high_resolution_clock::now();
increment_first_element(device_1, m);
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double, std::milli> elapsed = end - start;
std::cout << "Time taken: " << elapsed.count() << " ms\n";
Time taken: 181.915000 ms

We can override the previous implementation for the new Accelerator device and provide an implementation that is tailored to the hardware. In this hypothetical case we just provide a more efficient implementation of the same incrementation operation:

[8]:
template <typename SPEC>
void increment_first_element(Accelerator& device, rl_tools::Matrix<SPEC>& matrix){
    using TI = DEVICE::index_t;
    static_assert(SPEC::ROWS >= 1);
    static_assert(SPEC::COLS >= 1);
    rlt::increment(matrix, 0, 0, 10000000);
}

Executing this implementation on the same datastructure but using device_2 yields a significant speedup:

[9]:
rlt::set_all(device_2, m, 0);
auto start = std::chrono::high_resolution_clock::now();
increment_first_element(device_2, m);
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double, std::milli> elapsed = end - start;
std::cout << "Time taken: " << elapsed.count() << " ms\n";
Time taken: 0.012250 ms

Now that we have seen how primitives can be tailored to devices using multiple dispatch and specific implementations, we want to use them in higher-level, more abstract algorithms that are agnostic to the hardware they are running on as long as the primitive operations behave the same:

[10]:
template <typename DEVICE, typename SPEC>
void algorithm(DEVICE& device, rl_tools::Matrix<SPEC>& matrix){
    using TI = typename DEVICE::index_t;
    for(TI i=0; i < 5; i++){
        increment_first_element(device, matrix);
    }
}
[11]:
auto start = std::chrono::high_resolution_clock::now();
algorithm(device_1, m);
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double, std::milli> elapsed = end - start;
std::cout << "Time taken: " << elapsed.count() << " ms\n";
Time taken: 892.975000 ms
[12]:
auto start = std::chrono::high_resolution_clock::now();
algorithm(device_2, m);
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double, std::milli> elapsed = end - start;
std::cout << "Time taken: " << elapsed.count() << " ms\n";
Time taken: 0.016334 ms

In practice we provide generic implementations in pure C++ for all deep learning and reinforcement learning operators that do not depend on specific device capabilities. These naive implementations can be quite slow (e.g. for large matrix multiplication) hence we provide specialized implementations that are dispatched to by including the appropriate operations for that device and then calling all algorithms with the specialized device type. For example the Intel MKL library provides implementations of matrix multiplication that is tailored to Intel processors and their vector extensions (e.g. AVX). Hence in that case we would includ #include <rl_tools/operations/cpu_mkl.h> which uses all the generic or CPU implementations available in RLtools but overloads the forward and backward passes of neural networks to dispatch to the fast matrix multiplication implementations. Moreover, it also overloads the rl_tools::malloc to align the container memory to 64 byte boundaries which makes the loading and storing from and to memory more efficient through aggregation of multiple loads and stores.

We prefer static multiple dispatch in the way shown before over C++ method overriding because the latter requires an implicit method lookup through the Virtual Method Table (VMT) at runtime. In contrast, static multiple dispatch allows the compiler to do the dispatch at compile time and hence incur no runtime overhead while also being able to aggressively integrate and optimize the modules. This is especially beneficial on accelerators like GPUs.

Additionally, by always specifying the device, we omit global state which is common in other deep RL libraries but in our experience hinders the integration into larger systems. In this way, multiple configurations can coexist in the same translation unit without interfering. Moreover the device can be used as a context that e.g. carries components that should be available in many functions (like e.g. the Tensorboard logger) withouth making them global variables.