Containers#

Some examples of using containers (only 2D matrices for now)

Since RLtools is a header-only library the compiler only needs to know where its include folder is located (cloned or mounted at /usr/local/include/rl_tools in the docker image). This is a standard location for header files and the C_INCLUDE_PATH is set to include it in the Dockerfile.

Most operations in RLtools are generic and work on any device that supports a C++ 17 compiler (standard library support not required). But there are some device-specific functions like random number generation that are device dependent and hence might require specific implementations that are and often can only be included on that particular device (e.g. Intel CPU, CUDA GPU) hence we include the CPU implementations in this example. In this case, the CPU implementations entail a dependency on a few standard library objects (size_t, random number generation, logging, etc.). At the same time also all the basic generic functions that operate e.g. over containers are included.

[1]:

#include <rl_tools/operations/cpu.h>

All objects in RLtools are encapsulated in the rl_tools namespace and there is no global state (not even for logging etc.). In programs using RLtools we usually abbreviate the namespace rl_tools to rlt and define three shorthands for frequently used types. Firstly, DEVICE is the selected device type, T is the floating point type used (usually float or double, where float can e.g. be preferable for vastly better performance on accelerators). Moreover, we define TI as the index type which usually should be the size_t for the device (to match the device’s hardware and provide the best performance). All algorithms and data structures in RLtools are agnostic to these types by using the template metaprogramming capabilities of C++. Additionally the DEVICE type is usually used for a static, compile-time version of multiple dispatch to dispatch certain functions (like e.g. a neural network layer forward pass) to code that is optimized for a particular device. Through this design, the same higher-level algorithms can be executed on all sorts of devices from HPC clusters over workstations and laptops to smartphones, smartwatches, and microcontrollers without sacrificing performance. Through template metaprogramming e.g. all the matrix dimensions and the number of for-loop iterations are known a priori at compile time and can be used by the compiler to heavily optimize the code through loop unrolling, inlining etc.

[2]:

namespace rlt = rl_tools;
using DEVICE = rlt::devices::DefaultCPU;
using T = float;
using TI = typename DEVICE::index_t;

In the following we instantiate a device struct. The DEVICE struct can be empty and hence have no overhead but facilitate tag dispatch. It can also be used as a carrier for additional context that would otherwise be implemented as global state (e.g. logging through a Tensorboard logger). In the first example we will create a matrix and fill it with random numbers (from an isotropic, standard normal distribution) hence we define the initial seed for our random number generator which is instantiated depending on the device type. This allows us to easily change the DEVICE definition and have all downstream entities be appropriate for the particular device. Finally, we are creating a matrix. Particularly a dynamic (heap allocated) 3x3 matrix. The static, compile-time configuration of the matrix is defined by a specification type (rlt::matrix::Specification<ELEMENT_TYPE, INDEX_TYPE, ROWS, COLS>) that carries the types and compile-time constants. Compiling these attributes into a separate specification instead of having numerous template parameters on the rlt::MatrixDynamic type brings the benefit that writing functions that take matrices as input becomes easier as we just have to add a typename SPEC parameter to the template. We can still constrain the usage of a function with only matrices having particular attributes through e.g. static_assert and SFINAE. Moreover we can add attributes without breaking functions that are written this way.

[3]:

DEVICE device;
TI seed = 1;
auto rng = rlt::random::default_engine(DEVICE::SPEC::RANDOM(), seed);
rlt::Matrix<rlt::matrix::Specification<T, TI, 3, 3>> m;

Since we created a dynamic matrix (which just consists of a pointer to the beginning of a memory space) we need to allocate it which is done using rlt::malloc. As with all functions in RLtools it takes the device as an input because it provides the (global) context and in this case can be helpful to e.g. align the allocated memory space to certain boundaries to allow for maximum read-write performance for a particular device.

[4]:

rlt::malloc(device, m);

rlt::Matrix defaults to a dynamic, heap-allocated matrix but we can override this behavior by defining DYNAMIC_ALLOCATION=false in the specification and get a statically, stack-allocated matrix which does not require rlt::malloc and rlt::free.

[5]:

constexpr bool DYNAMIC_ALLOCATION = false;
rlt::Matrix<rlt::matrix::Specification<T, TI, 3, 3, DYNAMIC_ALLOCATION>> m_static;

The memory space is usually not initialized hence we fill it with random numbers (from a standard normal distribution):

[6]:

rlt::randn(device, m, rng);

Now we can print the allocated and filled matrix:

[7]:

rlt::print(device, m);

    0.849261    -0.102156    -0.256673
    0.904277    -0.538617    -0.506808
   -0.408192     0.271856    -0.311355

We can access elements using the get and set commands:

[8]:

rlt::get(m, 0, 0)

[8]:

0.849261f

[9]:

rlt::set(m, 0, 0, 1);
rlt::print(device, m);

    1.000000    -0.102156    -0.256673
    0.904277    -0.538617    -0.506808
   -0.408192     0.271856    -0.311355

get returns a reference so we could technically also set or increment it through the reference:

[10]:

rlt::get(m, 0, 0) += 10;
rlt::print(device, m);

   11.000000    -0.102156    -0.256673
    0.904277    -0.538617    -0.506808
   -0.408192     0.271856    -0.311355

Writing through the reference is not very intuitive so we prefer set and increment:

[11]:

rlt::increment(m, 0, 0, -10);
rlt::print(device, m);

    1.000000    -0.102156    -0.256673
    0.904277    -0.538617    -0.506808
   -0.408192     0.271856    -0.311355

Tensors#

Matrices are a simple, 2D data structure but to allow for more complex algorithms we have since introduce a tensor type that can hold arbitrary shapes:

[12]:

using SHAPE = rlt::tensor::Shape<TI, 3, 3, 3>;
using SPEC = rlt::tensor::Specification<T, TI, SHAPE, DYNAMIC_ALLOCATION>;
rlt::Tensor<SPEC> t;

Tensors support most of the operations that matrices also support:

[13]:

rlt::randn(device, t, rng);
rlt::print(device, t);

dim[0] = 0:
  -5.703804e-01  -3.422589e-01   1.008072e-01
  -9.118625e-01   2.108090e+00   9.476308e-02
   5.376303e-01   3.618752e-01  -7.995225e-01

dim[0] = 1:
   8.660405e-01   1.061986e+00   6.006763e-01
   2.661995e+00  -9.388391e-01  -1.549304e-01
   9.058360e-02  -1.328507e+00   1.262284e+00

dim[0] = 2:
   2.677846e+00  -1.236785e+00  -9.119245e-02
  -8.944708e-01  -2.577802e+00   2.305977e+00
   5.642641e-01   5.340819e-01   1.266308e+00

The signature of the set operations slightly differs from the ones for matrices because tensors can have arbitrary numbers of dimensions and to take advantage of the variadic arguments Args... the indices have to be last in the operations signature:

[14]:

std::cout << rlt::get(device, t, 0, 1, 1) << std::endl;
T new_value = 1337;
rlt::set(device, t, new_value, 0, 1, 1);
std::cout << rlt::get(device, t, 0, 1, 1) << std::endl;

2.108090e+00
1.337000e+03

Tensors can be sliced by using view and view_range

[15]:

auto mid3x3 = rlt::view(device, t, 1);
rlt::print(device, mid3x3);

   8.660405e-01   1.061986e+00   6.006763e-01
   2.661995e+00  -9.388391e-01  -1.549304e-01
   9.058360e-02  -1.328507e+00   1.262284e+00

[16]:

auto first_rows = rlt::view(device, t, 0, rlt::tensor::ViewSpec<1>{});
std::cout << "First rows: " << std::endl;
rlt::print(device, first_rows);
std::cout << "Last rows: " << std::endl;
auto last_rows = rlt::view(device, t, 2, rlt::tensor::ViewSpec<1>{});
rlt::print(device, last_rows);
std::cout << "First cols: " << std::endl;
auto first_cols = rlt::view(device, t, 0, rlt::tensor::ViewSpec<2>{});
rlt::print(device, first_cols);

First rows:
  -5.703804e-01  -3.422589e-01   1.008072e-01
   8.660405e-01   1.061986e+00   6.006763e-01
   2.677846e+00  -1.236785e+00  -9.119245e-02

Last rows:
   5.376303e-01   3.618752e-01  -7.995225e-01
   9.058360e-02  -1.328507e+00   1.262284e+00
   5.642641e-01   5.340819e-01   1.266308e+00

First cols:
  -5.703804e-01  -9.118625e-01   5.376303e-01
   8.660405e-01   2.661995e+00   9.058360e-02
   2.677846e+00  -8.944708e-01   5.642641e-01