The Loop Interface#

The Loop Interface is an attempt at abstracting the RL training loop and providing common defaults for each RL algorithm. The goal is to strike a balance of it working out of the box (batteries included) without a lot of configuration and boilerplate code but still being flexible enough to support many modifications to the training procedure. The latter is mainly attained by not handing over the control flow to a train() method where the user has to learn about and implement hooks/callbacks to modify the training, but rather expose a step(...) operation that progresses the RL algorithm by one step (not necessarily only one step of the environment). In this way the user can implement schedules (e.g. for the learning rate) or curricula as well as evaluation of intermediate policies and checkpointing straightforwardly without learning a new API.

The Loop Interface consists of three main components:

  1. Configuration: Compile-time configuration that defines the network structures and algorithm parameters. Takes an environment type as a template parameter input.

  2. State: The configuration gives rise to a state object that stores the whole state of the training procedure.

  3. Step operation: A step operation that advances the state by one step according to the algorithm

By wrapping the configuration, e.g. with the rl_tools::rl::loop::steps::evaluation module (as shown later in this example). The state gets extended automatically and the step(...) operations get called recursively to advance the partial states of all modules.

[1]:
#define RL_TOOLS_BACKEND_ENABLE_OPENBLAS
#include <rl_tools/operations/cpu_mux.h>
#include <rl_tools/rl/environments/pendulum/operations_generic.h>
#include <rl_tools/nn/optimizers/adam/instance/operations_generic.h>
#include <rl_tools/nn/operations_cpu_mux.h>
#include <rl_tools/nn/layers/standardize/operations_generic.h>
#include <rl_tools/nn_models/mlp_unconditional_stddev/operations_generic.h>
#include <rl_tools/nn_models/sequential/operations_generic.h>
#include <rl_tools/nn/optimizers/adam/operations_generic.h>
namespace rlt = rl_tools;
#pragma cling load("openblas")
[2]:
using DEVICE = rlt::devices::DEVICE_FACTORY<>;
using RNG = decltype(rlt::random::default_engine(typename DEVICE::SPEC::RANDOM{}));
using T = float;
using TI = typename DEVICE::index_t;

For each RL algorithm in RLtools we provide a loop interface consisting of a configuration, a corresponding state data structure, and step operation. To use the loop interface we include the core loop of e.g. PPO:

[3]:
#include <rl_tools/rl/algorithms/ppo/loop/core/config.h>
#include <rl_tools/rl/algorithms/ppo/loop/core/operations_generic.h>

Next we can define the MDP in form of an environment (see Custom Environment for details):

[4]:
using PENDULUM_SPEC = rlt::rl::environments::pendulum::Specification<T, TI, rlt::rl::environments::pendulum::DefaultParameters<T>>;
using ENVIRONMENT = rlt::rl::environments::Pendulum<PENDULUM_SPEC>;

Based on this environment we can create the default PPO loop config (with default shapes for the actor and critic networks as well as other parameters):

[5]:
struct LOOP_CORE_PARAMETERS: rlt::rl::algorithms::ppo::loop::core::DefaultParameters<T, TI, ENVIRONMENT>{
    static constexpr TI EPISODE_STEP_LIMIT = 200;
    static constexpr TI TOTAL_STEP_LIMIT = 300000;
    static constexpr TI STEP_LIMIT = TOTAL_STEP_LIMIT/(ON_POLICY_RUNNER_STEPS_PER_ENV * N_ENVIRONMENTS) + 1; // number of PPO steps
};
using LOOP_CORE_CONFIG = rlt::rl::algorithms::ppo::loop::core::Config<T, TI, RNG, ENVIRONMENT, LOOP_CORE_PARAMETERS>;

This config, which can be customized creating a subclass and overwriting the desired fields, gives rise to a loop state:

[6]:
using LOOP_CORE_STATE = typename LOOP_CORE_CONFIG::template State<LOOP_CORE_CONFIG>;

Next we can create an instance of this state and allocate as well as initialize it:

[7]:
DEVICE device;
LOOP_CORE_STATE lsc;
rlt::malloc(device, lsc);
TI seed = 1337;
rlt::init(device, lsc, seed);

Now we can execute PPO steps. A PPO step consists of collecting LOOP_CONFIG::CORE_PARAMETERS::N_ENVIRONMENTS * LOOP_CONFIG::CORE_PARAMETERS::ON_POLICY_RUNNER_STEPS_PER_ENV steps using the OffPolicyRunner and then training the actor and critic for LOOP_CONFIG::CORE_PARAMETERS::PPO_PARAMETERS::N_EPOCHS:

[8]:
bool finished = rlt::step(device, lsc);

Since we don’t want to re-implement e.g. the evaluation for each algorithm, we can wrap the PPO core config in an evaluation loop config wich adds its own configuration, state datastructure and step operation:

[9]:
#include <rl_tools/rl/environments/pendulum/ui_xeus.h> // For the interactive UI used later on
#include <rl_tools/rl/loop/steps/evaluation/config.h>
#include <rl_tools/rl/loop/steps/evaluation/operations_generic.h>
[10]:
template <typename NEXT>
struct LOOP_EVAL_PARAMETERS: rlt::rl::loop::steps::evaluation::Parameters<T, TI, NEXT>{
    static constexpr TI EVALUATION_INTERVAL = 4;
    static constexpr TI NUM_EVALUATION_EPISODES = 10;
    static constexpr TI N_EVALUATIONS = NEXT::CORE_PARAMETERS::STEP_LIMIT / EVALUATION_INTERVAL;
};
using LOOP_CONFIG = rlt::rl::loop::steps::evaluation::Config<LOOP_CORE_CONFIG, LOOP_EVAL_PARAMETERS<LOOP_CORE_CONFIG>>;
using LOOP_STATE = typename LOOP_CONFIG::template State<LOOP_CONFIG>;
[11]:
LOOP_STATE ls;
rlt::malloc(device, ls);
rlt::init(device, ls, seed);
ls.actor_optimizer.parameters.alpha = 1e-3; // increasing the learning rate leads to faster training of the Pendulum-v1 environment
ls.critic_optimizer.parameters.alpha = 1e-3;
[12]:
while(!rlt::step(device, ls)){
    if(ls.step == 5){
        std::cout << "Stepping yourself > hooks/callbacks" << std::endl;
    }
}
Step: 0/74 Mean return: -1682.39 Mean episode length: 200
Step: 4/74 Mean return: -1419.99 Mean episode length: 200
Stepping yourself > hooks/callbacks
Step: 8/74 Mean return: -1183.42 Mean episode length: 200
Step: 12/74 Mean return: -1380.93 Mean episode length: 200
Step: 16/74 Mean return: -1277.29 Mean episode length: 200
Step: 20/74 Mean return: -1395.43 Mean episode length: 200
Step: 24/74 Mean return: -1492.18 Mean episode length: 200
Step: 28/74 Mean return: -1057.63 Mean episode length: 200
Step: 32/74 Mean return: -942.946 Mean episode length: 200
Step: 36/74 Mean return: -607.803 Mean episode length: 200
Step: 40/74 Mean return: -530.716 Mean episode length: 200
Step: 44/74 Mean return: -206.616 Mean episode length: 200
Step: 48/74 Mean return: -165.582 Mean episode length: 200
Step: 52/74 Mean return: -160.067 Mean episode length: 200
Step: 56/74 Mean return: -163.664 Mean episode length: 200
Step: 60/74 Mean return: -174.3 Mean episode length: 200
Step: 64/74 Mean return: -220.182 Mean episode length: 200
Step: 68/74 Mean return: -202.481 Mean episode length: 200
[13]:
using UI_SPEC = rlt::rl::environments::pendulum::ui::xeus::Specification<T, TI, 400, 100>; // float type, index type, size, playback speed (in %)
using UI = rlt::rl::environments::pendulum::ui::xeus::UI<UI_SPEC>;
UI ui;
rlt::MatrixDynamic<rlt::matrix::Specification<T, TI, 1, ENVIRONMENT::Observation::DIM>> observations_mean;
rlt::MatrixDynamic<rlt::matrix::Specification<T, TI, 1, ENVIRONMENT::Observation::DIM>> observations_std;
rlt::malloc(device, observations_mean);
rlt::malloc(device, observations_std);
rlt::set_all(device, observations_mean, 0);
rlt::set_all(device, observations_std, 1);
ui.canvas
[13]:
[14]:
rlt::rl::utils::evaluation::Result<rlt::rl::utils::evaluation::Specification<T, TI, LOOP_CORE_CONFIG::ENVIRONMENT_EVALUATION, 1, LOOP_CORE_PARAMETERS::EPISODE_STEP_LIMIT>> result;
rlt::evaluate(device, ls.env_eval, ui, rlt::get_actor(ls), result, ls.actor_deterministic_evaluation_buffers, ls.rng_eval);

You can execute the previous cell again to run another rollout using the UI.

Moving Beyond the Loop Interface#

If you want more fine-grained control than the loop interface permits (e.g. for researching modifications to the algorithms), you can have a look at the definition of the config, the state, and particularly the step(...) implementation (e.g. in rl_tools/rl/algorithms/ppo/loop/core/operations_generic.h in the case of PPO). You can instantiate the data structures in a similar way and then call the PPO operations like collect(..), estimate_generalized_advantages(...), and train(...) yourself.