active: late 2004 to early 2006 download: sf.net/projects/verve-agents api: link paper: pdf

Verve

Reinforcement Learning in C++ and Python

Verve is a software library for reinforcement learning that I developed as part of my master's thesis. Its core object is an intelligent agent with user-defined inputs and outputs, including arbitrary reward signals to guide the agent's behavior. (The word "verve" means vitality, liveliness, aptitude, or talent.)

For all the details, see my master's thesis.

Overview

Reinforcement learning (RL) is a branch of machine learning involving sequential decision making with sparse feedback. It is generally useful in situations where a well-defined control algorithm is unknown but may be learned from direct sensory-motor experience. Simply by providing the agent with a scalar reinforcement signal as it explores its environment, it will optimize its behavior to achieve more rewards, thus learning the desired control policy. RL provides a fairly general way to build systems that learn to control things. For example, they could learn to control robots, drive vehicles, or direct non-player characters (NPCs) in video games.

Note that RL is not always the right approach. For example, if rich feedback is available in the form of complete sensory-motor data samples, a better approach is pure supervised learning. RL is more appropriate when the desired action for any given input is unknown.

2005 poster describing initial motivation and plans

Verve implements several well-studied RL methods (radial basis functions, single-layer feedforward neural networks, temporal difference learning, and planning) with a few experimental methods (artificial curiosity and internal uncertainty estimation) into a novel learning architecture. Verve's design principles are 1) general purpose (applicable to a wide range of problems), and 2) easy to use (small, intuitive API). It is intended for both discrete environments with discrete time steps and real-time control in continuous environments (e.g. video games and robotics).

Library Features

Written in C++ with Python bindings.
Small, simple API.
Source code is heavily commented.
Library is unit tested.
Open source license (either BSD or LGPL).
Trained agents can be saved to XML files.
Value function can be accessed independently for visualization.
Learning can be disabled for trained agents to save computational resources.

Algorithm Details

State spaces: any number of discrete and/or continuous input dimensions (e.g. discrete battery level, continuous laser rangefinder distance, etc.).
Continuous input dimensions have a resolution parameter to determine their acuity.
Action space: one discrete action variable, with softmax action selection.
Continuous states represented with radial basis functions (RBFs), either exhaustively or auto-allocated as needed.
Actor-critic architecture, with two feedforward neural networks mapping state to value and state to action.
Temporal difference learning with eligibility traces (trains value function and policy networks).
Learned world model: feedforward neural network mapping current (state, action) to next (state, reward, uncertainty).
Planning with learned model via planning trajectories (random model-based forward samples from current state) allows RL from simulated experience.
Automatic planning trajectory length based on model uncertainty estimation.
Artificial curiosity, with bonus rewards based on model uncertainty, drives agents to explore new situations and continually improve their learned models.
Curiosity rewards automatically work in planning mode due to learned estimation of the model's own uncertainty.

Limitations

While the state representations automatically provide "features" (combinations of input dimensions) that make multilayer neural networks unnecessary in a sense (for value function, policy, and model), one drawback is a potential combinatorial explosion of features which would be necessary for high-dimensional input spaces. Auto-allocated RBFs helps mitigate this to a large degree, so that the necessary storage and processing increase in proportion to the agent's lifetime in the worst case.
The RBF state representation enables some degree of generalization to nearby states, but this is nowhere near the powerful type of generalization provided by multilayer neural networks. The RBF approach is sort of a glorified form of memorization (which can be sufficient for domains not requiring generalization to novel states).
Agents learn to control a single discrete action variable, but not continuous or high-dimensional actions. A single continuous action variable can be discretized, but high-dimensional actions would need a new internal action representation.
Agents have no temporal state representation, so they cannot predict future events at specific times. One solution to this problem is to use an explicit representation of time (i.e. augmenting the state representation to include a short history of previous states).

Code Sample

This code shows how to create an Agent and set up a loop to update the Agent's Observation, reward, and action. Note that several function calls here are user-defined (initEnvironment, computeContinuousInput0, computeReward, performAction2, and updateEnvironment).

// Define an AgentDescriptor.
verve::AgentDescriptor agentDesc;
agentDesc.addDiscreteSensor(4);  // 4 possible values.
agentDesc.addContinuousSensor();
agentDesc.addContinuousSensor();
agentDesc.setContinuousSensorResolution(10);
agentDesc.setNumOutputs(3); // 3 actions.

// Create the Agent and Observation.
verve::Agent agent(agentDesc);
verve::Observation obs;
obs.init(agent);

initEnvironment();

// Loop forever (or until desired learning performance).
while (1)
{
  verve::real dt = 0.1;

  // Update the Observation based on the current state
  // of the world. Each sensor is accessed via an index.
  obs.setDiscreteValue(0, computeDiscreteInput());
  obs.setContinuousValue(0, computeContinuousInput0());
  obs.setContinuousValue(1, computeContinuousInput1());

  verve::real reward = computeReward();

  // Update the Agent.
  unsigned int action = agent.update(reward, obs, dt);

  // Apply the chosen action to the environment.
  switch(action)
  {
  case 0:
    performAction0();
    break;
  case 1:
    performAction1();
    break;
  case 2:
    performAction2();
    break;
  default:
    break;
  }

  // Simulate the environment ahead by dt seconds.
  updateEnvironment(dt);
}

Pendulum Swing-Up Task

The pendulum swing-up task is a standard RL problem that involves controlling a simple physical system. A pendulum suspended in midair must swing itself up to its highest position to receive a reward, but it is not physically possible to achieve this state without first learning to swing back and forth to build momentum.

For my thesis, I implemented this environment using OPAL to simulate physics and Ogre to render the graphics. This task and the inverted pendulum below helped verify that Verve agents can learn to solve standard control problems.

The pendulum swing-up test program, along with the inverted pendulum and curious robot playground described below, are available for download (Win32 only).

Visualization of value/policy neural networks, before and after learning the swing-up task. Links coming from the top are joined to a bias node (not shown).

Learned value function after trials 1, 5, 20, & 100 for the swing-up task. X axis is angle. Y axis is angular velocity.

Inverted Pendulum Task

Another standard but slightly more difficult control task is the inverted pendulum (aka cart-pole) problem. Here a pole (pendulum) is mounted upside-down on a cart. The cart can move left and right only, and the pole can rotate around just one axis. The agent controls the cart but not the pole, and the agent receives a reward on each time step where the pole has not fallen to either side. Thus, the agent must learn to move the cart back and forth in order to keep the pole balanced vertically.

Artificial Curiosity

For much of my thesis, I focused on artificial curiosity combined with RL. After my thesis work, I made this simulated playground for curious robot exploration.

A simulated robot with a green curiosity indicator

A simulated playground for curious robot exploration