Balancing Chaos: A Physicist’s Dive into Reinforcement Learning with CartPole

Posted Jun 21, 2025 Updated Jun 24, 2025

By Karan Anand

2 min read

What do rocket nozzles, humanoid robots, and an inverted pendulum have in common? They all live on the edge of instability. And now, so does our code.

In this project, I built a physics-based CartPole environment from scratch, visualized its motion, and trained a reinforcement learning agent (PPO) to master the task of balancing chaos — all in Python.

1. Modeling the Physics (Nonlinear Dynamics)

The classic inverted pendulum on a cart, governed by Newton’s laws without any linearizing assumptions.

Cart mass: $ M $
Pole mass: $ m $
Pole length: $ l $
Gravity: $ g $
State vector:
$\mathbf{s} = [x, \dot{x}, \theta, \dot{\theta}]$
Action: Apply a force $F$ left or right.

Equations of Motion

\[\ddot{\theta} = \frac{g \sin\theta - \cos\theta \left( \frac{F + m l \dot{\theta}^2 \sin\theta}{M + m} \right)}{l \left( \frac{4}{3} - \frac{m \cos^2\theta}{M + m} \right)}\] \[\ddot{x} = \frac{F + m l \dot{\theta}^2 \sin\theta - m l \ddot{\theta} \cos\theta}{M + m}\]

Nonlinear: No small-angle approximation — full dynamics.
Integration: Euler method with timestep $\Delta t = 0.02 \, \text{s}$

2. Building a Custom Gymnasium Environment

Instead of using gym.make("CartPole-v1"), I built the environment from scratch for control and learning.

Key Features:

Fully compliant with gymnasium.Env
reset(seed) and step() return the modern 5-element tuple
Exposes realistic dynamics for learning algorithms

3. Visualization

Using matplotlib + patches, I animated the cart and pole in motion and saved videos with OpenCV.

Output includes:

Left: Cart and pole animation
Right: Reward accumulation plot

Videos:

cartpole_random.mp4: Random actions
cartpole_rl.mp4: Trained PPO agent

4. Reinforcement Learning with PPO

We framed CartPole as a Markov Decision Process:

State: $s = [x, \dot{x}, \theta, \dot{\theta}]$
Action: $a \in {0, 1}$
Reward: $+1$ per timestep
Goal: Maximize total expected reward

PPO Algorithm

PPO balances exploration and exploitation using a clipped objective:

\[L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min\left(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_t\right) \right]\]

Where:

\[r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}\]

Stable Baselines3 + Gymnasium made implementation simple.
Training took ~50,000 steps to converge.

5. Results

Agent Type	Behavior	Reward Plot
Random	Cart goes off screen quickly	Flat reward
Trained	Cart balances pole for 500+ timesteps	Increasing reward

You can visually compare the two by watching the videos.

Project Structure

cartpole_rl/
├── env.py                 # Custom Gymnasium-compatible CartPole environment
├── train.py               # Trains PPO agent using Stable Baselines3
├── visualize.py           # Realtime cart-pole animation using matplotlib
├── visualize_and_save.py  # Saves trained agent animation to MP4 (with reward plot)
├── record_random_policy.py# Saves random policy animation to MP4
├── ppo_cartpole_custom.zip# Trained PPO model (after training)

Balancing Chaos: A Physicist’s Dive into Reinforcement Learning with CartPole

Balancing Chaos: A Physicist’s Dive into Reinforcement Learning with CartPole

1. Modeling the Physics (Nonlinear Dynamics)

Equations of Motion

2. Building a Custom Gymnasium Environment

Key Features:

3. Visualization

Output includes:

Videos:

4. Reinforcement Learning with PPO

PPO Algorithm

5. Results

You can visually compare the two by watching the videos.

Project Structure

Links

Trending Tags