Post

Balancing Chaos: A Physicist’s Dive into Reinforcement Learning with CartPole

Balancing Chaos: A Physicist’s Dive into Reinforcement Learning with CartPole

Balancing Chaos: A Physicist’s Dive into Reinforcement Learning with CartPole

What do rocket nozzles, humanoid robots, and an inverted pendulum have in common? They all live on the edge of instability. And now, so does our code.

In this project, I built a physics-based CartPole environment from scratch, visualized its motion, and trained a reinforcement learning agent (PPO) to master the task of balancing chaos — all in Python.


1. Modeling the Physics (Nonlinear Dynamics)

The classic inverted pendulum on a cart, governed by Newton’s laws without any linearizing assumptions.

  • Cart mass: $ M $
  • Pole mass: $ m $
  • Pole length: $ l $
  • Gravity: $ g $
  • State vector:
    \(\mathbf{s} = [x, \dot{x}, \theta, \dot{\theta}]\)

  • Action: Apply a force $F$ left or right.

Equations of Motion

\[\ddot{\theta} = \frac{g \sin\theta - \cos\theta \left( \frac{F + m l \dot{\theta}^2 \sin\theta}{M + m} \right)}{l \left( \frac{4}{3} - \frac{m \cos^2\theta}{M + m} \right)}\] \[\ddot{x} = \frac{F + m l \dot{\theta}^2 \sin\theta - m l \ddot{\theta} \cos\theta}{M + m}\]
  • Nonlinear: No small-angle approximation — full dynamics.
  • Integration: Euler method with timestep $\Delta t = 0.02 \, \text{s}$

2. Building a Custom Gymnasium Environment

Instead of using gym.make("CartPole-v1"), I built the environment from scratch for control and learning.

Key Features:

  • Fully compliant with gymnasium.Env
  • reset(seed) and step() return the modern 5-element tuple
  • Exposes realistic dynamics for learning algorithms

3. Visualization

Using matplotlib + patches, I animated the cart and pole in motion and saved videos with OpenCV.

Output includes:

  • Left: Cart and pole animation
  • Right: Reward accumulation plot

Videos:

  • cartpole_random.mp4: Random actions
  • cartpole_rl.mp4: Trained PPO agent

4. Reinforcement Learning with PPO

We framed CartPole as a Markov Decision Process:

  • State: $s = [x, \dot{x}, \theta, \dot{\theta}]$
  • Action: $a \in {0, 1}$
  • Reward: $+1$ per timestep
  • Goal: Maximize total expected reward

PPO Algorithm

PPO balances exploration and exploitation using a clipped objective:

\[L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min\left(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_t\right) \right]\]

Where:

\[r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}\]
  • Stable Baselines3 + Gymnasium made implementation simple.
  • Training took ~50,000 steps to converge.

5. Results

Agent TypeBehaviorReward Plot
RandomCart goes off screen quicklyFlat reward
TrainedCart balances pole for 500+ timestepsIncreasing reward

You can visually compare the two by watching the videos.

Project Structure

1
2
3
4
5
6
7
cartpole_rl/
├── env.py                 # Custom Gymnasium-compatible CartPole environment
├── train.py               # Trains PPO agent using Stable Baselines3
├── visualize.py           # Realtime cart-pole animation using matplotlib
├── visualize_and_save.py  # Saves trained agent animation to MP4 (with reward plot)
├── record_random_policy.py# Saves random policy animation to MP4
├── ppo_cartpole_custom.zip# Trained PPO model (after training)

  • [cartpole_rl.mp4]

From nonlinear dynamics to learning control — this was a playground where physics met AI, and chaos was tamed.

This post is licensed under CC BY 4.0 by the author.