Balancing Chaos: A Physicist’s Dive into Reinforcement Learning with CartPole
Balancing Chaos: A Physicist’s Dive into Reinforcement Learning with CartPole
What do rocket nozzles, humanoid robots, and an inverted pendulum have in common? They all live on the edge of instability. And now, so does our code.
In this project, I built a physics-based CartPole environment from scratch, visualized its motion, and trained a reinforcement learning agent (PPO) to master the task of balancing chaos — all in Python.
1. Modeling the Physics (Nonlinear Dynamics)
The classic inverted pendulum on a cart, governed by Newton’s laws without any linearizing assumptions.
- Cart mass: $ M $
- Pole mass: $ m $
- Pole length: $ l $
- Gravity: $ g $
State vector:
\(\mathbf{s} = [x, \dot{x}, \theta, \dot{\theta}]\)- Action: Apply a force $F$ left or right.
Equations of Motion
\[\ddot{\theta} = \frac{g \sin\theta - \cos\theta \left( \frac{F + m l \dot{\theta}^2 \sin\theta}{M + m} \right)}{l \left( \frac{4}{3} - \frac{m \cos^2\theta}{M + m} \right)}\] \[\ddot{x} = \frac{F + m l \dot{\theta}^2 \sin\theta - m l \ddot{\theta} \cos\theta}{M + m}\]- Nonlinear: No small-angle approximation — full dynamics.
- Integration: Euler method with timestep $\Delta t = 0.02 \, \text{s}$
2. Building a Custom Gymnasium Environment
Instead of using gym.make("CartPole-v1"), I built the environment from scratch for control and learning.
Key Features:
- Fully compliant with
gymnasium.Env reset(seed)andstep()return the modern 5-element tuple- Exposes realistic dynamics for learning algorithms
3. Visualization
Using matplotlib + patches, I animated the cart and pole in motion and saved videos with OpenCV.
Output includes:
- Left: Cart and pole animation
- Right: Reward accumulation plot
Videos:
cartpole_random.mp4: Random actionscartpole_rl.mp4: Trained PPO agent
4. Reinforcement Learning with PPO
We framed CartPole as a Markov Decision Process:
- State: $s = [x, \dot{x}, \theta, \dot{\theta}]$
- Action: $a \in {0, 1}$
- Reward: $+1$ per timestep
- Goal: Maximize total expected reward
PPO Algorithm
PPO balances exploration and exploitation using a clipped objective:
\[L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min\left(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_t\right) \right]\]Where:
\[r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}\]- Stable Baselines3 + Gymnasium made implementation simple.
- Training took ~50,000 steps to converge.
5. Results
| Agent Type | Behavior | Reward Plot |
|---|---|---|
| Random | Cart goes off screen quickly | Flat reward |
| Trained | Cart balances pole for 500+ timesteps | Increasing reward |
You can visually compare the two by watching the videos.
Project Structure
1
2
3
4
5
6
7
cartpole_rl/
├── env.py # Custom Gymnasium-compatible CartPole environment
├── train.py # Trains PPO agent using Stable Baselines3
├── visualize.py # Realtime cart-pole animation using matplotlib
├── visualize_and_save.py # Saves trained agent animation to MP4 (with reward plot)
├── record_random_policy.py# Saves random policy animation to MP4
├── ppo_cartpole_custom.zip# Trained PPO model (after training)
Links
- GitHub repo: cartpole_RL
- Videos:
- [
cartpole_random.mp4]
- [
- [
cartpole_rl.mp4]
From nonlinear dynamics to learning control — this was a playground where physics met AI, and chaos was tamed.