Stable Baselines3 Algorithm Reference
This document provides detailed characteristics of all RL algorithms in Stable Baselines3 to help select the right algorithm for specific tasks.
Algorithm Comparison Table
| Algorithm | Type | Action Space | Sample Efficiency | Training Speed | Use Case |
|---|---|---|---|---|---|
| PPO | On-Policy | All | Medium | Fast | General-purpose, stable |
| A2C | On-Policy | All | Low | Very Fast | Quick prototyping, multiprocessing |
| SAC | Off-Policy | Continuous | High | Medium | Continuous control, sample-efficient |
| TD3 | Off-Policy | Continuous | High | Medium | Continuous control, deterministic |
| DDPG | Off-Policy | Continuous | High | Medium | Continuous control (use TD3 instead) |
| DQN | Off-Policy | Discrete | Medium | Medium | Discrete actions, Atari games |
| HER | Off-Policy | All | Very High | Medium | Goal-conditioned tasks |
| RecurrentPPO | On-Policy | All | Medium | Slow | Partial observability (POMDP) |
Detailed Algorithm Characteristics
PPO (Proximal Policy Optimization)
Overview: General-purpose on-policy algorithm with good performance across many tasks.
Strengths: - Stable and reliable training - Works with all action space types (Discrete, Box, MultiDiscrete, MultiBinary) - Good balance between sample efficiency and training speed - Excellent for multiprocessing with vectorized environments - Easy to tune
Weaknesses: - Less sample-efficient than off-policy methods - Requires many environment interactions
Best For: - General-purpose RL tasks - When stability is important - When you have cheap environment simulations - Tasks with continuous or discrete actions
Hyperparameter Guidance:
- n_steps: 2048-4096 for continuous, 128-256 for Atari
- learning_rate: 3e-4 is a good default
- n_epochs: 10 for continuous, 4 for Atari
- batch_size: 64
- gamma: 0.99 (0.995-0.999 for long episodes)
A2C (Advantage Actor-Critic)
Overview: Synchronous variant of A3C, simpler than PPO but less stable.
Strengths: - Very fast training (simpler than PPO) - Works with all action space types - Good for quick prototyping - Memory efficient
Weaknesses: - Less stable than PPO - Requires careful hyperparameter tuning - Lower sample efficiency
Best For: - Quick experimentation - When training speed is critical - Simple environments
Hyperparameter Guidance:
- n_steps: 5-256 depending on task
- learning_rate: 7e-4
- gamma: 0.99
SAC (Soft Actor-Critic)
Overview: Off-policy algorithm with entropy regularization, state-of-the-art for continuous control.
Strengths: - Excellent sample efficiency - Very stable training - Automatic entropy tuning - Good exploration through stochastic policy - State-of-the-art for robotics
Weaknesses: - Only supports continuous action spaces (Box) - Slower wall-clock time than on-policy methods - More complex hyperparameters
Best For: - Continuous control (robotics, physics simulations) - When sample efficiency is critical - Expensive environment simulations - Tasks requiring good exploration
Hyperparameter Guidance:
- learning_rate: 3e-4
- buffer_size: 1M for most tasks
- learning_starts: 10000
- batch_size: 256
- tau: 0.005 (target network update rate)
- train_freq: 1 with gradient_steps=-1 for best performance
TD3 (Twin Delayed DDPG)
Overview: Improved DDPG with double Q-learning and delayed policy updates.
Strengths: - High sample efficiency - Deterministic policy (good for deployment) - More stable than DDPG - Good for continuous control
Weaknesses: - Only supports continuous action spaces (Box) - Less exploration than SAC - Requires careful tuning
Best For: - Continuous control tasks - When deterministic policies are preferred - Sample-efficient learning
Hyperparameter Guidance:
- learning_rate: 1e-3
- buffer_size: 1M
- learning_starts: 10000
- batch_size: 100
- policy_delay: 2 (update policy every 2 critic updates)
DDPG (Deep Deterministic Policy Gradient)
Overview: Early off-policy continuous control algorithm.
Strengths: - Continuous action space support - Off-policy learning
Weaknesses: - Less stable than TD3 or SAC - Sensitive to hyperparameters - Generally outperformed by TD3
Best For: - Legacy compatibility - Recommendation: Use TD3 instead for new projects
DQN (Deep Q-Network)
Overview: Classic off-policy algorithm for discrete action spaces.
Strengths: - Sample-efficient for discrete actions - Experience replay enables reuse of past data - Proven success on Atari games
Weaknesses: - Only supports discrete action spaces - Can be unstable without proper tuning - Overestimation bias
Best For: - Discrete action tasks - Atari games and similar environments - When sample efficiency matters
Hyperparameter Guidance:
- learning_rate: 1e-4
- buffer_size: 100K-1M depending on task
- learning_starts: 50000 for Atari
- batch_size: 32
- exploration_fraction: 0.1
- exploration_final_eps: 0.05
Variants: - QR-DQN: Distributional RL version for better value estimates - Maskable DQN: For environments with action masking
HER (Hindsight Experience Replay)
Overview: Not a standalone algorithm but a replay buffer strategy for goal-conditioned tasks.
Strengths: - Dramatically improves learning in sparse reward settings - Learns from failures by relabeling goals - Works with any off-policy algorithm (SAC, TD3, DQN)
Weaknesses: - Only for goal-conditioned environments - Requires specific observation structure (Dict with "observation", "achieved_goal", "desired_goal")
Best For: - Goal-conditioned tasks (robotics manipulation, navigation) - Sparse reward environments - Tasks where goal is clear but reward is binary
Usage:
from stable_baselines3 import SAC, HerReplayBuffer
model = SAC(
"MultiInputPolicy",
env,
replay_buffer_class=HerReplayBuffer,
replay_buffer_kwargs=dict(
n_sampled_goal=4,
goal_selection_strategy="future", # or "episode", "final"
),
)
RecurrentPPO
Overview: PPO with LSTM policy for handling partial observability.
Strengths: - Handles partial observability (POMDP) - Can learn temporal dependencies - Good for memory-required tasks
Weaknesses: - Slower training than standard PPO - More complex to tune - Requires sequential data
Best For: - Partially observable environments - Tasks requiring memory (e.g., navigation without full map) - Time-series problems
Algorithm Selection Guide
Decision Tree
- What is your action space?
- Continuous (Box) → Consider PPO, SAC, or TD3
- Discrete → Consider PPO, A2C, or DQN
-
MultiDiscrete/MultiBinary → Use PPO or A2C
-
Is sample efficiency critical?
- Yes (expensive simulations) → Use off-policy: SAC, TD3, DQN, or HER
-
No (cheap simulations) → Use on-policy: PPO, A2C
-
Do you need fast wall-clock training?
- Yes → Use PPO or A2C with vectorized environments
-
No → Any algorithm works
-
Is the task goal-conditioned with sparse rewards?
- Yes → Use HER with SAC or TD3
-
No → Continue with standard algorithms
-
Is the environment partially observable?
- Yes → Use RecurrentPPO
- No → Use standard algorithms
Quick Recommendations
- Starting out / General tasks: PPO
- Continuous control / Robotics: SAC
- Discrete actions / Atari: DQN or PPO
- Goal-conditioned / Sparse rewards: SAC + HER
- Fast prototyping: A2C
- Sample efficiency critical: SAC, TD3, or DQN
- Partial observability: RecurrentPPO
Training Configuration Tips
For On-Policy Algorithms (PPO, A2C)
# Use vectorized environments for speed
env = make_vec_env(env_id, n_envs=8, vec_env_cls=SubprocVecEnv)
model = PPO(
"MlpPolicy",
env,
n_steps=2048, # Collect this many steps per environment before update
batch_size=64,
n_epochs=10,
learning_rate=3e-4,
gamma=0.99,
)
For Off-Policy Algorithms (SAC, TD3, DQN)
# Fewer environments, but use gradient_steps=-1 for efficiency
env = make_vec_env(env_id, n_envs=4)
model = SAC(
"MlpPolicy",
env,
buffer_size=1_000_000,
learning_starts=10000,
batch_size=256,
train_freq=1,
gradient_steps=-1, # Do 1 gradient step per env step (4 with 4 envs)
learning_rate=3e-4,
)
Common Pitfalls
- Using DQN with continuous actions - DQN only works with discrete actions
- Not using vectorized environments with PPO/A2C - Wastes potential speedup
- Using too few environments - On-policy methods need many samples
- Using too large replay buffer - Can cause memory issues
- Not tuning learning rate - Critical for stable training
- Ignoring reward scaling - Normalize rewards for better learning
- Wrong policy type - Use "CnnPolicy" for images, "MultiInputPolicy" for dict observations
Performance Benchmarks
Approximate expected performance (mean reward) on common benchmarks:
Continuous Control (MuJoCo)
- HalfCheetah-v3: PPO ~1800, SAC ~12000, TD3 ~9500
- Hopper-v3: PPO ~2500, SAC ~3600, TD3 ~3600
- Walker2d-v3: PPO ~3000, SAC ~5500, TD3 ~5000
Discrete Control (Atari)
- Breakout: PPO ~400, DQN ~300
- Pong: PPO ~20, DQN ~20
- Space Invaders: PPO ~1000, DQN ~800
Note: Performance varies significantly with hyperparameters and training time.
Additional Resources
- RL Baselines3 Zoo: Collection of pre-trained agents and hyperparameters: https://github.com/DLR-RM/rl-baselines3-zoo
- Hyperparameter Tuning: Use Optuna for systematic tuning
- Custom Policies: Extend base policies for custom network architectures
- Contribution Repo: SB3-Contrib for experimental algorithms (QR-DQN, TQC, etc.)