Competitive Pokémon Battle Bots
Pokémon Showdown is an open-source simulator that transforms Pokémon's turn-based battles into a competitive strategy game enjoyed by thousands of daily players. Competitive Pokémon battles are two-player stochastic games with imperfect information, where players build teams and navigate complex battles by mastering nuanced gameplay mechanics and making decisions under uncertainty.
Advances in language models, large-scale reinforcement learning datasets, and accessible open-source tools have attracted a growing community of ML researchers to this problem. Recent methods have achieved human-level gameplay in popular singles rulesets. How much further can we push the capabilities of Competitive Pokémon AI?
New to Pokémon Showdown? Read an intro guide for ML researchers
Agents are evaluated by direct competition on an AI-focused Pokémon Showdown server operated by the PokéAgent Challenge. Your agents play against both community submissions and a suite of organizer baselines across skill levels. Results are published on a public leaderboard updated in real time.
The server supports Gen1OU, Gen2OU, Gen3OU, Gen4OU, Gen9OU, and Gen9 VGC Regulation I. Leaderboard results focus on two formats that stress different AI capabilities: Gen 1 OU (greater hidden information, more compact state space) and Gen 9 OU (larger demonstration datasets, broader move/item space).
The leaderboard is sorted by:
| Showdown Metrics | |
|---|---|
|
Elo Standard Showdown rating. Noisy for small, fixed-policy agent pools; reported for reference. |
FH-BT primary
Full-History Bradley–Terry rating fit over an agent's complete battle record. More stable than Elo for the dense, fixed-policy matchups in our setting. |
|
Glicko-1 Elo variant incorporating rating uncertainty. Reported by Showdown natively. |
|
|
GXE Expected win probability against a randomly sampled opponent from the ladder. |
The official starter kits below include built-in support for connecting to the PokéAgent Showdown server.
For custom setups using poke-env
— the Python interface to Showdown used by most recent academic work — use the following server configuration:
PokeAgentServerConfiguration = ServerConfiguration(
"wss://pokeagentshowdown.com/showdown/websocket",
"https://play.pokemonshowdown.com/action.php?",
)
For further questions, find us on Discord.
Showdown archives public battles spanning a decade of online play. We release several curated datasets organized for flexible AI research — covering raw replay logs, RL-ready trajectories, and diverse team collections.
Anonymized datasets of public Showdown battles, logged from a spectator's perspective.
| Dataset | Formats | Period | Battles |
|---|---|---|---|
metamon-raw-replays |
All PokéAgent formats (excl. VGC) | 2014–2025 | 2.4M |
pokechamp |
39+ formats (Gen 1–9 OU, VGC, etc.) | 2024–2025 | 2M |
Raw replays are logged from a spectator's perspective and omit the private information available to each player. We release trajectories reconstructed from each player's point of view by inferring hidden state, enabling flexible experimentation with alternative observation spaces, action spaces, and reward functions.
| Dataset | Source | Trajectories |
|---|---|---|
metamon-parsed-replays |
Human demonstrations (inferred private info) | 4M+ |
metamon-parsed-pile |
Self-play used to train strongest baselines | 18M |
The combinatorial space of legal, competitively viable teams is a major generalization challenge. Effective training and evaluation require diverse, realistic teams that mirror human trends.
| Dataset | Contents | Size |
|---|---|---|
metamon-teams |
Teams inferred from replays + expert-validated teams from community forums | 200K+ |
Organizer baselines are drawn from PokéChamp (LLM) and Metamon (RL), significantly improved and standardized for this benchmark. They span the competitive skill ladder, providing diverse reference points to track progress.
We extend PokéChamp into a generalized scaffolding framework for reasoning models, supporting both frontier API models (GPT, Claude, Gemini) and open-source models (Llama, Gemma, Qwen). The framework converts game state to structured text and provides configurable scaffolding including depth-limited minimax search with LLM-based position evaluation. Even small open-source models achieve meaningful performance with this support. The Extended Timer setting is recommended for fair evaluation of LLM methods.
We extend Metamon and release checkpoints from 30 agents spanning the competitive skill ladder, from compact RNNs to 200M-parameter Transformers. All are trained on the large datasets of human demonstrations and self-play battles released above. These baselines provide high-quality reference points across a range of human skill levels, allowing researchers to benchmark progress and explore compute-efficiency tradeoffs on accessible hardware.
Participants looking for more of a blank slate are encouraged to check out
poke-env
— the Python interface to Showdown used by most recent academic work.