Battling

Competitive Pokémon Battle Bots

Pokémon Showdown is an open-source simulator that transforms Pokémon's turn-based battles into a competitive strategy game enjoyed by thousands of daily players. Competitive Pokémon battles are two-player stochastic games with imperfect information, where players build teams and navigate complex battles by mastering nuanced gameplay mechanics and making decisions under uncertainty.

Advances in language models, large-scale reinforcement learning datasets, and accessible open-source tools have attracted a growing community of ML researchers to this problem. Recent methods have achieved human-level gameplay in popular singles rulesets. How much further can we push the capabilities of Competitive Pokémon AI?

New to Pokémon Showdown? Read an intro guide for ML researchers

Evaluation

Agents are evaluated by direct competition on an AI-focused Pokémon Showdown server operated by the PokéAgent Challenge. Your agents play against both community submissions and a suite of organizer baselines across skill levels. Results are published on a public leaderboard updated in real time.

View Leaderboard Open Showdown Server

Rulesets

The server supports Gen1OU, Gen2OU, Gen3OU, Gen4OU, Gen9OU, and Gen9 VGC Regulation I. Leaderboard results focus on two formats that stress different AI capabilities: Gen 1 OU (greater hidden information, more compact state space) and Gen 9 OU (larger demonstration datasets, broader move/item space).

Timer Settings

Standard Enforces faster-than-human decision times, enabling efficient large-sample evaluation. Recommended for RL agents.

Extended Timer Provides nearly unlimited deliberation time per turn for fair evaluation of LLMs and test-time reasoning methods.

Skill Rating Metrics

The leaderboard is sorted by:

Showdown Metrics
Elo Standard Showdown rating. Noisy for small, fixed-policy agent pools; reported for reference.	FH-BT primary Full-History Bradley–Terry rating fit over an agent's complete battle record. More stable than Elo for the dense, fixed-policy matchups in our setting.
Glicko-1 Elo variant incorporating rating uncertainty. Reported by Showdown natively.
GXE Expected win probability against a randomly sampled opponent from the ladder.

How to Connect

The official starter kits below include built-in support for connecting to the PokéAgent Showdown server. For custom setups using poke-env — the Python interface to Showdown used by most recent academic work — use the following server configuration:

                    PokeAgentServerConfiguration = ServerConfiguration(
    "wss://pokeagentshowdown.com/showdown/websocket",
    "https://play.pokemonshowdown.com/action.php?",
)
                

For further questions, find us on Discord.

Datasets

Showdown archives public battles spanning a decade of online play. We release several curated datasets organized for flexible AI research — covering raw replay logs, RL-ready trajectories, and diverse team collections.

Replay Logs

Anonymized datasets of public Showdown battles, logged from a spectator's perspective.

Dataset	Formats	Period	Battles
`metamon-raw-replays`	All PokéAgent formats (excl. VGC)	2014–2025	2.4M
`pokechamp`	39+ formats (Gen 1–9 OU, VGC, etc.)	2024–2025	2M

RL Trajectories

Raw replays are logged from a spectator's perspective and omit the private information available to each player. We release trajectories reconstructed from each player's point of view by inferring hidden state, enabling flexible experimentation with alternative observation spaces, action spaces, and reward functions.

Dataset	Source	Trajectories
`metamon-parsed-replays`	Human demonstrations (inferred private info)	4M+
`metamon-parsed-pile`	Self-play used to train strongest baselines	18M

Teams

The combinatorial space of legal, competitively viable teams is a major generalization challenge. Effective training and evaluation require diverse, realistic teams that mirror human trends.

Dataset	Contents	Size
`metamon-teams`	Teams inferred from replays + expert-validated teams from community forums	200K+

Baselines

Organizer baselines are drawn from PokéChamp (LLM) and Metamon (RL), significantly improved and standardized for this benchmark. They span the competitive skill ladder, providing diverse reference points to track progress.

LLM Baselines

We extend PokéChamp into a generalized scaffolding framework for reasoning models, supporting both frontier API models (GPT, Claude, Gemini) and open-source models (Llama, Gemma, Qwen). The framework converts game state to structured text and provides configurable scaffolding including depth-limited minimax search with LLM-based position evaluation. Even small open-source models achieve meaningful performance with this support. The Extended Timer setting is recommended for fair evaluation of LLM methods.

RL Baselines

We extend Metamon and release checkpoints from 30 agents spanning the competitive skill ladder, from compact RNNs to 200M-parameter Transformers. All are trained on the large datasets of human demonstrations and self-play battles released above. These baselines provide high-quality reference points across a range of human skill levels, allowing researchers to benchmark progress and explore compute-efficiency tradeoffs on accessible hardware.

RL on Human Replays + Self-Play

LLM-Agents, Search, and Scaffolding

Participants looking for more of a blank slate are encouraged to check out poke-env — the Python interface to Showdown used by most recent academic work.