openai/gpt-oss-20b
OpenAI's gpt-oss-20b — 21B-total / 3.6B-active MoE reasoning model with native MXFP4 quant; fits in 16GB VRAM
21B/3.6B-A MoE reasoning model with native MXFP4 — runs on 16GB
View on HuggingFaceGuide
Overview
gpt-oss-20b is OpenAI's smaller open-weight reasoning model: 21B total parameters with 3.6B activated per token across 32 experts (top-4 routing), shipped with native MXFP4 quantization on the MoE weights. It targets lower-latency and on-device use cases — the model loads in ~16GB of VRAM, runs on a single H100/H200/B200 or AMD MI300X/MI325X/MI355X, and supports the same harmony chat format, configurable reasoning effort (low / medium / high), and built-in tools (browser, python, function calling) as its larger sibling gpt-oss-120b.
Architectural notes:
- 24 layers alternating sliding-window (window=128) and full attention.
- YaRN rope scaling (factor=32) extending 4K → 131K context.
- MXFP4 quant on
model.layers.*.mlpexperts; attention, router, embeddings stay in BF16.
Prerequisites
- Hardware: NVIDIA H100/H200/B200 or AMD MI300X/MI325X/MI355X (also runs on Ada/Ampere consumer cards with sufficient VRAM).
- vLLM >= 0.10.0.
- CUDA >= 12.8 if building from source (must match between install and serving).
Install vLLM
uv venv
source .venv/bin/activate
uv pip install vllm --torch-backend=auto
Docker quickstart:
docker run --gpus all -p 8000:8000 --ipc=host vllm/vllm-openai --model openai/gpt-oss-20b
AMD ROCm wheels:
uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm
Launch commands
Single GPU (default — works on any 16GB+ card):
vllm serve openai/gpt-oss-20b
Blackwell (B200) with FlashInfer MXFP4+MXFP8 MoE:
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
vllm serve openai/gpt-oss-20b \
--kv-cache-dtype fp8 \
--no-enable-prefix-caching \
--max-cudagraph-capture-size 2048 \
--max-num-batched-tokens 8192 \
--stream-interval 20
Hopper (H100/H200): same as Blackwell minus --kv-cache-dtype fp8 and the FlashInfer env var.
AMD MI300X/MI325X/MI355X:
export HSA_NO_SCRATCH_RECLAIM=1
export AMDGCN_USE_BUFFER_OPS=0
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4
vllm serve openai/gpt-oss-20b \
--attention-backend ROCM_AITER_UNIFIED_ATTN \
-cc.pass_config.fuse_rope_kvcache=True \
-cc.use_inductor_graph_partition=True \
--gpu-memory-utilization 0.95 \
--block-size 64
Tool use
The /v1/responses endpoint supports built-in tools (browsing, python, MCP). Setup requires uv pip install gpt-oss and either Docker (for the Python sandbox) or PYTHON_EXECUTION_BACKEND=dangerously_use_uv. For demo tools:
vllm serve openai/gpt-oss-20b --tool-server demo
For user-defined function calling (toggle the Tool Calling feature above, or pass manually):
vllm serve openai/gpt-oss-20b --tool-call-parser openai --enable-auto-tool-choice
Reasoning effort
gpt-oss exposes three reasoning levels — low, medium, high — selected via the system prompt:
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
response = client.chat.completions.create(
model="openai/gpt-oss-20b",
messages=[
{"role": "system", "content": "Reasoning: high"},
{"role": "user", "content": "Explain why eigenvalues matter."},
],
)
print(response.choices[0].message.content)
Troubleshooting
- Attention sinks dtype error on Blackwell: ensure
VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1and--kv-cache-dtype fp8. tl.language not defined: make sure no extra Triton (e.g.,pytorch-triton) is installed alongside vLLM's bundled Triton.- Harmony vocab download failure: pre-download tiktoken files and set
TIKTOKEN_ENCODINGS_BASE.