openai/gpt-oss-20b

OpenAI's gpt-oss-20b — 21B-total / 3.6B-active MoE reasoning model with native MXFP4 quant; fits in 16GB VRAM

21B/3.6B-A MoE reasoning model with native MXFP4 — runs on 16GB

View on HuggingFace

moe21B / 3.6B131,072 ctxvLLM 0.10.0+text

Guide

Overview

gpt-oss-20b is OpenAI's smaller open-weight reasoning model: 21B total parameters with 3.6B activated per token across 32 experts (top-4 routing), shipped with native MXFP4 quantization on the MoE weights. It targets lower-latency and on-device use cases — the model loads in ~16GB of VRAM, runs on a single H100/H200/B200 or AMD MI300X/MI325X/MI355X, and supports the same harmony chat format, configurable reasoning effort (low / medium / high), and built-in tools (browser, python, function calling) as its larger sibling gpt-oss-120b.

Architectural notes:

24 layers alternating sliding-window (window=128) and full attention.
YaRN rope scaling (factor=32) extending 4K → 131K context.
MXFP4 quant on model.layers.*.mlp experts; attention, router, embeddings stay in BF16.

Prerequisites

Hardware: NVIDIA H100/H200/B200 or AMD MI300X/MI325X/MI355X (also runs on Ada/Ampere consumer cards with sufficient VRAM).
vLLM >= 0.10.0.
CUDA >= 12.8 if building from source (must match between install and serving).

Install vLLM

uv venv
source .venv/bin/activate
uv pip install vllm --torch-backend=auto

Docker quickstart:

docker run --gpus all -p 8000:8000 --ipc=host vllm/vllm-openai --model openai/gpt-oss-20b

AMD ROCm wheels:

uv pip install vllm --extra-index-url https://wheels.vllm.ai/rocm

Launch commands

Single GPU (default — works on any 16GB+ card):

vllm serve openai/gpt-oss-20b

Blackwell (B200) with FlashInfer MXFP4+MXFP8 MoE:

export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1

vllm serve openai/gpt-oss-20b \
  --kv-cache-dtype fp8 \
  --no-enable-prefix-caching \
  --max-cudagraph-capture-size 2048 \
  --max-num-batched-tokens 8192 \
  --stream-interval 20

Hopper (H100/H200): same as Blackwell minus --kv-cache-dtype fp8 and the FlashInfer env var.

AMD MI300X/MI325X/MI355X:

export HSA_NO_SCRATCH_RECLAIM=1
export AMDGCN_USE_BUFFER_OPS=0
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4

vllm serve openai/gpt-oss-20b \
  --attention-backend ROCM_AITER_UNIFIED_ATTN \
  -cc.pass_config.fuse_rope_kvcache=True \
  -cc.use_inductor_graph_partition=True \
  --gpu-memory-utilization 0.95 \
  --block-size 64

Tool use

The /v1/responses endpoint supports built-in tools (browsing, python, MCP). Setup requires uv pip install gpt-oss and either Docker (for the Python sandbox) or PYTHON_EXECUTION_BACKEND=dangerously_use_uv. For demo tools:

vllm serve openai/gpt-oss-20b --tool-server demo

For user-defined function calling (toggle the Tool Calling feature above, or pass manually):

vllm serve openai/gpt-oss-20b --tool-call-parser openai --enable-auto-tool-choice

Reasoning effort

gpt-oss exposes three reasoning levels — low, medium, high — selected via the system prompt:

from openai import OpenAI

client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
response = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[
        {"role": "system", "content": "Reasoning: high"},
        {"role": "user", "content": "Explain why eigenvalues matter."},
    ],
)
print(response.choices[0].message.content)

Troubleshooting

Attention sinks dtype error on Blackwell: ensure VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1 and --kv-cache-dtype fp8.
tl.language not defined: make sure no extra Triton (e.g., pytorch-triton) is installed alongside vLLM's bundled Triton.
Harmony vocab download failure: pre-download tiktoken files and set TIKTOKEN_ENCODINGS_BASE.