GPU Tensor CoresPRIMARY
Reasoning models use the same Transformer decoder architecture as standard LLMs and require GPUs with Tensor Cores for efficient inference. Generating long CoT chains substantially increases VRAM demand (KV cache for long sequences) and GPU time per query.
Due to very long token sequences (CoT + answer), reasoning models require GPUs with large HBM capacity. Models in the 7B–70B range require 16 to 80 GB of VRAM respectively. Inference for large reasoning models (DeepSeek-R1 671B) requires multi-GPU configurations. Test-time compute scaling directly translates to higher GPU cost per query compared to standard LLMs of the same size.
TPUGOOD
TPU v4/v5 are used to train large reasoning models (e.g., by Google). They efficiently handle long token sequences via fast HBM memory and a GEMM-optimized architecture.
RL training of reasoning models on TPUs requires infrastructure adapted to long rollouts and dynamic sequence lengths.