Reinforcement Learning for Large Language Models

Arun Shankar
15/10/2025
Complete guide to reinforcement learning for language models, from fundamental mathematical concepts to advanced techniques. Explains RLHF, DPO, reward models, reasoning strategies, and practical applications with an accessible yet rigorous approach
Reinforcement Learning for Large Language Models

This document, written by Arun Shankar (Applied AI, Google), presents a comprehensive guide to reinforcement learning (RL) applied to large language models. The text adopts a unique pedagogical approach presenting each mathematical concept in two parallel formats: formal notation and natural language, enabling understanding at different levels of technical depth. The author designed this guide to eliminate the mathematical intimidation barrier that keeps many engineers away from RL, demonstrating that concepts are accessible when properly explained.

The guide begins by establishing necessary mathematical foundations, including probability, logarithms, expected value, and loss functions, explained intuitively with detailed numerical examples. It then addresses the central problem: why models traditionally trained to predict words are not necessarily useful or safe, introducing the concept of alignment with human preferences.

The document's core explores the RLHF (Reinforcement Learning from Human Feedback) revolution that transformed models like ChatGPT. It describes in detail its three stages: supervised fine-tuning to follow instructions, training reward models based on human comparisons, and optimization using algorithms like PPO. It includes complete mathematical analysis with step-by-step examples illustrating how models learn to generate human-preferred responses.

The text examines modern alternatives like DPO (Direct Preference Optimization), which simplifies RLHF by eliminating the need for explicit reward models, reducing memory requirements by 50%. It analyzes DeepSeek-R1's revolutionary approach that skips supervised fine-tuning and applies RL directly, spontaneously discovering step-by-step reasoning strategies without explicit human examples.

It covers advanced concepts such as test-time compute scaling (investing more computation during inference to improve accuracy), process reward models (PRM) that evaluate each reasoning step rather than just final outcomes, and modern algorithms beyond PPO and DPO, including GRPO, RLOO, KTO, IPO, and ORPO, with comparative analysis of their advantages.

The document explores domain-specific applications: code generation with execution feedback, mathematics with formal verifiers, tool use with API success signals, and multi-turn dialogue improvement. It examines verifier-guided generation techniques, Monte Carlo Tree Search, and decoding strategies like rejection sampling and self-consistency.

Aimed at AI professionals, researchers, engineers, and students, the material is accessible to high school readers interested in AI through experts building production systems. Each concept is presented with intuitions, real-world analogies, detailed numerical examples, and warnings about common pitfalls, enabling three reading levels according to reader objectives.

Key points

  • Complete RL guide for LLMs with accessible mathematical approach through dual format.
  • RLHF transforms models through supervised fine-tuning, reward models, and PPO optimization.
  • DPO eliminates explicit reward models, simplifying RLHF with 50% less memory.
  • DeepSeek-R1 demonstrates reasoning emerges from pure RL without prior supervised fine-tuning.
  • Test-time compute trades inference time for accuracy without retraining the model.
  • PRMs evaluate each reasoning step, outperforming models that only measure final outcomes.
  • Three reading levels: conceptual understanding, practical implementation, or advanced research.
  • Covers GRPO, RLOO, KTO, IPO, and ORPO with specific use cases for each algorithm.
  • Verifiable applications in code, mathematics, tools, and dialogue with automated rewards.
  • Includes advanced strategies: verifiers, MCTS, rejection sampling, and self-consistency.

Latest documents

  • Labor market impacts of AI: A new measure and early evidence

    05/03/2026

    Anthropic study proposing a new way to measure the real impact of AI on the labor market. Combines theoretical capabilities with real usage data and ...

  • The Adolescence of Technology

    27/01/2026

    Essay by Dario Amodei analyzing the main risks of increasingly powerful AI systems: from unpredictable autonomous behaviors to biological weapons, ...

  • Claude’s Constitution

    22/01/2026

    Foundational document defining Claude's values, behaviors, and conceptual framework, Anthropic's AI model. It establishes principles of safety, ...

  • State of AI in the Enterprise

    21/01/2026

    Deloitte's "State of AI in the Enterprise 2026" report analyzes how organizations are moving from AI experimentation to large-scale implementation. ...

Trustpilot
This website uses technical, personalization and analysis cookies, both our own and from third parties, to facilitate anonymous browsing and analyze website usage statistics. We consider that if you continue browsing, you accept their use.