This document, written by Arun Shankar (Applied AI, Google), presents a comprehensive guide to reinforcement learning (RL) applied to large language models. The text adopts a unique pedagogical approach presenting each mathematical concept in two parallel formats: formal notation and natural language, enabling understanding at different levels of technical depth. The author designed this guide to eliminate the mathematical intimidation barrier that keeps many engineers away from RL, demonstrating that concepts are accessible when properly explained.
The guide begins by establishing necessary mathematical foundations, including probability, logarithms, expected value, and loss functions, explained intuitively with detailed numerical examples. It then addresses the central problem: why models traditionally trained to predict words are not necessarily useful or safe, introducing the concept of alignment with human preferences.
The document's core explores the RLHF (Reinforcement Learning from Human Feedback) revolution that transformed models like ChatGPT. It describes in detail its three stages: supervised fine-tuning to follow instructions, training reward models based on human comparisons, and optimization using algorithms like PPO. It includes complete mathematical analysis with step-by-step examples illustrating how models learn to generate human-preferred responses.
The text examines modern alternatives like DPO (Direct Preference Optimization), which simplifies RLHF by eliminating the need for explicit reward models, reducing memory requirements by 50%. It analyzes DeepSeek-R1's revolutionary approach that skips supervised fine-tuning and applies RL directly, spontaneously discovering step-by-step reasoning strategies without explicit human examples.
It covers advanced concepts such as test-time compute scaling (investing more computation during inference to improve accuracy), process reward models (PRM) that evaluate each reasoning step rather than just final outcomes, and modern algorithms beyond PPO and DPO, including GRPO, RLOO, KTO, IPO, and ORPO, with comparative analysis of their advantages.
The document explores domain-specific applications: code generation with execution feedback, mathematics with formal verifiers, tool use with API success signals, and multi-turn dialogue improvement. It examines verifier-guided generation techniques, Monte Carlo Tree Search, and decoding strategies like rejection sampling and self-consistency.
Aimed at AI professionals, researchers, engineers, and students, the material is accessible to high school readers interested in AI through experts building production systems. Each concept is presented with intuitions, real-world analogies, detailed numerical examples, and warnings about common pitfalls, enabling three reading levels according to reader objectives.
30/10/2025
Report on Sovereign AI analyzing how countries and companies can develop their own artificial intelligence capabilities to ensure technological ...
30/10/2025
Guide on the top 5 generative AI security threats for 2025, explaining how to protect AI applications in cloud environments through unified security ...
15/10/2025
Pew Research Center study on how people view artificial intelligence in 25 countries. Examines public knowledge about AI, emotions generated by its ...
30/09/2025
Report on AI return on investment in financial services based on survey of 556 global executives, revealing that 53% already use AI agents and 77% ...