https://www.youtube.com/watch?v=XMnxKGVnEUc
PPO, DPO and ORPO
From https://lifeinthesingularity.com/p/deepseek-proves-ai-comes-for-all
- Once the reasoning-oriented RL converged, the checkpoint was used to create new SFT data
- To sum it up neatly → the training process of DeepSeek-R1 is a multi-stage process that starts with a pure RL approach to establish reasoning capabilities, then introduces a cold start with high-quality data, followed by further refinement through both RL and SFT, and finally distillation to transfer these reasoning capabilities to smaller models. This combination of techniques resulted in a model that performs comparably to OpenAI-o1-1217 (cutting edge as of this writing) on various reasoning tasks.
- By combining RL with SFT and distillation, DeepSeek-R1 achieves comparable performance to cutting-edge models like OpenAI's, potentially at a fraction of the training cost. This could democratize access to advanced AI, making it more affordable and accessible for researchers, developers, and smaller organizations.