GitHub_collection_open-r1/README.md

# Open R1

*A fully open reproduction of DeepSeek-R1. This repo is work in progress, let's build it together!*

## Overview

The goal of this repo is to build the missing pieces of the R1 pipeline such that everybody can reproduce and build on top of it. The project is simple by design and mostly consists of:

- `src/open_r1` contains the scripts to train and evaluate models as well generate synthetic data:
    - `grpo.py`: trains a model with GRPO on a given dataset.
    - `sft.py`: simple SFT of a model on a dataset.
    - `evaluate.py`: evaluates a model on the R1 benchmarks.
    - `generate.py`: generate synthetic data from a model using [Distilabel](https://github.com/argilla-io/distilabel).
- `Makefile` contains an easy to run command for each step in the R1 pipeline leveraging the scipts above.

### Plan of attack

We will use the DeepSeek-R1 [tech report](https://github.com/deepseek-ai/DeepSeek-R1) as a guide, which can roughly be broken down into three main steps:

* Step 1: replicate the R1-Distill models by distilling a high-quality corpus from DeepSeek-R1.
* Step 2: replicate the pure RL pipeline that DeepSeek used to create R1-Zero. This will likely involve curating new, large-scale datasets for math, reasoning, and code.
* Step 3: show we can go from base model to RL-tuned via multi-stage training.

<center>
    <img src="assets/plan-of-attack.png" width="500">
</center>


## Installation

To run the code in this project, first, create a Python virtual environment using e.g. Conda:

```shell
conda create -n openr1 python=3.11 && conda activate openr1
```

Next, install vLLM:

```shell
pip install vllm==0.6.6.post1

# For HF (cluster only has CUDA 12.1)
pip install vllm==0.6.6.post1 --extra-index-url https://download.pytorch.org/whl/cu121
```

This will also install PyTorch `v2.5.1` and it is **very important** to use this version since the vLLM binaries are compiled for it. You can then install the remaining dependencies for your specific use case via `pip install -e .[LIST OF MODES]`. For most contributors, we recommend:

```shell
pip install -e ".[dev]"
```

Next, log into your Hugging Face and Weights and Biases accounts as follows:

```shell
huggingface-cli login
wandb login
```

Finally, check your system has Git LFS installed so that you can load and push models/datasets to the Hugging Face Hub:

```shell
git-lfs --version
```

If it isn't installed, run:

```shell
sudo apt-get install git-lfs
```

## Training models

We support training models with either DDP or DeepSpeed ZeRO-2 and ZeRO-3. To switch between methods, simply change the path to the `accelerate` YAML config in `configs`.

> [!NOTE]
> The training commands below are configured for a node of 8 x H100s (80GB). For different hardware and topologies, you may need to tune the batch size and number of gradient accumulation steps.

### SFT

To run SFT on a dataset distilled from DeepSeek-R1 with reasoning traces such as [Bespoke-Stratos-17k](https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k), run:

```
accelerate launch --config_file=configs/zero3.yaml src/open_r1/sft.py \
    --model_name_or_path Qwen/Qwen2.5-Math-1.5B-Instruct \
    --dataset_name HuggingFaceH4/Bespoke-Stratos-17k \
    --learning_rate 2.0e-5 \
    --num_train_epochs 1 \
    --packing \
    --max_seq_length 4096 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --gradient_checkpointing \
    --bf16 \
    --logging_steps 5 \
    --eval_strategy steps \
    --eval_steps 100 \
    --output_dir data/Qwen2.5-1.5B-Open-R1-Distill
```

To launch a Slurm job, run:

```shell
sbatch --output=/path/to/logs/%x-%j.out --err=/path/to/logs/%x-%j.err slurm/sft.slurm {model} {dataset} {accelerator}
```

Here `{model}` and `{dataset}` refer to the model and dataset IDs on the Hugging Face Hub, while `{accelerator}` refers to the choice of 🤗 Accelerate config in `configs`. 

### GRPO

```
accelerate launch --config_file configs/zero3.yaml src/open_r1/grpo.py \
    --output_dir DeepSeek-R1-Distill-Qwen-7B-GRPO \
    --model_name_or_path deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
    --dataset_name AI-MO/NuminaMath-TIR \
    --max_prompt_length 256 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --logging_steps 10 \
    --bf16
```

## Evaluating models

We use `lighteval` to evaluate models, with custom tasks defined in `src/open_r1/evaluate.py`. For models which fit on a single GPU, run:

```shell
MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
MODEL_ARGS="pretrained=$MODEL,dtype=float16,max_model_length=32768,gpu_memory_utilisation=0.8"
TASK=aime24
OUTPUT_DIR=data/evals/$MODEL

lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
    --custom-tasks src/open_r1/evaluate.py \
    --use-chat-template \
    --system-prompt="Please reason step by step, and put your final answer within \boxed{}." \
    --output-dir $OUTPUT_DIR 
```

To increase throughput across multiple GPUs, use _data parallel_ as follows:

```shell
NUM_GPUS=8
MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
MODEL_ARGS="pretrained=$MODEL,dtype=float16,data_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilisation=0.8"
TASK=aime24
OUTPUT_DIR=data/evals/$MODEL

lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
    --custom-tasks src/open_r1/evaluate.py \
    --use-chat-template \
    --system-prompt="Please reason step by step, and put your final answer within \boxed{}." \
    --output-dir $OUTPUT_DIR 
```

For large models which require sharding across GPUs, use _tensor parallel_ and run:

```shell
NUM_GPUS=8
MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
MODEL_ARGS="pretrained=$MODEL,dtype=float16,tensor_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilisation=0.8"
TASK=aime24
OUTPUT_DIR=data/evals/$MODEL

export VLLM_WORKER_MULTIPROC_METHOD=spawn
lighteval vllm $MODEL_ARGS "custom|$TASK|0|0" \
    --custom-tasks src/open_r1/evaluate.py \
    --use-chat-template \
    --system-prompt="Please reason step by step, and put your final answer within \boxed{}." \
    --output-dir $OUTPUT_DIR 
```
Update README.md 2025-01-24 22:06:22 +01:00			`# Open R1`
Add configs and stuff (#2) 2025-01-24 20:05:18 +01:00
Update README.md 2025-01-24 22:06:22 +01:00			`A fully open reproduction of DeepSeek-R1. This repo is work in progress, let's build it together!`

			`## Overview`
Refactor evaluation (#6) 2025-01-24 23:46:34 +01:00
Update README.md 2025-01-24 22:06:22 +01:00			`The goal of this repo is to build the missing pieces of the R1 pipeline such that everybody can reproduce and build on top of it. The project is simple by design and mostly consists of:`

Refactor evaluation (#6) 2025-01-24 23:46:34 +01:00			- `src/open_r1` contains the scripts to train and evaluate models as well generate synthetic data:
Fix Slurm SFT and gather Slurm scripts (#19) * Fix slurm * Fix generate * Fix install * Fix c 2025-01-25 13:47:52 +01:00			- `grpo.py`: trains a model with GRPO on a given dataset.
			- `sft.py`: simple SFT of a model on a dataset.
			- `evaluate.py`: evaluates a model on the R1 benchmarks.
			- `generate.py`: generate synthetic data from a model using [Distilabel](https://github.com/argilla-io/distilabel).
Update README.md 2025-01-24 22:06:22 +01:00			- `Makefile` contains an easy to run command for each step in the R1 pipeline leveraging the scipts above.

Add diagram (#16) 2025-01-25 11:20:17 +01:00			`### Plan of attack`

			`We will use the DeepSeek-R1 [tech report](https://github.com/deepseek-ai/DeepSeek-R1) as a guide, which can roughly be broken down into three main steps:`

			`* Step 1: replicate the R1-Distill models by distilling a high-quality corpus from DeepSeek-R1.`
Fix typo (#25) 2025-01-25 15:16:56 +01:00			`* Step 2: replicate the pure RL pipeline that DeepSeek used to create R1-Zero. This will likely involve curating new, large-scale datasets for math, reasoning, and code.`
Add diagram (#16) 2025-01-25 11:20:17 +01:00			`* Step 3: show we can go from base model to RL-tuned via multi-stage training.`

			`<center>`
Fix Slurm SFT and gather Slurm scripts (#19) * Fix slurm * Fix generate * Fix install * Fix c 2025-01-25 13:47:52 +01:00			`<img src="assets/plan-of-attack.png" width="500">`
Add diagram (#16) 2025-01-25 11:20:17 +01:00			`</center>`


Update README.md 2025-01-24 22:06:22 +01:00			`## Installation`
Add configs and stuff (#2) 2025-01-24 20:05:18 +01:00
			`To run the code in this project, first, create a Python virtual environment using e.g. Conda:`

			```shell
			`conda create -n openr1 python=3.11 && conda activate openr1`
			```

			`Next, install vLLM:`

			```shell
			`pip install vllm==0.6.6.post1`

			`# For HF (cluster only has CUDA 12.1)`
			`pip install vllm==0.6.6.post1 --extra-index-url https://download.pytorch.org/whl/cu121`
			```

			This will also install PyTorch `v2.5.1` and it is very important to use this version since the vLLM binaries are compiled for it. You can then install the remaining dependencies for your specific use case via `pip install -e .[LIST OF MODES]`. For most contributors, we recommend:

			```shell
			`pip install -e ".[dev]"`
			```

			`Next, log into your Hugging Face and Weights and Biases accounts as follows:`

			```shell
			`huggingface-cli login`
			`wandb login`
			```

			`Finally, check your system has Git LFS installed so that you can load and push models/datasets to the Hugging Face Hub:`

			```shell
			`git-lfs --version`
			```

			`If it isn't installed, run:`

			```shell
			`sudo apt-get install git-lfs`
Update README.md 2025-01-24 22:06:22 +01:00			```
Adds Math-500 and AIME24 evals (#4) * adds evals * up max model len --------- Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> 2025-01-24 23:09:07 +01:00
REFACTOR TO THE MAX (#7) 2025-01-25 00:12:25 +01:00			`## Training models`

Update README.md 2025-01-25 14:49:49 +01:00			We support training models with either DDP or DeepSpeed ZeRO-2 and ZeRO-3. To switch between methods, simply change the path to the `accelerate` YAML config in `configs`.
Fix Slurm SFT and gather Slurm scripts (#19) * Fix slurm * Fix generate * Fix install * Fix c 2025-01-25 13:47:52 +01:00
			`> [!NOTE]`
			`> The training commands below are configured for a node of 8 x H100s (80GB). For different hardware and topologies, you may need to tune the batch size and number of gradient accumulation steps.`

Add SFT command to the readme (#15) 2025-01-25 10:56:33 +01:00			`### SFT`
Add diagram (#16) 2025-01-25 11:20:17 +01:00
Fix Slurm SFT and gather Slurm scripts (#19) * Fix slurm * Fix generate * Fix install * Fix c 2025-01-25 13:47:52 +01:00			`To run SFT on a dataset distilled from DeepSeek-R1 with reasoning traces such as [Bespoke-Stratos-17k](https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k), run:`
Add SFT command to the readme (#15) 2025-01-25 10:56:33 +01:00
			```
			`accelerate launch --config_file=configs/zero3.yaml src/open_r1/sft.py \`
Fix Slurm SFT and gather Slurm scripts (#19) * Fix slurm * Fix generate * Fix install * Fix c 2025-01-25 13:47:52 +01:00			`--model_name_or_path Qwen/Qwen2.5-Math-1.5B-Instruct \`
Add SFT command to the readme (#15) 2025-01-25 10:56:33 +01:00			`--dataset_name HuggingFaceH4/Bespoke-Stratos-17k \`
			`--learning_rate 2.0e-5 \`
			`--num_train_epochs 1 \`
			`--packing \`
			`--max_seq_length 4096 \`
			`--per_device_train_batch_size 4 \`
			`--per_device_eval_batch_size 4 \`
			`--gradient_accumulation_steps 4 \`
			`--gradient_checkpointing \`
			`--bf16 \`
			`--logging_steps 5 \`
			`--eval_strategy steps \`
			`--eval_steps 100 \`
Fix Slurm SFT and gather Slurm scripts (#19) * Fix slurm * Fix generate * Fix install * Fix c 2025-01-25 13:47:52 +01:00			`--output_dir data/Qwen2.5-1.5B-Open-R1-Distill`
			```

			`To launch a Slurm job, run:`

			```shell
			`sbatch --output=/path/to/logs/%x-%j.out --err=/path/to/logs/%x-%j.err slurm/sft.slurm {model} {dataset} {accelerator}`
Add SFT command to the readme (#15) 2025-01-25 10:56:33 +01:00			```

Fix Slurm SFT and gather Slurm scripts (#19) * Fix slurm * Fix generate * Fix install * Fix c 2025-01-25 13:47:52 +01:00			Here `{model}` and `{dataset}` refer to the model and dataset IDs on the Hugging Face Hub, while `{accelerator}` refers to the choice of 🤗 Accelerate config in `configs`.

GRPO script (#3) * inital commit * with reward func * fix box extract * example line * don't break when answer malformed * command and logging * holly simplicity * move grpo * reverse readme * instructions 2025-01-25 00:19:38 +01:00			`### GRPO`
REFACTOR TO THE MAX (#7) 2025-01-25 00:12:25 +01:00
GRPO script (#3) * inital commit * with reward func * fix box extract * example line * don't break when answer malformed * command and logging * holly simplicity * move grpo * reverse readme * instructions 2025-01-25 00:19:38 +01:00			```
Handle error in verifier + deepspeed command (#17) * handle error in verification * command with zero2 and catch more error in verifier * Update README.md Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * deepseek distill and remove grad chekpoint * drop grad checkpoint * except --------- Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> 2025-01-25 13:58:04 +01:00			`accelerate launch --config_file configs/zero3.yaml src/open_r1/grpo.py \`
			`--output_dir DeepSeek-R1-Distill-Qwen-7B-GRPO \`
			`--model_name_or_path deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \`
GRPO script (#3) * inital commit * with reward func * fix box extract * example line * don't break when answer malformed * command and logging * holly simplicity * move grpo * reverse readme * instructions 2025-01-25 00:19:38 +01:00			`--dataset_name AI-MO/NuminaMath-TIR \`
Handle error in verifier + deepspeed command (#17) * handle error in verification * command with zero2 and catch more error in verifier * Update README.md Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * deepseek distill and remove grad chekpoint * drop grad checkpoint * except --------- Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> 2025-01-25 13:58:04 +01:00			`--max_prompt_length 256 \`
			`--per_device_train_batch_size 1 \`
			`--gradient_accumulation_steps 16 \`
			`--logging_steps 10 \`
			`--bf16`
GRPO script (#3) * inital commit * with reward func * fix box extract * example line * don't break when answer malformed * command and logging * holly simplicity * move grpo * reverse readme * instructions 2025-01-25 00:19:38 +01:00			```
REFACTOR TO THE MAX (#7) 2025-01-25 00:12:25 +01:00
Refactor evaluation (#6) 2025-01-24 23:46:34 +01:00			`## Evaluating models`

Fix Slurm SFT and gather Slurm scripts (#19) * Fix slurm * Fix generate * Fix install * Fix c 2025-01-25 13:47:52 +01:00			We use `lighteval` to evaluate models, with custom tasks defined in `src/open_r1/evaluate.py`. For models which fit on a single GPU, run:
Refactor evaluation (#6) 2025-01-24 23:46:34 +01:00
Fix eval comamnds (#18) 2025-01-25 12:31:40 +01:00			```shell
Fix Slurm SFT and gather Slurm scripts (#19) * Fix slurm * Fix generate * Fix install * Fix c 2025-01-25 13:47:52 +01:00			`MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B`
			`MODEL_ARGS="pretrained=$MODEL,dtype=float16,max_model_length=32768,gpu_memory_utilisation=0.8"`
			`TASK=aime24`
			`OUTPUT_DIR=data/evals/$MODEL`

			`lighteval vllm $MODEL_ARGS "custom\|$TASK\|0\|0" \`
			`--custom-tasks src/open_r1/evaluate.py \`
			`--use-chat-template \`
			`--system-prompt="Please reason step by step, and put your final answer within \boxed{}." \`
			`--output-dir $OUTPUT_DIR`
			```

			`To increase throughput across multiple GPUs, use _data parallel_ as follows:`

			```shell
			`NUM_GPUS=8`
Refactor evaluation (#6) 2025-01-24 23:46:34 +01:00			`MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B`
Fix eval comamnds (#18) 2025-01-25 12:31:40 +01:00			`MODEL_ARGS="pretrained=$MODEL,dtype=float16,data_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilisation=0.8"`
			`TASK=aime24`
			`OUTPUT_DIR=data/evals/$MODEL`

			`lighteval vllm $MODEL_ARGS "custom\|$TASK\|0\|0" \`
			`--custom-tasks src/open_r1/evaluate.py \`
			`--use-chat-template \`
			`--system-prompt="Please reason step by step, and put your final answer within \boxed{}." \`
			`--output-dir $OUTPUT_DIR`
			```
Adds Math-500 and AIME24 evals (#4) * adds evals * up max model len --------- Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> 2025-01-24 23:09:07 +01:00
Fix Slurm SFT and gather Slurm scripts (#19) * Fix slurm * Fix generate * Fix install * Fix c 2025-01-25 13:47:52 +01:00			`For large models which require sharding across GPUs, use _tensor parallel_ and run:`
Fix eval comamnds (#18) 2025-01-25 12:31:40 +01:00
			```shell
			`NUM_GPUS=8`
			`MODEL=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B`
			`MODEL_ARGS="pretrained=$MODEL,dtype=float16,tensor_parallel_size=$NUM_GPUS,max_model_length=32768,gpu_memory_utilisation=0.8"`
			`TASK=aime24`
			`OUTPUT_DIR=data/evals/$MODEL`

Fix Slurm SFT and gather Slurm scripts (#19) * Fix slurm * Fix generate * Fix install * Fix c 2025-01-25 13:47:52 +01:00			`export VLLM_WORKER_MULTIPROC_METHOD=spawn`
Fix eval comamnds (#18) 2025-01-25 12:31:40 +01:00			`lighteval vllm $MODEL_ARGS "custom\|$TASK\|0\|0" \`
			`--custom-tasks src/open_r1/evaluate.py \`
			`--use-chat-template \`
			`--system-prompt="Please reason step by step, and put your final answer within \boxed{}." \`
			`--output-dir $OUTPUT_DIR`
Adds Math-500 and AIME24 evals (#4) * adds evals * up max model len --------- Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> 2025-01-24 23:09:07 +01:00			```