Update setup.py (#315 )

adds peft as a temp dep due to https://github.com/huggingface/trl/issues/2849
Revert "Weighted reward functions (#213 )" (#317 )
2025-02-13 15:04:03 +01:00 · 2025-02-13 15:00:05 +01:00 · 2025-02-13 14:08:27 +01:00 · 2025-02-13 12:01:09 +01:00 · 2025-02-13 11:51:09 +01:00 · 2025-02-13 11:13:00 +01:00
15 changed files with 218 additions and 22 deletions
--- a/.gitignore
+++ b/.gitignore
@ -175,4 +175,6 @@ data/
 wandb/
 logs/
 eval_results/
-results/
+results/
 .vscode/
--- a/README.md
+++ b/README.md
@ -57,7 +57,7 @@ uv venv openr1 --python 3.11 && source openr1/bin/activate && uv pip install --u
 Next, install vLLM:
 ```shell
-uv pip install vllm==0.7.1 --link-mode=copy
+uv pip install vllm==0.7.2 --link-mode=copy
 ```
 This will also install PyTorch `v2.5.1` and it is **very important** to use this version since the vLLM binaries are compiled for it. You can then install the remaining dependencies for your specific use case via `pip install -e .[LIST OF MODES]`. For most contributors, we recommend:
@ -126,6 +126,14 @@ accelerate launch --config_file recipes/accelerate_configs/zero3.yaml src/open_r
    --per_device_train_batch_size=1 --num_train_epochs=5
 ```
 If you also wish to override the Weights and Biases default settings, you can do so as follows:
 ```shell
 accelerate launch --config_file recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \
    --config recipes/Qwen2.5-1.5B-Instruct/sft/config_demo.yaml
    --wandb_entity huggingface --wandb_project open-r1 --run_name Qwen2.5-1.5B-GRPO
 ```
 > [!NOTE]
 > The training commands below are configured for a node of 8 x H100s (80GB). For different hardware and topologies, you may need to tune the batch size and number of gradient accumulation steps.
@ -141,10 +149,10 @@ ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_con
 ### GRPO
-To train via the GRPO trainer, we use one GPU to run vLLM for faster generation and the remaining GPUs for training. For example, one a node with 8 GPUs, use the `recipes/accelerate_configs/zero3.yaml` config and then overwrite `num_processes` to run on 7 devices:
+To train via the GRPO trainer, we use one GPU to run vLLM for faster generation and the remaining GPUs for training. For example, one a node with 8 GPUs, use the `recipes/accelerate_configs/zero2.yaml` config and then overwrite `num_processes` to run on 7 devices:
 ```shell
-ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero3.yaml \
+ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero2.yaml \
    --num_processes=7 src/open_r1/grpo.py \
    --config recipes/Qwen2.5-1.5B-Instruct/grpo/config_demo.yaml
 ```
--- a/recipes/DeepSeek-R1-Distill-Qwen-7B/grpo/config_demo.yaml
+++ b/recipes/DeepSeek-R1-Distill-Qwen-7B/grpo/config_demo.yaml
@ -31,12 +31,12 @@ lr_scheduler_type: cosine
 max_prompt_length: 512
 max_completion_length: 1024
 max_steps: -1
-num_generations: 2
+num_generations: 7
 num_train_epochs: 1
 output_dir: data/DeepSeek-R1-Distill-Qwen-7B-GRPO
 overwrite_output_dir: true
-per_device_eval_batch_size: 4
+per_device_eval_batch_size: 32
-per_device_train_batch_size: 2
+per_device_train_batch_size: 16
 push_to_hub: true
 report_to:
 - wandb
--- a/recipes/Qwen2.5-1.5B-Instruct/grpo/config_demo.yaml
+++ b/recipes/Qwen2.5-1.5B-Instruct/grpo/config_demo.yaml
@ -33,12 +33,12 @@ lr_scheduler_type: cosine
 max_prompt_length: 512
 max_completion_length: 1024
 max_steps: -1
-num_generations: 2
+num_generations: 7
 num_train_epochs: 1
 output_dir: data/Qwen2.5-1.5B-Open-R1-GRPO
 overwrite_output_dir: true
-per_device_eval_batch_size: 4   
+per_device_eval_batch_size: 32
-per_device_train_batch_size: 2
+per_device_train_batch_size: 16
 push_to_hub: true
 report_to:
 - wandb
--- a/recipes/Qwen2.5-Math-7B/grpo/config_simple_rl.yaml
+++ b/recipes/Qwen2.5-Math-7B/grpo/config_simple_rl.yaml
@ -37,8 +37,8 @@ num_generations: 7
 num_train_epochs: 1
 output_dir: data/Qwen-2.5-7B-Simple-RL
 overwrite_output_dir: true
-per_device_eval_batch_size: 2
+per_device_eval_batch_size: 16
-per_device_train_batch_size: 2
+per_device_train_batch_size: 16
 push_to_hub: true
 report_to:
 - wandb
--- a/scripts/generate_reasoning.py
+++ b/scripts/generate_reasoning.py
@ -1,5 +1,6 @@
 import argparse
 import asyncio
 import hashlib
 import json
 import os
 import random
@ -87,14 +88,14 @@ async def process_example(example, session, args, output_file, pbar):
        return None
-async def load_processed_uuids(output_file):
+async def load_processed_uuids(output_file, uuid_column):
    processed_uuids = set()
    if os.path.exists(output_file):
        async with aiofiles.open(output_file, mode="r") as f:
            async for line in f:
                try:
                    data = json.loads(line)
-                    processed_uuids.add(data["uuid"])
+                    processed_uuids.add(hashlib.md5(str(data[uuid_column]).encode()).hexdigest())
                except json.JSONDecodeError:
                    continue
    return processed_uuids
@ -120,7 +121,9 @@ async def main():
    args = parser.parse_args()
    dataset = load_dataset(args.dataset_name, split="train").shuffle()
-    processed_uuids = await load_processed_uuids(args.output_file)
+    processed_uuids = await load_processed_uuids(args.output_file, args.uuid_column)
    if processed_uuids:
        print(f"Found {len(processed_uuids)} already processed examples, resuming from there...")
    if not os.path.exists(args.output_file):
        async with aiofiles.open(args.output_file, mode="w") as f:
@ -129,7 +132,7 @@ async def main():
    active_tasks: Set[asyncio.Task] = set()
    pbar = tqdm(
-        total=len(dataset),
+        total=len(dataset) - len(processed_uuids),
        desc="Generating responses",
        unit="row",
        mininterval=2,
@ -142,7 +145,8 @@ async def main():
        connector=aiohttp.TCPConnector(limit=args.max_concurrent, ttl_dns_cache=300, keepalive_timeout=60 * 60),
    ) as session:
        for example in dataset:
-            if example["uuid"] not in processed_uuids:
+            uuid = hashlib.md5(str(example[args.uuid_column]).encode()).hexdigest()
            if uuid not in processed_uuids:
                # Wait if we've hit the concurrency limit
                while len(active_tasks) >= args.max_concurrent:
                    done, active_tasks = await asyncio.wait(active_tasks, return_when=asyncio.FIRST_COMPLETED)
--- a/src/open_r1/utils/upload_details.py
+++ b/src/open_r1/utils/upload_details.py
--- a/setup.py
+++ b/setup.py
@ -58,6 +58,7 @@ _deps = [
    "math-verify==0.5.2",  # Used for math verification in grpo
    "packaging>=23.0",
    "parameterized>=0.9.0",
    "peft>=0.14.0",
    "pytest",
    "ruff>=0.9.0",
    "safetensors>=0.3.3",
--- a/slurm/evaluate.slurm
+++ b/slurm/evaluate.slurm
@ -81,7 +81,7 @@ echo "Uploading details to Hugging Face Hub..."
 DETAILS_FILEPATHS=$(find $OUTPUT_DIR/details/ -type f \( -name "*.parquet" \))
 echo "DETAILS_FILEPATHS: $DETAILS_FILEPATHS"
 TIMESTAMP=$(date +"%Y-%m-%dT%H-%M-%S")
-python src/open_r1/utils/upload_details.py --data_files $DETAILS_FILEPATHS --hub_repo_id $DETAILS_REPO_ID --config_name $MODEL_REVISION.$TASK_NAME.$TIMESTAMP
+python scripts/upload_details.py --data_files $DETAILS_FILEPATHS --hub_repo_id $DETAILS_REPO_ID --config_name $MODEL_REVISION.$TASK_NAME.$TIMESTAMP
 echo "Cleaning up ..."
 rm -rf $OUTPUT_DIR
--- a/src/open_r1/configs.py
+++ b/src/open_r1/configs.py
@ -40,6 +40,14 @@ class GRPOConfig(trl.GRPOConfig):
    )
    overwrite_hub_revision: bool = field(default=False, metadata={"help": "Whether to overwrite the Hub revision."})
    push_to_hub_revision: bool = field(default=False, metadata={"help": "Whether to push to a Hub revision/branch."})
    wandb_entity: Optional[str] = field(
        default=None,
        metadata={"help": ("The entity to store runs under.")},
    )
    wandb_project: Optional[str] = field(
        default=None,
        metadata={"help": ("The project to store runs under.")},
    )
@dataclass
@ -64,3 +72,11 @@ class SFTConfig(trl.SFTConfig):
    )
    overwrite_hub_revision: bool = field(default=False, metadata={"help": "Whether to overwrite the Hub revision."})
    push_to_hub_revision: bool = field(default=False, metadata={"help": "Whether to push to a Hub revision/branch."})
    wandb_entity: Optional[str] = field(
        default=None,
        metadata={"help": ("The entity to store runs under.")},
    )
    wandb_project: Optional[str] = field(
        default=None,
        metadata={"help": ("The project to store runs under.")},
    )
--- a/src/open_r1/grpo.py
+++ b/src/open_r1/grpo.py
@ -30,9 +30,11 @@ from open_r1.rewards import (
    format_reward,
    get_cosine_scaled_reward,
    get_repetition_penalty_reward,
    len_reward,
    reasoning_steps_reward,
 )
 from open_r1.utils.callbacks import get_callbacks
 from open_r1.utils.wandb_logging import init_wandb_training
 from trl import GRPOTrainer, ModelConfig, ScriptArguments, TrlParser, get_peft_config
@ -46,7 +48,7 @@ class GRPOScriptArguments(ScriptArguments):
    Args:
        reward_funcs (`list[str]`):
-            List of reward functions. Possible values: 'accuracy', 'format', 'reasoning_steps', 'cosine', 'repetition_penalty'.
+            List of reward functions. Possible values: 'accuracy', 'format', 'reasoning_steps', 'cosine', 'repetition_penalty', 'length'.
        cosine_min_value_wrong (`float`):
            Minimum reward for cosine scaling for wrong answers.
        cosine_max_value_wrong (`float`):
@ -62,7 +64,7 @@ class GRPOScriptArguments(ScriptArguments):
    reward_funcs: list[str] = field(
        default_factory=lambda: ["accuracy", "format"],
        metadata={
-            "help": "List of reward functions. Possible values: 'accuracy', 'format', 'reasoning_steps', 'cosine', 'repetition_penalty'"
+            "help": "List of reward functions. Possible values: 'accuracy', 'format', 'reasoning_steps', 'cosine', 'repetition_penalty', 'length'"
        },
    )
    cosine_min_value_wrong: float = field(
@ -130,7 +132,7 @@ def main(script_args, training_args, model_args):
    )
    logger.info(f"Model parameters {model_args}")
    logger.info(f"Script parameters {script_args}")
-    logger.info(f"Data parameters {training_args}")
+    logger.info(f"Training parameters {training_args}")
    # Check for last checkpoint
    last_checkpoint = None
@ -139,6 +141,9 @@ def main(script_args, training_args, model_args):
    if last_checkpoint is not None and training_args.resume_from_checkpoint is None:
        logger.info(f"Checkpoint detected, resuming training at {last_checkpoint=}.")
    if "wandb" in training_args.report_to:
        init_wandb_training(training_args)
    # Load the dataset
    dataset = load_dataset(script_args.dataset_name, name=script_args.dataset_config)
@ -158,6 +163,7 @@ def main(script_args, training_args, model_args):
            ngram_size=script_args.repetition_n_grams,
            max_penalty=script_args.repetition_max_penalty,
        ),
        "length": len_reward,
    }
    reward_funcs = [REWARD_FUNCS_REGISTRY[func] for func in script_args.reward_funcs]
--- a/src/open_r1/rewards.py
+++ b/src/open_r1/rewards.py
@ -2,6 +2,7 @@
 import math
 import re
 from typing import Dict
 from latex2sympy2_extended import NormalizationConfig
 from math_verify import LatexExtractionConfig, parse, verify
@ -74,6 +75,79 @@ def reasoning_steps_reward(completions, **kwargs):
    return [min(1.0, count / 3) for count in matches]
 def len_reward(completions: list[Dict[str, str]], solutions: list[str], **kwargs) -> float:
    """Compute length-based rewards to discourage overthinking and promote token efficiency.
    Taken from from the Kimi 1.5 tech report: https://arxiv.org/abs/2501.12599
    Args:
        completions: List of model completions
        solutions: List of ground truth solutions
    Returns:
        List of rewards where:
        - For correct answers: reward = 0.5 - (len - min_len)/(max_len - min_len)
        - For incorrect answers: reward = min(0, 0.5 - (len - min_len)/(max_len - min_len))
    """
    contents = [completion[0]["content"] for completion in completions]
    # First check correctness of answers
    correctness = []
    for content, sol in zip(contents, solutions):
        gold_parsed = parse(
            sol,
            extraction_mode="first_match",
            extraction_config=[LatexExtractionConfig()],
        )
        if len(gold_parsed) == 0:
            # Skip unparseable examples
            correctness.append(True)  # Treat as correct to avoid penalizing
            print("Failed to parse gold solution: ", sol)
            continue
        answer_parsed = parse(
            content,
            extraction_config=[
                LatexExtractionConfig(
                    normalization_config=NormalizationConfig(
                        nits=False,
                        malformed_operators=False,
                        basic_latex=True,
                        equations=True,
                        boxed=True,
                        units=True,
                    ),
                    boxed_match_priority=0,
                    try_extract_without_anchor=False,
                )
            ],
            extraction_mode="first_match",
        )
        correctness.append(verify(answer_parsed, gold_parsed))
    # Calculate lengths
    lengths = [len(content) for content in contents]
    min_len = min(lengths)
    max_len = max(lengths)
    # If all responses have the same length, return zero rewards
    if max_len == min_len:
        return [0.0] * len(completions)
    rewards = []
    for length, is_correct in zip(lengths, correctness):
        lambda_val = 0.5 - (length - min_len) / (max_len - min_len)
        if is_correct:
            reward = lambda_val
        else:
            reward = min(0, lambda_val)
        rewards.append(float(reward))
    return rewards
 def get_cosine_scaled_reward(
    min_value_wrong: float = -1.0,
    max_value_wrong: float = -0.5,
--- a/src/open_r1/sft.py
+++ b/src/open_r1/sft.py
@ -48,6 +48,7 @@ from transformers.trainer_utils import get_last_checkpoint
 from open_r1.configs import SFTConfig
 from open_r1.utils.callbacks import get_callbacks
 from open_r1.utils.wandb_logging import init_wandb_training
 from trl import (
    ModelConfig,
    ScriptArguments,
@ -88,7 +89,7 @@ def main(script_args, training_args, model_args):
    )
    logger.info(f"Model parameters {model_args}")
    logger.info(f"Script parameters {script_args}")
-    logger.info(f"Data parameters {training_args}")
+    logger.info(f"Training parameters {training_args}")
    # Check for last checkpoint
    last_checkpoint = None
@ -97,6 +98,9 @@ def main(script_args, training_args, model_args):
    if last_checkpoint is not None and training_args.resume_from_checkpoint is None:
        logger.info(f"Checkpoint detected, resuming training at {last_checkpoint=}.")
    if "wandb" in training_args.report_to:
        init_wandb_training(training_args)
    ################
    # Load datasets
    ################
--- a/src/open_r1/utils/wandb_logging.py
+++ b/src/open_r1/utils/wandb_logging.py
@ -0,0 +1,11 @@
 import os
 def init_wandb_training(training_args):
    """
    Helper function for setting up Weights & Biases logging tools.
    """
    if training_args.wandb_entity is not None:
        os.environ["WANDB_ENTITY"] = training_args.wandb_entity
    if training_args.wandb_project is not None:
        os.environ["WANDB_PROJECT"] = training_args.wandb_project
--- a/tests/test_rewards.py
+++ b/tests/test_rewards.py
@ -5,6 +5,7 @@ from open_r1.rewards import (
    format_reward,
    get_cosine_scaled_reward,
    get_repetition_penalty_reward,
    len_reward,
    reasoning_steps_reward,
 )
@ -110,6 +111,75 @@ class TestRewards(unittest.TestCase):
        rewards = format_reward(completion)
        self.assertEqual(rewards[0], 1.0)
    def test_same_length_responses(self):
        """Test len_reward when all responses have the same length."""
        completions = [[{"content": r"\boxed{\frac{63}{400}}"}], [{"content": r"\boxed{\frac{64}{400}}"}]]
        solutions = [r"\frac{63}{400}", r"\frac{63}{400}"]
        rewards = len_reward(completions, solutions)
        self.assertEqual(rewards, [0.0, 0.0])
    def test_different_lengths_correct_answers(self):
        """Test len_reward with different length correct answers."""
        completions = [
            [{"content": r"\boxed{\frac{63}{400}}"}],  # shorter
            [{"content": r"\boxed{\frac{63}{400}}  " + "x" * 10}],  # longer
        ]
        solutions = [r"\frac{63}{400}", r"\frac{63}{400}"]
        rewards = len_reward(completions, solutions)
        self.assertGreater(rewards[0], rewards[1])  # shorter answer should get higher reward
        self.assertAlmostEqual(rewards[0], 0.5)  # shortest correct answer gets maximum reward
    def test_different_lengths_incorrect_answers(self):
        """Test len_reward with different length incorrect answers."""
        completions = [
            [{"content": r"\boxed{\frac{64}{400}}"}],  # shorter
            [{"content": r"\boxed{\frac{64}{400}}  " + "x" * 10}],  # longer
        ]
        solutions = [r"\frac{63}{400}", r"\frac{63}{400}"]
        rewards = len_reward(completions, solutions)
        self.assertLessEqual(rewards[0], 0.0)  # incorrect answers should get non-positive rewards
        self.assertLessEqual(rewards[1], 0.0)
        self.assertGreater(rewards[0], rewards[1])  # shorter answer should still be penalized less
    def test_mixed_correctness(self):
        """Test len_reward with mix of correct and incorrect answers of different lengths."""
        completions = [
            [{"content": r"\boxed{\frac{63}{400}}"}],  # correct, shorter
            [{"content": r"\boxed{\frac{63}{400}}  " + "x" * 10}],  # correct, longer
            [{"content": r"\boxed{\frac{64}{400}}"}],  # incorrect, shorter
            [{"content": r"\boxed{\frac{64}{400}}  " + "x" * 10}],  # incorrect, longer
        ]
        solutions = [r"\frac{63}{400}"] * 4
        rewards = len_reward(completions, solutions)
        # Shortest correct answer should get positive reward
        self.assertGreater(rewards[0], 0.0)
        # Longer correct answer might get negative reward:
        self.assertGreater(rewards[2], rewards[1])
        self.assertGreaterEqual(rewards[1], rewards[3])
        # Incorrect answers should get non-positive rewards
        self.assertLessEqual(rewards[2], 0.0)
        self.assertLessEqual(rewards[3], 0.0)
        # Shorter answers should get better rewards within their correctness category
        self.assertGreater(rewards[0], rewards[1])  # correct answers
        self.assertGreater(rewards[2], rewards[3])  # incorrect answers
    def test_unparseable_solution(self):
        """Test len_reward with unparseable solution."""
        completions = [[{"content": r"\boxed{answer}"}], [{"content": r"\boxed{answer} " + "x" * 10}]]
        solutions = ["unparseable_latex", "unparseable_latex"]
        rewards = len_reward(completions, solutions)
        self.assertGreater(rewards[0], rewards[1])  # shorter answer should still get better reward
        self.assertAlmostEqual(rewards[0], 0.5)  # treated as correct, shortest gets maximum reward
 class TestRepetitionPenaltyReward(unittest.TestCase):
    def test_positive_max_penalty_raises_value_error(self):
Author	SHA1	Message	Date
Edward Beeching	7041fbc9d6	Update setup.py (#315 ) Some checks failed Tests / Run tests and quality checks (push) Has been cancelled Details adds peft as a temp dep due to https://github.com/huggingface/trl/issues/2849	2025-02-13 15:04:03 +01:00
Kashif Rasul	90a6de94c7	Revert "Weighted reward functions (#213 )" (#317 ) This reverts commit fbea53267b9676fc89e92c9a24c83cb23e0884d0.	2025-02-13 15:00:05 +01:00
Almaz Zinollayev	fbea53267b	Weighted reward functions (#213 ) * [Weighted reward functions] Adding functionality to weigh rewards. Tests. * [Weighted reward functions] Adding @wraps decorator to preserve reward function metadata * style * Changing grpo.py tests to run if cuda is available * style * Apply suggestions from code review Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> --------- Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co> Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>	2025-02-13 14:08:27 +01:00
lewtun	272b648c03	Fix logging import (#316 )	2025-02-13 12:01:09 +01:00
Kashif Rasul	7832290687	[Rewards] add kimi len_reward (#292 ) * add kimi len_reward * add to REWARD_FUNCS_REGISTRY * fix formatting * Update src/open_r1/grpo.py Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * Update src/open_r1/grpo.py Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * Update src/open_r1/grpo.py Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * Update src/open_r1/rewards.py Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * Update src/open_r1/rewards.py Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * Update src/open_r1/rewards.py Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * Update src/open_r1/rewards.py Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * Update src/open_r1/rewards.py Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * missing import --------- Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>	2025-02-13 11:51:09 +01:00
Edward Beeching	80e7e7b23c	move details script and fix wandb logging (#314 )	2025-02-13 11:13:00 +01:00
Edward Beeching	f987b3c877	bump vllm to version to 0.7.2 (#311 ) VLLM has made a number of throughput improvements in version 0.7.2, so it's worth bumping the version, particularly for GRPO training runs.	2025-02-13 10:48:11 +01:00
lewtun	96a6b0fa33	Enable Weights & Biases defaults to be overridden in training (#294 ) * Enable WandB defaults to be set * Fix	2025-02-12 13:01:07 +01:00
Anton Lozhkov	fa9b621cc9	Fix uuid in the data generator (#284 ) * fix uuid issues	2025-02-11 14:08:46 +01:00
Quentin Gallouédec	52aa8759a2	new grpo logic (#274 )	2025-02-11 09:35:06 +01:00