{"id":53776,"date":"2025-02-16T14:20:03","date_gmt":"2025-02-16T06:20:03","guid":{"rendered":"https:\/\/fwq.ai\/blog\/53776\/"},"modified":"2025-02-16T14:20:03","modified_gmt":"2025-02-16T06:20:03","slug":"%e5%a4%8d%e7%8e%b0deepseek-r1%e7%9a%84%e9%a1%bf%e6%82%9f%e6%97%b6%e5%88%bb","status":"publish","type":"post","link":"https:\/\/fwq.ai\/blog\/53776\/","title":{"rendered":"\u590d\u73b0DeepSeek-R1\u7684\u987f\u609f\u65f6\u523b"},"content":{"rendered":"<p>Deepseek R1 \u7684\u53d1\u5e03\u9707\u60ca\u4e86\u6574\u4e2a\u884c\u4e1a\u3002\u4e3a\u4ec0\u4e48\uff1f\u56e0\u4e3a DeepSeek-R1 \u662f\u4e00\u4e2a\u5f00\u653e\u6a21\u578b\uff0c\u5728\u590d\u6742\u7684\u63a8\u7406\u4efb\u52a1\u4e2d\u53ef\u4ee5\u4e0e OpenAI \u7684 o1 \u76f8\u5ab2\u7f8e\uff0c\u5b83\u4f7f\u7528\u7fa4\u7ec4\u76f8\u5bf9\u7b56\u7565\u4f18\u5316 (GRPO) \u548c\u4ee5\u5f3a\u5316\u5b66\u4e60\u4e3a\u4e2d\u5fc3\u7684\u591a\u9636\u6bb5\u8bad\u7ec3\u65b9\u6cd5\u5f15\u5165\u3002\u4ed6\u4eec\u4e0d\u4ec5\u53d1\u5e03\u4e86\u8be5\u6a21\u578b\uff0c\u8fd8\u53d1\u8868\u4e86\u4e00\u7bc7\u5173\u4e8e\u4ed6\u4eec\u5982\u4f55\u505a\u5230\u8fd9\u4e00\u70b9\u7684\u7814\u7a76\u8bba\u6587\u3002<\/p>\n<p>\u5728\u4e2d\uff0c\u4ed6\u4eec\u63cf\u8ff0\u4e86\u4f7f\u7528\u7eaf\u5f3a\u5316\u5b66\u4e60\u8bad\u7ec3\u6a21\u578b\u65f6\u7684\u201c\u987f\u609f\u65f6\u523b\u201d\u3002\u5728\u6b64\u9636\u6bb5\uff0cDeepSeek-R1-Zero\uff08DeepSeek-R1 \u7684\u7b2c\u4e00\u6b21\u6d4b\u8bd5\uff09\u901a\u8fc7\u91cd\u65b0\u8bc4\u4f30\u5176\u521d\u59cb\u65b9\u6cd5\uff0c\u5b66\u4f1a\u4e3a\u95ee\u9898\u5206\u914d\u66f4\u591a\u7684\u601d\u8003\u65f6\u95f4\uff0c\u800c\u65e0\u9700\u4efb\u4f55\u4eba\u5de5\u53cd\u9988\u6216\u63cf\u8ff0\u5982\u4f55\u6267\u884c\u7684\u6570\u636e\u3002\u4ed6\u4eec\u5c06\u6b64\u63cf\u8ff0\u4e3a\u201c\u987f\u609f\u65f6\u523b\u201d\uff1a<\/p>\n<blockquote><p>\n  \u8fd9\u79cd\u884c\u4e3a\u4e0d\u4ec5\u8bc1\u660e\u4e86\u6a21\u578b\u4e0d\u65ad\u589e\u5f3a\u7684\u63a8\u7406\u80fd\u529b\uff0c\u800c\u4e14\u662f\u4e00\u4e2a\u5f15\u4eba\u5165\u80dc\u7684\u4f8b\u5b50\uff0c\u5c55\u793a\u4e86\u5f3a\u5316\u5b66\u4e60\u5982\u4f55\u5bfc\u81f4\u610f\u60f3\u4e0d\u5230\u7684\u590d\u6742\u7ed3\u679c\u3002\n<\/p><\/blockquote>\n<p>\u5728\u8fd9\u7bc7\u535a\u6587\u4e2d\uff0c\u6211\u4eec\u60f3\u4f7f\u7528\u7fa4\u7ec4\u76f8\u5bf9\u7b56\u7565\u4f18\u5316 (GRPO) \u548c\u5012\u8ba1\u65f6\u6e38\u620f\u91cd\u73b0 DeepSeek-R1 \u7684\u5c0f\u201c\u987f\u609f\u65f6\u523b\u201d\u3002\u6211\u4eec\u5c06\u4f7f\u7528\u5f3a\u5316\u5b66\u4e60\u8bad\u7ec3\u4e00\u4e2a\u5f00\u653e\u6a21\u578b\uff0c\u5c1d\u8bd5\u6559\u4f1a\u5b83\u81ea\u6211\u9a8c\u8bc1\u548c\u641c\u7d22\u80fd\u529b\uff0c\u4ee5\u89e3\u51b3\u5012\u8ba1\u65f6\u6e38\u620f\u3002\u5012\u8ba1\u65f6\u6e38\u620f\u662f\u4e00\u4e2a\u6570\u5b57\u62fc\u56fe\u6e38\u620f\uff0c\u73a9\u5bb6\u4f7f\u7528\u4e00\u7ec4\u968f\u673a\u62bd\u53d6\u7684\u6570\u5b57\u548c\u57fa\u672c\u7b97\u672f\u8fd0\u7b97\uff08+\u3001-\u3001\u00d7\u3001\u00f7\uff09\u6765\u8fbe\u5230\u6216\u5c3d\u53ef\u80fd\u63a5\u8fd1\u76ee\u6807\u6570\u5b57\u3002<\/p>\n<pre><code>Target Number: 952\nAvailable Numbers: 25, 50, 75, 100, 3, 6\n\n(100 \u00d7 (3 \u00d7 3)) + (50 + 6 \/ 3) = 952\n<\/code><\/pre>\n<p>\u8fd9\u7bc7\u535a\u6587\u5305\u542b\u4e00\u4e2a\u4ea4\u4e92\u5f0f\u4ee3\u7801\uff0c\u4f60\u53ef\u4ee5\u5728 Jupyter Notebook \u4e2d\u8fd0\u884c\u8be5\u4ee3\u7801\uff0c\u4e86\u89e3\u5982\u4f55\u4f7f\u7528 GRPO \u548c Q-Lora \u8bad\u7ec3\u6a21\u578b\u3002\u8fd9\u662f\u5b66\u4e60\u5982\u4f55\u4f7f\u7528 TRL \u548c GRPO \u7684\u597d\u65b9\u6cd5\uff0c\u4f46\u5b83\u975e\u5e38\u6162\u5e76\u4e14\u9700\u8981\u5927\u91cf\u8ba1\u7b97\u3002\u6b64\u5916\uff0c\u6211\u6dfb\u52a0\u4e86\u4e00\u4e2a\u548c\u8bf4\u660e\uff0c\u4ee5\u5728\u5177\u6709\u591a\u4e2a GPU \u6216 SLURM \u96c6\u7fa4\u7684 Node \u4e0a\u8fd0\u884c\u8bad\u7ec3\u3002<\/p>\n<blockquote><p>\n  \u6ce8\u610f\uff1a\u8fd9\u7bc7\u535a\u5ba2\u7684\u7075\u611f\u6765\u81ea \uff0c\u4ed6\u6700\u521d\u63a2\u7d22\u4e86\u8fd9\u4e2a\u60f3\u6cd5\u5e76\u7528\u4e00\u4e2a\u5c0f\u6a21\u578b\u8bc1\u660e\u4e86\u5b83\u3002\n<\/p><\/blockquote>\n<p>\u4f46\u5728\u5f00\u59cb\u4e4b\u524d\uff0c\u8ba9\u6211\u4eec\u5148\u770b\u770b\u7ec4\u76f8\u5bf9\u7b56\u7565\u4f18\u5316 (GRPO) \u5e76\u4e86\u89e3\u5b83\u7684\u5de5\u4f5c\u539f\u7406\u3002<\/p>\n<h2>1\u3001\u7ec4\u76f8\u5bf9\u7b56\u7565\u4f18\u5316 (GRPO)<\/h2>\n<p>\u7ec4\u76f8\u5bf9\u7b56\u7565\u4f18\u5316 (GRPO) \u662f\u4e00\u79cd\u5f3a\u5316\u5b66\u4e60\u7b97\u6cd5\uff0c\u7528\u4e8e\u63d0\u9ad8 LLM \u7684\u63a8\u7406\u80fd\u529b\u3002\u5b83\u662f\u5728 \u4e2d\u5728\u6570\u5b66\u63a8\u7406\u7684\u80cc\u666f\u4e0b\u5f15\u5165\u7684\u3002GRPO \u901a\u8fc7\u6d88\u9664\u5bf9\u4ef7\u503c\u51fd\u6570\u6a21\u578b\u7684\u9700\u6c42\u6765\u4fee\u6539\u4f20\u7edf\u7684\u8fd1\u7aef\u7b56\u7565\u4f18\u5316 (PPO)\u3002\u76f8\u53cd\uff0c\u5b83\u6839\u636e\u7ec4\u5206\u6570\u4f30\u8ba1\u57fa\u7ebf\uff0c\u4ece\u800c\u51cf\u5c11\u5185\u5b58\u4f7f\u7528\u91cf\u548c\u8ba1\u7b97\u5f00\u9500\u3002GRPO \u73b0\u5728\u4e5f\u88ab Qwen \u56e2\u961f\u4f7f\u7528\uff0c\u53ef\u4ee5\u4e0e\u57fa\u4e8e\u89c4\u5219\/\u4e8c\u8fdb\u5236\u7684\u5956\u52b1\u4ee5\u53ca\u901a\u7528\u5956\u52b1\u6a21\u578b\u4e00\u8d77\u4f7f\u7528\uff0c\u4ee5\u6539\u8fdb\u6a21\u578b\u7684\u5b9e\u7528\u6027\u3002<\/p>\n<ul>\n<li>\u62bd\u6837\uff1a\u4f7f\u7528\u5f53\u524d\u7b56\u7565\u4e3a\u6bcf\u4e2a\u63d0\u793a\u751f\u6210\u591a\u4e2a\u8f93\u51fa<\/li>\n<li>\u5956\u52b1\u8bc4\u5206\uff1a\u4f7f\u7528\u5956\u52b1\u51fd\u6570\u5bf9\u6bcf\u4e00\u4ee3\u8fdb\u884c\u8bc4\u5206\uff0c\u53ef\u4ee5\u662f\uff08\u57fa\u4e8e\u89c4\u5219\u6216\u57fa\u4e8e\u7ed3\u679c\uff09<\/li>\n<li>\u4f18\u52bf\u8ba1\u7b97\uff1a\u751f\u6210\u8f93\u51fa\u7684\u5e73\u5747\u5956\u52b1\u7528\u4f5c\u57fa\u7ebf\u3002\u7136\u540e\u76f8\u5bf9\u4e8e\u6b64\u57fa\u7ebf\u8ba1\u7b97\u7ec4\u5185\u6bcf\u4e2a\u89e3\u51b3\u65b9\u6848\u7684\u4f18\u52bf\u3002\u5956\u52b1\u5728\u7ec4\u5185\u6807\u51c6\u5316\u3002<\/li>\n<li>\u7b56\u7565\u4f18\u5316\uff1a\u8be5\u7b56\u7565\u8bd5\u56fe\u6700\u5927\u5316 GRPO \u76ee\u6807\uff0c\u5176\u4e2d\u5305\u62ec\u8ba1\u7b97\u7684\u4f18\u52bf\u548c KL \u6563\u5ea6\u9879\u3002\u8fd9\u4e0e PPO \u5728\u5956\u52b1\u4e2d\u5b9e\u73b0 KL \u9879\u7684\u65b9\u5f0f\u4e0d\u540c\u3002<\/li>\n<\/ul>\n<h2>1\u3001\u8bbe\u7f6e\u5f00\u53d1\u73af\u5883<\/h2>\n<p>\u6211\u4eec\u7684\u7b2c\u4e00\u6b65\u662f\u5b89\u88c5 Hugging Face Libraries \u548c Pytorch\u3001vllm \u4ee5\u53ca trl\u3001transformers \u548c datasets\u3002\u5982\u679c\u4f60\u8fd8\u6ca1\u6709\u542c\u8bf4\u8fc7 trl\uff0c\u8bf7\u4e0d\u8981\u62c5\u5fc3\u3002\u5b83\u662f transformers \u548c datasets \u4e4b\u4e0a\u7684\u4e00\u4e2a\u65b0\u5e93\uff0c\u53ef\u4ee5\u66f4\u8f7b\u677e\u5730\u5fae\u8c03\u3001rlhf\u3001\u5bf9\u9f50\u5f00\u653e LLM\u3002<\/p>\n<pre><code># Install Pytorch &amp; other libraries, make sure to match your GPU driver version\n%pip install \"torch==2.5.1\" tensorboard \"setuptools&lt;71.0.0\"  --index-url https:\/\/download.pytorch.org\/whl\/cu121\n \n# Install flash-attn\n%pip install flash-attn \n \n# Install Hugging Face libraries\n%pip install  --upgrade \\\n  \"transformers==4.48.1\" \\\n  \"datasets==3.1.0\" \\\n  \"accelerate==1.3.0\" \\\n  \"hf-transfer==0.1.9\" \\\n  \"deepspeed==0.15.4\" \\\n  \"trl==0.14.0\"\n \n# install vLLM \n%pip install \"vllm==0.7.0\"\n \n## IMPORTANT: If you want to run the notebook and the interactive cells you also need to install the following libraries:\n# But first read it the blog post and then decide as they might conflict with the libraries for distributed training. \n# %pip install \"peft==0.14.0\" \"bitsandbytes==0.45.0\"\n <\/code><\/pre>\n<blockquote><p>\n  \u6ce8\u610f\uff1a\u4f60\u53ef\u80fd\u9700\u8981\u91cd\u65b0\u542f\u52a8\u5185\u6838\u624d\u80fd\u4f7f\u7528\u66f4\u65b0\u7684\u8f6f\u4ef6\u5305\u3002\n<\/p><\/blockquote>\n<p>\u6211\u4eec\u5c06\u4f7f\u7528 \u4f5c\u4e3a\u8fdc\u7a0b\u6a21\u578b\u7248\u672c\u63a7\u5236\u670d\u52a1\u3002\u8fd9\u610f\u5473\u7740\u6211\u4eec\u5c06\u5728\u8bad\u7ec3\u671f\u95f4\u81ea\u52a8\u5c06\u6211\u4eec\u7684\u6a21\u578b\u3001\u65e5\u5fd7\u548c\u4fe1\u606f\u63a8\u9001\u5230 Hub\u3002ni \u5fc5\u987b\u5728 Hugging Face \u4e0a\u6ce8\u518c\u624d\u80fd\u5b9e\u73b0\u6b64\u76ee\u7684\u3002\u5728\u4f60\u62e5\u6709\u5e10\u6237\u540e\uff0c\u6211\u4eec\u5c06\u4f7f\u7528 huggingface_hub \u5305\u4e2d\u7684\u767b\u5f55\u5b9e\u7528\u7a0b\u5e8f\u767b\u5f55\u6211\u4eec\u7684\u5e10\u6237\u5e76\u5c06\u6211\u4eec\u7684\u4ee4\u724c\uff08\u8bbf\u95ee\u5bc6\u94a5\uff09\u5b58\u50a8\u5728\u78c1\u76d8\u4e0a\u3002<\/p>\n<pre><code>from huggingface_hub import login\n \nlogin(token=\"\", add_to_git_credential=True) # ADD YOUR TOKEN HERE<\/code><\/pre>\n<h2>2\u3001\u4ece\u5012\u8ba1\u65f6\u6e38\u620f\u4e2d\u751f\u6210\u5e26\u6709\u63a8\u7406\u524d\u7f00\u7684\u8bad\u7ec3\u6837\u672c<\/h2>\n<p>\u6211\u4eec\u5c06\u4f7f\u7528 \u6570\u636e\u96c6\uff0c\u5176\u4e2d\u5305\u542b\u5177\u6709 3 \u5230 4 \u4e2a\u6570\u5b57\u548c\u89e3\u51b3\u65b9\u6848\u7684\u6837\u672c\u3002<\/p>\n<p>\u6a21\u578b\u65b9\u9762\uff0c\u6211\u4eec\u5c06\u4f7f\u7528 \uff0c\u8fd9\u662f\u4e00\u4e2a 3B \u53c2\u6570\u6307\u4ee4\u8c03\u6574\u6a21\u578b\u3002\u8fd9\u4f7f\u5f97\u5c55\u793a\u201c\u987f\u609f\u65f6\u523b\u201d\u53d8\u5f97\u66f4\u52a0\u5bb9\u6613\uff0c\u56e0\u4e3a\u5b83\u5df2\u7ecf\u9075\u5faa\u63d0\u793a\u683c\u5f0f\u3002\u4f46\u4f60\u4e5f\u53ef\u4ee5\u4f7f\u7528 Qwen \u7684\u57fa\u672c\u7248\u672c\u6216\u5176\u4ed6\u6a21\u578b\u3002\u63a2\u7d22\u4e86\u6a21\u578b\u9700\u8981\u5177\u6709\u4e00\u5b9a\u7684\u8d28\u91cf\u624d\u80fd\u5b66\u4e60\u63a8\u7406\u8fc7\u7a0b\uff0c\u4ece &gt; 1.5B \u53c2\u6570\u5f00\u59cb\u3002<\/p>\n<pre><code>from transformers import AutoTokenizer\nfrom datasets import load_dataset\n \n# Load dataset from Hugging Face Hub\ndataset_id = \"Jiayi-Pan\/Countdown-Tasks-3to4\"\ndataset = load_dataset(dataset_id, split=\"train\")\n# select a random subset of 50k samples\ndataset = dataset.shuffle(seed=42).select(range(50000))\n \n# Load tokenizer from Hugging Face Hub to format the dataset to our \"r1\" prompt \ntokenizer = AutoTokenizer.from_pretrained(\"Qwen\/Qwen2.5-3B-Instruct\")\n \n# gemerate r1 prompt with a prefix for the model to already start with the thinking process\ndef generate_r1_prompt(numbers, target):\n    r1_prefix = [{\n        \"role\": \"system\",\n        \"content\": \"You are a helpful assistant. You first thinks about the reasoning process in the mind and then provides the user with the answer.\"\n      },\n      { \n        \"role\": \"user\",\n        \"content\": f\"Using the numbers {numbers}, create an equation that equals {target}. You can use basic arithmetic operations (+, -, *, \/) and each number can only be used once. Show your work in &lt;think&gt; &lt;\/think&gt; tags. And return the final equation and answer in &lt;answer&gt; &lt;\/answer&gt; tags, for example &lt;answer&gt; (1 + 2) \/ 3 = 1 &lt;\/answer&gt;.\"\n      },\n      {\n        \"role\": \"assistant\",\n        \"content\": \"Let me solve this step by step.\\n&lt;think&gt;\"\n      }]\n    return {\"prompt\": tokenizer.apply_chat_template(r1_prefix, tokenize=False, continue_final_message=True), \"target\": target}\n \n# convert our dataset to the r1 prompt\ndataset = dataset.map(lambda x: generate_r1_prompt(x[\"nums\"], x[\"target\"]))\n \n# split the dataset into train and test\ntrain_test_split = dataset.train_test_split(test_size=0.1)\n \ntrain_dataset = train_test_split[\"train\"]\ntest_dataset = train_test_split[\"test\"]<\/code><\/pre>\n<h2>3\u3001\u4f7f\u7528 GRPO \u8bad\u7ec3\u6a21\u578b\uff08\u6559\u80b2\u90e8\u5206\uff09<\/h2>\n<blockquote><p>\n  \u6ce8\u610f\uff1a\u7b2c 3 \u8282\u4ecb\u7ecd\u4e86\u5982\u4f55\u4f7f\u7528 TRL \u548c GRPO \u7684\u57fa\u672c\u77e5\u8bc6\u3002\u5982\u679c\u8981\u8fd0\u884c\u4ea4\u4e92\u5f0f\u5355\u5143\uff0c\u5219\u9700\u8981\u5b89\u88c5 bitsandbytes \u548c peft\uff0c\u56e0\u4e3a\u5b83\u4eec\u662f Trainer \u7c7b\u6240\u5fc5\u9700\u7684\u3002\u672c\u8282\u4e3b\u8981\u7528\u4e8e\u6559\u80b2\u76ee\u7684\u3002\n<\/p><\/blockquote>\n<p>TRL \u901a\u8fc7\u4e13\u7528\u7684\u652f\u6301\u7ec4\u76f8\u5bf9\u7b56\u7565\u4f18\u5316 (GRPO)\uff0c\u7528\u4e8e\u6839\u636e\u504f\u597d\u6570\u636e\u5bf9\u9f50 LLM\uff0c\u5982\u4e2d\u6240\u8ff0\u3002 GRPOTrainer \u662f transformers \u5e93\u4e2d Trainer \u7684\u5b50\u7c7b\uff0c\u652f\u6301\u6240\u6709\u76f8\u540c\u7684\u529f\u80fd\uff0c\u5305\u62ec\u65e5\u5fd7\u8bb0\u5f55\u3001\u68c0\u67e5\u70b9\u3001\u5206\u5e03\u5f0f\u8bad\u7ec3\u548c\u53c2\u6570\u9ad8\u6548\u5fae\u8c03 (PEFT)\u3002<\/p>\n<p>GRPOTrainer \u652f\u6301\u901a\u7528\u7ed3\u679c\u5956\u52b1\u6a21\u578b (ORM) \u548c\u81ea\u5b9a\u4e49\u5956\u52b1\u51fd\u6570\uff0c\u53ef\u7528\u4e8e\u5b9e\u73b0\u57fa\u4e8e\u89c4\u5219\u7684\u5956\u52b1\u6a21\u578b\u3002\u5728 Deepseek R1 \u8bba\u6587\u4e2d\uff0c\u4ed6\u4eec\u5b9e\u65bd\u4e86\u57fa\u4e8e\u89c4\u5219\u7684\u5956\u52b1\u6a21\u578b\u6765\u9a8c\u8bc1\u751f\u6210\u7684\u89e3\u51b3\u65b9\u6848\u7684\u6b63\u786e\u6027\u3002\u5728\u6211\u4eec\u7684\u793a\u4f8b\u4e2d\uff0c\u6211\u4eec\u5c06\u91c7\u7528\u7c7b\u4f3c\u7684\u65b9\u6cd5\uff0c\u6211\u4eec\u5c06\u521b\u5efa 2 \u4e2a\u5956\u52b1\u51fd\u6570\uff1a<\/p>\n<ul>\n<li>\u683c\u5f0f\u5956\u52b1\uff1a\u68c0\u67e5\u751f\u6210\u7684\u683c\u5f0f\u662f\u5426\u6b63\u786e <code>&lt;think&gt; [thinking] &lt;\/think&gt;&lt;answer&gt; [answer] &lt;\/answer&gt;<\/code><\/li>\n<li>\u51c6\u786e\u5ea6\u5956\u52b1\uff1a\u4ece <code>&lt;answer&gt;<\/code> \u6807\u7b7e\u4e2d\u63d0\u53d6\u65b9\u7a0b\u5f0f\u5e76\u6839\u636e\u76ee\u6807\u5bf9\u5176\u8fdb\u884c\u8bc4\u4f30\uff0c\u5e76\u68c0\u67e5\u6bcf\u4e2a\u6570\u5b57\u662f\u5426\u4f7f\u7528\u4e00\u6b21\u3002<\/li>\n<\/ul>\n<p>\u6ce8\u610f\uff1a\u6211\u4eec\u793a\u4f8b\u4e2d\u7684\u6b63\u786e <code>&lt;answer&gt;<\/code>\u5305\u62ec\u65b9\u7a0b\u5f0f\uff0c\u4f8b\u5982 <code><code>&lt;answer&gt; 55 + 36 - 7 - 19 &lt;\/answer&gt;<\/code><\/code>\uff1a<\/p>\n<pre><code>import re\n \ndef format_reward_func(completions, target, **kwargs):\n    \"\"\"\n    Format: &lt;think&gt;...&lt;\/think&gt;&lt;answer&gt;...&lt;\/answer&gt;\n    Args:\n        completions (list[str]): Generated outputs\n        target (list[str]): Expected answers\n      \n      Returns:\n          list[float]: Reward scores\n    \"\"\"\n    rewards = []\n \n    for completion, gt in zip(completions, target):\n \n      try:\n        # add synthetic &lt;think&gt; as its already part of the prompt and prefilled for the assistant to more easily match the regex\n        completion = \"&lt;think&gt;\" + completion        \n        # Check if the format is correct\n        regex = r\"^&lt;think&gt;([^&lt;]*(?:&lt;(?!\/?think&gt;)[^&lt;]*)*)&lt;\\\/think&gt;\\n&lt;answer&gt;([\\s\\S]*?)&lt;\\\/answer&gt;$\"\n \n        match = re.search(regex, completion, re.DOTALL) \n        # if the format is not correct, reward is 0\n        if match is None or len(match.groups()) != 2:\n            rewards.append(0.0)\n        else:\n            rewards.append(1.0)\n      except Exception:\n        rewards.append(0.0)\n    return rewards\n \ndef equation_reward_func(completions, target, nums, **kwargs):\n    \"\"\"\n    Evaluates completions based on:\n    2. Mathematical correctness of the answer\n \n    Args:\n        completions (list[str]): Generated outputs\n        target (list[str]): Expected answers\n        nums (list[str]): Available numbers\n    \n    Returns:\n        list[float]: Reward scores\n    \"\"\"\n    rewards = []\n    for completion, gt, numbers in zip(completions, target, nums):\n      try:\n        # add synthetic &lt;think&gt; as its already part of the prompt and prefilled for the assistant to more easily match the regex\n        completion = \"&lt;think&gt;\" + completion\n        # Check if the format is correct\n        match = re.search(r\"&lt;answer&gt;(.*?)&lt;\\\/answer&gt;\", completion)\n        if match is None:\n            rewards.append(0.0)\n            continue\n        # Extract the \"answer\" part from the completion\n        equation = match.group(1).strip()\n        # Extract all numbers from the equation\n        used_numbers = [int(n) for n in re.findall(r'\\d+', equation)]\n        \n        # Check if all numbers are used exactly once\n        if sorted(used_numbers) != sorted(numbers):\n            rewards.append(0.0)\n            continue\n        # Define a regex pattern that only allows numbers, operators, parentheses, and whitespace\n        allowed_pattern = r'^[\\d+\\-*\/().\\s]+$'\n        if not re.match(allowed_pattern, equation):\n           rewards.append(0.0)\n           continue\n        \n        # Evaluate the equation with restricted globals and locals\n        result = eval(equation, {\"__builtins__\": None}, {})\n        # Check if the equation is correct and matches the ground truth\n        if abs(float(result) - float(gt)) &lt; 1e-5:\n            rewards.append(1.0)\n        else:\n            rewards.append(0.0)\n      except Exception:\n            # If evaluation fails, reward is 0\n            rewards.append(0.0) \n    return rewards<\/code><\/pre>\n<p>\u8ba9\u6211\u4eec\u7528\u4e00\u4e2a\u793a\u4f8b\u6765\u5c1d\u8bd5\u6211\u4eec\u7684\u5956\u52b1\u51fd\u6570\u3002<\/p>\n<blockquote><p>\n  \u6ce8\u610f\uff1a\u6ca1\u6709\u4e00\u4e2a\u793a\u4f8b\u4ee5<br \/>\n <em><code>&lt;think&gt;<\/code><\/em> \u5f00\u5934\uff0c\u56e0\u4e3a\u6211\u4eec\u5c06\u5176\u5408\u6210\u6dfb\u52a0\u5230\u63d0\u793a\u4e2d\u3002\n<\/p><\/blockquote>\n<pre><code>correct_sample_1 = \"\"\"We need to find an equation using the numbers 19, 36, 55, and 7\nexactly once, with basic arithmetic operations, that equals 65. One possible\ncombination is 55 + 36 - 19 + 7... &lt;\/think&gt;\n&lt;answer&gt; 55 + 36 - 7 - 19 &lt;\/answer&gt;\"\"\"\n \ncorrect_sample_2 = \"\"\" ... &lt;\/think&gt;\n&lt;answer&gt; 55 + 36 - 7 - 19 &lt;\/answer&gt;\"\"\"\n \nwrong_format = \"\"\"User: Using the numbers [19, 36, 55, 7], create an equation that equals 65.\"\"\"\n \nwrong_format_2 = \"\"\"To find the equation that equals 79 using the numbers 95, 78, 6, 88, I'll start by adding 88 and 95:                      \n95 + 88 = 183                                                                                                              \nNow, let's subtract 104 from 183 to get 79:\n183 - 104 = 79\n&lt;think&gt; 183 - 104 = 79 &lt;\/think&gt;&lt;think&gt; 183 - 104 = 79 &lt;\/think&gt;&lt;answer&gt; 183 - 104 = 79 &lt;\/answer&gt;\"\"\"\n \nwrong_result = \"\"\" ... &lt;\/think&gt;\n&lt;answer&gt; 55 + 36 - 7 - 18 &lt;\/answer&gt;\"\"\"\n \n \ntest_rewards = format_reward_func(completions=[correct_sample_1, correct_sample_2, wrong_format, wrong_format_2, wrong_result], target=[\"65\", \"65\", \"65\", \"65\", \"65\"], nums=[[19, 36, 55, 7]] * 5)\nassert test_rewards == [1.0, 1.0, 0.0, 0.0, 1.0], \"Reward function is not working\"\ntest_rewards = equation_reward_func(completions=[correct_sample_1, correct_sample_2, wrong_format, wrong_format_2, wrong_result], target=[\"65\", \"65\", \"65\", \"65\", \"65\"], nums=[[19, 36, 55, 7]] * 5)\nassert test_rewards == [1.0, 1.0, 0.0, 0.0, 0.0], \"Reward function is not working\"<\/code><\/pre>\n<p>\u770b\u8d77\u6765\u4e0d\u9519\uff0c\u73b0\u5728\u8ba9\u6211\u4eec\u5b9a\u4e49\u5269\u4f59\u7684\u8bad\u7ec3\u53c2\u6570\uff0c\u521b\u5efa\u4e00\u4e2a\u8bad\u7ec3\u5668\u5e76\u5f00\u59cb\u8bad\u7ec3\u3002<\/p>\n<pre><code>from trl import GRPOConfig, GRPOTrainer, get_peft_config, ModelConfig\n \n# our model we are going to use as policy \nmodel_config = ModelConfig(\n    model_name_or_path=\"Qwen\/Qwen2.5-3B-Instruct\",\n    torch_dtype=\"bfloat16\",\n    attn_implementation=\"flash_attention_2\",\n    use_peft=True,\n    load_in_4bit=True,\n)\n \n# Hyperparameters\ntraining_args = GRPOConfig(\n    output_dir=\"qwen-r1-aha-moment\",\n    learning_rate=5e-7,\n    lr_scheduler_type=\"cosine\",\n    logging_steps=10,\n    max_steps=100,\n    per_device_train_batch_size=1,\n    gradient_accumulation_steps=1,\n    gradient_checkpointing=True,\n    gradient_checkpointing_kwargs={\"use_reentrant\": False},\n    bf16=True,\n    # GRPO specific parameters\n    max_prompt_length=256,\n    max_completion_length=1024, # max length of the generated output for our solution\n    num_generations=2,\n    beta=0.001,\n    \n)\ntrainer = GRPOTrainer(\n    model=model_config.model_name_or_path,\n    reward_funcs=[format_reward_func, equation_reward_func],\n    args=training_args,\n    train_dataset=train_dataset,\n    eval_dataset=test_dataset,\n    peft_config=get_peft_config(model_config),\n)<\/code><\/pre>\n<p>\u6211\u4eec\u53ef\u4ee5\u901a\u8fc7\u5728\u8bad\u7ec3\u5668\u5b9e\u4f8b\u4e0a\u8c03\u7528\u8bad\u7ec3\u65b9\u6cd5\u6765\u5f00\u59cb\u8bad\u7ec3\u3002<\/p>\n<blockquote><p>\n  \u6ce8\u610f\uff1a\u5f3a\u5316\u8bad\u7ec3\u975e\u5e38\u6162\u4e14\u8ba1\u7b97\u91cf\u5927\u3002\u4f7f\u7528 Q-LoRA \u5728 1x L4 \u4e0a\u8fd0\u884c\u5355\u4e2a\u6b65\u9aa4\uff0c\u6279\u6b21\u5927\u5c0f\u4e3a 1\uff0c\u6bcf\u4e2a\u6837\u672c\u4ec5\u751f\u6210 2 \u4ee3\uff0c\u9700\u8981 20 \u591a\u5206\u949f\u3002\n<\/p><\/blockquote>\n<pre><code># Train and push the model to the Hub\ntrainer.train()\n# Save model\ntrainer.save_model(training_args.output_dir)<\/code><\/pre>\n<h2>4\u3001Deepspeed+vLLM GRPO\u5206\u5e03\u8bad\u7ec3\u793a\u4f8b<\/h2>\n<p>\u6bcf\u4e2a\u6b65\u9aa4\u8d85\u8fc7 20 \u5206\u949f\u4e14\u6bcf\u4e2a\u6837\u672c\u4ec5\u751f\u6210 2 \u4ee3\u662f\u4e0d\u53ef\u884c\u7684\u3002\u6211\u4eec\u9700\u8981\u6269\u5927\u6211\u4eec\u7684\u8bad\u7ec3\u89c4\u6a21\u3002Hugging Face TRL \u589e\u52a0\u4e86\u4f7f\u7528 Deepspeed \u8fdb\u884c\u5206\u5e03\u5f0f\u8bad\u7ec3\u548c\u4f7f\u7528 vLLM \u8fdb\u884c\u66f4\u5feb\u751f\u6210\u7684\u652f\u6301\u3002\u6211\u51c6\u5907\u4e86\u4e00\u4e2a \u811a\u672c\u548c\u4e00\u4e2a \u914d\u7f6e\u6587\u4ef6\u6765\u8fd0\u884c\u8bad\u7ec3\u3002<\/p>\n<p>\u6b64\u914d\u7f6e\u5728\u5177\u6709 4x H100 80GB \u7684\u8282\u70b9\u4e0a\u8fdb\u884c\u4e86\u6d4b\u8bd5\u548c\u9a8c\u8bc1\uff0c\u5176\u4e2d\u5355\u4e2a\u6b65\u9aa4\u5927\u7ea6\u9700\u8981 45-60 \u79d2\uff0c\u56e0\u4e3a\u6211\u4eec\u53ef\u4ee5\u5229\u7528 vLLM \u8fdb\u884c\u751f\u6210\u5e76\u5229\u7528 DeepSpeed \u8fdb\u884c\u5206\u5e03\u5f0f\u8bad\u7ec3\u3002\u56e0\u6b64\uff0c\u6211\u4eec\u9700\u8981\u786e\u4fdd\u5c06 num_processes \u6b63\u786e\u8bbe\u7f6e\u4e3a\u4f60\u62e5\u6709\u7684 GPU \u6570\u91cf &#8211; 1\uff0c\u56e0\u4e3a\u6700\u540e\u4e00\u4e2a\u5c06\u4e0e vLLM \u4e00\u8d77\u7528\u4e8e\u751f\u6210\u3002\u5982\u679c\u4f60\u4f7f\u7528\u66f4\u591a GPU\uff0c\u5219\u9700\u8981\u5c06\u914d\u7f6e\u6587\u4ef6\u4e2d\u7684 vllm_device \u66f4\u6539\u4e3a\u6700\u540e\u4e00\u4e2a\u7d22\u5f15 GPU\uff0c\u4f8b\u5982\uff0c\u5982\u679c\u4f60\u6709 8 \u4e2a GPU\uff0c\u5219\u9700\u8981\u5c06 vllm_device=7 \u5e76\u5c06 num_processes \u8bbe\u7f6e\u4e3a 7\u3002<\/p>\n<p>\u8fd0\u884c\u8bad\u7ec3\u7684\u547d\u4ee4\uff1a<\/p>\n<pre><code>accelerate launch --num_processes 3 --config_file configs\/accelerate_configs\/deepspeed_zero3.yaml scripts\/run_r1_grpo.py --config receipes\/grpo-qwen-2.5-3b-deepseek-r1-countdown.yaml<\/code><\/pre>\n<p>\u901a\u8fc7\u4f18\u5316\u7684\u5206\u5e03\u5f0f\u8bad\u7ec3\uff0c\u5728 4x H100 80GB \u4e0a\uff0c\u6bcf\u4e2a\u6837\u672c 8 \u4ee3\u7684\u5355\u4e2a\u6b65\u9aa4\u5927\u7ea6\u9700\u8981 45-60 \u79d2\u3002450 \u6b65\u7684\u5b8c\u6574\u8bad\u7ec3\u5927\u7ea6\u9700\u8981 6 \u4e2a\u5c0f\u65f6\u3002<\/p>\n<h2>5\u3001\u7ed3\u679c\u548c\u8bad\u7ec3\u89c2\u5bdf<\/h2>\n<p>\u811a\u672c\u5c06\u968f\u673a\u5b8c\u6210\u4fdd\u5b58\u5230completion_samples \u6587\u4ef6\u5939\uff0c\u4f60\u53ef\u4ee5\u4f7f\u7528\u5b83\u6765\u68c0\u67e5\u6a21\u578b\u7684\u8fdb\u5ea6\u3002\u5b83\u5305\u62eccompletion_samples.txt \u548csuccess_completion_samples.txt\u3002completion_samples.txt \u5305\u542b\u6240\u6709\u5b8c\u6210\uff0c\u800csuccess_completion_samples.txt \u6b63\u786e\u6c42\u89e3\u65b9\u7a0b\u3002\u4f60\u53ef\u4ee5\u5728\u4e0b\u9762\u627e\u5230\u5173\u4e8e\u6027\u80fd\u5982\u4f55\u968f\u65f6\u95f4\u53d8\u5316\u7684\u6709\u8da3\u8bad\u7ec3\u89c2\u5bdf\uff0c\u4ee5\u53ca Tensornoard \u65e5\u5fd7\u548c\u6210\u529f\u7684\u63a8\u7406\u6837\u672c\u3002<\/p>\n<p>\u53ef\u4ee5\u5728\u627e\u5230\u6bcf 25 \u6b65\u68c0\u67e5\u70b9\u7684\u6a21\u578b\u3002<\/p>\n<h3>5.1 \u8d85\u53c2\u6570<\/h3>\n<p>\u6211\u5f00\u59cb\u4f7f\u7528 \u4e2d\u7684\u8d85\u53c2\u6570\u8fdb\u884c\u5b9e\u9a8c\uff0c\u5b66\u4e60\u7387\u4e3a 1e-6\uff0cbeta\uff08KL \u7cfb\u6570\uff09\u4e3a 0.04\uff0c\u8fd9\u5bfc\u81f4\u5728\u5927\u7ea6 150 \u6b65\u540e\u8bad\u7ec3\u8fd0\u884c\u4e0d\u7a33\u5b9a\u3002\u6211\u8fdb\u884c\u4e86\u4e00\u4e9b\u5c0f\u7684\u8c03\u6574\uff0c\u5e76\u6839\u636e \u7684\u6d4b\u8bd5\u5c06\u5b66\u4e60\u7387\u964d\u4f4e\u5230 5e-7\uff0c\u5c06 beta \u964d\u4f4e\u5230 0.001\u3002\u6211\u65e0\u6cd5\u6d4b\u8bd5\u5c06 <code>num_generations<\/code> \u4ece 8 \u589e\u52a0\u5230 64 \u4f1a\u5bf9\u8bad\u7ec3\u4ea7\u751f\u4ec0\u4e48\u5f71\u54cd\u300264 \u662f\u4ee3\u503c\uff0c\u5728 DeepSeekMath \u8bba\u6587\u4e2d\u4f7f\u7528\u8fc7\u3002\u6240\u6709\u5176\u4ed6\u53c2\u6570\u90fd\u53ef\u4ee5\u5728 \u914d\u7f6e\u6587\u4ef6\u4e2d\u627e\u5230\u3002<\/p>\n<h3>5.2 \u8bad\u7ec3\u89c2\u5bdf<\/h3>\n<ul>\n<li>\u5728 ~50 \u6b65\u65f6\uff0c\u6a21\u578b\u5df2\u7ecf\u5b66\u4f1a\u4e86\u6b63\u786e\u7684\u683c\u5f0f <code>&lt;think&gt;...&lt;\/think&gt;\\n&lt;answer&gt;...&lt;\/answer&gt;<\/code> \u3002<\/li>\n<li>\u5728 100 \u6b65\u65f6\uff0c\u89e3\u65b9\u7a0b\u7684\u6210\u529f\u7387\u7ea6\u4e3a 25%\u3002\u8be5\u6a21\u578b\u5f00\u59cb\u7528\u5355\u8bcd\u201c\u63a8\u7406\u201d\uff0c\u89c1\u4e0b\u9762\u7684\u793a\u4f8b\u3002<\/li>\n<li>\u5728 200 \u6b65\u65f6\uff0c\u6027\u80fd\u4f3c\u4e4e\u6536\u655b\u5f97\u6162\u5f97\u591a\uff0c\u6210\u529f\u7387\u7ea6\u4e3a 40%\u3002\u8be5\u6a21\u578b\u5f00\u59cb\u5b66\u4e60\u4e00\u79cd\u65b0\u7684\u201c\u683c\u5f0f\u201d\uff0c\u5b83\u4ee5\u7c7b\u4f3c\u4e8e\u4f60\u4ee5\u7f16\u7a0b\u65b9\u5f0f\u89e3\u51b3\u65b9\u7a0b\u7684\u65b9\u5f0f\u89e3\u51b3\u65b9\u7a0b\uff0c\u901a\u8fc7\u5c1d\u8bd5\u4e0d\u540c\u7684\u7ec4\u5408\u5e76\u67e5\u770b\u7ed3\u679c\uff0c\u8bf7\u53c2\u9605\u201c\u6b65\u9aa4 200 \u548c 450 \u4e4b\u95f4\u7684\u6210\u529f\u63a8\u7406\u793a\u4f8b\u201d\u3002<\/li>\n<li>\u5728 450 \u6b65\u65f6\uff0c\u6211\u4eec\u89e3\u51b3\u65b9\u7a0b\u7684\u6210\u529f\u7387\u4e3a 50%\u3002\u6027\u80fd\u4ecd\u7136\u5728\u7f13\u6162\u63d0\u9ad8\uff0c\u6a21\u578b\u4fdd\u6301\u4e86\u4ece\u6b65\u9aa4 200 \u5f00\u59cb\u7684\u65b0\u683c\u5f0f\u5f62\u5f0f\u3002<\/li>\n<\/ul>\n<p>\u6211\u6709 3 \u4e2a\u53ef\u80fd\u7684\u5047\u8bbe\uff0c\u8bf4\u660e\u4e3a\u4ec0\u4e48\u6a21\u578b\u4ece\u201c\u5355\u8bcd\u63a8\u7406\u201d\u8f6c\u53d8\u4e3a\u201c\u7f16\u7a0b\u6267\u884c\u201d\uff1a<\/p>\n<ul>\n<li>Qwen 2.5 3B \u4e0d\u591f\u5f3a\u6216\u592a\u5c0f\uff0cDeepseek \u63d0\u5230\u4f60\u9700\u8981\u4e00\u4e2a\u975e\u5e38\u5f3a\u5927\u7684\u57fa\u7840\u6a21\u578b\u3002<\/li>\n<li>\u5956\u52b1\u51fd\u6570\u5b9a\u4e49\u5f97\u4e0d\u591f\u597d\uff0c\u5b83\u4f1a\u5956\u52b1\u7834\u89e3\u6a21\u578b\u6765\u89e3\u65b9\u7a0b\u3002\u6211\u4eec\u53ef\u4ee5\u5c1d\u8bd5\u5f3a\u5236\u5b83\u4f7f\u7528\u5355\u8bcd\uff0c\u4f8b\u5982\u5177\u6709\u6570\u5b57\u5230\u8bcd\u9891\u7684\u6761\u4ef6\u3002 \uff08\u6211\u4eec\u5bf9 Deepseek \u7684\u5956\u52b1\u51fd\u6570\u4e86\u89e3\u4e0d\u591a\uff09<\/li>\n<li>\u4ec5\u5bf9\u5012\u8ba1\u65f6\u6e38\u620f\u4efb\u52a1\u8fdb\u884c\u8bad\u7ec3\u53ef\u80fd\u4f1a\u8feb\u4f7f\u6a21\u578b\u81ea\u7136\u5730\u5b66\u4e60\u6700\u6709\u6548\u7684\u89e3\u65b9\u7a0b\u65b9\u6cd5\uff0c\u56e0\u4e3a\u4e0d\u9700\u8981\u5176\u4ed6\u683c\u5f0f\u3002<\/li>\n<li>\u6211\u4eec\u5bf9\u6a21\u578b\u7684\u8bad\u7ec3\u65f6\u95f4\u4e0d\u591f\u957f\uff0c\u5728 R1 \u8bba\u6587\u4e2d\uff0c\u4ed6\u4eec\u5c55\u793a\u4e86\u8d85\u8fc7 8000 \u6b65\u7684\u8bad\u7ec3\u89c6\u89c9\u6548\u679c\u3002<\/li>\n<\/ul>\n<p>Tensorboard \u7ed3\u679c<\/p>\n<p>\u6210\u529f\u63a8\u7406\u793a\u4f8b\uff0c\u6700\u591a 200 \u6b65\uff1a<\/p>\n<pre><code>&lt;think&gt; First, I need to think about the numbers we have: 14, 70, and 80. To get 85, I think 80 might be a good starting point because it's right in between 85. If I subtract or add 80, it won't lead to 85 easily. Next, 70 seems to be a good intermediate number, especially to use in conjunction with 14. \nNow, I can think about different combinations using addition, subtraction and multiplication\/division. If I take 80 - 70 = 10, which is not needed. So, I need to find a way to use 14. If I subtract 14 from 80, I get 66, which still isn't 85. If I subtract 70 from 80, I get 10 again, this isn't constructive yet. \nMultiplication could create numbers larger than the target 85. Let's consider division: If I divide 70 by 14, I get 5. Then if I add this result to 80, I get 85. This combination seems promising: 80 - (70 \/ 14) = 80 - 5 = 75, this still isn't quite right. Let me correct my approach: Let's use division and then addition. If I take 80 \/ (70 \/ 14) \u2192 80 \/ 5 = 16, and then 80 - 16 = 64, still incorrect. Let me simplify: 80 \/ (70 \/ 14) \u2192 80 \/ 5 = 16, and then 80 - 16 gives 64 again, let me correct: 80 - (70 \/ 14) \u2192 80 - 5 = 75, but I need 85. Thus I will use the combinations in a slightly different order to insert numbers: I get 80 + (70 \/ 14) = 80 + 5 = 85&lt;\/think&gt;\n&lt;answer&gt; 80 + (70 \/ 14) &lt;\/answer&gt;<\/code><\/pre>\n<pre><code>&lt;think&gt; First, I need to consider how I can use the numbers 1, 21, and 90 to reach 68. I can try different combinations of operations. Let's try to use subtraction and addition. If I subtract 90 from 21, I get -69, which is too far from 68. Now, if I add 1 to 90, I get 91, which is also not right. Another way is to see what happens if I subtract 21 from 90, which gives me 69. This is closer but still not 68. Let me use multiplication instead of subtraction. 90 * 1 = 90. Now, if I try to reduce this slightly by subtracting 21, I am not getting close to 68.\nNow, let's try another multiplication. I'll think of 1 * (larger value) + (smaller value). Let's try 1 * 90 + 21. This evaluates to 90 + 21 = 111, which is too much. Instead, I will try: 90 \/ 1 + 21 - 7. But this gives a division by 7, which seems complex. Let's re-evaluate: I'll use 1 to divide 21: 21 \/ 1 = 21. Now, subtract 21 from 90, and add 1, we have 90 - 21 = 69, using one...\nHey, wait on re-evaluating I will subtract 21 from 90, and include 1 to complete the solution: 90 - (21 + 1) which would be 68.&lt;\/think&gt;\n&lt;answer&gt; 90 - (21 + 1) &lt;\/answer&gt;<\/code><\/pre>\n<pre><code>&lt;think&gt; First, I need to consider the numbers we have: 59, 80, and 66. I need to use these numbers to get 45. Let's try to use subtraction first since it can quickly reduce large numbers.\nLet's see if we can get close by subtracting or adding these numbers. If I subtract 66 from 80, I get 14. Then I need to make this 14 equal 45 when combined with 59. That gives me 14 + 59 = 73, which is too high.\nNow let's try adding numbers. If I add 59 and 66, I get 125. Then I would need to subtract 80 from 125 to get 45, but 80 is too large to fit well with this.\nLet me try a different approach. What if I subtract 66 from 80 first and then add 59? That would be 14 + 59 = 73, which is still not 45.\nHmm, I need to think of another combination. What if I start with 59 and then add 80 and subtract 66? That would be (59 + 80) - 66, let me calculate it. That equals 139 - 66, which is 73.\nThat's still not 45, so let me try rearranging it slightly: maybe 80 - 66 + 59 would be closer to 45. Calculating again: 80 - 66 is 14, then adding 59 gives 73.\nThis way doesn't work either. Let's flip this: maybe (59 + 66) - 80. Let me check: that's 125 - 80, which equals 45.&lt;\/think&gt;\n&lt;answer&gt;(59 + 66) - 80&lt;\/answer&gt;<\/code><\/pre>\n<pre><code>&lt;think&gt; First, I need to find a combination of the numbers 61, 4, 94, and 21 that can be manipulated using basic arithmetic operations to equal 50. One way to approach this is to look for numbers that can be used to reach a target number close to 50 and then adjust with the remaining numbers.\nLet's try combining 94 and 21 first:\n94 + 21 = 115\nNow, we need to get from 115 to 50. We can try subtracting 65:\n115 - 65 = 50\nNow we need to find a way to get 65 using the remaining numbers 61 and 4. We can try:\n61 - 4 = 57\nNow we need to get from 57 to 65. Adding 8 would give us that:\n57 + 8 = 65\nSo, we need to find a way to get 8 using 61 and 4. We can try:\n61 - 94 = -36\nThis is negative, so let's try dividing:\n61 \/ 94 = 0.648987...\nThis is not helping, so let's try multiplication and division in a different way:\n(61 - 4) * (94 \/ 21) = 57 * 4.476... = 254.4\nThis is not working either. Let's try a different approach:\n(94 - 61) = 33\nNow we have:\n33 + 21 = 54\nWe need to adjust from 54 to 50, so we can:\n54 - 4 = 50\nSo, the equation becomes:\n(94 - 61) + (21 - 4) = 50\n&lt;\/think&gt;\n&lt;answer&gt;(94 - 61) + (21 - 4)&lt;\/answer&gt;<\/code><\/pre>\n<pre><code>&lt;think&gt; First, I'll consider the possibility of directly subtracting two numbers to get close to 31. If I subtract 57 from 63, I get 7. Now, if I subtract this 7 from 25, I get 25 - 7 = 18. Far from 31, so this doesn't work.\nNext, let's try adding and subtracting. Dividing 63 by 25 will give a decimal, so that's unlikely to work. Now, let's see if adding and subtracting in another order could work. For example, let's try 63 - 25, and then we see this gets us 38. Now, we need a way to get to 31 from here. Subtracting 7 from 38 would work, but we can't figure out how to create that 7 from the given numbers.\nHmm, let me try to incorporate all numbers by adding and subtracting. If I try 63 + 25, I get 88. This is quite far from 31, but if 88 - 57, I get 31 exactly!\nThis seems to be the equation: 63 + 25 - 57 equals 31.\n&lt;\/think&gt;\n&lt;answer&gt; 63 + 25 - 57 &lt;\/answer&gt;<\/code><\/pre>\n<p>\u6b65\u9aa4 200 \u81f3 450 \u4e4b\u95f4\u7684\u6210\u529f\u63a8\u7406\u793a\u4f8b<\/p>\n<pre><code>&lt;think&gt; We need to use the numbers 39, 16, and 55 with basic arithmetic operations to get 78. Let's try different combinations:\n- 55 + 39 - 16 = 84 - 16 = 68 (too low)\n- 55 + 16 - 39 = 71 - 39 = 32 (too low)\n- 55 - 16 + 39 = 39 + 39 = 78 (this works)\nSo, the equation is 55 - 16 + 39 = 78. &lt;\/think&gt;\n&lt;answer&gt; 55 - 16 + 39 &lt;\/answer&gt;<\/code><\/pre>\n<pre><code>&lt;think&gt; We need to use the numbers 21, 5, and 40 with basic arithmetic operations to get 24. Let's try different combinations:\n- 21 - 5 - 40 = -15\n- 21 - 5 + 40 = 56\n- 21 + 5 - 40 = -14\n- 21 * 5 \/ 40 = 2.1 (not an integer, let's try another)\n- 40 * 5 \/ 21 = 9.52 (not 24)\n- 40 \/ 5 - 21 = 9 - 21 = -12\n- 5 * 40 \/ 21 = 100 \/ 21 ~= 4.76 (not 24)\n- 5 + 40 - 21 = 24 ! This works.\n&lt;\/think&gt;\n&lt;answer&gt; 5 + 40 - 21 &lt;\/answer&gt;<\/code><\/pre>\n<pre><code>&lt;think&gt; We need to use the numbers 75, 99, 62, and 34 with basic arithmetic operations to get 72. Let's try different combinations:\n- 99 - 75 + 62 - 34 = 92 - 34 = 58 (too high)\n- 99 - 75 - 62 + 34 = 24 - 62 + 34 = -38 + 34 = -4 (too low)\n- 99 + 75 - 62 - 34 = 174 - 96 = 78 (too high)\n- 99 + 75 - 62 + 34 = 174 + 34 - 62 = 208 - 62 = 146 (too high)\n- 99 - 75 + 62 + 34 = 24 + 96 = 120 (too high)\n- 75 + 99 - 62 - 34 = 174 - 96 = 78 (too high)\n- 75 + 99 - 62 + 34 = 174 + 34 - 62 = 208 - 62 = 146 (too high)\n- 75 + 62 - 99 + 34 = 137 - 99 + 34 = 38 + 34 = 72\nSo, 75 + 62 - 99 + 34 equals 72.\n&lt;\/think&gt;\n&lt;answer&gt; 75 + 62 - 99 + 34 &lt;\/answer&gt;<\/code><\/pre>\n<h2>6\u3001\u7ed3\u675f\u8bed<\/h2>\n<p>DeepSeek R1 \u53ca\u5176\u7814\u7a76\u8bba\u6587\u7684\u53d1\u5e03\u53ef\u80fd\u662f\u5f00\u653e\u79d1\u5b66\u548c\u5f00\u6e90\u5f00\u53d1\u7684\u8f6c\u6298\u70b9\u3002DeepSeek \u53d1\u5e03\u4ec5\u4e00\u5468\u540e\uff0c\u6211\u4eec\u5c31\u80fd\u591f\u4f7f\u7528 GRPO \u548c\u5012\u8ba1\u65f6\u6e38\u620f\u91cd\u73b0 R1 \u5b66\u4e60\u201c\u63a8\u7406\u201d\u7684\u7b80\u5355\u7248\u672c\u3002\u867d\u7136\u6211\u4eec\u7684\u5b9e\u73b0\u4fa7\u91cd\u4e8e\u7279\u5b9a\u4efb\u52a1\uff0c\u800c\u4e0d\u662f\u4e00\u822c\u63a8\u7406\u548c\u6536\u655b\u5230\u975e\u5e38\u5177\u4f53\u7684\u201c\u63a8\u7406\u201d\u683c\u5f0f\uff0c\u4f46\u5b83\u8868\u660e\u8be5\u65b9\u6cd5\u662f\u6709\u6548\u7684\u3002<\/p>\n<p>\u5728\u6211\u4eec\u7684\u8ff7\u4f60 R1 \u5b9e\u9a8c\u4e2d\uff0c\u6211\u4eec\u4f7f\u7528\u4e86 GRPO\uff0c\u6709\u4e24\u4e2a\u57fa\u4e8e\u89c4\u5219\u7684\u5956\u52b1\uff0c\u4f46\u5df2\u7ecf\u9700\u8981\u5927\u91cf\u8ba1\u7b97\uff1a4 \u4e2a H100 GPU \u8fd0\u884c 6 \u5c0f\u65f6\uff0c\u5728 3B \u53c2\u6570\u6a21\u578b\u4e0a\u4ec5\u5b8c\u6210 450 \u4e2a\u8bad\u7ec3\u6b65\u9aa4\u3002\u8fd9\u8ba9\u6211\u4eec\u4e86\u89e3\u4e86\u6269\u5c55\u5f3a\u5316\u5b66\u4e60\u6240\u9700\u7684\u8ba1\u7b97\u9700\u6c42\u3002Deepseek \u8fd0\u884c\u4e86\u4e00\u4e2a 671B \u6a21\u578b\uff0c\u8d85\u8fc7 8000 \u4e2a\u6b65\u9aa4\uff0c\u4ed6\u4eec\u53ef\u80fd\u505a\u4e86\u5f88\u591a\u6d88\u878d\u3002<\/p>\n<p>\u5c55\u671b 2025 \u5e74\uff0c\u5f88\u660e\u663e\u6211\u4eec\u6b63\u5904\u4e8e\u66f4\u5927\u8fdb\u6b65\u7684\u98ce\u53e3\u6d6a\u5c16\u3002 RL \u5c06\u53d8\u5f97\u66f4\u52a0\u6613\u4e8e\u4f7f\u7528\u548c\u7528\u6237\u53cb\u597d\uff0c\u66f4\u591a\u7684\u7814\u7a76\u4eba\u5458\u548c\u5f00\u53d1\u4eba\u5458\u5c06\u63a2\u7d22\u5176\u6f5c\u529b\uff0c\u4f46\u4e0e\u4ee5\u524d\u76f8\u6bd4\uff0c\u4e0e\u76d1\u7763\u5fae\u8c03\u76f8\u6bd4\uff0c\u4e5f\u9700\u8981\u66f4\u591a\u7684\u8ba1\u7b97\u91cf\u3002<\/p>\n<hr>\n","protected":false},"excerpt":{"rendered":"<p>Deepseek R1 \u7684\u53d1\u5e03\u9707\u60ca\u4e86\u6574\u4e2a\u884c\u4e1a\u3002\u4e3a\u4ec0\u4e48\uff1f\u56e0\u4e3a DeepSeek-R1 \u662f\u4e00\u4e2a\u5f00\u653e\u6a21\u578b\uff0c\u5728\u590d\u6742\u7684\u63a8\u7406\u4efb\u52a1\u4e2d\u53ef\u4ee5\u4e0e OpenAI \u7684 o1 \u76f8\u5ab2\u7f8e\uff0c\u5b83\u4f7f\u7528\u7fa4\u7ec4\u76f8\u5bf9\u7b56\u7565\u4f18\u5316 (GRPO) \u548c\u4ee5\u5f3a\u5316\u5b66\u4e60\u4e3a\u4e2d\u5fc3\u7684\u591a\u9636\u6bb5\u8bad\u7ec3\u65b9\u6cd5\u5f15\u5165\u3002\u4ed6\u4eec\u4e0d\u4ec5\u53d1\u5e03\u4e86\u8be5\u6a21\u578b\uff0c\u8fd8\u53d1\u8868\u4e86\u4e00\u7bc7\u5173\u4e8e\u4ed6\u4eec\u5982\u4f55\u505a\u5230\u8fd9\u4e00\u70b9\u7684\u7814\u7a76\u8bba\u6587\u3002 \u5728\u4e2d\uff0c\u4ed6\u4eec\u63cf\u8ff0\u4e86\u4f7f\u7528\u7eaf\u5f3a\u5316\u5b66\u4e60\u8bad\u7ec3\u6a21\u578b\u65f6\u7684\u201c\u987f\u609f\u65f6\u523b\u201d\u3002\u5728\u6b64\u9636\u6bb5\uff0cDeepSeek-R1-Zero\uff08DeepSeek-R1 \u7684\u7b2c\u4e00\u6b21\u6d4b\u8bd5\uff09\u901a\u8fc7\u91cd\u65b0\u8bc4\u4f30\u5176\u521d\u59cb\u65b9\u6cd5\uff0c\u5b66\u4f1a\u4e3a\u95ee\u9898\u5206\u914d\u66f4\u591a\u7684\u601d\u8003\u65f6\u95f4\uff0c\u800c\u65e0\u9700\u4efb\u4f55\u4eba\u5de5\u53cd\u9988\u6216\u63cf\u8ff0\u5982\u4f55\u6267\u884c\u7684\u6570\u636e\u3002\u4ed6\u4eec\u5c06\u6b64\u63cf\u8ff0\u4e3a\u201c\u987f\u609f\u65f6\u523b\u201d\uff1a \u8fd9\u79cd\u884c\u4e3a\u4e0d\u4ec5\u8bc1\u660e\u4e86\u6a21\u578b\u4e0d\u65ad\u589e\u5f3a\u7684\u63a8\u7406\u80fd\u529b\uff0c\u800c\u4e14\u662f\u4e00\u4e2a\u5f15\u4eba\u5165\u80dc\u7684\u4f8b\u5b50\uff0c\u5c55\u793a\u4e86\u5f3a\u5316\u5b66\u4e60\u5982\u4f55\u5bfc\u81f4\u610f\u60f3\u4e0d\u5230\u7684\u590d\u6742\u7ed3\u679c\u3002 \u5728\u8fd9\u7bc7\u535a\u6587\u4e2d\uff0c\u6211\u4eec\u60f3\u4f7f\u7528\u7fa4\u7ec4\u76f8\u5bf9\u7b56\u7565\u4f18\u5316 (GRPO) \u548c\u5012\u8ba1\u65f6\u6e38\u620f\u91cd\u73b0 DeepSeek-R1 \u7684\u5c0f\u201c\u987f\u609f\u65f6\u523b\u201d\u3002\u6211\u4eec\u5c06\u4f7f\u7528\u5f3a\u5316\u5b66\u4e60\u8bad\u7ec3\u4e00\u4e2a\u5f00\u653e\u6a21\u578b\uff0c\u5c1d\u8bd5\u6559\u4f1a\u5b83\u81ea\u6211\u9a8c\u8bc1\u548c\u641c\u7d22\u80fd\u529b\uff0c\u4ee5\u89e3\u51b3\u5012\u8ba1\u65f6\u6e38\u620f\u3002\u5012\u8ba1\u65f6\u6e38\u620f\u662f\u4e00\u4e2a\u6570\u5b57\u62fc\u56fe\u6e38\u620f\uff0c\u73a9\u5bb6\u4f7f\u7528\u4e00\u7ec4\u968f\u673a\u62bd\u53d6\u7684\u6570\u5b57\u548c\u57fa\u672c\u7b97\u672f\u8fd0\u7b97\uff08+\u3001-\u3001\u00d7\u3001\u00f7\uff09\u6765\u8fbe\u5230\u6216\u5c3d\u53ef\u80fd\u63a5\u8fd1\u76ee\u6807\u6570\u5b57\u3002 Target Number: 952 Available Numbers: 25, 50, 75, 100, 3, 6 (100 \u00d7 (3 \u00d7 3)) + (50 + 6 \/ 3) = 952 \u8fd9\u7bc7\u535a\u6587\u5305\u542b\u4e00\u4e2a\u4ea4\u4e92\u5f0f\u4ee3\u7801\uff0c\u4f60\u53ef\u4ee5\u5728 Jupyter Notebook \u4e2d\u8fd0\u884c\u8be5\u4ee3\u7801\uff0c\u4e86\u89e3\u5982\u4f55\u4f7f\u7528 GRPO \u548c Q-Lora \u8bad\u7ec3\u6a21\u578b\u3002\u8fd9\u662f\u5b66\u4e60\u5982\u4f55\u4f7f\u7528 TRL \u548c GRPO \u7684\u597d\u65b9\u6cd5\uff0c\u4f46\u5b83\u975e\u5e38\u6162\u5e76\u4e14\u9700\u8981\u5927\u91cf\u8ba1\u7b97\u3002\u6b64\u5916\uff0c\u6211\u6dfb\u52a0\u4e86\u4e00\u4e2a\u548c\u8bf4\u660e\uff0c\u4ee5\u5728\u5177\u6709\u591a\u4e2a [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[13],"tags":[],"class_list":["post-53776","post","type-post","status-publish","format-standard","hentry","category-ai"],"_links":{"self":[{"href":"https:\/\/fwq.ai\/blog\/wp-json\/wp\/v2\/posts\/53776","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fwq.ai\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fwq.ai\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fwq.ai\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/fwq.ai\/blog\/wp-json\/wp\/v2\/comments?post=53776"}],"version-history":[{"count":0,"href":"https:\/\/fwq.ai\/blog\/wp-json\/wp\/v2\/posts\/53776\/revisions"}],"wp:attachment":[{"href":"https:\/\/fwq.ai\/blog\/wp-json\/wp\/v2\/media?parent=53776"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fwq.ai\/blog\/wp-json\/wp\/v2\/categories?post=53776"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fwq.ai\/blog\/wp-json\/wp\/v2\/tags?post=53776"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}