{"id":53727,"date":"2025-02-16T11:27:15","date_gmt":"2025-02-16T03:27:15","guid":{"rendered":"https:\/\/fwq.ai\/blog\/53727\/"},"modified":"2025-02-16T11:27:15","modified_gmt":"2025-02-16T03:27:15","slug":"%e7%94%a8grpo%e7%ae%97%e6%b3%95%e8%ae%ad%e7%bb%83%e5%8c%bb%e7%96%97ai%e6%a8%a1%e5%9e%8b","status":"publish","type":"post","link":"https:\/\/fwq.ai\/blog\/53727\/","title":{"rendered":"\u7528GRPO\u7b97\u6cd5\u8bad\u7ec3\u533b\u7597AI\u6a21\u578b"},"content":{"rendered":"<p>\u5927\u578b\u8bed\u8a00\u6a21\u578b\uff08LLMs\uff09\u4e0e\u533b\u7597\u4fdd\u5065\u7684\u4ea4\u53c9\u70b9\u5e26\u6765\u4e86\u4ee4\u4eba\u5174\u594b\u7684\u673a\u4f1a\uff0c\u4f46\u4e5f\u5e26\u6765\u4e86\u72ec\u7279\u7684\u6311\u6218\u3002\u5728\u672c\u6559\u7a0b\u4e2d\uff0c\u6211\u4eec\u5c06\u63a2\u8ba8\u5982\u4f55\u4f7f\u7528<strong>\u5206\u7ec4\u76f8\u5bf9\u7b56\u7565\u4f18\u5316\uff08GRPO\uff09<\/strong>\u2014\u2014\u4e00\u79cd\u6700\u8fd1\u7531DeepSeek\u56e2\u961f\u5f15\u5165\u7684\u6709\u524d\u9014\u7684\u65b0\u5f3a\u5316\u5b66\u4e60\u6280\u672f\u2014\u2014\u6765\u9002\u5e94\u963f\u91cc\u5df4\u5df4\u7684Qwen-3B\u6a21\u578b\u4ee5\u7528\u4e8e\u533b\u5b66\u63a8\u7406\u3002<\/p>\n<p>\u4e3a\u4ec0\u4e48\u8fd9\u5f88\u91cd\u8981\uff1a<\/p>\n<ul>\n<li> <strong>\u60a3\u8005\u5b89\u5168\u7b2c\u4e00<\/strong>\uff1a\u533b\u5b66AI\u4e2d\u7684\u5e7b\u89c9\u53ef\u80fd\u662f\u5371\u9669\u7684\u3002<\/li>\n<li> <strong>\u9886\u57df\u4e13\u4e1a\u5316<\/strong>\uff1a\u901a\u7528LLMs\u96be\u4ee5\u5904\u7406\u4e34\u5e8a\u63a8\u7406\u3002<\/li>\n<li> <strong>\u6548\u7387<\/strong>\uff1a\u6211\u4eec\u76843B\u53c2\u6570\u6a21\u578b\u53ef\u4ee5\u5728\u6d88\u8d39\u7ea7GPU\u4e0a\u8fd0\u884c\u3002<\/li>\n<\/ul>\n<p>\u63a8\u7406\u6a21\u578b\u5982O3\u548cDeepSeek R1\u5728\u8bb8\u591a\u5177\u6709\u6311\u6218\u6027\u7684\u57fa\u51c6\u6d4b\u8bd5\u4e2d\u663e\u793a\u4e86\u524d\u6240\u672a\u6709\u7684\u6539\u8fdb\u3002\u5b83\u4eec\u6539\u53d8\u4e86\u76d1\u7763\u5fae\u8c03\u7684\u8d8b\u52bf\uff0c\u8f6c\u5411\u5b9e\u9645\u7684\u5f3a\u5316\u5b66\u4e60\uff08RL\uff09\u3002\u6211\u4eec\u5728\u6df1\u5ea6\u5b66\u4e60\u9886\u57df\u7684\u8bb8\u591a\u7a81\u7834\u90fd\u6765\u81eaRL\uff0c\u4f8b\u5982AlphaGo\uff0c\u56e0\u4e3a\u6a21\u578b\u80fd\u591f\u901a\u8fc7\u4e0e\u4e0d\u540c\u7684\u73b0\u5b9e\u573a\u666f\u4e92\u52a8\u6765\u5b66\u4e60\uff0c\u800c\u8fd9\u4e9b\u573a\u666f\u5728\u76d1\u7763\u5fae\u8c03\u4e2d\u5f80\u5f80\u96be\u4ee5\u63d0\u4f9b\u793a\u4f8b\u3002<\/p>\n<p>  DeepSeek R1\u5728\u51e0\u4e2a\u5173\u952e\u57fa\u51c6\u4e0a\u7684\u8868\u73b0 \u3002 <\/p>\n<p>\u5982\u679c\u4f60\u60f3\u4e86\u89e3\u66f4\u591a\u5173\u4e8e\u63a8\u7406\u6a21\u578b\u6216\u66f4\u591a\u5386\u53f2\u7ec6\u8282\uff0c\u6211\u5f3a\u70c8\u63a8\u8350Maarten\u7684\u6587\u7ae0\u3002DeepSeek\u5de5\u4f5c\u7684\u7f8e\u5999\u4e4b\u5904\u5728\u4e8e\u4ed6\u4eec\u5b9e\u73b0\u4e86\u4e00\u4e2a\u5b9e\u7528\u7684\u6846\u67b6\uff0c\u7528\u4e8e\u4f7f\u7528GRPO\u5bf9LLM\u8fdb\u884c\u5fae\u8c03\u3002\u6839\u636eMaarten\u7684\u6587\u7ae0\uff1a<\/p>\n<blockquote><p>\n  \u8fd9\u4e2a\u7b97\u6cd5\u80cc\u540e\u7684\u76f4\u89c9\u662f\uff0c\u5b83\u4f7f\u6240\u6709\u5bfc\u81f4\u6b63\u786e\u6216\u9519\u8bef\u7b54\u6848\u7684\u9009\u62e9\u66f4\u53ef\u80fd\u6216\u66f4\u4e0d\u53ef\u80fd\u3002\u8fd9\u4e9b\u9009\u62e9\u53ef\u4ee5\u662f\u4ee4\u724c\u96c6\u4e5f\u53ef\u4ee5\u662f\u63a8\u7406\u6b65\u9aa4\u3002\n<\/p><\/blockquote>\n<p>\u6b63\u5982\u4e0b\u9762\u7684\u56fe\u7247\u6240\u793a\uff1a\u76ee\u6807\u662f\u6fc0\u52b1\u6a21\u578b\u751f\u6210\u54cd\u5e94\uff0c\u4f7f\u5176\u5728\u6b63\u786e\u7684*<em>\u548c<\/em>*\u5757\u4e2d\u4ee5\u53ca\u6211\u4eec\u80fd\u591f\u8f7b\u677e\u9a8c\u8bc1\u7684\u6700\u7ec8\u6b63\u786e\u7b54\u6848\u4e2d\u90fd\u80fd\u4ea7\u751f\u826f\u597d\u7684\u7ed3\u679c\uff08\u5982\u6570\u5b66\u95ee\u9898\uff09\u3002<\/p>\n<p>  DeepSeek-R1-Zero\u4f7f\u7528\u7684RL\u7ba1\u9053 <\/p>\n<p>\u597d\u4e86\uff0c\u80cc\u666f\u77e5\u8bc6\u5c31\u5230\u8fd9\u91cc\uff0c\u8ba9\u6211\u4eec\u5f00\u59cb\u52a8\u624b\u5427\u3002\u672c\u6587\u4f7f\u7528\u7684\u4ee3\u7801\u4f5c\u4e3a\u63d0\u4f9b\uff0c\u4f60\u53ef\u4ee5\u8f7b\u677e\u5730\u4f7f\u7528T4\u514d\u8d39\u8d44\u6e90\u8fd0\u884c\u3002<\/p>\n<h2>1\u3001\u5b89\u88c5Unsloth\u548cTRL<\/h2>\n<p>\u5f00\u6e90\u8f6f\u4ef6\u5df2\u7ecf\u53d6\u5f97\u4e86\u5f88\u5927\u8fdb\u5c55\u2014\u2014\u5728\u8fd9\u4e2a\u6559\u7a0b\u4e2d\uff0c\u6211\u4eec\u5c06\u4f7f\u7528\u4e24\u4e2a\u51fa\u8272\u7684\u5f00\u6e90\u5e93\uff1a<\/p>\n<ul>\n<li> <strong>:<\/strong> \u4e00\u4e2a\u5e2e\u52a9\u6211\u4eec\u4eceGPU\u4e2d\u5c3d\u53ef\u80fd\u591a\u5730\u63d0\u53d6\u5185\u5b58\u5e76\u63d0\u9ad8\u8bad\u7ec3\u6027\u80fd\u7684\u5e93\u3002<\/li>\n<li><strong>:<\/strong> \u6765\u81eaHugging Face\u7684\u4e00\u4e2a\u5f00\u6e90\u5e93\uff0c\u5e2e\u52a9\u6211\u4eec\u5b9e\u73b0GRPO\u3002<\/li>\n<\/ul>\n<p>\u6211\u4eec\u8fd8\u5c06\u4f7f\u7528<em>Qlora<\/em>\u6280\u672f\uff0c\u5e2e\u52a9\u6211\u4eec\u4ee5\u66f4\u9ad8\u6548\u7684\u65b9\u5f0f\u5fae\u8c03\u6a21\u578b\u3002\u5982\u679c\u4f60\u60f3\u4e86\u89e3\u66f4\u591a\u5173\u4e8eQlora\u7684\u4fe1\u606f\uff0c\u6211\u5f3a\u70c8\u63a8\u8350Sebastian\u7684\u6587\u7ae0\u3002<\/p>\n<pre><code>!pip install unsloth vllm  # \u5185\u5b58\u9ad8\u6548\u7684\u8bad\u7ec3\u548c\u63a8\u7406  \n!pip install trl@git+https:\/\/github.com\/huggingface\/trl  # GRPO\u5b9e\u73b0\n<\/code><\/pre>\n<pre><code>from unsloth import FastLanguageModel, PatchFastRL  \nPatchFastRL(\"GRPO\", FastLanguageModel)\n<\/code><\/pre>\n<h2>2\u3001\u4e0b\u8f7d\u5e76\u521d\u59cb\u5316\u6a21\u578b<\/h2>\n<p>\u6211\u4eec\u5c06\u9996\u5148\u4e0b\u8f7d\u6a21\u578b\uff0c\u5e76\u5229\u752850%\u7684GPU\u5bb9\u91cf\u4ee5\u53ca\u63a8\u7406\u6765\u52a0\u901f\u4f7f\u7528Qlora\u7684GRPO\u8bad\u7ec3\u3002<\/p>\n<pre><code>from unsloth import is_bfloat16_supported  \nimport torch  \nmax_seq_length = 2048 # \u53ef\u4ee5\u589e\u52a0\u4ee5\u652f\u6301\u66f4\u957f\u7684\u63a8\u7406\u8ddf\u8e2a  \nlora_rank = 64 # \u8f83\u5927\u7684rank\u8868\u793a\u66f4\u667a\u80fd\u4f46\u66f4\u6162  \n  \nmodel, tokenizer = FastLanguageModel.from_pretrained(  \n    model_name = \"Qwen\/Qwen2.5-3B-Instruct\",  \n    max_seq_length = max_seq_length,  \n    load_in_4bit = True, # False for LoRA 16bit  \n    fast_inference = True, # \u542f\u7528vLLM\u5feb\u901f\u63a8\u7406  \n    max_lora_rank = lora_rank,  \n    gpu_memory_utilization = 0.5, # \u51cf\u5c11\u5982\u679c\u51fa\u73b0\u5185\u5b58\u4e0d\u8db3  \n)  \n  \nmodel = FastLanguageModel.get_peft_model(  \n    model,  \n    r = lora_rank, # \u9009\u62e9\u4efb\u4f55\u5927\u4e8e0\u7684\u6570\u5b57\uff01\u5efa\u8bae8, 16, 32, 64, 128  \n    target_modules = [  \n        \"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",  \n        \"gate_proj\", \"up_proj\", \"down_proj\",  \n    ], # \u5982\u679c\u51fa\u73b0\u5185\u5b58\u4e0d\u8db3\u5219\u79fb\u9664QKVO  \n    lora_alpha = lora_rank,  \n    use_gradient_checkpointing = \"unsloth\", # \u542f\u7528\u957f\u4e0a\u4e0b\u6587\u5fae\u8c03  \n    random_state = 3407,  \n)\n<\/code><\/pre>\n<p><strong>\u5173\u952e\u9009\u62e9<\/strong>\uff1a<\/p>\n<ul>\n<li><em>\u91cf\u5316\uff1a\u542f\u752816\/24GB GPU\u8bad\u7ec3\uff08\u517c\u5bb9T4\/A10\uff09<\/em><\/li>\n<li><em>LoRA Rank 64\uff1a\u5e73\u8861\u6027\u80fd\u4e0e\u5185\u5b58<\/em><\/li>\n<li><em>vLLM\u96c6\u6210\uff1a\u751f\u6210\u901f\u5ea6\u63d0\u9ad850%<\/em><\/li>\n<\/ul>\n<h2>3\u3001\u6570\u636e\u7b56\u7565\uff1a\u533b\u5b66\u63a8\u7406\u9e21\u5c3e\u9152<\/h2>\n<p>\u6211\u4eec\u4f7f\u7528Hugging Face\u7684<code>interleave_datasets<\/code>\u6df7\u5408\u4e86\u4e09\u4e2a\u5173\u952e\u6570\u636e\u96c6\uff1a<\/p>\n<p><strong>PubMedQA<\/strong>\uff08\u5360\u603b\u6570\u636e\u768470%\uff09\uff1a<\/p>\n<ul>\n<li>\u4e34\u5e8a\u95ee\u7b54\uff0c\u7b54\u6848\u4e3a<em>yes\/no\/maybe<\/em><\/li>\n<li>\u8fc7\u6ee4\u5230&lt;1024\u4e2atoken\u4ee5\u63d0\u9ad8\u5185\u5b58\u6548\u7387<\/li>\n<\/ul>\n<p><strong>GSM8K<\/strong>\uff1a<\/p>\n<ul>\n<li>\u6570\u5b66\u6587\u5b57\u95ee\u9898\uff0c\u4ee5\u4fdd\u6301\u6570\u503c\u63a8\u7406\u80fd\u529b<\/li>\n<\/ul>\n<p><strong>Health Benchmarks<\/strong>\uff1a<\/p>\n<ul>\n<li>\u8d85\u8fc750\u4e2a\u533b\u5b66\u4e13\u79d1\u7684\u591a\u9879\u9009\u62e9\u9898<\/li>\n<li>\u5305\u62ec\u5fc3\u810f\u75c5\u5b66\u5230\u75ab\u82d7\u63a5\u79cd\u7b49\u7c7b\u522b<\/li>\n<\/ul>\n<p><strong>\u5c0f\u8d34\u58eb<\/strong>\uff1a\u6743\u91cd\u5e94\u53cd\u6620\u6570\u636e\u96c6\u590d\u6742\u6027\u2014\u2014PubMedQA\u83b7\u5f973\u500d\u66f4\u591a\u7684\u66dd\u5149\u4ee5\u5904\u7406\u5176\u590d\u6742\u6027\u3002\u6211\u4eec\u8fd9\u91cc\u6ca1\u6709\u4f7f\u7528\u4efb\u4f55\u7b49\u5f85\u65f6\u95f4\uff0c\u4f46\u6211\u4eec\u6253\u4e71\u4e86\u6570\u636e\u96c6\uff0c\u7531\u4e8e\u6211\u4eec\u6709\u4e09\u500d\u7684PubMedQA\u6837\u672c\uff0c\u56e0\u6b64\u6a21\u578b\u6709\u4e09\u500d\u7684\u673a\u4f1a\u5c55\u793a\u8fd9\u4e9b\u793a\u4f8b\u3002<\/p>\n<pre><code>import re  \nfrom datasets import load_dataset, Dataset, interleave_datasets, concatenate_datasets  \n  \n# \u52a0\u8f7d\u548c\u51c6\u5907\u6570\u636e\u96c6  \nSYSTEM_PROMPT = \"\"\"  \n\u54cd\u5e94\u683c\u5f0f\u5982\u4e0b\uff1a  \n&lt;reasoning&gt;  \n...  \n&lt;\/reasoning&gt;  \n&lt;answer&gt;  \n...  \n&lt;\/answer&gt;  \n\"\"\"  \n  \nXML_COT_FORMAT = \"\"\"\\  \n&lt;reasoning&gt;  \n{reasoning}  \n&lt;\/reasoning&gt;  \n&lt;answer&gt;  \n{answer}  \n&lt;\/answer&gt;  \n\"\"\"  \n  \ndef extract_xml_answer(text: str) -&gt; str:  \n    answer = text.split(\"&lt;answer&gt;\")[-1]  \n    answer = answer.split(\"&lt;\/answer&gt;\")[0]  \n    return answer.strip()  \n  \ndef extract_hash_answer(text: str) -&gt; str | None:  \n    if \"####\" not in text:  \n        return None  \n    return text.split(\"####\")[1].strip()  \n  \n# \u53d6\u6d88\u6ce8\u91ca\u4e2d\u95f4\u6d88\u606f\u4ee5\u8fdb\u884c\u4e00\u6b21\u63d0\u793a  \ndef get_datasets(split = \"train\") -&gt; Dataset:  \n    data = load_dataset('openai\/gsm8k', 'main')[split] # type: ignore  \n    data = data.map(lambda x: { # type: ignore  \n        'prompt': [  \n            {'role': 'system', 'content': SYSTEM_PROMPT},  \n            {'role': 'user', 'content': x['question']}  \n        ],  \n        'answer': extract_hash_answer(x['answer']),  \n        'db_set':'gsm8k'  \n    }) # type: ignore  \n    data = data.remove_columns(['question'])  \n      \n    data_qa = load_dataset(\"qiaojin\/PubMedQA\", \"pqa_artificial\")[split] # \u4e24\u500d\u4e8e\u5176\u4ed6\u6570\u636e\u96c6  \n    data_qa = data_qa.filter(lambda x: len(\"\\n\".join(x['context']['contexts'])) &lt; 1024) # \u907f\u514d\u957f\u8ddf\u8e2a  \n    data_qa = data_qa.map(lambda x: { # type: ignore  \n        'prompt': [  \n            {'role': 'system', 'content': SYSTEM_PROMPT},  \n            {  \n                \"role\": \"user\",  \n                \"content\": \"Given the scientific context below:\\n\" +   \n                          \"\\n\".join(x['context']['contexts']) +   \n                          \"\\n\\nAnswer the following question:\\n\" +  \n                          x['question'] +   \n                          \" with 'yes', 'no' or 'maybe'. You need to carefully review the context and reason before answering.\"  \n            },  \n        ],  \n        'answer': x['final_decision'],  \n        'db_set': 'pubmedqa'  \n    }) # type: ignore  \n    data_qa = data_qa.remove_columns(['pubid', 'question', 'context', 'long_answer', 'final_decision'])  \n      \n      \n    categories =['Lab_Medicine', 'Wearables', 'Dermatology', 'Gastroenterology', 'Internal_Medicine', 'Oncology', 'Orthopedics', 'General_Surgery', 'Ophthalmology', 'Audiology', 'Head_Neck_Surgery', 'Elderly_Care', 'Pediatrics', 'Allergy_Immunology', 'Rheumatology', 'Pharmacy', 'Obstetrics_Gynecology', 'Microbiology', 'Dentistry', 'Physical_Medicine_and_Rehabilitation', 'Neurology', 'Psychiatry', 'Pathology', 'Genetics', 'Rare_Diseases', 'Hematology', 'Emergency', 'Endocrinology', 'Radiology', 'Cardiology', 'Pulmonology', 'Infectious_Diseases', 'Critical_Care', 'Pediatric_Surgery', 'Neuroscience', 'Epidemiology', 'Fitness_Sports', 'Health_Education', 'Health_Economics', 'Health_Entrepreneurship', 'Hospital_Management', 'Mental_Health', 'Nutrition', 'Palliative_Care', 'Preventive_Medicine', 'Public_Health', 'Social_Media_Addiction', 'Sleep', 'Supplements', 'Vaccination', 'Work_Health', 'Wellbeing']  \n    data_mc = concatenate_datasets([load_dataset(\"yesilhealth\/Health_Benchmarks\",i)[i] for i in categories])  \n    data_mc = data_mc.map(lambda x: { # type: ignore  \n        'prompt': [  \n            {'role': 'system', 'content': SYSTEM_PROMPT},  \n            {  \n                \"role\": \"user\",  \n                \"content\": \"\\n\\nAnswer the following question:\\n\" +  \n                          x['Questions'] +   \n                          \"\\n With 'A', 'B', 'C' or 'D'. You need to carefully review the context and reason before answering.\"  \n            },  \n        ],  \n        'answer': x['Answers'],  \n        'db_set': 'med_mc'  \n    }) # type: ignore  \n    data_mc = data_mc.remove_columns(['Answers', 'Questions'])  \n      \n    dataset = concatenate_datasets([data, data_qa, data_mc])  \n    return dataset\n<\/code><\/pre>\n<h2>4\u3001\u79d8\u5bc6\u6b66\u5668\uff1a\u5956\u52b1\u5de5\u7a0b<\/h2>\n<p>\u6211\u4eec\u7684\u591a\u5956\u52b1\u7cfb\u7edf\u65e2\u6559\u6388<strong>\u63a8\u7406\u7ed3\u6784<\/strong>\u4e5f\u6559\u6388<strong>\u533b\u5b66\u51c6\u786e\u6027<\/strong>\uff08\u8bf7\u53c2\u9605\u7b14\u8bb0\u672c\u4ee5\u4e86\u89e3\u8be6\u7ec6\u7684\u5956\u52b1\u51fd\u6570\uff09\uff1a<\/p>\n<pre><code>def correctness_reward(responses, answers):  \n    # \u7ed9\u4e88\u5b8c\u5168\u5339\u914d2.0\u5206\uff0c\u90e8\u5206\u5339\u914d1.0\u5206  \n    return [2.0 if match else (1.0 if partial else 0.0)...]  \n  \ndef format_reward(completions):  \n    # \u5f3a\u5236\u6267\u884c&lt;reasoning&gt;...&lt;\/answer&gt;\u7ed3\u6784  \n    return [0.5 if re.match(XML_PATTERN) else 0.0...]\n<\/code><\/pre>\n<p><strong>\u5956\u52b1\u5c42\u6b21<\/strong>\uff1a<\/p>\n<ol>\n<li><strong>\u51c6\u786e\u6027<\/strong>\uff0850%\u6743\u91cd\uff09\uff1a\u4e0e\u771f\u5b9e\u60c5\u51b5\u4e00\u81f4<\/li>\n<li><strong>\u683c\u5f0f\u5316<\/strong>\uff0830%\uff09\uff1aXML\u98ce\u683c\u7684\u63a8\u7406\u8ddf\u8e2a<\/li>\n<li><strong>\u4e2d\u95f4\u68c0\u67e5<\/strong>\uff0820%\uff09\uff1a\u6709\u6548\u7b54\u6848\u7c7b\u578b<\/li>\n<\/ol>\n<blockquote><p>\n <em>\u7c7b\u6bd4<\/em>\uff1a\u5c31\u50cf\u6559\u5bfc\u4e00\u4f4d\u533b\u5b66\u4f4f\u9662\u533b\u751f\u4e00\u6837\u2014\u2014\u65e2\u8981\u8868\u626c\u8bca\u65ad\u51c6\u786e\u5ea6\u4e5f\u8981\u8868\u626c\u9002\u5f53\u7684\u6587\u6863\u8bb0\u5f55\u3002\n<\/p><\/blockquote>\n<h2>5\u3001GRPO\u8bad\u7ec3\u914d\u7f6e<\/h2>\n<p>\u8fd9\u4e9b\u53c2\u6570\u4e3b\u8981\u662f\u731c\u6d4b\u5de5\u4f5c\uff0c\u5c1a\u672a\u4f18\u5316\uff0c\u4f46\u5728\u6211\u7684\u521d\u6b65\u5b9e\u9a8c\u4e2d\u4f3c\u4e4e\u6548\u679c\u5f88\u597d\u3002\u60a8\u53ef\u4ee5\u6839\u636e\u81ea\u5df1\u7684\u7528\u4f8b\u8fdb\u884c\u8c03\u6574\u548c\u5b9e\u9a8c\u3002<\/p>\n<pre><code>from trl import GRPOConfig, GRPOTrainer  \ntraining_args = GRPOConfig(  \n    use_vllm = True, # \u4f7f\u7528vLLM\u8fdb\u884c\u5feb\u901f\u63a8\u7406\uff01  \n    learning_rate = 5e-6,  \n    adam_beta1 = 0.9,  \n    adam_beta2 = 0.99,  \n    weight_decay = 0.1,  \n    warmup_ratio = 0.1,  \n    lr_scheduler_type = \"cosine\",  \n    optim = \"adamw_8bit\",  \n    logging_steps = 1,  \n    bf16 = is_bfloat16_supported(),  \n    fp16 = not is_bfloat16_supported(),  \n    per_device_train_batch_size = 1,  \n    gradient_accumulation_steps = 1, # \u589e\u52a0\u52304\u4ee5\u83b7\u5f97\u66f4\u5e73\u6ed1\u7684\u8bad\u7ec3  \n    num_generations = 6, # \u51fa\u73b0\u5185\u5b58\u4e0d\u8db3\u65f6\u51cf\u5c11  \n    max_prompt_length = 1024,  \n    max_completion_length = 1024,  \n    #num_train_epochs = 1, # \u8bbe\u7f6e\u4e3a1\u4ee5\u8fdb\u884c\u5b8c\u6574\u7684\u8bad\u7ec3  \n    max_steps = 750,  \n    save_steps = 100,  \n    max_grad_norm = 0.1,  \n    report_to = \"none\", # \u53ef\u4ee5\u4f7f\u7528Weights &amp; Biases  \n    output_dir = \"outputs\",  \n)  \n  \ntrainer = GRPOTrainer(  \n    model = model,  \n    processing_class = tokenizer,  \n    reward_funcs = [  \n        xmlcount_reward_func,  \n        soft_format_reward_func,  \n        strict_format_reward_func,  \n        int_reward_func,  \n        correctness_reward_func,  \n    ],  \n    args = training_args,  \n    train_dataset = train_dataset,  \n    eval_dataset=test_dataset,  \n)  \ntrainer.train()\n<\/code><\/pre>\n<p>\u5c31\u8fd9\u6837\u2014\u2014\u7531\u4e8e\u6211\u4eec\u7684\u5956\u52b1\u51fd\u6570\u8d28\u91cf\u5f88\u9ad8\u2014\u2014\u60a8\u5e94\u8be5\u5f88\u5feb\u5f00\u59cb\u770b\u5230\u5956\u52b1\u51fd\u6570\u8df3\u5f97\u5f88\u9ad8\uff08\u8fd9\u662fRL\u5fae\u8c03\u7684\u7b2c\u4e00\u6b65\uff09\u3002<\/p>\n<p>\u6a21\u578b\u5f00\u59cb\u53d1\u5c55\u63a8\u7406\u80fd\u529b<\/p>\n<p>\u8ba9\u6211\u4eec\u770b\u770b\u6a21\u578b\u5728\u8bad\u7ec3\u8fc7\u7a0b\u4e2d\u751f\u6210\u7684\u4e00\u4e9b\u793a\u4f8b\u7b54\u6848\uff1a<\/p>\n<pre><code>Prompt:\n\nGiven the scientific context below:\n\nWe have previously shown the detrimental effects of 19 sub-erythemal exposures to daily ultraviolet radiation (DUVR, which mimics non-extreme exposure conditions), delivered over 4 weeks to volunteers. This source had UVA (320-400 nm) to UVB (290-320 nm) irradiance ratio of 25, instead of that close to 10 that is typically the case with solar-simulated radiation (SSR) that represents summer global sunlight with a clear sky and quasi-zenith solar irradiance.\nHere, we report on an extension of this previous study, in which we evaluated the photoprotection afforded by a broad-spectrum daily-care product with a low-sun protection factor (SPF 8, UVA-PF 7 and 3* rated UVA protection). We assessed cellular and molecular markers of photodamage that are relevant to skin cancer and photoageing.\nThis study shows that biological effects of repeated exposure to DUVR can be prevented by a broad-spectrum daily-care product and that the level of protection afforded varies with the studied endpoint.\n\nAnswer the following question:\nDoes a broad-spectrum sunscreen prevent cumulative damage from repeated exposure to sub-erythemal solar ultraviolet radiation representative of temperate latitudes? with 'yes', 'no' or 'maybe'. You need to carefully review the context and reason before answering. \n\nModel Response:\n\n&lt;reasoning&gt;\nThe context provided discusses a study where repeated exposures to sub-erythemal UV radiation (DUVR) can cause detrimental effects over a period of 4 weeks. It also mentions that a broad-spectrum daily-care product with an SPF 8 and UVA protection factor with a 7 and 3* rating was used. The study aims to evaluate whether this broad-spectrum sunscreen can prevent cumulative damage from such repeated exposures. The mention of repeated exposure and the presence of a broad-spectrum product strongly suggests that the sunscreen can prevent cumulative damage. However, the actual effectiveness of the sunscreen in preventing cumulative damage is not explicitly stated in the provided context. Given the experimental setup and the corrective action (broad-spectrum product) taken, it can be inferred that the answer is likely 'yes'.\n\nThe sunscreen contains SPF 8, which means it provides protection to around 92% of the UVB radiation, and it has a UVA protection factor (UVA-PF) of 7, which likely provides varying degrees of protection to different wavelengths, including those of UVA. Such broad-spectrum protection would effectively prevent the cumulative damage associated with prolonged exposure to sub-erythemal UV radiation.\n\n&lt;\/reasoning&gt;\n\n&lt;answer&gt;\nyes\n&lt;\/answer&gt;<\/code><\/pre>\n<p>\u5982\u4f60\u6240\u89c1\uff0c\u6a21\u578b\u5f00\u59cb\u5c06\u601d\u8003\u7ec4\u7ec7\u6210\u7ed3\u6784\u5316\u7684\u63a8\u7406\u5757\u2014\u2014\u73b0\u5728\u4f60\u53ef\u4ee5\u60f3\u8c61\u6211\u4eec\u53ef\u4ee5\u4f7f\u7528\u8fd9\u6837\u7684\u6280\u672f\u663e\u8457\u6539\u5584\u4e13\u95e8\u6a21\u578b\u7684RAG\u3002\u6211\u4eec\u751a\u81f3\u53ef\u4ee5\u6559\u6a21\u578b\u7f16\u5199SQL\u4ee3\u7801\u3001\u533b\u5b66\u672f\u8bed\u89c4\u8303\u5316\u7b49\u7b49\uff0c\u6f5c\u529b\u65e0\u9650\uff01<\/p>\n<\/p>\n<hr>\n","protected":false},"excerpt":{"rendered":"<p>\u5927\u578b\u8bed\u8a00\u6a21\u578b\uff08LLMs\uff09\u4e0e\u533b\u7597\u4fdd\u5065\u7684\u4ea4\u53c9\u70b9\u5e26\u6765\u4e86\u4ee4\u4eba\u5174\u594b\u7684\u673a\u4f1a\uff0c\u4f46\u4e5f\u5e26\u6765\u4e86\u72ec\u7279\u7684\u6311\u6218\u3002\u5728\u672c\u6559\u7a0b\u4e2d\uff0c\u6211\u4eec\u5c06\u63a2\u8ba8\u5982\u4f55\u4f7f\u7528\u5206\u7ec4\u76f8\u5bf9\u7b56\u7565\u4f18\u5316\uff08GRPO\uff09\u2014\u2014\u4e00\u79cd\u6700\u8fd1\u7531DeepSeek\u56e2\u961f\u5f15\u5165\u7684\u6709\u524d\u9014\u7684\u65b0\u5f3a\u5316\u5b66\u4e60\u6280\u672f\u2014\u2014\u6765\u9002\u5e94\u963f\u91cc\u5df4\u5df4\u7684Qwen-3B\u6a21\u578b\u4ee5\u7528\u4e8e\u533b\u5b66\u63a8\u7406\u3002 \u4e3a\u4ec0\u4e48\u8fd9\u5f88\u91cd\u8981\uff1a \u60a3\u8005\u5b89\u5168\u7b2c\u4e00\uff1a\u533b\u5b66AI\u4e2d\u7684\u5e7b\u89c9\u53ef\u80fd\u662f\u5371\u9669\u7684\u3002 \u9886\u57df\u4e13\u4e1a\u5316\uff1a\u901a\u7528LLMs\u96be\u4ee5\u5904\u7406\u4e34\u5e8a\u63a8\u7406\u3002 \u6548\u7387\uff1a\u6211\u4eec\u76843B\u53c2\u6570\u6a21\u578b\u53ef\u4ee5\u5728\u6d88\u8d39\u7ea7GPU\u4e0a\u8fd0\u884c\u3002 \u63a8\u7406\u6a21\u578b\u5982O3\u548cDeepSeek R1\u5728\u8bb8\u591a\u5177\u6709\u6311\u6218\u6027\u7684\u57fa\u51c6\u6d4b\u8bd5\u4e2d\u663e\u793a\u4e86\u524d\u6240\u672a\u6709\u7684\u6539\u8fdb\u3002\u5b83\u4eec\u6539\u53d8\u4e86\u76d1\u7763\u5fae\u8c03\u7684\u8d8b\u52bf\uff0c\u8f6c\u5411\u5b9e\u9645\u7684\u5f3a\u5316\u5b66\u4e60\uff08RL\uff09\u3002\u6211\u4eec\u5728\u6df1\u5ea6\u5b66\u4e60\u9886\u57df\u7684\u8bb8\u591a\u7a81\u7834\u90fd\u6765\u81eaRL\uff0c\u4f8b\u5982AlphaGo\uff0c\u56e0\u4e3a\u6a21\u578b\u80fd\u591f\u901a\u8fc7\u4e0e\u4e0d\u540c\u7684\u73b0\u5b9e\u573a\u666f\u4e92\u52a8\u6765\u5b66\u4e60\uff0c\u800c\u8fd9\u4e9b\u573a\u666f\u5728\u76d1\u7763\u5fae\u8c03\u4e2d\u5f80\u5f80\u96be\u4ee5\u63d0\u4f9b\u793a\u4f8b\u3002 DeepSeek R1\u5728\u51e0\u4e2a\u5173\u952e\u57fa\u51c6\u4e0a\u7684\u8868\u73b0 \u3002 \u5982\u679c\u4f60\u60f3\u4e86\u89e3\u66f4\u591a\u5173\u4e8e\u63a8\u7406\u6a21\u578b\u6216\u66f4\u591a\u5386\u53f2\u7ec6\u8282\uff0c\u6211\u5f3a\u70c8\u63a8\u8350Maarten\u7684\u6587\u7ae0\u3002DeepSeek\u5de5\u4f5c\u7684\u7f8e\u5999\u4e4b\u5904\u5728\u4e8e\u4ed6\u4eec\u5b9e\u73b0\u4e86\u4e00\u4e2a\u5b9e\u7528\u7684\u6846\u67b6\uff0c\u7528\u4e8e\u4f7f\u7528GRPO\u5bf9LLM\u8fdb\u884c\u5fae\u8c03\u3002\u6839\u636eMaarten\u7684\u6587\u7ae0\uff1a \u8fd9\u4e2a\u7b97\u6cd5\u80cc\u540e\u7684\u76f4\u89c9\u662f\uff0c\u5b83\u4f7f\u6240\u6709\u5bfc\u81f4\u6b63\u786e\u6216\u9519\u8bef\u7b54\u6848\u7684\u9009\u62e9\u66f4\u53ef\u80fd\u6216\u66f4\u4e0d\u53ef\u80fd\u3002\u8fd9\u4e9b\u9009\u62e9\u53ef\u4ee5\u662f\u4ee4\u724c\u96c6\u4e5f\u53ef\u4ee5\u662f\u63a8\u7406\u6b65\u9aa4\u3002 \u6b63\u5982\u4e0b\u9762\u7684\u56fe\u7247\u6240\u793a\uff1a\u76ee\u6807\u662f\u6fc0\u52b1\u6a21\u578b\u751f\u6210\u54cd\u5e94\uff0c\u4f7f\u5176\u5728\u6b63\u786e\u7684*\u548c*\u5757\u4e2d\u4ee5\u53ca\u6211\u4eec\u80fd\u591f\u8f7b\u677e\u9a8c\u8bc1\u7684\u6700\u7ec8\u6b63\u786e\u7b54\u6848\u4e2d\u90fd\u80fd\u4ea7\u751f\u826f\u597d\u7684\u7ed3\u679c\uff08\u5982\u6570\u5b66\u95ee\u9898\uff09\u3002 DeepSeek-R1-Zero\u4f7f\u7528\u7684RL\u7ba1\u9053 \u597d\u4e86\uff0c\u80cc\u666f\u77e5\u8bc6\u5c31\u5230\u8fd9\u91cc\uff0c\u8ba9\u6211\u4eec\u5f00\u59cb\u52a8\u624b\u5427\u3002\u672c\u6587\u4f7f\u7528\u7684\u4ee3\u7801\u4f5c\u4e3a\u63d0\u4f9b\uff0c\u4f60\u53ef\u4ee5\u8f7b\u677e\u5730\u4f7f\u7528T4\u514d\u8d39\u8d44\u6e90\u8fd0\u884c\u3002 1\u3001\u5b89\u88c5Unsloth\u548cTRL \u5f00\u6e90\u8f6f\u4ef6\u5df2\u7ecf\u53d6\u5f97\u4e86\u5f88\u5927\u8fdb\u5c55\u2014\u2014\u5728\u8fd9\u4e2a\u6559\u7a0b\u4e2d\uff0c\u6211\u4eec\u5c06\u4f7f\u7528\u4e24\u4e2a\u51fa\u8272\u7684\u5f00\u6e90\u5e93\uff1a : \u4e00\u4e2a\u5e2e\u52a9\u6211\u4eec\u4eceGPU\u4e2d\u5c3d\u53ef\u80fd\u591a\u5730\u63d0\u53d6\u5185\u5b58\u5e76\u63d0\u9ad8\u8bad\u7ec3\u6027\u80fd\u7684\u5e93\u3002 : \u6765\u81eaHugging Face\u7684\u4e00\u4e2a\u5f00\u6e90\u5e93\uff0c\u5e2e\u52a9\u6211\u4eec\u5b9e\u73b0GRPO\u3002 \u6211\u4eec\u8fd8\u5c06\u4f7f\u7528Qlora\u6280\u672f\uff0c\u5e2e\u52a9\u6211\u4eec\u4ee5\u66f4\u9ad8\u6548\u7684\u65b9\u5f0f\u5fae\u8c03\u6a21\u578b\u3002\u5982\u679c\u4f60\u60f3\u4e86\u89e3\u66f4\u591a\u5173\u4e8eQlora\u7684\u4fe1\u606f\uff0c\u6211\u5f3a\u70c8\u63a8\u8350Sebastian\u7684\u6587\u7ae0\u3002 !pip install unsloth vllm # \u5185\u5b58\u9ad8\u6548\u7684\u8bad\u7ec3\u548c\u63a8\u7406 !pip install trl@git+https:\/\/github.com\/huggingface\/trl # GRPO\u5b9e\u73b0 from unsloth import FastLanguageModel, PatchFastRL PatchFastRL(&#8220;GRPO&#8221;, FastLanguageModel) 2\u3001\u4e0b\u8f7d\u5e76\u521d\u59cb\u5316\u6a21\u578b \u6211\u4eec\u5c06\u9996\u5148\u4e0b\u8f7d\u6a21\u578b\uff0c\u5e76\u5229\u752850%\u7684GPU\u5bb9\u91cf\u4ee5\u53ca\u63a8\u7406\u6765\u52a0\u901f\u4f7f\u7528Qlora\u7684GRPO\u8bad\u7ec3\u3002 from unsloth import is_bfloat16_supported import torch max_seq_length = 2048 # \u53ef\u4ee5\u589e\u52a0\u4ee5\u652f\u6301\u66f4\u957f\u7684\u63a8\u7406\u8ddf\u8e2a lora_rank [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[13],"tags":[],"class_list":["post-53727","post","type-post","status-publish","format-standard","hentry","category-ai"],"_links":{"self":[{"href":"https:\/\/fwq.ai\/blog\/wp-json\/wp\/v2\/posts\/53727","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fwq.ai\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fwq.ai\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fwq.ai\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/fwq.ai\/blog\/wp-json\/wp\/v2\/comments?post=53727"}],"version-history":[{"count":0,"href":"https:\/\/fwq.ai\/blog\/wp-json\/wp\/v2\/posts\/53727\/revisions"}],"wp:attachment":[{"href":"https:\/\/fwq.ai\/blog\/wp-json\/wp\/v2\/media?parent=53727"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fwq.ai\/blog\/wp-json\/wp\/v2\/categories?post=53727"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fwq.ai\/blog\/wp-json\/wp\/v2\/tags?post=53727"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}