Estimating worst case frontier risks of open weight LLMs摘要本文研究了发布gpt-oss模型时的最坏情况前沿风险

Estimating worst case frontier risks of open weight LLMs

摘要

本文研究了发布gpt-oss模型时的最坏情况前沿风险。我们引入了恶意微调（MFT）方法，旨在通过微调gpt-oss，使其在生物学和网络安全两个领域的能力达到最大化。为了最大化生物风险（biorisk），我们策划了与威胁制造相关的任务，并在具备网页浏览功能的强化学习环境中训练gpt-oss。为了最大化网络安全风险，我们在一个具备代理编码能力的环境中训练gpt-oss，以解决夺旗赛（CTF）挑战。我们将这些经过恶意微调的模型与开放权重和封闭权重的大型语言模型（LLM）在前沿风险评估中进行了比较。与封闭权重的前沿模型相比，经过恶意微调的gpt-oss表现不及OpenAI的o3模型，后者在生物风险和网络安全方面的能力低于“高准备度”水平。与开放权重模型相比，gpt-oss可能略微提升了生物学能力，但并未显著推动前沿发展。综合来看，这些结果促成了我们发布该模型的决定，我们希望恶意微调方法能为评估未来开放权重模型发布可能带来的危害提供有益指导。

----------------------

Abstract

In this paper, we study the worst-case frontier risks of releasing gpt-oss. We introduce malicious fine-tuning (MFT), where we attempt to elicit maximum capabilities by fine-tuning gpt-oss to be as capable as possible in two domains: biology and cybersecurity. To maximize biological risk (biorisk), we curate tasks related to threat creation and train gpt-oss in an RL environment with web browsing. To maximize cybersecurity risk, we train gpt-oss in an agentic coding environment to solve capture-the-flag (CTF) challenges. We compare these MFT models against open- and closed-weight LLMs on frontier risk evaluations. Compared to frontier closed-weight models, MFT gpt-oss underperforms OpenAI o3, a model that is below Preparedness High capability level for biorisk and cybersecurity. Compared to open-weight models, gpt-oss may marginally increase biological capabilities but does not substantially advance the frontier. Taken together, these results contributed to our decision to release the model, and we hope that our MFT approach can serve as useful guidance for estimating harm from future open-weight releases.

via OpenAI News