Technical Report: Performance and baseline evaluations of gpt-oss-safeguard-120b and gpt-oss-safeguard-20b引言gpt-oss-safeguard-120b 和 gpt-oss-safeguard-20b 是两款基于 gpt-oss 模型后续训练得到的开源权重推理模型，旨在根据所提供的策略进行推理并据此对内容打标签

Technical Report: Performance and baseline evaluations of gpt-oss-safeguard-120b and gpt-oss-safeguard-20b

引言

gpt-oss-safeguard-120b 和 gpt-oss-safeguard-20b 是两款基于 gpt-oss 模型后续训练得到的开源权重推理模型，旨在根据所提供的策略进行推理并据此对内容打标签。它们以 Apache 2.0 许可及我们的 gpt-oss usage policy 发布。两款仅文本模型在开源社区反馈下开发，兼容我们的 Responses API 。这些模型可定制，能输出完整的思路链条（ chain-of-thought ( CoT ) ），支持不同推理强度（低、中、高），并可生成结构化输出。

本报告介绍了 gpt-oss-safeguard 的能力，并在以底层 gpt-oss 模型为基线的基础上，对 gpt-oss-safeguard 系列模型给出安全基线评估。关于底层 gpt-oss 模型的开发与架构详情，请参阅原始 gpt-oss model model card [https://openai.com/index/gpt-oss-model-card/]。

我们建议将这些模型用于根据既定政策对内容进行分类，而不是作为最终用户直接交互的核心；面向用户的应用更适合使用原始 gpt-oss 模型。下文的安全指标反映了 gpt-oss-safeguard 在聊天场景下的表现。尽管这些模型并未为此类用途设计，但由于它们是开放模型，仍有人可能将其用于聊天场景；因此我们验证了其在此种使用下是否满足我们的安全标准，本报告披露了相关测试结果。我们还对其在聊天场景下的多语言表现做了初步评估；注意这并不直接等同于在以策略为基础的内容分类任务中的表现。

gpt-oss-safeguard 系列是对各自 gpt-oss 模型的微调版本，训练过程中未加入额外的生物或网络安全数据。因此，我们认为此前针对 gpt-oss 发布所做的最坏情形风险估计研究同样适用于这些新模型。[https://openai.com/index/estimating-worst-case-frontier-risks-of-open-weight-llms/]

----------------------

Introduction

gpt-oss-safeguard-120b and gpt-oss-safeguard-20b are two open-weight reasoning models post-trained from the gpt-oss models and trained to reason from a provided policy in order to label content under that policy. They are available under the Apache 2.0 license and our gpt-oss usage policy. Developed with feedback from the open-source community, these text-only models are compatible with our Responses API. The models are customizable, provide full chain-of-thought (CoT), can be used with different reasoning efforts (low, medium, high), and support Structured Outputs.

In this report, we describe gpt-oss-safeguard’s capabilities and provide our baseline safety evaluations on the gpt-oss-safeguard models, using the underlying gpt-oss models as a baseline. For more information about the development and architecture of the underlying gpt-oss models, see the original gpt-oss model model card⁠.

We recommend using these models to classify content against a provided policy, and not as the core functionality with which end users interact; the original gpt-oss models are better for those applications. The safety metrics provided below describe how gpt-oss-safeguard models function in chat settings. The gpt-oss-safeguard models are not intended for this use, but since they are open models, it is possible for someone to use the models in this way. Because of that possibility, we wanted to verify that they met our safety standards in such usage; this report shares the results of those tests. We also share an initial evaluation of multi-language performance in a chat setting; note that this does not directly assess performance during content classification with a provided policy.

The gpt-oss-safeguard models are fine-tunes of their gpt-oss counterparts, and were trained without any additional biological or cybersecurity data. As a result, we determined that the previous work estimating worst case scenarios⁠ from gpt-oss release cross applies to these new models.

via OpenAI News