Computer Science > Computation and Language

arXiv:2407.09447 (cs)

[Submitted on 12 Jul 2024 (v1), last revised 18 Oct 2024 (this version, v2)]

Title:ASTPrompter: Weakly Supervised Automated Language Model Red-Teaming to Identify Likely Toxic Prompts

Authors:Amelia F. Hardy, Houjun Liu, Bernard Lange, Mykel J. Kochenderfer

Abstract:Typical schemes for the automated red-teaming of large language models (LLMs) focus on discovering prompts that trigger a frozen language model (the defender) to generate toxic text. This often results in the prompting model (the adversary) producing text that is unintelligible and unlikely to arise. Here, we propose a reinforcement learning formulation of the LLM red-teaming task that allows us to discover prompts that both (1) trigger toxic outputs from a frozen defender and (2) have low perplexity as scored by that defender. We argue these cases are the most pertinent in a red-teaming setting because they are likely to arise during normal use of the defender model. We solve this formulation through a novel online and weakly supervised variant of Identity Preference Optimization (IPO) on GPT-2, GPT-2 XL, and TinyLlama defenders. We demonstrate that our policy is capable of generating likely (low-perplexity) prompts that also trigger toxicity from all of these architectures. Furthermore, we show that this policy outperforms baselines by producing attacks that are occur with higher probability and are more effective. Finally, we discuss our findings and the observed trade-offs between likelihood vs toxicity. Source code for this project is available for this project at: this https URL.

Comments:	8 pages, 8 pages of appendix, 2 tables, 3 figures
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2407.09447 [cs.CL]
	(or arXiv:2407.09447v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2407.09447

Submission history

From: Houjun Liu [view email]
[v1] Fri, 12 Jul 2024 17:33:34 UTC (9,180 KB)
[v2] Fri, 18 Oct 2024 21:14:46 UTC (10,533 KB)

Computer Science > Computation and Language

Title:ASTPrompter: Weakly Supervised Automated Language Model Red-Teaming to Identify Likely Toxic Prompts

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:ASTPrompter: Weakly Supervised Automated Language Model Red-Teaming to Identify Likely Toxic Prompts

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators