Language Model Inversion

ICLR 2025 Conference Submission217 Authors

24 Sept 2024 (modified: 24 Sept 2024)ICLR 2025 Conference SubmissionEveryoneRevisionsCC BY 4.0
Keywords: inversion, language models, prompt engineering, security
Abstract: Given a prompt, language models produce a distribution over all possible next tokens; when the prompt is unknown, can we use this distributional information to recover the prompt? We consider the problem of anguage model inversion and show that next-token probabilities contain a surprising amount of information about the preceding text. Often we can recover the text in cases where it is hidden from the user, motivating a method for recovering unknown prompts given only the model's current distribution output. We consider a variety of model access scenarios, and show how even without predictions for every token in the vocabulary we can recover the probability vector through search and reconstruction of the input. On LLAMA-7B, our inversion method reconstructs prompts with a BLEU of $59$ and token-level F1 of $77$ and recovers $23\%$ of prompts exactly
Submission Number: 217
Loading