AI or Human: Watermarking LLM-Generated Text

Image of a dark blue sphere with numerals floating

Erman Ayday, Co-Faculty Director, xLab; Associate Professor, Computer and Data Science

The rapid expansion of artificial intelligence (AI) and natural language processing (NLP) in recent years has allowed for the development of large language models (LLMs) to become a prominent aspect in academia and industry. Models such as OpenAI’s ChatGPT, enables capabilities for generating human-like text by understanding user input and learning from extensive amounts of data. As more users are drawn to these systems, the amount of LLM-generated text over the internet increases, causing potential misinformation, copyright, and plagiarism issues. The ability to identify text produced by LLMs is increasingly vital, especially given the easy access to LLM services in scenarios such as university coursework, where academic integrity is paramount. These concerns result in the need for tools that can differentiate LLM- and human-generated material.

Watermarking algorithms have potential to provide solutions to address these concerns by embedding identifiable patterns in LLM-generated text. Watermarking text enables the detection of whether a specific text sequence was generated by an LLM or a human individual.

Current watermarking schemes adjust the generated output token (i.e., units of text used in NLP analysis, such as words, phrases, or characters as elements for processing in NLP tasks) distributions by sampling tokens into “green” and “red” lists, enhancing the likelihood of green tokens in the LLM output to create a detectable watermark. Detection is determined by comparing the frequency of red tokens in a text sequence with human-generated text expected to contain significantly more red tokens then LLM-generated text.

One major limitation of current watermarking algorithms is the lack of robustness against different types of attacks concerning basic text manipulations and more powerful attacks. In such attacks, the goal of a potential attacker is to tamper with the LLM output to remove or distort the watermark, so the watermark detection mechanism cannot identify LLM-generated output with high confidence. In addition, practicality issues limit the real implementation of the framework to ground: most algorithms contain multiple tunable parameters, which makes it hard to control in realistic scenarios. In addition, the current watermark detection mechanisms are either ineffective or they are not practical.

Recently, Google introduced SynthID-Text, a watermarking algorithm for their LLM (Gemini), which aims to address some of these limitations by embedding context-specific tokens into the text during the generation process using a sampling method called Tournament Sampling. SynthID-Text is specifically designed for scalability and preserving text quality, making it the first watermarking scheme openly deployed in production for a large-scale model like Gemini. However, SynthID-Text has limitations, with extensive edits and paraphrasing, where it cannot fully prevent adversarial attempts to bypass detection.

In xLab, we aim to address these shortcomings and limitations of existing watermarking algorithms and propose innovative frameworks that utilizes topics within text sequences. Since potential attacks against watermarking algorithms mainly aim at tampering with the watermark in such a way that the detection algorithm fails to detect the watermark, using a topic-based watermarking algorithm for LLMs makes it easier to model the benefit vs. loss of such attacks. Our proposed framework not only enhances robustness against various forms of attacks and ensures efficiency in embedding and detecting watermarks, but it also supports adaptability across different LLM applications. An early draft of our work is available at https://arxiv.org/abs/2404.02138