detectai.media

Why AI text watermarking is a bad idea

Watermarking is the strongest AI-text detection there is, and still the wrong basis for an accusation: opt-in, cheaply scrubbed, and forgeable onto the innocent.

By The DetectAI team
5 min read
Contents

Watermarking is the strongest form of AI-text detection, and it is still the wrong thing to build an accusation on. The injected watermark is the one text fingerprint that behaves like a real trace, but it is opt-in only, cheaply scrubbed, and cheaply forged, and a system that can be made to falsely accuse is worse than no system at all. The case for that conclusion has to start by granting the watermark everything it genuinely earns.

Grant the watermark its real strength

The honest argument steelmans the watermark before it limits it. The Kirchenbauer green-list scheme (Kirchenbauer et al., ICML 2023) biases generation so the output is detectable “from short spans of tokens (as few as 25 tokens)” at “negligible impact on text quality.” It is more robust than that sounds: the reliability follow-up shows it survives human and machine paraphrase, still detectable after about 800 tokens at a 1e-5 false-positive rate (Kirchenbauer et al., ICLR 2024). And it scales: SynthID-Text was A/B-tested across roughly 20 million Gemini responses with no statistically significant quality loss (Dathathri et al., Nature 2024). Where a cooperating model has marked its own output, the watermark is a genuine, low-error provenance signal. Everything below assumes it works exactly as advertised.

It only marks text that opted in

The first limit is structural. A watermark exists only where the generator inserted it. Text from an open-weight model with watermarking disabled, text from any model that does not watermark, and text from a keyboard and a word processor carry no signal at all. The case detectai.media is built around, a real person accused of using AI on their own writing, is exactly the non-cooperative case, where there is nothing to detect. A tool that fires only when the writer’s model chose to cooperate is silent against the writer who used a different model, and silent again for the one who used a watermarking model but stripped the mark.

It is cheaply scrubbed

The second limit is removal. The watermark dilutes as text is paraphrased, and the dilution is cheap to force. Sadasivan et al. (2023) show that recursive paraphrasing drops a soft-watermark detector’s true-positive rate at a 1% false-positive rate from 99.3% to 9.7% on 300-token passages. No model access and no special tooling are required, only a second pass over the text. That is the practical problem: the watermark may be strong in long, clean, cooperating text, but real text is often short, edited, excerpted, translated, or revised, and its absence then proves nothing about authorship.

It is cheaply forged, and forgery frames the innocent

The third limit is the most damaging, because it runs the other way. Jovanović et al. (ICML 2024) demonstrate watermark stealing: querying a watermarked model enough to approximate its rules, then mounting attacks. For a one-time query cost of under $50, they spoof and scrub schemes “previously considered safe” with average success over 80%. Scrubbing removes the mark; spoofing stamps it onto text that no watermarking model ever produced. That is the failure that matters here, because a spoofed watermark on innocent human writing can, in the authors’ words, “incriminate a particular user.” A detection system whose signal can be forged onto an innocent person is not merely a weak tool, it is a dangerous one, and a system that can falsely accuse is worse than none.

The key is centralised, and it protects the vendor

Even setting forgery aside, a watermark is provenance for whoever holds the key, not a public test. Its provider-side cousin, retrieval, makes this explicit: matching a candidate against a log of generated text works, but as Krishna et al. (NeurIPS 2023) put it, it “must be maintained by a language model API provider.” Both mechanisms answer one question, “did this come from our system,” which a vendor can ask and a teacher, editor or court cannot. They establish provenance for the party that already controls the text, and that party is never the one bringing a false-accusation case against a stranger.

It solves the wrong side of the problem

Stack the limits and the burden runs backwards. A watermark can show that a compliant generator may have produced a passage. It cannot show that an unmarked passage is human, it cannot prove that a person did not use a different model, and it does nothing for the falsely accused human whose writing was never marked. It does nothing about the determined evader, who scrubs the mark for the price of a paraphrase, or who never used a watermarking model at all. What it does do is mark the compliant user, the person who used the sanctioned tool honestly, while a forger can stamp the same mark onto an innocent third party. A mechanism that burdens the honest, frees the evader, and can frame the innocent has its incentives exactly backwards.

There is a narrower, legitimate role for watermarking inside a model’s own ecosystem, for disclosure, platform governance and provider accountability, and our sister site examines that durability question in can AI watermarks be removed. But that is provenance a vendor maintains over its own outputs, not proof of authorship a third party can use against a stranger. As the basis for accusing a real person, the verdict that anchors are AI text detectors reliable, the watermark fails, and it fails in the one direction a detection system must never fail: by accusing the innocent.

Sources

  • Kirchenbauer, Geiping, Wen et al. (2023). A Watermark for Large Language Models. ICML 2023.
  • Kirchenbauer, Geiping, Wen et al. (2023). On the Reliability of Watermarks for Large Language Models. ICLR 2024.
  • Dathathri, See et al. (2024). Scalable watermarking for identifying large language model outputs. Nature 634:818-823.
  • Jovanović, Staab, Vechev (2024). Watermark Stealing in Large Language Models. ICML 2024.
  • Sadasivan, Kumar, Balasubramanian et al. (2023). Can AI-Generated Text be Reliably Detected?
  • Krishna, Song, Karpinska et al. (2023). Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense. NeurIPS 2023.
#text#watermarking#synthid
Last updated
29 June 2026
Category
Watermarking