Researchers at ETH Zurich and Anthropic just proved that LLMs can identify anonymous internet users for $1-4 per person. They deanonymized two-thirds of a Hacker News user pool at 90% precision. The entire experiment cost less than $2,000.
Pseudonymity is dead. And it was never really alive.
Last week, a team of researchers published a paper called "Large-scale online deanonymization with LLMs." The findings are exactly as alarming as the title suggests.
They built an automated pipeline that reads your anonymous online posts, extracts identity signals (profession, location, interests, writing style, incidental disclosures), and cross-references them against public profiles. Then an LLM reasons over the top candidates and picks the most likely match.
They tested it on 338 Hacker News users. The system correctly identified 226 of them. 67% recall. 90% precision. At a cost of $1-4 per person.
This wasn't a nation-state attack. This wasn't a zero-day exploit. This was six researchers with API access to commercially available AI models, doing something that looks indistinguishable from normal LLM usage. Summarizing text. Generating embeddings. Comparing profiles. No jailbreak required.
And this is the part that matters: it's going to get cheaper.
The End of "Practical Obscurity"
For twenty years, online pseudonymity survived on one thing: economics.
It was always theoretically possible to identify anonymous users. Latanya Sweeney proved in 2002 that 87% of Americans could be uniquely identified using just three data points: a ZIP code, gender, and date of birth. The 2008 Netflix Prize attack linked "anonymous" movie ratings to real IMDb profiles. Deanonymization was never a question of whether. It was a question of how much it cost.
And for most of the internet's history, the cost was high enough to protect most people. Connecting anonymous posts to real identities required human analysts, manual cross-referencing, and structured data that rarely existed in the right format. Your Reddit throwaway was safe the same way a house with no lock is safe: nobody bothered to try the door.
LLMs just removed the cost barrier.
The ETH Zurich pipeline doesn't require structured data. It works on raw, unstructured text. Forum posts, comments, interview transcripts. It extracts identity signals that a human investigator would recognize (this person works in biology, is based in the UK, uses Python, mentioned attending a specific conference) and automates the matching process at scale.
Simon Lermen, one of the paper's authors, put it bluntly: "Ask yourself: could a team of smart investigators figure out who you are from your posts? If yes, LLM agents can likely do the same, and the cost of doing so is only going down."
The Numbers That Should Scare You
The headline stat (67% at 90% precision) is striking. But the scaling numbers are worse.
Against a pool of 1,000 candidates, the pipeline hit 68% true matches at 90% precision. Expand the pool to 89,000 candidates (closer to the size of a real platform community), and identification still held at 55%. Even when the probability of a matching identity being in the candidate pool dropped to 1 in 10,000, the system still managed 9% true matches at 90% precision.
Let those numbers sink in for a second.
Reddit has roughly 100 million monthly active users. 9% of that, at high confidence, means approximately 9 million accounts that could be linked to real identities. Not through hacking. Not through data breaches. Through reading public posts.
And it gets worse. Researcher Daniel Paleka told reporters that deanonymization capability "scales predictably with model improvements." In one experiment matching Reddit users across movie discussion subreddits, switching from low to high reasoning effort roughly doubled the correct identification rate. Models are getting better. API costs are falling. Both curves are accelerating.
The $4 price tag isn't a floor. It's day one of a clearance sale.
The Pipeline Looks Like Normal Usage
Here's what makes this fundamentally different from previous privacy attacks: you can't block it.
The Deanonymization Pipeline:
Step 1: Read someone's public posts (normal LLM usage)
Step 2: Extract identity signals from text (normal LLM usage)
Step 3: Generate embeddings for matching (normal LLM usage)
Step 4: Search candidate profiles (normal LLM usage)
Step 5: Reason over top matches (normal LLM usage)
Step 6: Assign confidence score (normal LLM usage)
Combined effect: full deanonymization
No single step triggers a safety guardrail.
The researchers noted that when models did refuse,
"this could be avoided with small prompt changes."
Open-source models skip the question entirely.
This isn't an exploit. There's no vulnerability to patch. It's an emergent capability that falls out of making AI models better at the things we already want them to do: reading, summarizing, reasoning, and comparing text.
The more capable LLMs become, the better they get at identifying you. The capability improvement we celebrate as "AI progress" is, on the privacy side, the exact same capability that destroys pseudonymity.
Who Cares When It Costs $4?
When deanonymization cost thousands of dollars and required dedicated analysts, it was a targeted tool. Governments used it against specific suspects. Companies used it in specific investigations. The cost naturally limited the scope.
At $1-4 per person, and falling? The scope becomes "everyone."
Authoritarian governments can now unmask dissidents at scale. A motivated intelligence service could always identify a specific protestor with enough analyst hours. Running that same process against every pseudonymous account in a protest movement was prohibitively expensive. At $4 per identification, it's a rounding error in any national security budget.
Data brokers inherit a new product category. Your anonymous post about a health condition in a support community becomes a lead that can be packaged and sold. Your pseudonymous review of a prescription medication becomes a data point in someone's advertising profile. The data broker industry already moves hundreds of billions of dollars annually. This is a new inventory line.
Corporate retaliation becomes trivial. That anonymous Glassdoor review? That pseudonymous post about workplace conditions? At $4 to identify the author, the question isn't whether your employer can find out who wrote it. It's whether they care enough to spend the price of a coffee.
Stalkers and harassers get a new tool. Every post someone writes under a pseudonym becomes raw material for a targeted attack. Not generic phishing. The kind that references the specific conference you mentioned and the niche framework you complained about in a forum thread three years ago.
The researchers themselves acknowledged this: "The practical obscurity protecting pseudonymous users online no longer holds and threat models for online privacy need to be reconsidered."
Pseudonymity Was Never Security
This is the part most of the coverage is missing. The framing has been: "AI has a new dangerous capability." But that puts the emphasis on the wrong variable.
The correct framing is: pseudonymity was never a security mechanism. It was an economic barrier. And economic barriers don't survive technological cost reduction.
A pseudonym is just a label. It doesn't remove your identity from the system. It adds a layer of indirection between your posts and your real name. That indirection held up exactly as long as it was expensive to unwrap. Now it's cheap.
This is the same lesson the privacy community has been learning in every other domain:
The Pattern:
VPNs → "We don't log" (but they can)
Encrypted email → "We can't read it" (but metadata is visible)
Tor → "We can't trace you" (but traffic analysis exists)
Pseudonyms → "Nobody knows who I am" (but your posts identify you)
The common thread: partial measures protect you
until the cost of breaking them drops below
the value of identifying you.
The only measure that survives cost reduction
is not having the data in the first place.
The Mullvad VPN raid in 2023 is still the clearest illustration. Swedish police showed up with a search warrant. They wanted user data. They left empty-handed. Not because Mullvad's security was impenetrable, but because there was no data to take. Random account numbers. No email. No name. No logs.
When the cost of breaking a privacy measure drops to zero, the only thing that protects you is the absence of data, not the difficulty of accessing it.
What This Means for Infrastructure
If pseudonymity is dead, what actually works?
The answer isn't more pseudonymity. It's not better usernames or more careful posting habits. The researchers showed that even users who were careful about what they posted could be identified through subtle patterns: interests, expertise areas, temporal posting patterns, writing style.
The answer is architectural. Services need to be designed so that user data doesn't exist in a form that can be correlated.
Pseudonymous Architecture (vulnerable to LLM deanonymization):
- User creates account with email
- User chooses username
- User posts under username
- Platform stores: email, IP logs, posting history,
metadata, behavioral patterns
- All of this is correlatable
Data-Minimized Architecture (resistant to LLM deanonymization):
- User creates account with random credential
- No email, no name, no personal information
- Platform stores: credential, balance, active services
- Nothing to correlate because nothing was collected
The LLM deanonymization attack works by extracting identity signals from what you've said and matching them to who you are. It requires both halves: behavioral data and identity data. If a service never collects the identity half, there's nothing to match against.
This is the difference between privacy as a policy and privacy as an architecture. A policy says "we won't look at your data." An architecture says "the data doesn't exist." Policies can be broken by technology, by legal compulsion, by insider threats, by acquisitions, by breaches. Architecture survives all of these because there's nothing to break.
The New Threat Model
The old threat model for online privacy was relatively simple: don't post personal details, use a pseudonym, maybe use a VPN. That was enough to protect against casual identification.
The new threat model needs to account for the fact that AI can extract identity signals from the content of what you write, not just the explicit details you share. Mentioning that you use a specific programming language. Complaining about weather in a way that narrows your geography. Discussing a conference you attended. Having opinions about a niche topic that few people share.
Every post adds what researchers call "micro-data" that narrows your identity space. The more you post, the easier you are to identify. Users who shared 10+ movies in the Reddit experiment were matched at 48% recall, compared to 3% for users who shared just one.
The practical implications:
- Pseudonyms on platforms that collect identity data (email, phone, IP) no longer provide meaningful protection against motivated adversaries
- The more content you produce under a single pseudonym, the more identifiable you become
- Compartmentalization (different identities for different platforms, different interests, different writing patterns) helps but requires extreme discipline
- Services that never collect identity data in the first place are the only ones where behavioral data can't be matched to a person
And here's the final twist: this doesn't just affect you going forward. It affects everything you've already posted. Every comment. Every forum thread. Every anonymous review. All of it is now raw material for $4 deanonymization. You can't un-post it. The data is already public.
What You Can Actually Do
Let's be practical. Most people aren't going to delete all their online accounts. Here's what actually helps, in order of impact:
1. Choose services that don't collect identity data. If a service requires your email, phone number, or real name, assume that everything you do on that service can eventually be linked to you. The service itself might not do it, but the data exists in a form that can be correlated by anyone with access, whether that's a hacker, a government, a data broker, or an LLM.
2. Compartmentalize aggressively. If you use pseudonyms, use different ones on different platforms. Don't discuss the same niche interests across accounts. Vary your writing style. This is exhausting to maintain, which is why option 1 is better.
3. Reduce your posting surface. Fewer posts means fewer identity signals. The researchers found dramatic differences in identification rates based on posting volume. If you must post pseudonymously, post less.
4. Assume your existing posts are compromised. Anything you've posted under a pseudonym on a platform that has your identity data (even just an email) is now matchable. This can't be undone. But it can inform your behavior going forward.
5. For sensitive activities, use infrastructure that was designed from the ground up for data minimization. Not infrastructure that promises to be careful with your data. Infrastructure that architecturally cannot collect it. There's a fundamental difference between "we protect your data" and "your data doesn't exist."
The Bottom Line
Pseudonymity was never security. It was security by economics. AI just crashed the price.
The ETH Zurich and Anthropic research isn't a surprise if you've been paying attention. It's the inevitable conclusion of a trend that's been building for two decades: as the cost of data processing approaches zero, every form of privacy that depends on "it's too expensive to identify me" fails.
What survives is the same thing that has always survived: not having the data in the first place. Mullvad understood this when they built a VPN service around random account numbers. The principle applies everywhere. If a service doesn't have your identity, no amount of AI capability can extract it. If a service does have your identity, it's a matter of when, not whether, that link gets made.
The researchers' pipeline costs $4 today. Next year it'll cost less. The year after that, less still. Eventually, deanonymization will be so cheap that it happens passively, embedded in advertising networks and data broker pipelines, running continuously in the background.
The question isn't whether your pseudonym will protect you. It's how much time you have left before it doesn't.
Build accordingly.