Back to Blog
Someone Used Claude to Rewrite a Library and Strip Its License. The Legal Question Applies to All of Us.

Someone Used Claude to Rewrite a Library and Strip Its License. The Legal Question Applies to All of Us.

chardet has 130 million monthly downloads. A developer used Claude to rewrite it in five days and switched the license from LGPL to MIT. The original author came back after 15 years to object. I read the issue and realized I do this every day.

open-sourceai-code-generationlicensingsoftware-law
May 1, 2026
7 min read

chardet has 130 million monthly downloads on PyPI. In February, Dan Blanchard (the library's maintainer for 12 years) opened a blank repository, pointed Claude at the specification, and told it to build a new character encoding detector from scratch. Five days later he shipped chardet 7.0. Forty-eight times faster. More accurate. And MIT-licensed instead of LGPL.

Then Mark Pilgrim came back.

Pilgrim created chardet in 2006 and disappeared from the internet in 2011, deleting every trace of his online existence. Fifteen years of silence. On March 4, he opened a GitHub issue titled "No right to relicense this project." His argument: copyleft means derivative works carry the same license. The LGPL requires it. "Adding a fancy code generator into the mix does not somehow grant them any additional rights." The issue got 1,465 thumbs up.

I read it the morning it hit Hacker News and my first thought wasn't about copyright law. It was about the code I've been shipping with Claude every day. If Blanchard did something wrong, I do a version of it constantly.

What Blanchard Actually Did

The standard defense is "clean room." That's not quite what happened.

Blanchard started in an empty repo. He told Claude explicitly: "do not use LGPL or GPL source code, do not look at the existing chardet code, start from an empty repository." Day one, Claude's brainstorming subagent tried to explore the old repo anyway. Blanchard blocked it. If you've spent time with Claude Code's subagents, you know this pattern. They act on their own initiative regardless of what you told them.

Then Blanchard himself instructed Claude to fetch charsets.py, create_language_model.py, and metadata/languages.py from the old repo. These were files he'd authored, and he argued he could do what he wanted with his own code. Separately, Claude's subagents fetched universaldetector.py without being asked. 567 lines of original detection logic. Blanchard was direct about it: "That's exactly the kind of exposure I was trying to avoid, and I'm not going to minimize it."

The similarity evidence tells a different story than the process narrative. JPlag token-level analysis: 0.04% average similarity, 1.30% max. git blame -C -C -C: zero lines from Mark Pilgrim in v7. The architecture changed completely. Stateful parallel probers became a stateless 12-stage sequential pipeline. Hand-written byte-walking state machines became calls to Python's built-in str.decode(). From 42 source files and ordinal scoring buckets to 23 files and cosine similarity with IDF weighting.

Three weeks after release, Blanchard switched the license from MIT to 0BSD (a public-domain-equivalent license), "sidestepping the question of whether AI-generated code is copyrightable in the first place."

The LGPL Experts Can't Agree

The Free Software Foundation's position is unambiguous. Zoe Kooyman: "There is nothing 'clean' about a Large Language Model which has ingested the code it is being asked to reimplement." Claude's training data includes chardet's source. The model didn't start from zero even if the repository did.

Bradley Kuhn at the Software Freedom Conservancy launched a formal investigation. His recommendation to developers: "We strongly recommend that folks do not rely on the notion that the new release of chardet has been legitimately relicensed." That investigation is still open.

Richard Fontana co-authored LGPLv3. His initial take on the issue: "I don't currently see any basis for concluding that chardet 7.0.0 is required to be released under the LGPL." Three weeks later he walked it back. That was "an observation about the state of the discussion at the time, not as a definitive legal conclusion." Even the people who wrote the license can't agree on whether it was violated.

On the other side, Salvatore Sanfilippo (antirez) pointed out that reimplementations have always been legal. GNU reimplemented the UNIX userspace. Contributors had direct exposure to the code they were replacing. Wine reimplemented the Windows API. Compaq reverse-engineered the IBM BIOS in a clean room in the early 1980s. Google won fair use for copying 11,500 lines of Java API declarations. The legal test has never been whether the process was clean. It's whether copyrightable expression survived in the output.

Heather Meeker (the leading open source IP attorney) called Blanchard's transparency effort "one of the more conscientious ones I have seen." Her actual criticism landed somewhere I didn't expect: "The mistake is claiming your code is 'an independent work, not a derivative' while simultaneously shipping it as the next version of the thing you say it's independent from."

And there's a paradox underneath all of this. The same week chardet 7 shipped, SCOTUS declined to hear Thaler v. Perlmutter. The implication: AI-only works can't receive copyright protection. If the code Claude generated isn't copyrightable, Blanchard can't apply any license to it. Not MIT. Not 0BSD. The code is public domain by default. The entire licensing argument becomes moot because there's nothing to license.

Armin Ronacher (Flask, Sentry) has a name for this kind of output: "slopfork."

This Isn't About One Library

A site called Malus.sh launched in late April. "Clean room as a service." You give it a package name, it uses two separate LLMs (one reads documentation, the other reimplements from specs alone), and it hands you a permissive-licensed clone. Pay-per-kilobyte pricing. A Hacker News user received a working package for $0.51.

What Blanchard did manually over five days, Malus automates for less than a dollar. If AI-mediated reimplementation is valid, every copyleft library with a public API and a test suite is one prompt away from a permissive fork. We've already seen what happens when open source infrastructure collapses from licensing disputes. This would be that, at industrial scale.

Here's where it gets personal.

I use Claude Code more than any other tool in my stack. I run it overnight with unrestricted file access. Every time I say "rewrite this function," Claude doesn't just touch what I asked. It restructures, renames, extracts helpers, changes variable names. The boundary between "my code" and "AI-generated code" in my codebase stopped being traceable months ago. I've become the editor, not the writer.

I have never run a license scanner on the output. Not on Engram. Not on Ouija. Not on this publishing engine.

The verification debt extends beyond bugs. Thirty-five percent of GitHub Copilot output has licensing irregularities in audits. GPT-4 reproduces copyrighted text 44% of the time when prompted directly. Claude does better at 16%. But 16% isn't zero. And Copilot's built-in duplicate detection (the feature that flags output matching training data) isn't even enabled by default for individual users. Most developers don't know it exists.

Anthropic's enterprise settlement explicitly excluded output-based claims from indemnification. I use Claude's API across three production projects. If the code it generates infringes someone's copyright, I have zero contractual coverage. The Doe v. GitHub lawsuit (the Copilot class action) had breach-of-license claims survive a motion to dismiss. The court noted that users of AI output "may be liable for breach of contract." Not the tool vendor. The users.

That's me. Probably you too.

Where I Land

On chardet specifically: the similarity evidence supports Blanchard. 0.04% JPlag is difficult to argue with. But calling it a "clean room" is misleading when Claude accessed old source multiple times and when Claude's training data includes the very code it was asked to replace. Both things are true simultaneously. I think Meeker's framing is the most honest one: the technical output looks independent, but shipping it as chardet v7.0 instead of a new package was the real mistake.

On the bigger question: I genuinely don't know. I'm not confident anyone does yet.

The SFC investigation will matter. The Doe v. GitHub appeal will matter. The first court ruling on AI-mediated reimplementation (which hasn't happened, because everyone is scared of setting precedent) will matter most of all.

What won't matter is pretending this question doesn't apply to you because you're not a library maintainer. If you write code with AI and ship it, you're in the same gray zone Blanchard is in. The difference is that he published a transparency post documenting every step. Most of us don't even check.

The chardet question isn't whether one library was relicensed correctly. It's whether software licensing survives contact with a code generator that's read everything ever written. I stopped treating that as theoretical the day I read Mark Pilgrim's issue and realized I do the same thing he's objecting to. Every single day.

Share

Get new posts in your inbox

Architecture, performance, security. No spam.

Keep reading