Using Claude Code as an AI Research Lab

Automating LLM experiments with a Director/Contributor multi-agent setup in Claude Code, from hypothesis to raw data.

claude-codemulti-agentautomationexperiments

The scratchpads on this blog are small experiments: take a hypothesis about LLM behavior, test it with a bunch of API calls, see what the data says. The interesting part is coming up with the hypothesis and making sense of the results, but most of the actual time goes into setting up the Python environment, writing the API call loop, handling rate limits and saving intermediate results. I'd been putting off a few experiments because of this overhead, and at some point I started wondering whether Claude Code could handle all of that for me.

Claude Code can spawn subagents that run independently with their own tools and workspace, so I set up a two-tier system: a Director that plans experiments and coordinates everything, and Contributors that actually execute them.

◇ How the Director/Contributor split works

The Director is the main Claude Code session where I work. I feed it rough experiment ideas and we discuss them together until the hypothesis, methodology, and parameters are nailed down. Once we've settled on a game plan, the Director creates a workspace folder for the experiment and writes a brief in XML with everything a Contributor needs to run it autonomously.

Contributors are subagents that the Director spawns. A Contributor reads its experiment brief, sets up an isolated Python environment with uv in its workspace folder, then writes and runs the scripts, collects the data, and produces a raw report when it's done. Communication is one-way: the Director gives instructions and the Contributor reports back on completion.

The isolation between experiments matters, because the first Contributor I ran put its pyproject.toml at the repo root and broke the rest of the project. Now each experiment gets its own uv-managed environment in its own folder, and all output data goes into a shared data/ directory with an INDEX.md catalog so experiments can reference each other's results. Scripts also save progress after every API call, which turned out to be a lifesaver when we hit token quotas partway through a run.

◇ Why raw reports instead of blog posts

One thing I was deliberate about is keeping the Contributor's output as raw research data in XML: what was tested, what the parameters were, what the numbers say. A separate process with the blog's writing style guide turns those into articles later.

This matters because the whole point of the blog is to avoid AI-generated writing style, and having the same agent that runs the experiment also write the blog post about it is a recipe for exactly that.

◇ The first experiment: word count compliance

I'd been curious for a while about how accurately LLMs follow word count instructions, and whether the phrasing of the request changes anything. The Contributor set up 120 API calls to Cerebras (gpt-oss-120b on their free tier), varying the target word count (100, 500, 1000, 2500), the topic type (factual, creative, argumentative), and whether the prompt said "write exactly X words" or "write approximately X words."

The phrasing difference turned out to be massive. "Exactly" produced a mean deviation of about 2%, which means the model was genuinely counting words and landing close to the target. "Approximately" blew up to 45% mean deviation though, consistently overshooting: at the 2500-word target, responses regularly hit 4000-5000 words, as if the model treats "approximately" as a soft lower bound rather than something to aim for.

That overshoot had a consequence I didn't anticipate. The "approximately" condition ate through Cerebras' free tier quota (1M tokens/day) much faster than expected, and we ran out at 113 of 120 calls. The resume support in the scripts saved it since the Contributor could just pick up the remaining 7 calls the next day without re-running anything.

The full results are in the scratchpad.

◇ What broke along the way

The first run wasn't smooth:

Subagents couldn't execute bash commands because the permission settings weren't configured for spawned agents. One-time fix, but not obvious until it failed.
The first Contributor put its pyproject.toml at the repo root instead of in the workspace folder, which is why the isolation rule exists now.
Hatchling's build-system config in pyproject.toml caused import failures for what were really just standalone scripts. Fix: omit [build-system] entirely and let uv handle it.

None of these were hard to fix, but they're the kind of thing you only figure out by actually running the system and watching where it falls over.

◇ What's next

The infrastructure works now, and the next experiments should be smoother since the environment issues are solved. I have a few queued in the backlog, and because everything lives in the same data/ directory with a shared index, they can build on each other's results instead of starting from scratch. An experiment about temperature and output diversity, for instance, can reference the length compliance data directly.

I wouldn't have guessed you could go from "I want to automate my experiments" to actually running one and getting usable data in a single afternoon, but that's roughly what happened. The Director/Contributor pattern isn't complicated once you see it.

ARTICLES-Reading