250 terabytes of sequenced DNA, and somewhere in it, the pathogens you have to find. The more you detect, and the faster you do it, the higher you score.

This was part one of DARPA's Bioattribution Challenge. Through a misunderstanding, I¹ ended up with seven days to complete a challenge that most teams had been working on for a month with a full team.

The catch was the runtime: any solution had to finish in eight hours, and the most popular tool for taxonomic classification would take 43 days running single-threaded.²

Half out of desperation, half because I'd been reading about agents doing research, rewriting existing tools, and building new ones, I figured this was the same shape of problem. Stand up an experiment loop, let the model optimize a score, and have it write new tools if necessary. So that's what I built.

And I came fourth.

Bootstrapping from zero

To build an experiment loop you need two things: a pipeline to produce a score (here, macro F1 penalized by runtime) and data to evaluate it against. Unlike other challenges, DARPA's challenge did not provide the data or let me know what pathogens to look for. This left me starting everything from scratch.

First I spun up a VM with enough memory and compute to run experiments without keeping my laptop open. Then I deployed agents in parallel.

One group of agents built a modular pipeline I could run sweeps on: swap tools, change parameters, point it at new datasets. Another drew up a list of pathogens that DARPA mentioned on the challenge website: human, crop, and livestock threats. The last built the test data and the reference indices (the genome database the classifier matches reads against) around that list. For the data, I had the agents research read simulators, realistic sample types (e.g., clinical and environmental), and which organisms show up in each, then generate synthetic reads spiked with those pathogens at varying abundances.³

Of all this, the synthetic data took the most hand-holding and review, and for good reason: false positives are one of the biggest problems classification tools hit in real workloads. If my data didn't force that problem, any score I optimized toward would be hollow.

The experiment loop

With the test data and a working pipeline in place, I could start the optimization loop: have the agent suggest a parameter or tool change, run the pipeline to benchmark it, write the results to an experiment log, and propose the next experiment.

A big part of making this work was a detailed CLAUDE.md, written so the model could start working right away. It covered what the repo was for, how it was laid out, how to run the important commands, and anything else about the challenge it might need.

When I started a new session, the first instruction was always the same: read the experiment log and pick up where the last entry left off. Each experiment log entry had the date, a table of outputs, the agent's hypothesis for what went right or wrong, and what to try next. That gave the agents a rough memory across sessions and kept me from re-running the same experiments.

In practice, running these loops looked like a tmux session with several Claude Code panes open in bypass mode. Since most of the major scaffolding was already done, I let the agent add whatever tooling or change whatever settings it needed without running each plan by me first.

These agents made me faster, not smarter. They'd run whatever sweep I pointed them at, but they couldn't tell me when I was tuning the wrong knob, and I paid the price.

Hitting the wall

Halfway through the week, I ran a back-of-the-envelope estimate of throughput and realized my pipeline was nowhere near the 8-hour limit. The accuracy was fine, but the speed was a disaster.

I had forgotten to tell the model in the CLAUDE.md that it needed to measure wall-clock time and consider it when choosing experiments to run. Even after I changed that, two problems remained: the model was terrible at extrapolating runtimes, and it kept stopping to ask me questions despite having permission to operate on its own.

I solved both by running bigger experiments and adding a stop hook. First, I had an agent stand up a SLURM cluster on AWS so it could submit batch jobs, wait for them, and analyze the results without me in the loop. This meant I could measure real runtimes at scale instead of trusting the model's guesses. Second, I added a stop hook that reminded the model of its core goal and told it to do whatever it thought would be most reasonable.

The oracle trick

Running all these experiments made it clear that my best bet for speed was to rewrite some of the existing tooling, similar to what's already happening in the bioinformatics community with Rust.

I had the agents write plans for the rewrites, then execute them. Each agent used the original tool as an oracle, diffing the results of both whenever changes were made. Once correctness matched, I optimized for speed. In some cases I even let accuracy fall a bit for a significant speed gain.

This worked better than expected, because I could exploit the constraints of my setup: sizing memory to my reference database and tuning the code to the exact hardware it would run on.

My main takeaway from this was that any tool with well-defined input/output behavior can be rewritten this way.

What I learned

Working with the agents was more like managing a new grad than running a tool: I couldn't offload my thinking, and the back-and-forth was constant. Whenever I stepped away from reading the code, bugs would pop up, especially in the synthetic data generation and SLURM infrastructure.

Coding harnesses need stronger biosecurity safeguards. While working with pathogen genomes, I was repeatedly blocked by classifiers, but I found those safeguards easy to bypass.

Breadth of knowledge is becoming more important: if you know a technique exists, you can point an agent toward it. The model produced code I couldn’t have written myself in the seven-day window, including optimizations I’d heard about in performance engineering but never implemented.

For all that, I came fourth with a score that wasn't great, beaten by teams that included some of the best-known groups in the field. AI is getting better, but for now, the best humans still have the lead.

References

Ye, Simon H. and Siddle, Katherine J. and Park, Daniel J. and Sabeti, Pardis C.. "Benchmarking Metagenomics Tools for Taxonomic Classification". (2019).

Through my startup TwentyTwo. ↩
This is a single-threaded back-of-the-envelope estimate using Kraken2 throughput from this paper (Ye et al. 2019). Per paired-end Illumina read (~150 bp): header (~100 B) + sequence (300 B) + delimiter (4 B) + quality (300 B) ≈ 700 B, so 250 TB / 700 B ≈ 350 billion read pairs. At 5.7M read pairs/min single-threaded, that's ~62,000 minutes, or 43 days. GPT-5.5 estimates that multi-threading would bring this down but likely to something in the range of tens of hours. ↩
If I had more time, I would have taken real MGS data that had pathogens of interest in it, or computationally spiked them in, then BLASTed the dataset to get a ground truth. ↩