Feb 6, 2026

2026-16 - Interesting reads

Building a C compiler without writing a line of code

Somone at anthropic build a team of agents, and for 20k $, build a C compiler that can compile Linux (among other OSS projects like sqlite), and compile+run doom on it. My key takeaways:

This pushes the future definition of SWE. Building a compiler is not a small feat, and while we may have issues getting GHCP write proper TypeScript code, they managed to build it, for x86, Arm and even RISC V.
It was made possible only due to some fundamentals, including tooling and testing. Because GCC has a large test suite they could leverage, agents were able to steer themselves. The comparaison with “GCC needed X people during YY years” is then false: all these years of work were “sumamrized” in those “torture test”. They’re the best spec possible, written and refined by humans.
They are sharing the source code…of the compiler, but not the code for agents, or the tooling they’ve used (they were running agents in //, each in it’s own docker container.
They are some recipes - lock files for // agents, persistent tasks lists, ideas files, working on our code/tests output to limit context window pollution, … - applicable outside of this experiment.

Why LLM hallucinate? Because of us!

TL;DR: A good consultant will try to answer any question. A great consultant is not afraid to say “I don’t know”. All LLM benchmarks push LLMs to be “good consultants”. (both in training and post training).

There are “classes” of hallucinations. One of them being sparse facts (birthdays, …) that appear only once in training data.
They recommend to add explicit confidence threshold to all major benchmarks to reward uncertainty in LLM output