2026-16 - Interesting reads
Building a C compiler without writing a line of code
Somone at anthropic build a team of agents, and for 20k $, build a C compiler that can compile Linux (among other OSS projects like sqlite), and compile+run doom on it. My key takeaways:
- This pushes the future definition of SWE. Building a compiler is not a small feat, and while we may have issues getting GHCP write proper TypeScript code, they managed to build it, for x86, Arm and even RISC V.
- It was made possible only due to some fundamentals, including tooling and testing. Because GCC has a large test suite they could leverage, agents were able to steer themselves. The comparaison with “GCC needed X people during YY years” is then false: all these years of work were “sumamrized” in those “torture test”. They’re the best spec possible, written and refined by humans.
- They are sharing the source code…of the compiler, but not the code for agents, or the tooling they’ve used (they were running agents in //, each in it’s own docker container.
- They are some recipes - lock files for // agents, persistent tasks lists, ideas files, working on our code/tests output to limit context window pollution, … - applicable outside of this experiment.
Why LLM hallucinate? Because of us!
TL;DR: A good consultant will try to answer any question. A great consultant is not afraid to say “I don’t know”. All LLM benchmarks push LLMs to be “good consultants”. (both in training and post training).
- There are “classes” of hallucinations. One of them being sparse facts (birthdays, …) that appear only once in training data.
- They recommend to add explicit confidence threshold to all major benchmarks to reward uncertainty in LLM output