How I Built Eval Tools for Karpathy's Autoresearch

TL;DR: Karpathy's autoresearch runs hundreds of GPT pretraining experiments overnight. It doesn't tell you which ones mattered. I built three CLIs that do: autojudge (noise floor + Pareto analysis)...

By · · 1 min read
How I Built Eval Tools for Karpathy's Autoresearch

Source: DEV Community

TL;DR: Karpathy's autoresearch runs hundreds of GPT pretraining experiments overnight. It doesn't tell you which ones mattered. I built three CLIs that do: autojudge (noise floor + Pareto analysis), autosteer (what to try next), autoevolve (competing agents, cross-pollinate winners). The problem After running autoresearch for a week I had a TSV with thousands of rows and no idea what to trust. The built-in keep/discard logic is: did val_bpb go down? That's it. No noise floor estimation. No way to know if a 0.02% improvement is real signal or run-to-run jitter. After 700 experiments I had 6 "improvements" and zero confidence in any of them. The eval layer isn't there. Karpathy left it as an exercise. What I built autojudge Reads results.tsv and run.log, estimates the noise floor from recent experiments, checks if the improvement is on the Pareto front (val_bpb vs memory), and returns a verdict with a confidence score. pip install autojudge autojudge --results results.tsv --run run.log O

Similar Topics

#artificial intelligence (31552) #data science (24017) #crypto (15110) #generative ai (15034) #machine learning (14680) #bitcoin (14310) #featured (13553) #news & insights (13064) #crypto news (11112) #vc & technology (10543) #research (8564) #deep learning (7655) #news (7647) #bitcoin news (6886) #gaming (5907) #grow your business (5747) #ai for good (5043) #web/tech (5030) #btc (4998) #trending (4405)

Related Posts

Trending on ShareHub

Latest on ShareHub

Browse Topics

#artificial intelligence (31552) #data science (24017) #ai (16738) #generative ai (15034) #crypto (14987) #machine learning (14680) #bitcoin (14229) #featured (13550) #news & insights (13064) #crypto news (11082)

Around the Network