Stop Vibes-Checking Your AI: A Practical Guide to LLM Evaluation
My project: Hermes IDE | GitHub Me: gabrielanhaia You changed one word in your system prompt and now 30% of your outputs are garbage. You wouldn't know that, though, because you tested it by runnin...
Source: dev.to
My project: Hermes IDE | GitHub Me: gabrielanhaia You changed one word in your system prompt and now 30% of your outputs are garbage. You wouldn't know that, though, because you tested it by running three examples and thinking "yeah, looks fine." I've done this. More than once. On a Friday afternoon before a deploy. When I started building an AI-powered dev tool, I quickly realized that manually checking AI outputs doesn't scale past about day three. You need evals. Not the academic kind with 47-page papers behind them. The practical kind that tell you "this prompt change made things better" or "this prompt change broke everything." Why You Can't Just Unit Test AI Traditional testing works because functions are deterministic. Same input, same output. Write an assertion, move on. LLMs don't work that way. Ask GPT to summarize the same paragraph twice and you'll get two different summaries. Both might be correct. Or one might hallucinate a detail that wasn't in the source. Or both might