The goal is not to do evals perfectly, it's to actionably improve your product.
Iteration beats perfection
Craft → Execution Sense
Persistence is extremely valuable. Successful companies right now building in any new area, they are going through the pain of learning this, implementing this and understanding what works and what doesn't work. Pain is the new moat.
I would bias less toward, trying in one go to tell the model, 'Hey, here's exactly what I want you to do.' Instead what I would do is I would chop things up into bits.
You don't want to skip this step. The reason I'm kind of spending so much time on this is this is where people get lost. They go straight into evals like, 'Let me just write some tests,' and that is where things go off the rails.
Before you release your LLM as a judge, you want to make sure it's aligned to the human. A lot of people stop there and they say, 'Okay, I have my judge prompt. We're done.' Don't do that, because that's the fastest way that you can have evals that don't match what's going on, and when people lose trust in your evals, they lose trust in you.