What Actually Works When Shipping AI Products
Notes from the field on why most AI teams measure the wrong things and fix the wrong problems.
The teams that ship good AI products aren’t the ones with the best models or the most sophisticated pipelines. They’re the ones who figured out what’s actually going wrong and built the feedback infrastructure to fix it fast. That sounds obvious. It isn’t how most teams operate.
Start by reading your own logs
The first thing that goes wrong is teams reach for metrics before they understand their failure modes. They pick up off-the-shelf evals — hallucination rate, toxicity, whatever — and those numbers go down while the actual product stays broken.
The alternative is tedious and it works: read your transcripts. Not a sample. Sit down with a domain expert and go through real conversations. Write notes. Don’t start with categories — let the categories emerge. One team did this for an apartment leasing assistant and found that three issues caused 60% of failures. One of them was date handling. Fixing date handling moved their success rate from 33% to 95%. That’s not something any generic eval surface would have surfaced.
Build the tool that lets you look at data
If you’re copying outputs into spreadsheets or jumping between five tabs to see what context the model had, you’re going to look at less data than you should. The friction is real and it compounds.
A custom data viewer doesn’t need to be fancy. It needs to show everything relevant in one place and make it trivially easy to leave a note or flag something. Teams with decent viewers iterate meaningfully faster than teams without them — the time investment pays back quickly. You can build something useful in a day with current AI coding tools. Start there before you build anything else.
Get domain experts writing prompts directly
The standard approach is: domain expert explains what they need, engineer translates it into a prompt, engineer ships it, domain expert says it’s not quite right, repeat. This is slow and it loses information at every handoff.
What works better is giving domain experts direct access to prompts inside something that looks like the actual product — with all the real context in place. The hurdle is usually language. “RAG” means nothing to a leasing agent or a compliance officer. “Making sure the model has the right information before it answers” means something. The concepts aren’t hard once the jargon is out of the way.
You don’t have to wait for real users
Early on you have no data, which makes evaluation hard, which makes improvement hard. Synthetic data breaks that loop. LLMs are genuinely good at generating varied, realistic user inputs — better than most engineers expect.
The useful framing is three dimensions: what features does the system need to handle, what situations will users actually be in, and who are the users. Generate inputs across all three. The important constraint is to generate user inputs, not model outputs — you want test cases, not answers. And verify that the synthetic inputs actually exercise what you think they exercise before you trust them.
On evaluation: binary beats scale
Rating outputs 1–5 sounds like it gives you more information. In practice it mostly gives you more arguments about whether something is a 3 or a 4. Pass/fail is harder to disagree about and easier to act on. The nuance lives in the written critique — why did it fail, what specifically — not in the number.
The other thing that erodes trust in evals is criteria drift. Your standards for what “good” looks like will change as you see more outputs. That’s fine and normal, but it means your LLM judge will drift too. Check alignment between your judges and human judgment regularly. Don’t just automate and assume it’s staying calibrated.
Roadmaps that actually fit how AI works
Feature roadmaps don’t fit AI development well. The feasibility questions aren’t answered upfront — you find out by running experiments. Committing to ship a specific capability by a specific date assumes you know things you don’t know yet.
A better framing is to commit to an experimentation cadence rather than a feature list. Two weeks to understand the data. A month to understand technical feasibility. Six weeks to a testable prototype. Regular decision points. That structure gives you somewhere to pivot without it feeling like a failure. Long flat periods where nothing seems to work are normal before something clicks — a traditional roadmap would have killed the project.
The underlying thing is: evaluation infrastructure is the investment that makes everything else move faster. Build the viewer, instrument the feedback loop, get alignment on what good looks like. The model improvements compound on top of that.