April 27, 2026

Building Australia’s most boring AI

Jack Latrobe
Solutions Architect, Data#3

Back in the age of typewriters, there was a simple standard everyone understood: the machine either worked, or it didn’t. If a keystroke jammed, skipped or failed to land on the page, it was obvious. The value of the tool was not in how impressive it looked, but in whether it reliably did the job it was bought to do, day after day.

That’s the standard I personally measure good AI against: it should be useful, stick around for a while, and have a real impact on the way people do their work.

In my role I see a lot of AI: copilots summarising meeting transcripts, container-hosted models doing document review, and multi-agent systems handing complex workloads that I am not allowed to talk about. Claude, GPT, Gemini. Not to mention OpenClaw, Perplexity and a whole bunch of tools that I’m sure I’ll learn the names of next week.

The problem I see most often has nothing to do with the model, the architecture, the product or the prompt. The problem is that most of these “AIs” aren’t boring enough.

Why most AI fails after the demo

When people buy and use AI, it tends to happen in the same way: someone sees a slick demo, they pilot it, do some tests, then it goes live.

After about a month you get the first bill, it’s usually higher than expected, and then someone asks the only question that really matters: “did we get what we paid for?”

Running AI is the easy part. The system hums along, tokens burn, work gets done. What matters at that point is whether it’s doing the job people thought they were paying for, safely, consistently, and in a way you can explain to the person who signs off the bill at the end of each month. What did your Claude code actually deliver for those dollars?

What we do for people but skip for AI

When we hire a human worker, most organisations take a structured approach. We write a position description and define success. We assign a manager, run a 90-day probation followed by a 12-month review, then track whether they deliver. If they don’t, we manage that performance, and if it cannot be improved, the worker is removed. It’s not personal. It’s how businesses reduce risk: they set expectations, check progress, and take actions if an investment isn’t generating sufficient returns.

For AI, we often skip these “boring” steps. Fast experimentation has its place. Move quickly, learn quickly. However, the moment you start scaling up for real, you need to answer one simple question: “is our AI doing what it says on the tin?”

I try to make that question answerable in under five minutes, without opening a slide deck. What are the top three tasks the AI is meant to help with? For each one, what does “good” look like (time saved, fewer escalations, higher first-pass quality), and what does “bad” look like (rework, policy breaches, incorrect decisions, data leaving where it shouldn’t)?

If you can’t them describe both, you probably can’t measure either of them well.

What “boring AI” looks like in practice

My work with customers usually starts the same way. We stop talking about the model for a moment and talk about the job. What decision is the AI making? What workflow is it touching? What does “good” mean in business terms? Where is the risk in it for you?

Then I ask for a baseline. Not a perfect benchmark, just the current way the work gets done. If a human team does the task today, what’s the typical turnaround time? What’s the rework rate? Where do errors show up, and how are they detected or audited?

You need evaluations, but building a full evaluation system with a ground truth dataset is an exhaustive activity, so I’ll usually start with a basic performance loop: a thumbs up / thumbs down button, a short “was this useful?” prompt, and a periodic check-in with a sample of users. Then we make it concrete. We define a simple rubric (what counts as correct, what counts as complete, what counts as unsafe), and we agree who is going to look at the results.

Next, once the workload has proven some early value, we might build a lightweight evaluation set: a small collection of real, permissioned examples that represent the meaty parts of the job. We score the answers or outputs against criteria the business cares about (correct, complete, compliant, appropriately cautious), and we keep that set stable so we can detect regressions when prompts, tools or models change.

Why measurement determines trust and adoption

As you scale, measurement needs to become programmatic: traces you can follow end-to-end to understand where good answers come from, which allows you to explain how much a good output costs and why. This is where the cascade shows up again. If your team can’t explain what the system is doing, your business won’t trust it. If they don’t trust it, adoption will plateau. If adoption plateaus, you will never get beyond the pilot, and the system will reach its target users and will never meet it’s expected ROI.

Abstract accuracy doesn’t help much on its own. A 4 per cent error rate might not look alarming on a dashboard, but if that same error shows up across a million transactions a year, it stops being a metric and starts being a cost. What matters is not whether the AI looks mostly right in theory or that it passes a “vibe-check”. What will matter is when you are asked to show that the thing with the big bill has delivered real business value, you want to be able to do so using numbers you understand and trust.

That applies to both single-agent and multi-agent systems. However, once you move into multi-agent territory, you are not just measuring whether each agent performs well on its own. You are measuring how well the whole system works together: how work gets routed, how cleanly agents hand off, where delays or failures occur, and whether the orchestration is improving outcomes rather than just adding complexity.

That is also why I don’t think “move fast” is enough of a strategy on its own. Speed matters, but only if it is backed by evidence. You need to know whether feedback is improving, whether known failure modes are being caught, and if the system is saving or costing you money.

Back to clarity

There was a time when it was obvious whether a tool was earning its place on the desk. When something went wrong, you could see it straight away and do something about it.

The best enterprise AI gets back to that same standard. Not because the technology is simple, but because it is well understood, well measured and well managed. You know what it’s allowed to do, you can see what it did, and you can tell whether it delivered what you paid for.

When an AI system reaches that point, it becomes boring in the best possible way. It does the job, it earns trust and it sticks around long enough to have a real impact.

If you are building AI for business and want a second set of eyes on making sure you are getting the outcomes you need without a shock on your next token bill, please reach out to the team at Data#3 and ask to be put in touch. I am always happy to talk.

Contact us

Information provided within this form will be handled in accordance with our privacy statement.