Every AI vendor selling to government now says the same sentence: a human reviews all outputs. Good. That is the right design. Now ask the follow-up question: reviews it where? There is a real difference between a review queue with names attached and the answer you usually get, which is some version of staff look at it before sending. One of those is a workflow. The other is a hope with a sales deck behind it.
Approved in Bulk at 4:55 on a Friday
Picture the weak version in a utility office. The system drafts 200 annual test reminder letters for backflow assemblies. A billing clerk, covering for the utility clerk who is out, opens the batch screen, scrolls for a few seconds, and clicks Approve All at 4:55 on a Friday. In the batch are three letters staff meant to hold: one assembly on a vacant service, one being replaced during a remodel, and one where the tester's certification number expired in March and the test report should never have been accepted. The letters go out Monday with the rest.
Eighteen months later an auditor asks a simple question: who approved these 200 letters? The only answer the system can give is a sent timestamp. Not a name. Not what the drafts said before edits, if there were edits. Not whether anyone opened a single one. The review happened, in the sense that a person was in the room. But review that leaves no record is, for a public agency, very hard to distinguish from no review at all.
A Queue With Names Attached
Real human-in-the-loop has structure, and the structure is not exotic:
- Assigned review — each item lands in a named person's queue, not in an ambient pile anyone might glance at
- Approvals on the item — who approved it and when, attached to the letter itself
- The draft trail — what the system generated versus what the human sent
- Reasons for changes — when something is edited, rejected, or held, the why is captured in a field, not in someone's memory
- Holds that hold — a flagged assembly stays out of every batch until a person releases it, by name
None of this slows a competent clerk down much. All of it is what protects that clerk when the question comes in eighteen months.
Design for Disagreement
Here is the detail that separates serious systems: what happens when the human disagrees? There are systems in production right now where the model flags an account as noncompliant and the only way to override the flag is to email the vendor's support desk and wait. So nobody overrides. The flag stands, the shutoff notice queues up, and the staff member who knew the account was a duplicate from a billing migration shrugs and moves on. The vendor's compliance numbers look great.
If overriding the system is harder than accepting it, people will accept it — quietly, by default, every time. That is how a human-in-the-loop becomes a human rubber stamp. Not through bad intent. Through interface friction. The override should be one step, available to the person doing the work, and it should be the most thoroughly recorded action in the system: who disagreed, with what, and why. An override log full of entries is not a sign the model is failing. It is a sign the humans are still awake.
The Question to Ask a Vendor
When a vendor says a human reviews all outputs, ask to see two screens: the one where a named person approves an item, and the one where that person disagrees with it. If the first is a checkbox and the second is a support ticket, the loop is decorative. Buy something else.