An appeal lives or dies on its citations: the payer's own policy, by name and section, applied correctly. AI systems are famously capable of inventing citations that look right and are not. So we publish our verification method and our measured numbers, because "trust us" is not a methodology.
In our own benchmark, roughly 27% of citations in un-gated, single-pass AI drafts failed verification on the same case corpus our engine runs. That is the industry's quiet defect: letters that read beautifully and cite policies that do not exist, do not say what the letter claims, or do not apply to the patient. Payers increasingly machine-read appeals, and machine reviewers punish exactly this.
Appealit's engine was designed so that a fabricated citation is not merely unlikely. It is structurally blocked.
The engine cannot cite the open internet. Every authority a letter may cite lives in a curated library (we call it the rulepack): FDA labeling, payer medical policies, Medicare coverage manuals, state insurance law, specialty-society guidelines, and peer-reviewed trials pinned to their DOI and journal. Every entry is verified against the primary source before it is admitted, under a verify-don't-trust workflow: no entry gets in on an AI's say-so.
Today the library covers 18 condition classes with 217 verified entries, and grows class by class under the same rule. Duplicate or conflicting entries are machine-blocked at build time.
A drafted letter ships only after it clears all three. Fail any gate and the letter is revised and re-run; a letter that cannot pass is not shipped, and we say so.
Not AI. Every citation in the draft must resolve exactly to a verified library entry. A citation that does not resolve blocks the letter, no exceptions, and cannot be argued with.
A separate AI reviewer, deliberately from a different model vendor than the drafter, checks that each citation is applied correctly to this patient's facts: right criterion, right section, right conclusion.
A third, separately-built reviewer attacks the letter the way a hostile payer reviewer would: weak arguments, unsupported claims, and anything a denial could latch onto.
Three different systems, three different failure modes, on purpose. A mistake has to survive all of them to reach a letter.
The most important thing a verified engine does is decline. When the engine cannot ground an appeal in a verified authority, it does not improvise; it stops and says what is missing.
Every number below is measured engine output on blinded synthetic or public data. Nothing is a projection, and nothing was graded on data the engine trained or tuned on.
| metric | result | basis |
|---|---|---|
| Fabricated citations in shipped letters | 0 | every published evaluation run to date |
| Held-out draftability benchmark | 95.2% | 249 held-out cases, full three-gate engine, our GLP-1 wedge benchmark |
| Across all published gated-loop runs | 93.0% | 487 loop runs, including re-runs and harder case mixes |
| Average verification rounds per letter | 2.07 | most letters pass all gates in 1-2 rounds |
| Engine time per letter | ~77s | drafting plus all three gates |
| Successful red-team breaches | 0 | published adversarial suite: prompt injection, fabrication pressure, fake-authority bait |
| Automated tests on the engine | 766 | current suite, run on every change |
Every evaluation runs on fresh, blinded cases the engine has never seen, and each batch is thrown away after grading so no run is ever flattered by familiarity. When an evaluation exposes a flaw, the flaw gets published into the fix, not smoothed out of the summary. The point of the whole method is simple: the appeal a real patient files has to be true.
Methodology pages are easy to write. So we also let partners touch the thing: run a rejected-PA batch on synthetic data, open a letter, click a citation, and watch the gates' verdicts live. We run the sandbox with you in a 20-minute walkthrough.