How Appealit verifies every citation -- our published methodology

The problem we built against

In our own benchmark, roughly 27% of citations in un-gated, single-pass AI drafts failed verification on the same case corpus our engine runs. That is the industry's quiet defect: letters that read beautifully and cite policies that do not exist, do not say what the letter claims, or do not apply to the patient. Payers increasingly machine-read appeals, and machine reviewers punish exactly this.

Appealit's engine was designed so that a fabricated citation is not merely unlikely. It is structurally blocked.

Step one: a citation has to already exist in our verified library

The engine cannot cite the open internet. Every authority a letter may cite lives in a curated library (we call it the rulepack): FDA labeling, payer medical policies, Medicare coverage manuals, state insurance law, specialty-society guidelines, and peer-reviewed trials pinned to their DOI and journal. Every entry is verified against the primary source before it is admitted, under a verify-don't-trust workflow: no entry gets in on an AI's say-so.

Today the library covers 18 condition classes with 217 verified entries, and grows class by class under the same rule. Duplicate or conflicting entries are machine-blocked at build time.

Step two: three independent gates, every letter, every time

A drafted letter ships only after it clears all three. Fail any gate and the letter is revised and re-run; a letter that cannot pass is not shipped, and we say so.

The deterministic authority check

Not AI. Every citation in the draft must resolve exactly to a verified library entry. A citation that does not resolve blocks the letter, no exceptions, and cannot be argued with.

The independent judge

A separate AI reviewer, deliberately from a different model vendor than the drafter, checks that each citation is applied correctly to this patient's facts: right criterion, right section, right conclusion.

The adversarial reviewer

A third, separately-built reviewer attacks the letter the way a hostile payer reviewer would: weak arguments, unsupported claims, and anything a denial could latch onto.

Three different systems, three different failure modes, on purpose. A mistake has to survive all of them to reach a letter.

Refusal is a feature

The most important thing a verified engine does is decline. When the engine cannot ground an appeal in a verified authority, it does not improvise; it stops and says what is missing.

Live example from our test runs: handed a denial from a payer that does not exist, the engine returned NEEDS_INFO and refused to draft, rather than inventing a plausible-sounding policy for a fictional insurer. A confident wrong letter costs a patient a real appeal window. We would rather say no.

The measured numbers

Every number below is measured engine output on blinded synthetic or public data. Nothing is a projection, and nothing was graded on data the engine trained or tuned on.

metric	result	basis
Fabricated citations in shipped letters	0	every published evaluation run to date
Held-out draftability benchmark	95.2%	249 held-out cases, full three-gate engine, our GLP-1 wedge benchmark
Across all published gated-loop runs	93.0%	487 loop runs, including re-runs and harder case mixes
Average verification rounds per letter	2.07	most letters pass all gates in 1-2 rounds
Engine time per letter	~77s	drafting plus all three gates
Successful red-team breaches	0	published adversarial suite: prompt injection, fabrication pressure, fake-authority bait
Automated tests on the engine	766	current suite, run on every change

What we do not claim

We do not publish overturn or win rates. Not because they look bad, but because we have not yet measured them on real filed appeals, and we will not put an unmeasured percentage in front of you. Much of this market does. When we publish outcome numbers, they will come with the n, the denominator rules, the time window, and per-denial-type splits.
Reference outcomes shown in our sandbox are context, not claims. They are the blinded source-dataset outcomes for those cases and are never used by the engine.
A verified citation is not a guaranteed win. It means the authority exists, says what we say it says, and applies to the case as argued. Payers still decide appeals.

How we evaluate ourselves

Every evaluation runs on fresh, blinded cases the engine has never seen, and each batch is thrown away after grading so no run is ever flattered by familiarity. When an evaluation exposes a flaw, the flaw gets published into the fix, not smoothed out of the summary. The point of the whole method is simple: the appeal a real patient files has to be true.

Check it yourself

Methodology pages are easy to write. So we also let partners touch the thing: run a rejected-PA batch on synthetic data, open a letter, click a citation, and watch the gates' verdicts live. We run the sandbox with you in a 20-minute walkthrough.

Book a walkthrough Questions about the method: [email protected]