AI ResearchMarch 28, 20267 min read

LLMs and Document Understanding at Human Scale

Insurance policies, wills, health records, and lease agreements are among the most consequential documents in a person's life. They're also among the most unread. We're using large language models to change that — carefully.

The average insurance policy is 47 pages. The average American reads zero of them.

This is not laziness — it's rational. Dense legal language, undefined terms, cross-referenced exclusions, and deliberately obfuscated coverage limits make these documents functionally inaccessible to most readers. The same is true for wills, health records, lease agreements, and retirement account disclosures. These documents define critical rights and obligations. Almost nobody reads them carefully.

One of the core research problems at oue.ai is: how do you use language models to help people actually understand documents that govern their lives, without introducing errors that could be worse than the ignorance you're trying to solve?

The document understanding stack

Our document pipeline has four stages: ingestion, chunking, extraction, and presentation.

Ingestion

We accept PDFs, images, and plain text. For PDFs, we use structured text extraction first; if that fails (scanned documents, image-based PDFs), we fall back to OCR. The quality of extraction at this stage is the primary determinant of downstream accuracy — garbage in, garbage out is more true for LLM pipelines than for any other software stack.

Chunking

We segment documents by logical section rather than by token count. For insurance policies, this means identifying coverage sections, exclusions, definitions, and endorsements. For wills, it means identifying articles (personal property, real property, beneficiary designations, trustee provisions). Section-aware chunking significantly outperforms naive chunking on extraction tasks because model context isn't wasted bridging unrelated sections.

Extraction

We prompt Claude with a domain-specific schema and ask it to populate fields from each chunk. The schema is specific to document type — we have separate schemas for health insurance, life insurance, auto insurance, wills, trusts, and health records. Fields include:

Coverage limits (per-incident, annual, lifetime)
Deductibles and out-of-pocket maximums
Exclusion clauses (verbatim text extracted)
Renewal dates and grace periods
Beneficiary names and conditions
Medication names, dosages, prescribing providers

Critically, the model is instructed to return the source text alongside each extracted field — the actual sentence or paragraph from which the value was extracted. This enables source attribution in the UI, which we consider non-negotiable for high-stakes documents.

Presentation

We never show users only the extracted summary. We show the extracted value, the confidence level, and the source text. Users can click through to the original document section. This design choice reflects a core principle: our job is to make the document more accessible, not to replace it. The document remains the ground truth.

Measuring accuracy

We maintain annotated test sets for each document type — collections of real documents (with sensitive information removed) where human annotators have identified the correct values for every field we extract. We evaluate against these sets using:

Field-level precision: When we return a value, is it correct?
Field-level recall: What percentage of present fields do we find?
Hallucination rate: How often does the model return a value that isn't in the document?

We optimize for precision and low hallucination rate over recall. It is much better to tell a user "we couldn't find your deductible in this document" than to confidently return the wrong number. A false negative creates confusion. A false positive creates harm.

Current performance on our internal benchmarks across document types ranges from 91% to 97% field-level precision. Hallucination rates are below 2% across all types. These numbers aren't good enough for autonomous action — but they're good enough to provide a meaningful starting point for human review.

The limits we've learned to respect

Some things current models handle poorly, and we've learned to route around them rather than fight them:

Conditional logic. "Coverage applies if the procedure is medically necessary and performed by an in-network provider and the annual deductible has been met" involves three conditions that interact. Models frequently extract the coverage amount without preserving all three conditions. We now explicitly prompt for condition extraction and display them together.

Cross-document references. A will may reference a trust document that doesn't exist in the system. An insurance policy may incorporate a separate endorsement by reference. We flag these explicitly and tell users to provide the referenced document before relying on our summary.

State-specific legal variations. A healthcare directive valid in California may not be valid in Texas. We surface the state-specificity of legal documents but do not attempt to adjudicate cross-state validity.

Ambiguous language. Legal documents are sometimes deliberately ambiguous — the ambiguity is what gets litigated. We flag linguistic ambiguity when we detect it rather than choosing an interpretation.

What this makes possible

When someone uploads their homeowner's insurance policy to Haven, they see: what's covered, what's excluded, what their limits are, and when it renews — in plain English, with the source text one click away. When a Vigil user uploads their will, their designated emergency contacts can understand what it says if something happens, without needing a lawyer present.

This doesn't replace professional legal or financial advice. It closes the gap between having a document and understanding what you have. That gap, for most people, is enormous — and closing it is one of the most straightforwardly useful things we can do.

oue.ai Research

March 28, 2026

Why We Built 16 Products Instead of One

The Kin Matching Algorithm: Objective Compatibility Scoring for Family Formation