AI ResearchApril 10, 20268 min read

Responsible AI for High-Stakes Life Decisions

Most AI tools are built for low-stakes tasks. We're building for the moments that actually define a life — and that requires a fundamentally different approach to how models are prompted, constrained, and evaluated.

There is a meaningful difference between an AI that helps you write a marketing email and one that helps you draft your will. The stakes, the error surface, and the user's emotional state are all categorically different. Yet most AI products treat these contexts identically — a language model, a prompt, a response.

At oue.ai, we build across 16 domains that touch the most consequential decisions in a person's life: estate planning, family health records, retirement, elder care, spiritual growth, education, procreation. Getting something wrong in these contexts doesn't mean a bad tweet — it can mean a will that fails probate, a medication interaction that gets missed, or a retirement projection that leads someone to underestimate how long their savings need to last.

This post describes how we think about deploying large language models responsibly in high-stakes settings.

The core tension

LLMs are extraordinarily capable at generating fluent, confident-sounding text. That's also precisely what makes them dangerous in legal, financial, and medical contexts. Confidence without accuracy is harmful at any stakes. It is catastrophic when someone is making an irrevocable decision about their estate or their child's health.

Our first design principle is: the model should surface information, not make decisions. Every AI-generated output in our products is explicitly positioned as a starting point for human judgment, not a substitute for it.

Structured extraction over open generation

We use Claude (Anthropic's model) across our product suite, primarily in a structured-extraction mode rather than open-ended generation. When a user uploads an insurance policy, we don't ask the model to "summarize it." We ask it to extract specific fields into a schema: coverage limits, exclusion clauses, renewal dates, premium amounts, out-of-pocket maximums. This is a fundamentally different task.

Structured extraction is more reliable because:

  • The output space is constrained — the model either finds the field or it doesn't
  • Missing values are explicit ("not found") rather than hallucinated
  • We can validate outputs against expected types, ranges, and formats
  • The original source text is always preserved and surfaced to the user

When we do use open generation — for example, explaining a legal clause in plain English — we pair the explanation with the original text so the user can verify it themselves. We also add explicit hedges: "This is our interpretation — consult a licensed attorney for advice specific to your situation." These aren't just legal CYA disclaimers. They're part of the honest communication design.

Prompt engineering for accuracy, not persuasion

Most consumer AI prompts are optimized for user satisfaction. Ours are optimized for accuracy. That means we actively engineer against the model's tendency to be helpful in ways that compromise precision.

Concretely, our system prompts for legal and financial domains include explicit instructions like:

  • If a field is ambiguous, say so rather than choosing an interpretation
  • If information is not present in the document, say "not found" — do not infer
  • Never provide specific investment, tax, or legal advice — describe options, not recommendations
  • When summarizing, prefer understatement to overstatement — it is better to under-explain than to introduce false confidence

We also separate the model's read of a document from any downstream action. The model tells you what your policy says. It doesn't tell you whether you have enough coverage — that requires knowing your circumstances, risk tolerance, and financial situation, none of which we fully know.

Evaluation

We maintain a growing set of test documents across each domain — insurance policies, wills, health records, lease agreements — with human-annotated ground truth for the fields we extract. Our extraction pipeline is evaluated against this set on every significant prompt change. We target precision over recall: we would rather flag something as "not found" than return a wrong answer.

For generative tasks (plain-English explanations, goal suggestions, session summaries), we use a combination of human review and model-graded evaluation. We sample outputs weekly and review anything that scores below our confidence threshold.

What we don't do

We do not let models take actions. No AI in our products sends emails, files documents, moves money, or makes external API calls on behalf of users. Every action is initiated by a human after reviewing AI-generated content. This isn't a technical limitation — it's a deliberate constraint. In high-stakes domains, reversibility is a feature.

We also do not optimize for engagement. Our products don't send "you haven't logged your medication today" push notifications designed to maximize daily active users. The measure of success is whether the information we surface helps people make better decisions about their lives — not whether they open the app every day.

The research questions we're still working on

We are an early-stage product company, not a research lab. But we track several open problems that directly affect how we build:

  • Calibration in domain-shifted settings. Models trained on internet-scale text can be systematically miscalibrated when applied to specialized legal or medical language. We are exploring retrieval-augmented approaches that ground model outputs in verified source documents.
  • Multi-document reasoning. A user's estate situation involves a will, a trust, beneficiary designations on retirement accounts, and insurance policies — all of which interact. Cross-document reasoning is genuinely hard and not something we trust current models to do reliably without significant scaffolding.
  • User trust calibration. Research shows that people over-rely on confident AI outputs. We are studying how interface design choices — confidence indicators, source citations, explicit uncertainty signals — affect whether users engage critically with AI-generated content or accept it uncritically.

These are hard problems. We don't have complete solutions. What we have is a commitment to treating them as engineering constraints rather than footnotes — and to building products that respect the weight of the moments they inhabit.

o

oue.ai Research

April 10, 2026