AI ResearchApril 18, 202611 min read

Multiple Small Models: The Ensemble Architecture Behind Aura

Rather than relying on a single large frontier model, Aura runs an ensemble of continuously LoRA-tuned open-source models — each specialized, each small enough to improve rapidly on personal data. Here's why, and how it works.

When most people think about personalized AI, they imagine a large foundation model that somehow "remembers" you. The implicit mental model is a single large brain that knows everything about everyone. This is not how Aura is built — and we think our approach is meaningfully better for the use case.

Aura's inference architecture is an ensemble of multiple small open-source models, each continuously fine-tuned on synthetic personal data using LoRA adapters. New models are automatically evaluated and integrated as they are released. The ensemble routes queries to the right specialist and synthesizes responses across models when the query spans multiple domains.

This post explains the architecture in detail: why we made this choice, how the pipeline works, and where we think it leads.

Why not one large model?

The obvious alternative to an ensemble of small models is a single large frontier model — GPT-4, Claude, Gemini. These models are extraordinary at general reasoning. For Aura's specific requirements, they have three limitations that matter:

1. Personal fine-tuning is expensive at large scale. Fine-tuning a 70B+ parameter model on personal user data is prohibitively expensive for a per-user operation. LoRA fine-tuning of a 7B–14B parameter model costs a fraction of that and can be done continuously as new personal data accumulates. At our target scale — one personalized model adapter per user — small models are the only economically viable path to genuine personalization.

2. Large models are slow at inference. Streaming a response from a 70B model has materially higher latency than streaming from a 7B model, especially for the shorter, more frequent exchanges that characterize a daily-use personal assistant. Our p50 target for first token is under 300ms. Small models hit this reliably; large models require hardware that most inference providers don't offer at competitive economics.

3. Fine-tuning specializes better than prompting. A large model prompted with a user's context window is not the same as a small model fine-tuned on that user's data. Prompting uses context length at inference time. Fine-tuning changes the model weights. The latter produces more consistent personalization at lower per-query cost, and doesn't have the context-length limitations that constrain how much personal history you can inject at inference time.

The model zoo

Our ensemble currently spans five model families, each selected for different characteristics:

Qwen (Alibaba). The Qwen3 series — particularly the 8B and 14B variants — has produced the strongest results in our benchmarks for structured reasoning tasks: financial calculations, goal decomposition, timeline planning, and cross-domain inference. Qwen3's extended context window (up to 128K tokens) is particularly useful for personal context aggregation. When new Qwen versions are released (Qwen3.5, Qwen4), they enter our pipeline automatically within 48 hours of availability on HuggingFace.

Llama 3 (Meta). Llama 3.1 and 3.2 models excel at instruction following and conversational coherence. We use the 8B variant as the primary conversational backbone for Aura's chat interface. It handles multi-turn dialogue better than other models at its parameter count, which matters for a daily-use assistant where conversation history spans many exchanges.

Mistral / Mixtral. Mistral's models punch above their weight on knowledge retrieval tasks. We use Mistral Nemo (12B) and Mistral Small for queries that require synthesizing across the user's oue.ai data — connecting a health record to an insurance policy, or cross-referencing a retirement projection with an estate plan. Mixtral's mixture-of-experts architecture is also a direct influence on our ensemble routing logic.

Gemma 2 (Google). Gemma 2 9B is our default for document understanding tasks — parsing imported data exports from Google, Meta, and LinkedIn. It handles HTML, JSON, and structured document formats reliably, and its safety tuning aligns well with the sensitive personal data we process.

Phi-4 (Microsoft). Phi-4 (14B) is used for tasks requiring long-chain reasoning with relatively compact context — legal document interpretation, estate planning logic, insurance policy analysis. Microsoft's focus on reasoning efficiency at small scale makes Phi-4 particularly well-suited for tasks where we need the model to reason through a multi-step problem, not just retrieve a pattern.

LoRA fine-tuning with Unsloth

LoRA (Low-Rank Adaptation) is the fine-tuning method that makes per-user model personalization tractable. Instead of updating all model weights during training — which requires storing and computing gradients for billions of parameters — LoRA injects trainable low-rank matrices into specific layers of the model. The number of trainable parameters drops by 99%+ compared to full fine-tuning, while retaining most of the quality of full fine-tuning on the target task.

We use Unsloth as our LoRA training framework. Unsloth provides 2× faster training and 60% lower VRAM usage compared to standard HuggingFace PEFT training, via kernel-fused operations and custom CUDA implementations. This matters enormously for our economics: we need to train per-user adapters on A100-class hardware, and Unsloth's efficiency directly determines how many adapters we can train per GPU-hour.

The training configuration for each user adapter:

Rank r=16, alpha=32 — higher rank captures more user-specific information; we've found 16 to be the sweet spot between capacity and overfitting on our synthetic data volumes
Target modules: q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj — full attention and MLP projection coverage
Training on 512–2048 synthetic examples per update cycle, generated from the user's actual oue.ai data
Learning rate 2e-4 with cosine schedule, 3 epochs, gradient checkpointing enabled
Trained in 4-bit NF4 quantization (QLoRA) for memory efficiency

A full adapter training run for a single user takes approximately 8–12 minutes on an A100, depending on data volume. Adapters are stored as separate files (typically 50–150MB) and hot-swapped at inference time using peft. This means we can serve the personalized model without reloading the base model — just swapping the LoRA adapter weights.

Synthetic data generation

Fine-tuning requires labeled training data. For personal AI, the challenge is generating training data that reflects the user's specific context without requiring the user to manually create it.

We solve this with a synthetic data generation pipeline:

Step 1: Context extraction. We pull structured data from all of the user's oue.ai products — goals from Tempo, health records from Nest, financial accounts from Atlas, education history from Edify, etc. This is the same context that populates Aura's inference-time system prompt.

Step 2: Question generation. A base model (currently Claude, used only for data generation, not deployed inference) generates diverse questions a user might ask about this context, ranging from simple lookups ("When does my homeowner's insurance renew?") to complex synthesis ("Given my retirement savings rate and expected Social Security benefit, what's my projected shortfall at age 67?").

Step 3: Answer generation and verification. The same base model generates gold-standard answers, with structured verification for any numeric claims. Answers that fail verification are rejected. The resulting (question, answer) pairs are formatted as instruction-tuning examples.

Step 4: Diversity augmentation. We paraphrase each question 3–5 times using a smaller model to increase training diversity. This prevents the adapter from overfitting to specific phrasings.

The pipeline generates 512–2048 training examples per user per update cycle. Update cycles are triggered by: significant new data being added to any oue.ai product, passage of 30 days since last training, or explicit user request.

Ensemble routing and synthesis

At inference time, a query passes through a lightweight router (a fine-tuned Phi-3 Mini) that classifies it along two dimensions: domain (finance, health, legal, education, general) and task type (retrieval, reasoning, generation, synthesis). The router's output determines which models in the ensemble are invoked.

Simple queries go to a single model. Complex queries — those that span multiple domains or require multi-step reasoning — invoke multiple models, and their outputs are synthesized by a second-stage aggregator. The aggregation strategy is confidence-weighted majority voting for factual claims, and the model with the highest domain specialization score for generative responses.

The full ensemble adds approximately 40ms to p50 latency compared to single-model inference. For the quality improvement on complex cross-domain queries, we consider this acceptable.

Continuous model intake

The open-source model landscape changes rapidly. A model family that leads our benchmarks today may be superseded by a new release in three months. We've built a continuous intake pipeline that:

Monitors HuggingFace model releases via the Hub API, filtered by our target parameter ranges (7B–32B) and license compatibility (Apache 2.0, MIT, Llama community license)
Automatically runs new models through a benchmark suite covering our core task types, using a held-out synthetic evaluation set
If a new model beats the current specialist in its category by >2% on our benchmarks, it enters a shadow deployment — serving a small fraction of traffic alongside the current model
After 7 days of shadow traffic with no regression on user feedback signals, the new model replaces the specialist in production
LoRA adapters trained on the previous model architecture are automatically retrained on the new base using knowledge distillation — the previous model's outputs on training examples serve as soft targets for the new adapter

This pipeline means that when Qwen releases Qwen3.5, or Meta releases Llama 4, or Google releases Gemma 3, our ensemble evaluates and potentially integrates them automatically — without requiring engineering intervention for each release.

Privacy architecture

User data never leaves their tenant during fine-tuning. Synthetic training data is generated within the tenant compute boundary. LoRA adapters are stored encrypted at rest and are not shared across users. The base model weights are shared (they're public open-source models), but the personalization layer — the adapter — is strictly per-user.

We do not use user data to improve shared models. The only signal that propagates across users is aggregate benchmark performance of base models, which is used to update the ensemble composition. Individual user content and synthetic data is tenant-isolated.

Where this leads

The multiple-small-models architecture is not yet common in consumer AI products. Most products use a single large model behind a chat interface. We believe this will change as the economics of personalization become clearer: the value of a model that genuinely knows you — your specific goals, your family's health history, your exact financial situation — is substantially higher than the value of a model that knows everything about everyone generically.

The open-source ecosystem has reached the quality threshold where small models, fine-tuned well, are competitive with large models on specific tasks. Unsloth has lowered the cost of fine-tuning to the point where per-user adapters are economically viable. And the pace of open-source model releases means that the ensemble continues to improve without our engineering effort — it absorbs the progress of the broader research community automatically.

We're building Aura on this architecture not because it's the fashionable choice, but because we think it's the right one for building AI that's genuinely useful at the scale of an individual human life.

oue.ai Research

April 18, 2026

Why Your Data Should Pay You Back