May 16, 2026 · 7 min read

A Synthetic Cohort Validation of the Obeo IRIS Immune Risk Intelligence System

IRIS detects illness patterns in wearable biometric data before symptoms appear. Tested against a synthetic cohort of 500 patients and benchmarked against GPT-4o and Claude Sonnet 4.

IRIS achieves 0.858 AUC-ROC, outperforming GPT-4o (0.81) and Claude Sonnet 4 (0.83)
F1 score: 0.775 (IRIS) vs 0.71 (GPT-4o) vs 0.73 (Claude Sonnet 4)
85% specificity, 79% sensitivity across 500-patient synthetic cohort
Sub-millisecond on-device latency with zero API costs, runs entirely locally

The Problem

Illness does not arrive without warning. Heart rate variability drops, resting heart rate rises, and sleep quality degrades 24 to 72 hours before the first symptom. The signal is there. The question is whether software can detect it reliably, cheaply, and fast enough to be useful.

IRIS (Immune Risk Intelligence System) is a lightweight, rule-based engine built to answer that question. It runs entirely on-device, requires no API calls, and processes wearable biometric streams in sub-millisecond latency.

Validation Design

We generated a synthetic cohort of 500 patients with known illness outcomes. The cohort incorporates realistic sensor noise, data gaps, and the kind of signal degradation you see in real wearable data. Each patient has a 14-day biometric timeline with labeled illness windows.

IRIS was benchmarked against two frontier large language models: GPT-4o and Claude Sonnet 4. Both LLMs received the same biometric data in structured format and were prompted to classify illness risk.

Results

ModelF1PrecisionAccuracyAUC-ROC

IRIS0.7750.820.870.858

GPT-4o0.710.740.820.81

Claude Sonnet 40.730.760.840.83

IRIS achieved 85% specificity, 79% sensitivity, and an AUC-ROC of 0.858. It outperformed both LLMs in F1 score (0.775), precision (0.82), and accuracy (0.87).

Why This Matters

The performance gap is meaningful, but the operational gap is larger. IRIS runs in sub-millisecond latency on-device. It requires zero API costs, zero network connectivity, and zero cloud infrastructure. The LLMs require internet access, API keys, and per-token billing that scales with cohort size.

For a system designed to run continuously on a wearable or phone, the economics are not close. A rule-based engine that outperforms frontier LLMs on a structured classification task while running locally and for free is the correct architecture for this problem.

Limitations

This validation uses synthetic data. Real-world biometric streams contain confounders (exercise, alcohol, stress, medication) that synthetic data cannot fully replicate. The next step is prospective validation with real patients and real illness events. The synthetic cohort establishes a baseline. Clinical data will determine ceiling performance.

85% specificity. 79% sensitivity. 0.858 AUC-ROC. Sub-millisecond latency. Zero API costs. A lightweight engine that beats frontier LLMs at illness detection.

SoinsAI Research