7B Model Beats o3 and GPT-5 in Medical AI Vision Reasoning

Medical AI doesn’t just write explanations — it must truly see the evidence.

A Paradigm Shift: From Passive Viewing to Active Visual Reasoning

Traditional medical multimodal models encode images or videos into static visual features, then generate answers and explanations via large language models. But clinical decisions hinge on microscopic clues: a subtle lesion boundary, a fleeting surgical motion, or a transient endoscopic finding — often lasting only seconds.

When models passively ingest visual context, they frequently misfocus, miss pathologies, or misinterpret normal anatomy as abnormal. This limitation isn’t about language capability — it’s a fundamental gap in visual interaction intelligence.

To bridge this, the LeapQuest team (Shanghai Institute of Innovation) — in collaboration with Zhejiang University, Shanghai Jiao Tong University, and Fudan University — introduced two ICML 2026 papers pioneering the Think with Images and Think with Videos paradigms for medical AI:

✅ Models no longer merely generate explanations after viewing — they actively invoke visual tools mid-reasoning to re-examine critical regions or moments.
✅ Visual evidence transitions from input to an integral component of the reasoning chain.
✅ Explanation becomes evidence-verified inference, not post-hoc justification.

Core Concept Diagram

Ophiuchus: Think with Images for Precision Medical Imaging

Ophiuchus transforms LLMs into collaborative visual agents, dynamically selecting and applying domain-specific tools based on real-time reasoning needs:

🔍 SAM2: Pixel-precise lesion segmentation
🎯 BiomedParse: Text-guided anatomical structure localization
➕ Zoom-in: Adaptive magnification of suspicious regions

Each tool’s output is ingested as an observation token — directly feeding the next step in the reasoning chain.

Unlike external plugin architectures, Ophiuchus embeds tool usage into the CoT itself: the model learns when to call a tool, which tool to select, how to interpret outputs, and how to self-correct when evidence contradicts hypotheses.

Ophiuchus Framework

Benchmark Leadership: Small Model, Superior Vision

On 8 VQA benchmarks, Ophiuchus-7B achieves 68.0 average score, outperforming:

OpenAI-o3: 62.2
Gemini 2.5 Pro: 61.8
GPT-5: 59.9

It also attains 97.9% tool-call accuracy, proving that finer-grained visual interrogation — not just scale — drives clinical reliability.

Ophiuchus Performance Comparison

MedScope: Think with Videos for Clinical Temporal Reasoning

Where Ophiuchus solves fine spatial reasoning, MedScope tackles sparse, time-critical evidence in long clinical videos — such as laparoscopic procedures or endoscopic sequences.

Its “think with videos” approach mirrors expert clinician behavior:

🌐 First, build global contextual understanding
⏱️ Then, identify suspicious temporal windows
✂️ Use crop_video to extract key segments
🖼️ Apply get_frame to isolate decisive frames
🔗 Integrate localized observations into final diagnosis

This yields inherently auditable reasoning: every answer is traceable to which video segment was reviewed, which frame was selected, and how evidence supports the conclusion.

Textual vs Visual Chain-of-Thought

ClinVideoSuite & GA-GRPO: Training Evidence-Aware Agents

MedScope introduces:

📚 ClinVideoSuite: 635K timestamped captions, 254K evidence-linked QA pairs, and 34K visual-CoT trajectories
🧠 GA-GRPO: A grounding-aware reinforcement learning algorithm using evidence-modulated advantage to reward retrieval of clinically supportive visual snippets

Results show dramatic degradation without evidence rewards — e.g., R@0.5 drops from 40.1 → 33.2, confirming that answer-level supervision alone fails to teach reliable evidence selection.

MedScope Framework

ClinVideoSuite Data Pipeline

The Bigger Picture: Vision as Cognitive Process

Together, Ophiuchus and MedScope define a new foundation for clinical multimodal intelligence:
➡️ Reasoning is no longer a linear language sequence — it’s a closed-loop interaction among language tokens, visual tools, image regions, video segments, and evidence feedback.

Unified Visual Reasoning Loop

This shift enables three critical capabilities for trustworthy clinical AI:

✅ Reduced hallucination (evidence-grounded outputs)
✅ Stronger explainability (auditable visual provenance)
✅ Robust process handling (stepwise verification for complex workflows)

Clinical Agent Capabilities

Why This Is a Turning Point for Medical AI Agents

In medicine, every conclusion demands an evidence chain. Radiologists zoom margins. Pathologists scan cellular morphology. Surgeons replay maneuvers. Endoscopists track lesion evolution across time.

“Think with Images/Videos” aligns AI cognition with this inherently interactive, evidence-driven, and verifiable clinical practice — establishing an internal hypothesize → verify → refine → answer loop.

It redefines reasoning itself: not just language generation, but dynamic, goal-directed visual exploration.

When models can actively revisit imaging, magnify lesions, clip surgical clips, and validate findings — they move beyond answering questions toward performing clinical visual reasoning.

About LeapQuest

LeapQuest [Rising Horizon] is an interdisciplinary research team at the Shanghai Institute of Innovation, advancing next-generation medical AI agents, visual reasoning, and multimodal foundation models — focused on Visual Reasoning, Agentic RL, and Clinical Tool Integration.

🔗 GitHub Repositories:
– MedScope | Think with Videos: https://github.com/SII-WenjieLisjtu/MedScope
– Ophiuchus | Think with Images: https://github.com/SII-zyj/Ophiuchus

Article sourced from QuantumBit, authored by LeapQuest Team.