Preprint · arXiv 2605.02943

Healthcare AI GYM

for Medical Agents

Upstage AI
Abstract

Clinical reasoning demands multi-step interactions — gathering patient history, ordering tests, interpreting results, and making safe treatment decisions — yet no unified environment exists to train generalizable medical AI agents through reinforcement learning. We present a comprehensive empirical study of multi-turn agentic RL for medical AI, built on Healthcare AI GYM, a Gymnasium-compatible environment spanning 10 clinical domains with 3.6K+ tasks and domain-specific tools.

Our analysis reveals three compounding pathologies absent from single-turn settings: response explosion, multi-turn collapse into verbose monologues, and distillation instability — all stemming from the misalignment of sparse terminal rewards with sequential clinical trajectories. To stabilize training, we propose Turn-Level Truncated On-Policy Distillation (TT-OPD), a self-distillation framework where a gradient-free EMA teacher provides dense, outcome-aware KL regularization at every conversation turn — achieving accuracy comparable to vanilla GRPO with controlled response length and sustained tool use. We further identify a fundamental agentic–textual transfer gap: RL improves procedural competence but does not transfer to text-based QA benchmarks. The environment, training pipeline, and all experimental artifacts are publicly available.

How it works

The overall workflow, step by step

Healthcare AI GYM closes the loop between a clinical agent, a Gymnasium environment, a tool & knowledge ecosystem, and a reward-driven RL trainer. Press Play to watch a single training episode flow through the system — each step highlights one component and explains what it does.

action aₜ obs sₜ₊₁ call result query passages trajectory τ Rtotal ∇θ ℒ — update agent (GRPO + TT-OPD) π LLM / VLM Agent multi-turn think · search · submit Qwen3.5-9B text + vision πθ(aₜ | sₜ) Healthcare AI GYM reset() · step(a) · render() 10 clinical domains · 3.6K+ tasks ƒ 135 Clinical Tools Search Assess Actions Reasoning 828K-Passage KB MedCPT BioQA BM25 retrieval · SQLite R 5D Reward Acc · Proc · Safe Fmt · Coh + cosine length ctrl RL Trainer GRPO TT-OPD policy-gradient update ↻ ×Tₘₐₓ turns
01 reset()

Patient case is loaded

The environment samples a clinical task and returns the first observation — a patient scenario plus the available tool menu — to the agent.

    Contributions

    Five contributions

    01

    Multi-Domain Medical Gymnasium

    A Gymnasium-compatible environment: 10 clinical domains, 3.6K+ tasks, 135 domain-specific tools, an 828K-passage knowledge base, and a safety-aware 5D reward.

    02

    Systematic RL Benchmark

    A rigorous comparison of GRPO, DAPO, Dr. GRPO and GSPO that exposes the trade-off between peak accuracy (62.0%) and training stability.

    03

    TT-OPD Framework

    Outcome-aware regularization: correctness signals are injected into the teacher's context but withheld from the student, giving dense turn-by-turn guidance with controlled response length.

    04

    Systematic OPD Failure Analysis

    Ablations trace the failure progression — KL collapse → response explosion — and identify multi-turn collapse as an agentic-specific failure mode absent from single-turn OPD.

    05

    Transfer Gap Analysis

    RL-driven procedural competence (+22% composite reward) does not automatically translate to text QA, due to a 51:1 format-reward dilution ratio.

    The environment

    Inside Healthcare AI GYM

    10 Clinical Domains

      135 Tools · 4 types

      • Evidence RetrievalBM25 over 828K passages
      • Clinical Assessment22 validated scores — SOFA, NEWS, PHQ-9…
      • Intervention Actionslabs, imaging, prescribing, triage
      • Reasoning Scaffoldsdifferential dx, planning, ICD-10

      828K-Passage Knowledge Base

      PubMed abstracts, clinical guidelines (AHA, ACOG, SSC) and textbooks, indexed via BM25 and exposed as tool calls. Built on Self-BioRAG and OLAPH.

      Safety-Aware 5D Reward

      R = w·Acc + w·Proc + w·Safe + w·Fmt + w·Coh

        A critical safety violation caps the composite score at 0.1 — directly countering format-reward dilution. An optional assertion dimension (0.15) is added when rubric annotations exist.

        Method

        TT-OPD: stabilizing multi-turn agentic RL

        On-policy distillation collapses in agentic settings because the teacher goes stale as the student explores. TT-OPD keeps a gradient-free EMA teacher and feeds it outcome-aware hints that the student never sees — turning sparse terminal rewards into dense, per-turn guidance. Press Play to walk through one optimization step.

        trajectories correct / wrong logprobs ∇θₛ ℒ — update student EMA: θ_T ← α·θ_T + (1−α)·θ_S (no gradient) 1 Student πθₛ rollout · n = 3 trajectories multi-turn think / search / submit 2 Classification outcome → hint ✓ confirmatory cue ✗ diagnostic redirection 3 Teacher πθ_T EMA copy of student re-scores w/ hint context log πθ_T(aₜ | sₜ⁺) 4 Policy Update ℒ_GRPO(R_cos) λ · D_KL(πθₛ ‖ πθ_T) turn-level · truncated
        01 rollout

        Student rollout

        For each prompt, the student samples n = 3 multi-turn trajectories — interleaving think, search and submit actions in the GYM.

          θ_T

          Gradient-free EMA teacher

          θ_T ← 0.995·θ_T + 0.005·θ_S, updated every 5 steps with a hard-copy fallback every 30 — the teacher tracks the student without ever taking a gradient.

          h(τ)

          Outcome-conditioned hints

          Correct trajectories get confirmatory cues, incorrect ones get corrective redirection. Hints enter the teacher's context but are removed from its logprobs — the student never sees them.

          R_cos

          Cosine length control

          Concise correct answers are rewarded most; reward decays with length and incorrect answers are penalized more as they grow — preventing monotonic explosion toward L_max.

          Self-distillation is fragile: three failure modes

          Distillation is most valuable early (steps 1–40) and should be monotonically phased out — not adaptively toggled. Each run dies a different death.

          v29

          Teacher corruption cascade

          Unconditional EMA absorbs corrupted student weights; once KL > 0.7 a positive feedback loop collapses accuracy to 0%.

          death @ step 70 · 10 steps
          v30

          Frozen teacher gap cascade

          A static distill coef (4.0) is 10–20× the RL gradient; when the frozen teacher goes stale it overwhelms optimization.

          death @ step 74 · 50 steps
          v31 · novel

          Adaptive re-engagement explosion

          After distillation auto-disables for 100+ steps, a natural KL drop re-engages a 120-step-stale teacher → grad explosion (142→301) → permanent learning arrest.

          death @ step 145 · 5 steps
          Results

          TT-OPD wins 10 of 18 benchmarks

          All models use a Qwen3.5-9B backbone. Base (text) is single-turn log-prob evaluation with no tools; Base+AR adds the multi-turn AgentRunner (135 tools + 828K KB) without RL; GRPO and TT-OPD are RL-trained. Best per row is highlighted.

          CategoryBenchmarkBasetextBase+ARGRPOTT-OPD
          MC QAMedQA (USMLE)70.778.885.587.1
          MMLU-Med. (6 sub.)83.860.660.165.5
          MedMCQA63.855.858.066.2
          Visual QAVQA-RAD52.563.260.763.1
          PathVQA40.538.741.545.3
          SLAKE79.030.629.532.1
          PMC-VQA57.935.134.238.9
          VQA-Med-20218.69.810.715.2
          Quilt-VQA25.227.825.230.7
          EHRMIMIC-III58.562.161.162.7
          eICU53.255.955.557.1
          LFQALiveQA53.258.257.762.5
          MedicationQA49.553.155.860.9
          HealthSearchQA39.841.939.545.3
          KQA-Golden55.762.165.364.1
          KQA-Silver52.561.764.962.8
          Broad competence. TT-OPD is best on 10/18 benchmarks across MC QA, Visual QA, EHR and LFQA — +3.9 pp average over Base+AR.
          Agentic overhead. Multi-turn evaluation trades parametric precision for retrieval-augmented reasoning (MMLU 83.8 → 60.6 → 65.5).
          GRPO peaks on recall. Vanilla GRPO leads on KQA-Golden/Silver — higher peak training accuracy aids open-ended factual recall.
          Stability, not raw accuracy. TT-OPD's win is controlled length (5.7–9.3K tok) and sustained 7.0–7.4 turns throughout training.
          Training dynamics: validation accuracy, KL divergence, response length and average turns for TT-OPD vs GRPO.
          Training dynamics over 60 steps. TT-OPD (red) controls response length and preserves 7.0–7.4 turns, while GRPO oscillates and EMA-only distillation declines toward single-turn behavior.
          Deeper analysis

          Does RL internalize tool use — or copy-paste the prompt?

          A follow-up study asks whether agentic RL writes tool-use reasoning into the model's weights, or whether the agent is just reading the tool spec from its context window. We answer it with Progressive Spec Withdrawal (PSW) — a curriculum that strips tool definitions out of the prompt during training — and a forensic taxonomy of 31,500+ tool calls.

          55
          Tool vocabulary collapseunique tools invoked falls 55 → 19 once specs are withdrawn
          21%
          Reward-driven pruningsearch_pubmed share rises 21.2% → 93.5% — the model keeps only its most reward-efficient tool
          0%
          Already latentbase, GRPO and PSW all hit the same 86% on MedQA with no tool specs — RL adds no new parametric tool encoding

          Accuracy actually rises after specs are removed (58.0% → 61.7%, matching GRPO's 62.0% peak), and PSW reaches 90% with specs — so the curriculum teaches the model to better exploit tool definitions, not to memorize a capability the base model's post-training already had.

          A taxonomy of tool hallucination

          Across 31,500+ tool calls, failures fall into four distinct modes — each with its own cause and its own signature over training.

          SH

          Schema Hallucination

          A real tool is called with parameters that don't exist in its schema.

          < 0.1% · spikes early, fades at convergence
          PTI

          Phantom Tool Invocation

          Entirely fabricated tools (web_search, google_search) that were never defined.

          driven by pre-training leakage
          SM

          Structural Malformation

          Broken JSON or unclosed tags that abort the call before it runs.

          🐤 the canary — primary stability predictor
          STM

          Semantic Tool Misuse

          Correct syntax, wrong tool for the job — the right call in the wrong place.

          causes redundant retrieval loops

          RL self-corrects its own vocabulary. When specs are withdrawn, phantom tool names surface around steps 35–100 — then reward pressure prunes them away by step 300. Only tools that produce valid, correct answers survive.

          Citation

          Cite this work

          @article{jeong2026healthcare,
            title   = {Healthcare AI GYM for Medical Agents},
            author  = {Jeong, Minbyul},
            journal = {arXiv preprint arXiv:2605.02943},
            year    = {2026}
          }