Project: Building a Preventive Cardiology Instruction Dataset for Local LLM Fine-Tuning

From March–June 2026, I’m running a focused open-source project to create a specialized dataset for fine-tuning small, locally runnable large language models (LLMs) on preventive cardiology topics.

I’m applying my background in symbolic reasoning, knowledge representation, and now transformer-based systems to address a persistent gap: most medical LLMs still struggle with deep preventive reasoning, especially in early atherosclerosis detection, coronary artery calcium (CAC) scoring, carotid intima-media thickness (CIMT), stress-test limitations, and, critically, nonlinear and autonomic predictors such as heart rate variability (HRV) complexity (multiscale entropy, detrended fluctuation analysis, power-law slopes) and their interplay with plaque vulnerability and stroke risk.

The goal is a high-quality, clinician-aligned instruction dataset (1,000–5,000+ examples) that enables privacy-safe, offline-capable models for education, patient empowerment, and local experimentation —> without relying on massive proprietary data or cloud dependency. In other words, it can be utilized on a sovereign air-gapped computer.

The process emphasizes quality over quantity, starting with manual curation from trusted sources (AHA/ESC guidelines, PubMed abstracts, the Stanford inherited cardiomyopathy dataset, and anonymized preventive patterns) and expanding through synthetic generation grounded in real medical knowledge.

A strong emphasis will be placed on nonlinear/autonomic dimensions —> capturing how reduced HRV complexity and fractal dynamics contribute to cardiovascular and stroke risk independently or synergistically with structural plaque findings.

I’ll incorporate cardiologist feedback for preference optimization, ensuring outputs reflect accurate, nuanced reasoning rather than hallucinations. By releasing the dataset openly on Hugging Face (with full documentation, ethics statement, and sample fine-tunes), the project aims to support accessible tools for preventive cardiology education and research —> all running locally on consumer hardware.

Key Steps in the Three-Month Timeline:

  • Weeks 1–4: Define scope, collect and manually curate 200–500 seed examples from guidelines, literature, and preventive concepts (with heavy focus on nonlinear HRV/autonomic predictors in plaque/stroke contexts).
  • Weeks 5–8: Generate 2,000–4,000 synthetic variations using strong base models, with heavy manual review for factual alignment and diversity.
  • Weeks 9–12: Share batches with cardiologists for corrections, ranking, and preference data creation to enable DPO-style optimization.
  • Ongoing: Deduplicate, format as instruction-response JSONL, add chain-of-thought reasoning, and test on small models for quality validation.
  • Final month: Package dataset with README, ethics notes, and baseline fine-tune results; release on Hugging Face.
  • Throughout: Document progress publicly (blog posts, short updates) to share learnings and invite community input.
  • Tools: Leverage MLX on Apple Silicon for local generation/review, simple Python scripts for processing, and clinician loop for alignment.
  • Focus areas: 40% plaque/atherosclerosis/CAC/CIMT/prevention myths; 50% nonlinear HRV/autonomic predictors and complexity loss in CV/stroke risk; 10% hybrid cases combining structural and dynamic markers.
  • Expected outcome: A niche, open dataset that powers better local LLMs for preventive education, with particular strength in underrepresented nonlinear/autonomic reasoning.
  • Call to action: If you’re a cardiologist or researcher interested in reviewing batches or contributing preference data — especially on HRV complexity or plaque-HRV interplay — please reach out.

A blog tracking the project will be available on this website in early March.