Lesson 02 of 10AI Healthcare Quality & Safety

How AI Learns
Machine Learning, Deep Learning & NLP

Understanding how AI systems learn is not optional technical knowledge for healthcare professionals. It is the foundation of asking the right questions about any AI system — what was it trained on, how does it make decisions, and where is it likely to fail?

What you will learn
Distinguish between supervised, unsupervised, and reinforcement learning and their healthcare applications
Explain how neural networks and deep learning systems process clinical data
Describe how natural language processing enables AI to work with unstructured clinical text
Identify the key factors that determine the quality and reliability of an AI model
Explain model training, validation, and testing and why each phase matters clinically

Three ways machines learn
supervised, unsupervised, and reinforcement

Machine learning is the approach by which AI systems learn patterns from data rather than following explicitly programmed rules. There are three primary learning paradigms, each appropriate for different clinical applications.

Supervised learning is the most common approach in clinical AI. The system learns from labeled examples — pairs of inputs and correct outputs provided by human experts. A chest X-ray AI trained on images labeled by radiologists as 'pneumonia present' or 'pneumonia absent' is using supervised learning. The model learns to predict the label for new, unlabeled images. Quality of the labeled training data is everything — if the labels are wrong, incomplete, or biased, the model learns the wrong patterns.

Unsupervised learning works without labeled examples. The system finds patterns, clusters, or structures in data on its own. In healthcare, unsupervised approaches are used to discover patient subgroups, identify unusual patterns in clinical data, or reduce the complexity of high-dimensional datasets. These applications are powerful but more difficult to validate clinically — because there is no ground truth label to compare against.

Reinforcement learning trains systems through interaction and feedback — the system takes actions, receives rewards or penalties, and gradually learns strategies that maximize reward. Clinical applications are emerging in treatment optimization and robotic surgery, but reinforcement learning in high-stakes clinical settings raises significant safety and governance questions.

The Labeling Problem

In supervised learning, a model is only as good as its training labels. If radiologists disagreed on labeling, if labels reflect historical biases, or if the condition being labeled was underdiagnosed in the training population, the model inherits those problems — invisibly and at scale.

Neural networks and
deep learning in clinical AI

Deep learning is a subset of machine learning that uses artificial neural networks — computational structures loosely inspired by the architecture of biological neural networks. These networks consist of layers of computational nodes. Each layer transforms its input and passes the result to the next layer. Networks with many layers — deep networks — can learn highly complex representations of data.

Deep learning has driven the most significant recent advances in clinical AI, particularly in medical imaging. Convolutional neural networks excel at analyzing spatial patterns in images — detecting retinal disease, classifying skin lesions, identifying pulmonary nodules on CT scans. Transformer models — the architecture underlying large language models — excel at processing sequential data, including clinical text.

The clinical challenge with deep learning is opacity. Unlike a decision tree or a logistic regression model, a deep neural network cannot easily explain its reasoning. When a deep learning system flags a chest X-ray as abnormal, it cannot tell the clinician which features drove that prediction in a clinically meaningful way — at least not without specialized explainability techniques applied on top. This opacity has significant implications for clinical governance and patient disclosure.

Training, validation, and testing
the three phases every clinician should understand

Every AI model goes through three distinct development phases that every clinical governance professional should understand. Training is the phase in which the model learns from data — adjusting its parameters to minimize prediction error on the training dataset. A model that performs well on its training data has learned something, but this does not yet tell us whether it has learned something useful.

Validation is the phase in which the model's performance is evaluated on data it has not seen during training — held-out examples from the same source dataset. Validation performance guides decisions about model architecture and training — and is commonly the figure reported in research publications. However, a model that performs well in internal validation may still fail in deployment if the real-world population differs from the training population.

Testing — ideally on an entirely independent external dataset from a different institution or population — is the most clinically meaningful evaluation. It tells us how the model performs when the conditions of deployment differ from the conditions of training. Models that perform well only in internal validation but not in external testing are a significant governance concern.

Validation vs Real-World Performance

Most AI performance claims in published literature reflect internal validation results. External validation — testing on data from a different institution and population — is the more clinically meaningful measure. Always ask which type of validation a performance claim is based on.

Key concepts
from this lesson

Key Concept

Supervised Learning

Learning from labeled examples — the most common approach in clinical AI including diagnostic imaging and predictive models.

Key Concept

Unsupervised Learning

Finding patterns in unlabeled data — used for patient clustering, anomaly detection, and exploratory analysis.

Key Concept

Neural Network

A computational structure of interconnected nodes organized in layers — the foundation of deep learning.

Key Concept

Deep Learning

Machine learning using multi-layer neural networks — powers most modern imaging AI and large language models.

Key Concept

Model Validation

Evaluating model performance on data not used in training — internal validation uses held-out data from the same source; external validation uses independent datasets.

Key Concept

Overfitting

When a model learns the training data too precisely and fails to generalize — a key failure mode in clinical AI development.

Case Study

The model that aced its exam but failed in practice

A diabetic retinopathy screening AI achieves 94% sensitivity and 96% specificity in its validation study — performance that exceeds the average ophthalmologist in controlled conditions. The study is published in a leading clinical journal. The hospital system purchases and deploys the tool across its primary care network.

Six months after deployment, the clinical informatics team conducts a prospective audit. Real-world sensitivity has dropped to 71% — well below the published figure. Investigation reveals three contributing factors: the validation dataset contained high-quality fundus photographs taken by trained ophthalmic photographers, while the primary care deployment uses tablet-based cameras operated by medical assistants with minimal training. Image quality is lower and more variable. Additionally, the validation dataset significantly underrepresented patients with darker fundus pigmentation — a population comprising 34% of the primary care network's diabetic patients.

The performance gap was not fraud or error in the original study. It was the predictable consequence of deploying a model under conditions that differed from its validation environment — different image acquisition conditions and a different patient population.

What this illustrates

Published AI performance figures are not deployment performance figures. They reflect performance under controlled validation conditions. Every deployment environment is different — and the gap between validation performance and real-world performance is one of the most important and least discussed dimensions of clinical AI governance.

Reflection Prompt

What do you actually know about the AI systems you use?

Think about an AI-enabled tool you currently use or that influences care in your organization — an electronic alert, a risk score, a documentation suggestion. Do you know whether it uses supervised or unsupervised learning? Do you know what population it was trained on? Do you know whether its published performance figure reflects internal or external validation? If the answers are no — which is common — what would it take to find out? And whose responsibility is it to know?

Further Learning

The IHI's work on Learning Health Systems provides important context for how organizations can systematically learn from data — a foundation for understanding both the promise and the governance requirements of AI in continuous improvement. Available at ihi.org.

Knowledge Check — Lesson 02

1. A fall risk prediction model is trained on patient records where nurses have documented whether each patient fell during their stay. This is an example of:

AUnsupervised learning — the model discovers fall patterns without guidance
BReinforcement learning — the model receives a penalty each time it predicts incorrectly
CSupervised learning — the model learns from labeled examples of fall and no-fall outcomes
DDeep learning — the model uses neural networks to analyze fall risk factors

2. A deep learning imaging AI cannot explain which specific features of a chest X-ray led it to flag an abnormality. This characteristic is best described as:

AA training error that can be corrected by retraining the model on larger datasets
BModel opacity — a fundamental characteristic of deep neural networks with governance implications
CAn acceptable limitation that does not affect clinical decision-making
DEvidence that the model is not functioning correctly and should be decommissioned

3. An AI model performs with 92% accuracy on its internal validation dataset but only 74% accuracy when tested at an external hospital with a different patient population. This is most likely explained by:

AThe external hospital's IT infrastructure is incompatible with the AI system
BThe model has overfitted to its training population and does not generalize to a different population
CThe external validation was conducted incorrectly and the results should be disregarded
DA 74% accuracy rate is clinically acceptable and does not require investigation

4. Which phase of AI model development is most clinically meaningful for assessing real-world deployment performance?

ATraining — because this is where the model learns its core capabilities
BInternal validation — because it uses held-out data from the same dataset
CExternal testing — because it evaluates performance on an independent population
DPublication — because peer review ensures clinical validity before deployment

5. The quality of labeled training data in supervised learning is described as critically important because:

ALarger labeled datasets always produce better models regardless of label quality
BA supervised model learns the patterns in its labels — including errors, biases, and omissions
CRegulatory authorities require labeling quality certificates before clinical AI deployment
DLabels determine the computational cost of training but not the model's clinical performance