Learning/ AI Healthcare Quality & Safety/ Lesson 02

Lesson 02 of 10AI Healthcare Quality & Safety

How AI Learns
Machine Learning, Deep Learning & NLP

Understanding how AI systems learn is not optional technical knowledge for healthcare professionals. It is the foundation of asking the right questions about any AI system — what was it trained on, how does it make decisions, and where is it likely to fail?

What you will learn

Distinguish between supervised, unsupervised, and reinforcement learning and their healthcare applications

Explain how neural networks and deep learning systems process clinical data

Describe how natural language processing enables AI to work with unstructured clinical text

Identify the key factors that determine the quality and reliability of an AI model

Explain model training, validation, and testing and why each phase matters clinically

Start Lesson Back to Course

Three ways machines learn
supervised, unsupervised, and reinforcement

Machine learning is the approach by which AI systems learn patterns from data rather than following explicitly programmed rules. There are three primary learning paradigms, each appropriate for different clinical applications.

Supervised learning is the most common approach in clinical AI. The system learns from labeled examples — pairs of inputs and correct outputs provided by human experts. A chest X-ray AI trained on images labeled by radiologists as 'pneumonia present' or 'pneumonia absent' is using supervised learning. The model learns to predict the label for new, unlabeled images. Quality of the labeled training data is everything — if the labels are wrong, incomplete, or biased, the model learns the wrong patterns.

Unsupervised learning works without labeled examples. The system finds patterns, clusters, or structures in data on its own. In healthcare, unsupervised approaches are used to discover patient subgroups, identify unusual patterns in clinical data, or reduce the complexity of high-dimensional datasets. These applications are powerful but more difficult to validate clinically — because there is no ground truth label to compare against.

Reinforcement learning trains systems through interaction and feedback — the system takes actions, receives rewards or penalties, and gradually learns strategies that maximize reward. Clinical applications are emerging in treatment optimization and robotic surgery, but reinforcement learning in high-stakes clinical settings raises significant safety and governance questions.

The Labeling Problem

In supervised learning, a model is only as good as its training labels. If radiologists disagreed on labeling, if labels reflect historical biases, or if the condition being labeled was underdiagnosed in the training population, the model inherits those problems — invisibly and at scale.

Neural networks and
deep learning in clinical AI

Deep learning is a subset of machine learning that uses artificial neural networks — computational structures loosely inspired by the architecture of biological neural networks. These networks consist of layers of computational nodes. Each layer transforms its input and passes the result to the next layer. Networks with many layers — deep networks — can learn highly complex representations of data.

Deep learning has driven the most significant recent advances in clinical AI, particularly in medical imaging. Convolutional neural networks excel at analyzing spatial patterns in images — detecting retinal disease, classifying skin lesions, identifying pulmonary nodules on CT scans. Transformer models — the architecture underlying large language models — excel at processing sequential data, including clinical text.

The clinical challenge with deep learning is opacity. Unlike a decision tree or a logistic regression model, a deep neural network cannot easily explain its reasoning. When a deep learning system flags a chest X-ray as abnormal, it cannot tell the clinician which features drove that prediction in a clinically meaningful way — at least not without specialized explainability techniques applied on top. This opacity has significant implications for clinical governance and patient disclosure.

Training, validation, and testing
the three phases every clinician should understand

Every AI model goes through three distinct development phases that every clinical governance professional should understand. Training is the phase in which the model learns from data — adjusting its parameters to minimize prediction error on the training dataset. A model that performs well on its training data has learned something, but this does not yet tell us whether it has learned something useful.

Validation is the phase in which the model's performance is evaluated on data it has not seen during training — held-out examples from the same source dataset. Validation performance guides decisions about model architecture and training — and is commonly the figure reported in research publications. However, a model that performs well in internal validation may still fail in deployment if the real-world population differs from the training population.

Testing — ideally on an entirely independent external dataset from a different institution or population — is the most clinically meaningful evaluation. It tells us how the model performs when the conditions of deployment differ from the conditions of training. Models that perform well only in internal validation but not in external testing are a significant governance concern.

Validation vs Real-World Performance

Most AI performance claims in published literature reflect internal validation results. External validation — testing on data from a different institution and population — is the more clinically meaningful measure. Always ask which type of validation a performance claim is based on.

Key concepts
from this lesson

Key Concept

Supervised Learning

Learning from labeled examples — the most common approach in clinical AI including diagnostic imaging and predictive models.

Key Concept

Unsupervised Learning

Finding patterns in unlabeled data — used for patient clustering, anomaly detection, and exploratory analysis.

Key Concept

Neural Network

A computational structure of interconnected nodes organized in layers — the foundation of deep learning.

Key Concept

Deep Learning

Machine learning using multi-layer neural networks — powers most modern imaging AI and large language models.

Key Concept

Model Validation

Evaluating model performance on data not used in training — internal validation uses held-out data from the same source; external validation uses independent datasets.

Key Concept

Overfitting

When a model learns the training data too precisely and fails to generalize — a key failure mode in clinical AI development.

Case Study

The model that aced its exam but failed in practice

A diabetic retinopathy screening AI achieves 94% sensitivity and 96% specificity in its validation study — performance that exceeds the average ophthalmologist in controlled conditions. The study is published in a leading clinical journal. The hospital system purchases and deploys the tool across its primary care network.

Six months after deployment, the clinical informatics team conducts a prospective audit. Real-world sensitivity has dropped to 71% — well below the published figure. Investigation reveals three contributing factors: the validation dataset contained high-quality fundus photographs taken by trained ophthalmic photographers, while the primary care deployment uses tablet-based cameras operated by medical assistants with minimal training. Image quality is lower and more variable. Additionally, the validation dataset significantly underrepresented patients with darker fundus pigmentation — a population comprising 34% of the primary care network's diabetic patients.

The performance gap was not fraud or error in the original study. It was the predictable consequence of deploying a model under conditions that differed from its validation environment — different image acquisition conditions and a different patient population.

What this illustrates

Published AI performance figures are not deployment performance figures. They reflect performance under controlled validation conditions. Every deployment environment is different — and the gap between validation performance and real-world performance is one of the most important and least discussed dimensions of clinical AI governance.

Reflection Prompt

What do you actually know about the AI systems you use?

Think about an AI-enabled tool you currently use or that influences care in your organization — an electronic alert, a risk score, a documentation suggestion. Do you know whether it uses supervised or unsupervised learning? Do you know what population it was trained on? Do you know whether its published performance figure reflects internal or external validation? If the answers are no — which is common — what would it take to find out? And whose responsibility is it to know?

↗

Further Learning

The IHI's work on Learning Health Systems provides important context for how organizations can systematically learn from data — a foundation for understanding both the promise and the governance requirements of AI in continuous improvement. Available at ihi.org.

Knowledge Check — Lesson 02

1. A fall risk prediction model is trained on patient records where nurses have documented whether each patient fell during their stay. This is an example of:

AUnsupervised learning — the model discovers fall patterns without guidance

BReinforcement learning — the model receives a penalty each time it predicts incorrectly

CSupervised learning — the model learns from labeled examples of fall and no-fall outcomes

DDeep learning — the model uses neural networks to analyze fall risk factors

Correct. Correct. This is supervised learning — the model learns from labeled training examples (patient records tagged with 'fell' or 'did not fall') to predict future fall risk. The quality of those labels directly determines what the model learns.

Review the lesson. Review the lesson. Supervised learning trains on labeled examples — inputs paired with correct outputs. Fall outcome documentation provides the labels. Deep learning describes architecture, not learning type, and can be used in supervised learning.

2. A deep learning imaging AI cannot explain which specific features of a chest X-ray led it to flag an abnormality. This characteristic is best described as:

AA training error that can be corrected by retraining the model on larger datasets

BModel opacity — a fundamental characteristic of deep neural networks with governance implications

CAn acceptable limitation that does not affect clinical decision-making

DEvidence that the model is not functioning correctly and should be decommissioned

Correct. Correct. Model opacity — sometimes called the black box problem — is a fundamental characteristic of deep neural networks. They learn complex, high-dimensional representations that cannot be easily translated into clinically interpretable explanations without specialized explainability techniques.

Review the lesson. Review the lesson. Opacity in deep learning is not an error — it is an inherent characteristic of the architecture. It has significant governance implications for patient disclosure, clinical accountability, and the design of human oversight mechanisms.

3. An AI model performs with 92% accuracy on its internal validation dataset but only 74% accuracy when tested at an external hospital with a different patient population. This is most likely explained by:

AThe external hospital's IT infrastructure is incompatible with the AI system

BThe model has overfitted to its training population and does not generalize to a different population

CThe external validation was conducted incorrectly and the results should be disregarded

DA 74% accuracy rate is clinically acceptable and does not require investigation

Correct. Correct. The gap between internal validation performance (92%) and external validation performance (74%) suggests overfitting — the model learned patterns specific to its training population that do not transfer to a different institution and patient population.

Review the lesson. Review the lesson. The difference between internal and external validation performance is one of the most important — and least discussed — dimensions of clinical AI governance. A model that performs well internally but poorly externally is a significant deployment risk.

4. Which phase of AI model development is most clinically meaningful for assessing real-world deployment performance?

ATraining — because this is where the model learns its core capabilities

BInternal validation — because it uses held-out data from the same dataset

CExternal testing — because it evaluates performance on an independent population

DPublication — because peer review ensures clinical validity before deployment

Correct. Correct. External testing on an independent dataset from a different institution and population is the most clinically meaningful evaluation — because it most closely approximates the conditions of real-world deployment in a new setting.

Review the lesson. Review the lesson. Training performance tells us the model learned something. Internal validation tells us it generalized within the same dataset. Only external testing — on genuinely independent data — tells us how the model is likely to perform in deployment.

5. The quality of labeled training data in supervised learning is described as critically important because:

ALarger labeled datasets always produce better models regardless of label quality

BA supervised model learns the patterns in its labels — including errors, biases, and omissions

CRegulatory authorities require labeling quality certificates before clinical AI deployment

DLabels determine the computational cost of training but not the model's clinical performance

Correct. Correct. In supervised learning, a model learns to replicate the patterns in its training labels. If the labels contain errors, reflect historical biases, or systematically underrepresent certain conditions or populations, the model learns those problems — invisibly and at scale.

Review the lesson. Review the lesson. The labeling problem is one of the most important quality considerations in supervised learning. Data quantity matters, but data quality — especially label quality — determines what the model actually learns.

How AI LearnsMachine Learning, Deep Learning & NLP

Three ways machines learnsupervised, unsupervised, and reinforcement

Neural networks anddeep learning in clinical AI

Training, validation, and testingthe three phases every clinician should understand

Key conceptsfrom this lesson

Supervised Learning

Unsupervised Learning

Neural Network

Deep Learning

Model Validation

Overfitting

The model that aced its exam but failed in practice

What do you actually know about the AI systems you use?

How AI Learns
Machine Learning, Deep Learning & NLP

Three ways machines learn
supervised, unsupervised, and reinforcement

Neural networks and
deep learning in clinical AI

Training, validation, and testing
the three phases every clinician should understand

Key concepts
from this lesson