How to Introduce Pipelines in Your Machine Learning Projects
This post may contain affiliate links which means I may receive a commission for purchases made through links. I will only recommend products that I have personally used! Learn more on my Private Policy page.
Efficient machine learning projects require clean, structured workflows, especially when dealing with repetitive tasks like preprocessing. In scikit-learn, pipelines are a powerful tool that can simplify and automate these workflows. By seamlessly chaining preprocessing steps with model training, pipelines not only reduce errors but also ensure your code is more readable and maintainable.
In this blog, we’ll guide you through the concept of scikit-learn pipelines, explaining what they are, why they matter, and how to effectively use them. You’ll learn how to seamlessly integrate preprocessing and modeling into your machine learning workflow, all while applying these techniques to solve a real-world data problem.
Table of Contents
What Are Pipelines in scikit-learn?
In scikit-learn, a pipeline is a streamlined way to combine data transformations and modeling steps into a single cohesive workflow. By chaining preprocessing steps—such as scaling, encoding, or handling missing values—together with a machine learning model, pipelines simplify and unify the entire process.
Rather than manually preprocessing data before training your model, a pipeline automates these tasks, ensuring consistency and reducing the chance of errors across both the training and testing phases. This approach not only saves time but also enhances the reliability of your machine learning workflow.
Understanding The Dataset
The World Health Organization estimates that heart diseases claim 12 million lives globally each year. In the United States and other developed nations, cardiovascular diseases account for nearly half of all deaths. Early detection of these conditions can play a crucial role in guiding high-risk individuals toward necessary lifestyle changes, ultimately reducing complications and improving outcomes. This work aims to identify the most significant risk factors for heart disease while using logistic regression to predict overall risk.
This dataset originates from an ongoing cardiovascular study conducted on residents of Framingham, Massachusetts. The primary objective is to classify and predict a patient’s 10-year risk of developing coronary heart disease (CHD). It contains detailed patient information, including over 4,000 records and 15 attributes, offering valuable insights for analysis.Each attribute is a potential risk factor. There are both demographic, behavioral and medical risk factors.
- Medical Factors [history]:
- BP Meds: whether or not the patient was on blood pressure medication (Nominal)
- Prevalent Stroke: whether or not the patient had previously a stroke (Nominal)
- Prevalent Hyp: whether or not the patient was hypertensive (Nominal)
- Diabetes: whether or not the patient had diabetes (Nominal)
- Demographic Factors:
- Sex: male or female (Nominal)
- Age: Age of the patient (Continuous)
- Medical Factors [current]:
- Tot Chol: total cholesterol level (Continuous)
- Sys BP: systolic blood pressure (Continuous)
- Dia BP: diastolic blood pressure (Continuous)
- BMI: Body Mass Index (Continuous)
- Heart Rate: heart rate (Continuous)
- Glucose: glucose level (Continuous)
- Behavioral Factors:
- Current Smoker: whether or not the patient is a current smoker (Nominal)
- Cigs Per Day: the number of cigarettes the person smoked on average in one day (continuous)
- Target variable:
- 10 year risk of coronary heart disease CHD (binary: “1”=“Yes”, “0”=“No”)
Preprocessing Steps You Can Automate with Pipelines
To create a machine learning pipeline, the first step is to outline its structure by clearly defining each stage involved. This means identifying the exact steps required to build the pipeline effectively.
To achieve this, we begin by developing a prototype machine learning model using the existing data. The purpose of this prototype is to gain a deeper understanding of the data and determine the necessary preprocessing steps needed before building the final model.
The insights gained from the prototype will guide the design of a comprehensive machine learning pipeline, ensuring it incorporates all essential preprocessing stages for optimal performance.
Exploring and Preparing Data
Data exploration and preprocessing are essential stages in any machine learning workflow. During data exploration, the goal is to thoroughly analyze the dataset to uncover its structure, highlight missing values, identify outliers, and assess feature distributions. Tools like histograms, boxplots, and scatter plots are invaluable for visualizing patterns and correlations within the data.
Preprocessing, by contrast, focuses on preparing the data for modeling through cleaning and transformation. Key tasks include handling missing values, scaling numerical features, encoding categorical variables, and eliminating irrelevant or redundant features. Together, these steps create a solid foundation for building accurate, reliable machine learning models.
- Printing the first five rows:
import pandas as pd
data= pd.read_csv('framingham.csv', delimiter=',')
df=data.copy()
print(df.head())
- Concise summary about the DataFrame:
print(df.info())
The output from df.info()
provides a detailed summary of the DataFrame, including information about the columns, data types, and missing values. Let’s break down the results:
<class 'pandas.core.frame.DataFrame'>
: Indicates that the object is a pandas DataFrame.RangeIndex: 4238 entries, 0 to 4237
: The DataFrame has 4,238 rows, with row indices ranging from 0 to 4,237.- The DataFrame has 16 columns in total.
- Each column is listed with its name, the number of non-null (non-missing) values, and its data type.
int64
: 7 columns are of integer type (e.g.,male
,age
,currentSmoker
).float64
: 9 columns are of floating-point type (e.g.,education
,cigsPerDay
,totChol
).- Some columns have missing values, as indicated by the non-null count being less than the total number of rows (4,238). For example:
education
: 4,133 non-null values (105 missing).cigsPerDay
: 4,209 non-null values (29 missing).BPMeds
: 4,185 non-null values (53 missing).totChol
: 4,188 non-null values (50 missing).BMI
: 4,219 non-null values (19 missing).heartRate
: 4,237 non-null values (1 missing).glucose
: 3,850 non-null values (388 missing).
Key Insights
- Missing Data:
- Columns like
education
,cigsPerDay
,BPMeds
,totChol
,BMI
,heartRate
, andglucose
have missing values. You may need to handle these missing values (e.g., by imputation or removal) before analysis.
- Columns like
- Data Types:
- Most columns are numeric (
int64
orfloat64
), which is suitable for statistical analysis or machine learning.
- Most columns are numeric (
- Dataset Size:
- The dataset is relatively small in terms of memory usage (529.9 KB), but it has a moderate number of rows (4,238) and columns (16).
- Target Variable:
- The column
TenYearCHO
(our target variable for prediction) has no missing values, which is good for modeling.
- The column
Handle Missing Values
Missing data occurs when a feature has no value in a particular row. This can happen due to a lack of observation or data corruption. Regardless of the cause, it’s essential to address and manage missing data effectively.
To identify missing values (NaN) in a pandas DataFrame, you can use a combination of pandas functions. By chaining `.isna()`, `.sum()`, and `.sort_values()`, you can generate a clear summary of the missing value counts for each column, sorted for better readability and analysis.
print(df.isna().sum().sort_values())
A common approach to impute missing data is by making an educated guess as to what the missing values could be using statistical or machine learning techniques. The goal is to retain as much data as possible while minimizing bias introduced by missing values.
We can impute the mean of all non-missing values for a specific feature, or alternatively, use the median. For categorical variables, the most frequent value is often used for imputation. However, it’s essential to split the dataset before performing imputation to prevent test set information from influencing the model; a phenomenon known as data leakage.
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
imputer= SimpleImputer()
log_reg= LogisticRegression()
pipeline= Pipeline(steps=[('imputer', imputer), ('log_reg', log_reg)])
X= df.drop('TenYearCHD', axis=1)
y=df['TenYearCHD']
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=42)
pipeline.fit(X_train, y_train)
print(pipeline.score(X_test, y_test))
#output: 0.8549528301886793
The code snippet uses a pipeline to impute missing data and train a logistic regression model, achieving a test accuracy of 85.5%. While this seems high, there are critical considerations to address:
Class Imbalance
The high accuracy may be misleading if your target variable (TenYearCHD
) is imbalanced (e.g., most samples belong to one class).
print(df['TenYearCHD'].value_counts(normalize=True))
#output:
#TenYearCHD
#0 0.847648
#1 0.152352
If 85% of the data is labeled “0” (no heart disease), a model that always predicts “0” would achieve 85% accuracy but fail to identify true cases.This is a common situation in practice and requires a different approach to assessing the model’s performance other than accuracy.
Confusion matrix
Given a binary classification task, such as predicting if a person will have a 10 year risk of coronary heart disease, we can create a 2-by-2 matrix that summarizes performance named a confusion matrix.
Interpreting the Values in a Heart Disease Prediction Model:
Usually, the class of interest is called the positive class.
- True Positive (TP) : The person has heart disease, and the model correctly predicts it.
- True Negative (TN): The person does not have heart disease, and the model correctly predicts no disease.
- False Positive (FP) : The person does not have heart disease, but the model incorrectly predicts they do.
- False Negative (FN) : The person has heart disease, but the model incorrectly predicts they don’t.
Performance Metrics from the Confusion Matrix
From the confusion matrix, we can calculate key performance metrics:
- Accuracy :Overall correctness of the model
Accuracy=Tp+Tn/Tp+Tn+Fp+Fn
- Precision (Positive Predictive Value) : Of the predicted positive cases, how many were actually positive?
-
Precision=Tp/Fp+Tp
-
- Recall (Sensitivity) : Of the actual positive cases, how many were correctly predicted?
Recall=Tp/Tp+Fn
- F1-Score : The harmonic mean of precision and recall
-
F1=2×(Precision×Recall)/(Precision+Recall)
-
In order to compute the confusion matrix, along with the other metrics, we import classification report and confusion matrix from sklearn.metrics.
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_predict))
print(pipeline.score(X_test, y_test))
#confusion_matrix :
#[[720 4]
#[119 5]]
#classification_report:
# precision recall f1-score support
# 0 0.86 0.99 0.92 724
# 1 0.56 0.04 0.08 124
# accuracy 0.85 848
# macro avg 0.71 0.52 0.50 848
#weighted avg 0.81 0.85 0.80 848
Let’s break down the confusion matrix and classification report to interpret the model’s performance:
Confusion Matrix:
- True Negatives (TN): 720
The model correctly predicted 720 cases where patients did not have a 10-year risk of CHD. - False Positives (FP): 4
The model incorrectly predicted 4 cases as having a 10-year risk of CHD when they did not. - False Negatives (FN): 119
The model incorrectly predicted 119 cases as not having a 10-year risk of CHD when they actually did. - True Positives (TP): 5
The model correctly predicted 5 cases where patients had a 10-year risk of CHD.
Classification Report:
Class 0 (No 10-year risk of CHD):
- Precision: 0.86
Out of all the cases predicted as “no risk,” 86% were correct. - Recall: 0.99
The model correctly identified 99% of the actual “no risk” cases. - F1-Score: 0.92
A high F1-score indicates strong performance for this class.
Class 1 (10-year risk of CHD):
- Precision: 0.56
Out of all the cases predicted as “risk,” 56% were correct. - Recall: 0.04
The model only identified 4% of the actual “risk” cases. - F1-Score: 0.08
A very low F1-score indicates poor performance for this class.
Overall Metrics:
- Accuracy: 0.85
The model is correct 85% of the time overall. However, accuracy can be misleading in imbalanced datasets. - Macro Avg: 0.71 (precision), 0.52 (recall), 0.50 (F1-score)
The average performance across both classes, giving equal weight to each class. - Weighted Avg: 0.81 (precision), 0.85 (recall), 0.80 (F1-score)
The average performance weighted by the number of samples in each class.
Key Observations
- Class Imbalance:
The dataset is imbalanced, with far more samples for “no risk” (724) than “risk” (124). This imbalance affects the model’s ability to predict the minority class (risk of CHD). - Poor Performance for Class 1 (Risk of CHD):
The model struggles to identify patients at risk of CHD, as indicated by the very low recall (0.04) and F1-score (0.08). This is critical because missing patients at risk (false negatives) is a significant issue in healthcare. - High Accuracy Misleading:
While the overall accuracy is 85%, this is largely driven by the model’s ability to correctly predict the majority class (no risk). The model fails to address the primary concern: identifying patients at risk.
Recommendations
- Address Class Imbalance:
Use techniques like oversampling (e.g., SMOTE), undersampling, or class weighting to balance the dataset and improve performance for the minority class. - Optimize for Recall:
Since false negatives (missed risk cases) are critical in healthcare, consider optimizing the model for recall. This might involve adjusting the decision threshold or using metrics like ROC-AUC. - Feature Engineering:
Re-evaluate the features used in the model. Are there additional risk factors or biomarkers that could improve predictions for the “risk” class? - Try Different Models:
Experiment with other algorithms (e.g., Random Forest, Gradient Boosting, or Neural Networks) that might better handle imbalanced data. - Evaluate with Domain Experts:
Collaborate with healthcare professionals to ensure the model’s predictions align with clinical insights and prioritize patient safety.
Conclusion
The model performs well for identifying patients with no risk of CHD but struggles to detect those at risk. Given the high stakes in healthcare, improving recall for the “risk” class should be the primary focus. Addressing class imbalance and refining the model’s approach will be key to making it clinically useful.
Additionally, the use of pipelines in your model is a significant advantage. Pipelines streamline the machine learning workflow by encapsulating data preprocessing, feature engineering, and model training into a single, cohesive process. This ensures consistency, reduces the risk of data leakage, and makes the model easier to deploy and maintain. For example, in a healthcare context, pipelines can seamlessly integrate steps like scaling numerical features, encoding categorical variables, and applying dimensionality reduction, all while maintaining reproducibility and efficiency.
By leveraging pipelines, you can also experiment more effectively with different preprocessing techniques and models, ensuring that your final solution is robust and reliable. This is especially important in healthcare applications, where model accuracy and interpretability are critical for patient outcomes and clinical decision-making.
FAQs
Why use machine learning to predict CHD risk?
Machine learning can analyze complex patterns in patient data to predict CHD risk more accurately than traditional statistical methods. It can also handle large datasets, incorporate diverse risk factors, and continuously improve predictions as more data becomes available.
What data is used to predict CHD risk?
Common data used includes:
Demographic information (age, gender)
Clinical measurements (blood pressure, cholesterol levels, BMI)
Lifestyle factors (smoking status, physical activity)
Medical history (diabetes, family history of heart disease)
What is a confusion matrix, and why is it important?
A confusion matrix is a table that shows the performance of a classification model by comparing predicted and actual outcomes. It is important because it helps evaluate metrics like accuracy, precision, recall, and F1-score, which are critical for understanding how well the model identifies patients at risk of CHD.
What does low recall for the “risk” class mean?
Low recall for the “risk” class means the model is missing a large number of patients who are actually at risk of CHD (high false negatives). This is particularly concerning in healthcare, as failing to identify at-risk patients can have serious consequences.
What is the role of pipelines in building a CHD risk prediction model?
Pipelines streamline the machine learning workflow by combining data preprocessing, feature engineering, and model training into a single process. They ensure consistency, prevent data leakage, and make the model easier to deploy and maintain. For example, a pipeline might include steps like scaling numerical features, encoding categorical variables, and training a classifier.
No Comment! Be the first one.