Press ESC to close

Classification with scikit-learn Simplified: 3 Essential Steps for Beginners

Classification is a fundamental concept in machine learning, enabling us to predict categories or labels based on input data. For example, we can predict whether a bank transaction is fraudulent or not. There are two outcomes here – a fraudulent or non-fraudulent transaction, this is known as binary classification. Using scikit-learn, one of Python’s most popular machine learning packages it simplifies this process with its user-friendly tools and robust algorithms. Whether you’re new to classification or refining your skills, this guide will help you dive into scikit-learn’s classification capabilities with confidence.

Data Preparation

Before performing classification, we have to meet some requirements. The data must not have missing values, must be in numeric format, and be stored as pandas DataFrames, Series, or NumPy arrays. This requires some exploratory data analysis first to ensure data is in the correct format. Various pandas methods for descriptive statistics, along with appropriate data visualizations, are useful in this step.

This guide focuses on predicting customer churn – a critical metric for many businesses, especially in the banking and telecom sectors.The dataset used in this guide contains various customer attributes, such as age, balance, and number of products used, along with a target variable indicating whether the customer has exited (churned) or not.

import pandas as pd
import requests 

train_url= "https://raw.githubusercontent.com/zainhaidar16/Binary-Classification-with-a-Bank-Churn-Dataset/refs/heads/main/Data/train.csv"

response = requests.get(train_url)
response.raise_for_status()

with open("train.csv", "wb") as csv_file:
    csv_file.write(response.content)

train_df = pd.read_csv('train.csv', delimiter=',')
print(train_df.head())

This Python code snippet retrieves a CSV file from a github repository, saves it locally, and loads it into a pandas DataFrame for further data analysis.The print statement displays the first 5 rows of the DataFrame. This is useful for quickly inspecting the data structure, column names, and some sample values.

As a next step, we check for missing values (NaN) in each column of the DataFrame and prints the count of missing values for every column.

print(train_df.isna().sum())

When deciding which columns to use as features for your model, it’s essential to focus on variables that are likely to have a meaningful relationship with the target variable (Exited, in this case). Let’s analyze each column and identify those that are suitable as features:

Columns in the Dataset

  1. id or CustomerId
    • Exclude: These are unique identifiers and have no predictive power.
  2. Surname
    • Exclude: A person’s surname is unlikely to correlate with customer churn.
  3. CreditScore
    • Include: A customer’s credit score may influence their likelihood of staying or leaving.
  4. Geography
    • Include: The region or country may play a role in customer behavior and churn rates.
  5. Gender
    • Include: Gender might have a relationship with churn, depending on customer demographics.
  6. Age
    • Include: Age is often an important factor, as customer needs and behaviors change with age.
  7. Tenure
    • Include: How long a customer has been with the bank might strongly impact their decision to stay or leave.
  8. Balance
    • Include: The account balance could indicate a customer’s financial engagement with the bank.
  9. NumOfProducts
    • Include: The number of products (e.g., loans, credit cards) could reflect customer loyalty or satisfaction.
  10. HasCrCard
    • Include: Indicates whether the customer has a credit card. This could influence churn.
  11. IsActiveMember
    • Include: An active member is less likely to churn, making this a relevant feature.
  12. EstimatedSalary
    • Include: A customer’s salary may correlate with their likelihood to churn.
  13. Exited (Target Variable)
    • Exclude as Feature: This is the target variable you are trying to predict, not a feature.

Geography and Gender are categorical and need to be encoded. Use one-hot encoding or label encoding depending on the algorithm.

# One-hot encode categorical variables
train_df = pd.get_dummies(train_df, columns=['Geography', 'Gender'])

After the preprocessing steps,we split our data into X, a 2D array of our features, and y, a 1D array of the target values. scikit-learn requires that the features are in an array where each column is a feature and each row is a different observation. Similarly, the target needs to be a single column with the same number of observations as the feature data. We use the dot-values attribute to convert X and y to NumPy arrays. Printing the shape of X and y, we see there are 165034 observations of 13 features, and 165034 observations of the target variable.

X= train_df.drop(columns=['id', 'CustomerId', 'Surname', 'Exited'], axis=1).values
y = train_df['Exited'].values
print(X.shape, y.shape) 
# output: (165034, 13) (165034,) The same number of observations

It is necessary to split data into a training set and a test set, to fit the classifier using the training set,then we calculate the model’s accuracy using the test set’s labels.To split data, we import train_test_split from sklearn.model_selection.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, stratify=y,random_state=42)

We call train_test_split, passing our features and targets. We commonly use 20% of our data as the test set. By setting the test_size argument to 0.2 we use 20% here. The random_state argument sets a seed for a random number generator that splits the data. Using the same number when repeating this step allows us to reproduce the exact split and our downstream results. It is best practice to ensure our split reflects the proportion of labels in our data. So if Exited occurs in 15% of observations, we want 15% of labels in our training and test sets to represent churn. We achieve this by setting stratify equal to y.

Building Your First Classification Model

Let’s build our first model! We’ll use an algorithm called k-Nearest Neighbors, which is popular for classification problems.The idea of k-Nearest Neighbors, or KNN, is to predict the label of any data point by looking at the k, for example, three, closest labeled data points and getting them to vote on what label the unlabeled observation should have.

Using scikit-learn to fit a classifier

We instantiate our KNeighborsClassifier by setting n_neighbors equal to 20, and assign it to the variable knn.Then we can fit this classifier to our labeled data by applying the classifier’s dot-fit method and passing two arguments: the feature values, X_train, and the target values, y_train.

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=4)
knn.fit(X_train, y_train)

After fitting the classifier, We use the classifier’s dot-predict method and pass it the unseen data as a 2D NumPy array containing features in columns and observations in rows.Printing the predictions returns a binary value for each observation.

y_pred= knn.predict(X_test)

print("Predictions: {}".format(y_pred))

Measuring model performance

In classification, accuracy is a commonly-used metric.which is the number of correct predictions divided by the total number of observations. We could compute accuracy on the data used to fit the classifier.However, as this data was used to train the model, performance will not be indicative of how well it can generalize to unseen data, which is what we are interested in! To check the accuracy, we use the dot-score method, passing X test and y test.

print(knn.score(X_test, y_test)) # output: 0.7707758960220559

Model complexity depends on the value of n_neighbors, which determines how many neighbors are considered when classifying a data point.

Low n_neighbors (High Model Complexity)

Description: When n_neighbors is small (e.g., 1 or 2), the model is more sensitive to individual data points. Each data point’s nearest neighbors dominate the classification decision.Effect:

  • High Variance: The model may overfit the training data, leading to poor generalization on test data.
  • Detailed Decision Boundary: The decision boundary is highly flexible, closely following the training data.
  • Susceptible to Noise: The model may incorrectly classify due to noisy or outlier data points.

High n_neighbors (Low Model Complexity)

Description: When n_neighbors is large, the classification decision is based on a broader group of neighbors. This smooths the decision boundary.Effect:

  • Low Variance, High Bias: The model becomes simpler, possibly underfitting the data and failing to capture important patterns.
  • Smoothed Decision Boundary: The boundary becomes less flexible and more generalized.
  • Robustness to Noise: Outliers and noisy data have less influence, as decisions are averaged across many points.

We can also interpret k using a model complexity curve. With a KNN model, we can calculate accuracy on the training and test sets by incrementing k values, and ploting the results.

import matplotlib.pyplot as plt
import numpy as np
train_accuracy = {}
test_accuracy = {}

neighbors = np.arange(1, 30)

for neighbor in neighbors:
    knn= KNeighborsClassifier(n_neighbors=neighbor)
    knn.fit(X_train, y_train)
    train_accuracy[neighbor] = knn.score(X_train, y_train)
    test_accuracy[neighbor] = knn.score(X_test, y_test)


plt.figure(figsize=(10, 10))
plt.title("Model performance based on varying number of neighbors")
plt.plot(neighbors, train_accuracy.values(), label= 'Train_accuracy')
plt.plot(neighbors, test_accuracy.values(), label= 'Test_accuracy')

plt.legend()

plt.xlabel("Neighbors")
plt.ylabel("Accuracy")

plt.savefig("accuracy.png")
plt.show()

We create dictionaries to store our train and test accuracies, and an array containing a range of k values.We use a for loop to repeat our previous workflow, building several models using a different number of neighbors. We loop through our neighbors array and, inside the loop, we instantiate a KNN model with n_neighbors equal to the neighbor iterator, and fit to the training data. We then calculate training and test set accuracy, storing the results in their respective dictionaries. After our for loop, we then plot the training and test values, including a legend and the labels.

Classification accuracy

As k increases beyond 16 we see underfitting where performance tableland on both test and training sets, as we see in this plot.The peak test accuracy actually occurs at around 17 neighbors.

Conclusion

Classification is a cornerstone of machine learning, and mastering it using scikit-learn opens doors to solving diverse real-world problems. With its rich suite of algorithms, preprocessing tools, and evaluation metrics, scikit-learn provides everything you need to build effective and scalable classification models. As a beginner, starting with simple models like Logistic Regression or k-NN helps you understand the fundamental principles. Gradually, exploring advanced methods like Random Forests and SVMs will deepen your expertise. Remember, the key to success lies in experimenting with different algorithms, tuning hyperparameters, and validating your models to ensure robust performance.

Whether you are predicting customer churn, classifying images, or tagging text, scikit-learn equips you with the tools to achieve your goals. Keep learning, practice consistently, and soon you’ll be leveraging classification models to make data-driven decisions with confidence.