Cracking the Code: How Clustering Reveals Winning Strategies for Online Sellers
This post may contain affiliate links which means I may receive a commission for purchases made through links. I will only recommend products that I have personally used! Learn more on my Private Policy page.
In today’s digital economy, social media platforms have become powerful sales channels, with Facebook Live emerging as a game-changer for online sellers. From small business owners to established retailers, sellers use live streaming to showcase products, engage with audiences, and drive real-time sales. However, not all live-selling strategies yield the same results. Why do some sellers consistently attract high engagement, while others struggle to gain traction?
This is where machine learning, specifically, clustering algorithms comes into play. By analyzing engagement metrics such as likes, shares, comments, and reactions, clustering can help uncover hidden patterns in seller performance. In this post, we’ll explore how clustering techniques can be used to segment online sellers based on their engagement levels, identify key success factors, and optimize digital selling strategies.
Table of Contents
Understanding Unsupervised Learning and Clustering
Machine learning techniques can be broadly categorized into supervised and unsupervised learning. Supervised learning relies on labeled data to train models for classification or prediction tasks. However, in many real-world scenarios, we don’t always have labeled data. This is where unsupervised learning comes in.
What is Unsupervised Learning?
Unsupervised learning is a type of machine learning that identifies patterns in data without predefined labels. Instead of predicting a specific outcome, these models detect hidden structures within the dataset. One of the most commonly used unsupervised learning techniques is clustering, which groups data points based on their similarities.
Clustering and Its Importance
Clustering algorithms help uncover meaningful groupings within a dataset. This is especially useful in business analytics, customer segmentation, and social media insights. In our case, clustering can help categorize online sellers based on engagement metrics, allowing us to understand different seller behaviors and success patterns.
Some widely used clustering algorithms include:
- K-Means Clustering – Groups data into a predefined number of clusters based on similarity.
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise) – Identifies clusters based on density, useful for detecting outliers.
- Hierarchical Clustering – Builds a tree-like structure to visualize relationships between data points.
Applying Clustering to Facebook Live Sellers’ Engagement Data
In this blog post, we analyze a dataset containing 7,050 Facebook posts from 10 Thai fashion and cosmetics retailers, covering text posts, images, and videos (both live and pre-recorded). Each post has recorded engagement metrics, including:
- Likes (Traditional + Emoji Reactions)
- Comments
- Shares
Using clustering, we aim to segment sellers based on their engagement patterns. This will help identify:
- High-engagement sellers who receive strong audience interaction.
- Moderate-engagement sellers with average interaction levels.
- Low-engagement sellers who struggle to attract engagement.
By analyzing these clusters, we can extract insights into what drives success on Facebook Live and other content formats. In the next section, we’ll dive into how we can get the data for preprocessing, feature selection, and applying clustering algorithms.
Accessing and Preparing the Dataset for Analysis
Before diving into data analysis, it’s important to first obtain and set up the dataset for use. Since the dataset is hosted on a server in a ZIP file, here’s a step-by-step guide downloading and preparing it for analysis.
- Download the ZIP File
- The dataset containing Facebook post engagement data is hosted as a ZIP file. You can download it from the link provided. Simply click the download link, and the ZIP file will be saved to your local machine.Or you can use Python to automate the download process. The following code download the ZIP file directly from the server:
import requests # Imports the 'requests' library for making HTTP requests
link="https://archive.ics.uci.edu/static/public/488/facebook+live+sellers+in+thailand.zip" # URL of the dataset ZIP file
req = requests.get(link) # Sends a GET request to download the file
with open("data.zip", "wb") as file: # Opens/Creates 'data.zip' in write-binary mode
file.write(req.content) # Writes the downloaded content to the ZIP file
- Extracting the Data
- Once the ZIP file is downloaded, you’ll need to extract its contents. If you’re working manually, you can use your computer’s file explorer to extract the ZIP file. However, if you prefer to automate the process, here’s how you can do it in Python:
from zipfile import ZipFile #This gives us tools to work with ZIP compressed files.
zip_data_path= "data.zip" #Creates a variable storing the path to your ZIP file.
zipe_file = ZipFile(zip_data_path) #Creates a ZipFile object that represents the ZIP archive.
for file_name in zipe_file.namelist(): #Loops through each file contained in the ZIP archive.
with open(file_name, "wb") as file: #Opens (or creates) a file in write-binary mode for
# each item in the ZIP.
file.write(zipe_file.read(file_name)) #Reads the compressed file from the ZIP and
#writes its contents to the new file.
After extraction, you should see the contents of the ZIP file in the specified folder. The extracted data will typically be in a CSV or JSON format, containing the Facebook posts and engagement metrics.
From Raw Data to Pandas: Preparing for Clustering
Raw data alone won’t reveal winning strategies—it needs to be organized for analysis. Using Python’s pandas library, we’ll load the CSV into a DataFrame, where we can clean, explore, and prepare it for clustering.
import pandas as pd #Imports the Pandas library and gives it the alias pd (a common convention)
df = pd.read_csv('Live_20210128.csv') #Reads the CSV file 'Live_20210128.csv' into a DataFrame
print(df.head()) #Displays the first 5 rows of the DataFrame
This will give us an initial look at the data and its structure. The dataset should include columns like post type (text, live video, image), engagement metrics (likes, shares, comments), and possibly other relevant features like post timestamps.
Uncovering Patterns: Exploratory Data Analysis for Clustering
Before applying clustering algorithms, it’s essential to first explore and understand the dataset. Exploratory Data Analysis (EDA) helps uncover key patterns, trends, and potential issues within the data. By visualizing engagement metrics such as likes, comments, and shares, we can gain insights into how different types of Facebook posts perform and whether any anomalies or outliers exist.
In this section, we will:
- Summarize the dataset’s structure and key statistics.
- Identify trends in engagement across different post types (text, images, live videos).
- Detect missing values, outliers, and distribution patterns.
- Use visualizations such as histograms, scatter plots, and box plots to gain deeper insights.
By performing EDA, we can make informed decisions on feature selection and ensure that the data is well-prepared for clustering analysis.
print(df.info()) #It provides a technical summary of your DataFrame
The dataset contains 7,050 instances and 16 attributes, whereas the dataset description states there should be 7,051 instances and 12 attributes.
From this, we can infer that the first instance is likely a header row, and the dataset includes four additional attributes beyond what was initially described.
Key Observations:
- Complete Engagement Metrics
- All reaction-type columns (
num_likes
,num_loves
, etc.) have 7,050 non-null values - Implication: No missing data in critical engagement features – ideal for clustering.
- All reaction-type columns (
- Empty Columns
Column1
toColumn4
have 0 non-null values (entirely empty)- Action: Remove these to clean your dataset:
df.drop(['Column1', 'Column2', 'Column3', 'Column4'], inplace=True, axis=1)
- Data Types
- Numeric: Reaction counts (
int64
) - Categorical:
status_type
(object) - Datetime:
status_published
(currently asobject
)
- Numeric: Reaction counts (
Should You Keep status_id
?
- Redundancy Confirmed
- The column appears to be a duplicate identifier since:
- Pandas already has a built-in index (0, 1, 2…)
- No additional unique information is provided by
status_id
in the sample
- The column appears to be a duplicate identifier since:
- When to Keep It
- Only if:
- It’s a foreign key for joining with other datasets
- The values have special meaning (e.g., encoded timestamps/seller info)
- You need to preserve original post IDs for reference
- Only if:
- Action Recommended:
df.drop(['status_id'], inplace=True, axis=1)
Why This Matters for Clustering:
- Unnecessary features can skew results or slow computations
- Cleaner data improves interpretability of clusters
- Memory efficiency (though minimal impact here)
Should You Keep status_published
for Clustering?
Case 1: Remove It ❌
- If your goal is to cluster sellers posts purely by engagement patterns (reactions, comments, shares), the timestamp itself is irrelevant.
- Action:
df.drop(['status_published'], inplace=True, axis=1)
Case 2: Keep and Transform It ✅
- If you want to cluster based on posting behavior or time-based patterns.
- Use Case:
- Cluster sellers who post at similar times (e.g., “Late-night live sellers”).
- Identify optimal posting schedules for each segment.
Should You Keep status_type
for Clustering?
- Post type influences engagement (e.g., videos get more comments, photos get more likes).
- You want clusters to reflect content strategy differences.
status_type
(e.g., video, photo, status) is a categorical feature that can significantly impact clustering results—but only if you handle it properly.
To visually compare how different Facebook post types perform across various engagement metrics, we can use the status_type
column to create a series of bar plots. These plots will illustrate the average number of likes, shares, and comments for each post type (e.g., photo, video, status). By doing so, we can easily spot trends—such as whether videos receive more comments due to real-time interaction, or if photos tend to generate higher reaction counts. Bar plots provide a clear way to analyze engagement distribution across post types, helping us understand which content formats drive the most audience interaction.
import matplotlib.pyplot as plt
# Define engagement metrics to plot
metrics = [
'num_reactions', 'num_comments', 'num_shares',
'num_likes', 'num_loves', 'num_wows',
'num_hahas', 'num_sads', 'num_angrys'
]
# Set up a 3x3 grid of subplots
fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(18, 12))
fig.suptitle('Engagement Metrics by Post Type', fontsize=16, y=1)
# Flatten axes for easy iteration
axes = axes.flatten()
# Loop through metrics and plot each in a subplot
for i, metric in enumerate(metrics):
# Group data and plot
df.groupby('status_type')[metric].sum().plot(
kind='bar',
ax=axes[i],
color=['#1f77b4', '#ff7f0e', '#2ca02c'], # Custom colors
edgecolor='black',
alpha=0.7
)
# Customize subplot
axes[i].set_title(metric.replace('num_', '').title(), pad=1)
axes[i].set_xlabel('Post Type', fontsize=10)
axes[i].set_ylabel('Total Count', fontsize=10)
axes[i].grid(axis='y', linestyle='--', alpha=0.6)
# Rotate x-axis labels for readability
plt.setp(axes[i].get_xticklabels(), rotation=360, ha='right')
# Adjust layout to prevent overlap
plt.tight_layout()
plt.show()
The bar plots illustrate how photos and videos dominate across various engagement metrics, including reactions, comments, shares, and specific emoji reactions.
Photos receive the highest number of total reactions and likes, indicating they are the most engaging in terms of general user appreciation.Videos follow closely in reactions but stand out significantly in comments and shares, suggesting they drive more interaction and virality.
The comments bar for videos is overwhelmingly high compared to other post types, showing that videos encourage discussions, likely due to their dynamic nature and real-time interaction. Shares are also significantly higher for videos, meaning users are more inclined to distribute video content than photos, text posts, or links.
“Love” and “Wow” reactions are more common for videos, reinforcing their emotional impact. “Haha” reactions are also higher for videos, suggesting that humorous or entertaining content is often video-based. “Sad” and “Angry” reactions appear more frequently in videos and photos compared to links or text statuses, indicating that emotionally charged content is more common in visual formats.
Links receive the least engagement across all metrics, meaning users are less likely to interact with external content.Status have slightly higher engagement than links but still perform poorly compared to photos and videos, likely due to the lack of visual appeal.
To include status_type
in clustering, we need to convert it into a numerical format, as most clustering algorithms work best with numerical data. Since status_type
is a categorical variable with values like photo, video, and status, we can use one-hot encoding to create separate binary columns for each category or label encoding to assign numerical labels to each type. One-hot encoding preserves category independence but increases dimensionality, while label encoding is more compact but may introduce unintended ordinal relationships.
The easiest approach is using dummies values , this approach is more flexible because it allows encoding as many category columns as you would like and choose how to label the columns using a prefix. Proper naming will make the rest of the analysis just a little bit easier.
df=pd.get_dummies(df, columns=['status_type'],drop_first=True, prefix='type')
print(df.head())
Preparing Data for Clustering: The Importance of Feature Scaling
Before applying clustering algorithms, it’s crucial to scale the features to ensure that all variables contribute equally to the distance calculations. Engagement metrics such as likes, shares, and comments have different numerical ranges, which can lead to biased clustering results if not properly scaled. For example, since the number of likes is often much larger than the number of shares, clustering algorithms like K-Means may give disproportionate weight to likes when forming clusters.
Why Scale the Data?
- Avoids bias in clustering: Algorithms that use distance calculations (e.g., K-Means, Hierarchical Clustering) can be skewed if one feature has a much larger scale than others.
- Improves performance: Well-scaled data ensures faster convergence and more meaningful clusters.
- Ensures fair comparison: Scaling puts all features on a comparable scale, preventing any single metric from dominating the analysis.
from sklearn.preprocessing import StandardScaler
features = [
'num_reactions', 'num_comments', 'num_shares',
'num_likes', 'num_loves', 'num_wows',
'num_hahas', 'num_sads', 'num_angrys'
]
standardscaler= StandardScaler(with_mean=True, with_std=True)
scaled_data= standardscaler.fit_transform(df[features])
Segmenting Sellers with K-Means: From Data to Clusters
Now that our data is properly preprocessed and scaled, we can apply K-Means clustering, one of the most widely used unsupervised learning techniques for segmentation. K-Means helps us identify natural groups within the dataset by clustering posts based on engagement metrics like likes, shares, and comments.
How K-Means Works?
K-Means is a centroid-based clustering algorithm that works as follows:
- Choose the number of clusters, K.
- Randomly initialize K cluster centroids within the feature space.
- Assign each data point to the nearest centroid based on Euclidean distance.
- Recalculate centroids by averaging the points assigned to each cluster.
- Repeat steps 3 and 4 until centroids stop changing significantly.
By the end of this process, posts with similar engagement patterns are grouped into the same cluster, allowing for valuable insights.
Choosing the Optimal Number of Clusters (K)
One challenge in K-Means is deciding the right number of clusters.The two common methods for determining K are:
- Elbow Method: Plots the variance within clusters (inertia) against different values of K. The point where the curve “bends” is an ideal choice.
- Silhouette Score: Measures how well-separated the clusters are. A higher silhouette score suggests better-defined clusters.
from sklearn.cluster import KMeans
inertias = []
# Try different values of K and compute inertia
for k in range(1, 11):
kmeans= KMeans(n_clusters=k, init = 'k-means++', max_iter = 300, n_init = 10)
kmeans.fit(scaled_data)
inertias.append(kmeans.inertia_)
# Plot Elbow Method
plt.figure(figsize=(10, 6))
plt.plot(range(1, 11), inertias, marker='o')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal K')
plt.show()
The Elbow Method plot shows how the inertia (within-cluster sum of squares) decreases as the number of clusters K increases.at some point, the decrease slows down, forming an elbow shape which is the optimal number of clusters. Based on the curve, the “elbow” appears to be around K = 3 or 4, meaning that clustering the data into 3 or 4 groups might provide a good balance between simplicity and segmentation quality.
Applying K-Means to Engagement Data
Once we determine the best K, we can apply K-Means clustering:
from sklearn.cluster import KMeans
kmeans= KMeans(n_clusters=4, init = 'k-means++', max_iter = 300, n_init = 10)
labels= kmeans.fit_predict(scaled_data)
df['cluster']= labels
After clustering, we can analyze the characteristics of clusters using scatter plot.
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.scatter(df['num_reactions'], df['num_comments'], c=df['cluster'], cmap='RdBu', alpha=0.5)
plt.title('KMeans Clustring of Facebook Live Sellers')
plt.xlabel('Number of Reactions')
plt.ylabel('Number of comments')
plt.grid(linestyle='--', alpha=0.3)
plt.tight_layout()
plt.show()
Interpreting the K-Means Clustering Results
The scatter plot visualizes the K-Means clustering results, where data points are grouped into different clusters based on the number of reactions (x-axis) and comments (y-axis). Each color represents a distinct cluster.
Key Observations:
- Low Engagement Cluster (Bottom-Left)
- Majority of data points are concentrated near the origin, indicating posts with low reactions and comments.
- This suggests a large group of sellers receive minimal engagement.
- Moderate Engagement Cluster (Middle Region)
- A cluster of points scattered between 500–2000 reactions and comments.
- These posts show moderate engagement, likely representing sellers with a growing audience.
- High Engagement Cluster (Top Region)
- A few data points with thousands of comments and reactions.
- These posts belong to high-performing sellers who generate significant user interaction.
What This Means?
- Identifying Key Engagement Patterns : Helps sellers understand where they stand and what kind of engagement they attract.
- Targeted Strategy Optimization : Sellers in low or moderate engagement clusters can analyze high-performing sellers to improve content strategy.
- Actionable Insights : Businesses can segment sellers based on engagement levels and provide tailored recommendations.
Conclusion
Clustering analysis offers a powerful way to segment Facebook Live sellers based on engagement patterns, helping to uncover key insights that might not be apparent through traditional analytics. By applying K-Means clustering to engagement data—likes, comments, shares, and reactions—we were able to identify distinct seller groups with varying interaction levels.
From our analysis, we observed that video posts tend to drive the highest engagement, followed by photo posts, while status updates and links generally receive lower interaction. This highlights the importance of content strategy for digital sellers, where incorporating more video-based content could significantly enhance engagement.
Furthermore, the use of feature scaling, encoding categorical variables, and selecting the optimal number of clusters using the Elbow Method ensured a robust clustering process. These steps provided a structured approach to segmenting sellers, enabling a data-driven strategy for improving online sales performance.
Ultimately, leveraging machine learning techniques like clustring can help businesses refine their social media marketing approach, tailor their content, and better engage their audience, leading to increased visibility and sales. As digital commerce continues to evolve, integrating data-driven insights will be essential for staying ahead in a competitive landscape.
FAQs (Frequently Asked Questions)
What is clustering, and why is it useful for analyzing Facebook Live sellers?
Clustering is an unsupervised machine learning technique used to group similar data points based on patterns. In this context, it helps segment Facebook Live sellers based on engagement metrics like reactions, comments, and shares, allowing businesses to identify trends and optimize their strategies.
Why did we choose K-Means clustering for this analysis?
K-Means is a widely used clustering algorithm due to its simplicity and efficiency. It works well with numerical data and helps identify groups of sellers with similar engagement levels, making it ideal for this study.
How did we determine the optimal number of clusters?
We used the Elbow Method, which involves plotting the inertia (sum of squared distances to the nearest cluster center) against different values of K (number of clusters). The “elbow point” in the graph indicates the ideal number of clusters.
What were the key findings from the clustering results?
Video posts tend to generate the highest engagement in terms of comments and shares.
Photo posts also receive high levels of engagement but not as much as videos.
Status updates and link posts generally have lower engagement.
Sellers can be segmented into different groups based on their engagement levels, helping businesses tailor their content strategy.
No Comment! Be the first one.