K-Means Clustering with the Elbow method (2024)

K-means clustering is an unsupervised learning algorithm that groups data based on each point euclidean distance to a central point called centroid. The centroids are defined by the means of all points that are in the same cluster. The algorithm first chooses random points as centroids and then iterates adjusting them until full convergence.

An important thing to remember when using K-means, is that the number of clusters is a hyperparameter, it will be defined before running the model.

K-means can be implemented using Scikit-Learn with just 3 lines of code. Scikit-Learn also already has a centroid optimization method available, kmeans++, that helps the model converge faster.

To apply the K-means clustering algorithm, let's load the Palmer Penguins dataset, choose the columns that will be clustered, and use Seaborn to plot a scatter plot with color coded clusters.

K-Means Clustering with the Elbow method (2)

Note: You can download the dataset from this link.

Let's import the libraries and load the Penguins dataset, trimming it to the chosen columns and dropping rows with missing data (there were only 2):

import pandas as pdimport seaborn as snsimport matplotlib.pyplot as pltfrom sklearn.cluster import KMeansdf = pd.read_csv('penguins.csv')print(df.shape) # (344, 9)df = df[['bill_length_mm', 'flipper_length_mm']]df = df.dropna(axis=0)

We can use the Elbow method to have an indication of clusters for our data. It consists in the interpretation of a line plot with an elbow shape. The number of clusters is where the elbow bends. The x axis of the plot is the number of clusters and the y axis is the Within Clusters Sum of Squares (WCSS) for each number of clusters:

wcss = []for i in range(1, 11): clustering = KMeans(n_clusters=i, init='k-means++', random_state=42) clustering.fit(df) wcss.append(clustering.inertia_) ks = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]sns.lineplot(x = ks, y = wcss);

The elbow method indicates our data has 2 clusters. Let's plot the data before and after clustering:

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15,5))sns.scatterplot(ax=axes[0], data=df, x='bill_length_mm', y='flipper_length_mm').set_title('Without clustering')sns.scatterplot(ax=axes[1], data=df, x='bill_length_mm', y='flipper_length_mm', hue=clustering.labels_).set_title('Using the elbow method');

This example shows how the Elbow method is only a reference when used to choose the number of clusters. We already know that we have 3 types of penguins in the dataset, but if we were to determine their number by using the Elbow method, 2 clusters would be our result.

Since K-means is sensitive to data variance, let's look at the descriptive statistics of the columns we are clustering:

df.describe().T # T is to transpose the table and make it easier to read

This results in:

 count mean std min 25% 50% 75% maxbill_length_mm 342.0 43.921930 5.459584 32.1 39.225 44.45 48.5 59.6flipper_length_mm 342.0 200.915205 14.061714 172.0 190.000 197.00 213.0 231.0

Notice that the mean is far from the standard deviation (std), this indicates high variance. Let's try to reduce it by scaling the data with Standard Scaler:

from sklearn.preprocessing import StandardScalerss = StandardScaler()scaled = ss.fit_transform(df)

Now, let's repeat the Elbow method process for the scaled data:

wcss_sc = []for i in range(1, 11): clustering_sc = KMeans(n_clusters=i, init='k-means++', random_state=42) clustering_sc.fit(scaled) wcss_sc.append(clustering_sc.inertia_) ks = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]sns.lineplot(x = ks, y = wcss_sc);

Free eBook: Git Essentials

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

This time, the suggested number of clusters is 3. We can plot the data with the cluster labels again along with the two former plots for comparison:

fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15,5))sns.scatterplot(ax=axes[0], data=df, x='bill_length_mm', y='flipper_length_mm').set_title('Without clustering')sns.scatterplot(ax=axes[1], data=df, x='bill_length_mm', y='flipper_length_mm', hue=clustering.labels_).set_title('With the Elbow method')sns.scatterplot(ax=axes[2], data=df, x='bill_length_mm', y='flipper_length_mm', hue=clustering_sc.labels_).set_title('With the Elbow method and scaled data');

When using K-means Clustering, you need to predetermine the number of clusters. As we have seen when using a method to choose our k number of clusters, the result is only a suggestion and can be impacted by the amount of variance in data. It is important to conduct an in-depth analysis and generate more than one model with different _k_s when clustering.

If there is no prior indication of how many clusters are in the data, visualize it, test it and interpret it to see if the clustering results make sense. If not, cluster again. Also, look at more than one metric and instantiate different clustering models - for K-means, look at silhouette score and maybe Hierarchical Clustering to see if the results stay the same.

K-Means Clustering with the Elbow method (2024)

References

Top Articles
Caring for Fire-Bellied Toads: A Comprehensive Guide - Wild Explained
Habitat Setup, Care Sheet, Diet, Lifespan, & Handling – TheReptilesZilla
Tiffany's Breakfast Portage
Savory Dishes Made Simple: 6 Ingredients to Kick Up the Flavor - MSGdish
Poppers Goon
Halo AU/Crossover Recommendations & Ideas Thread
Google Sites 1V1.Lol
Joann Ally Employee Portal
Dr. Nicole Arcy Dvm Married To Husband
Clarita Amish Auction 2023
Valeriewhitebby Footjob
Black Adam Showtimes Near Kerasotes Showplace 14
Dusk Hypixel Skyblock
Craigslist Shallotte
Choke Pony Dating App
Rugged Gentleman Barber Shop Martinsburg Wv
Liquor World Sharon Ma
Hdmovie 2
Anon Rotten Tomatoes
512-872-5079
Food Universe Near Me Circular
Appraisalport Com Dashboard /# Orders
Eaglecraft Minecraft Unblocked
Blackwolf Run Pro Shop
eUprava - About eUprava portal
Atdhe Net
The Star Beacon Obituaries
Fototour verlassener Fliegerhorst Schönwald [Lost Place Brandenburg]
Harvestella Farming Guide | Locations & Crop Prices | TechRaptor
Milwaukee Nickname Crossword Clue
The 7 Cs of Communication: Enhancing Productivity and Effectiveness
Hose Woe Crossword Clue
Nail Supply Glamour Lake June
Best Boxing Gyms Near Me
How to Choose Where to Stay When You Visit Pittsburgh
Acadis Portal Missouri
Get Over It Stables
Miawaiifu
Cavender's Boot City Lafayette Photos
Spacebar Counter - Space Bar Clicker Test
Roseberrys Obituaries
10 Teacher Tips to Encourage Self-Awareness in Teens | EVERFI
Star Wars Galaxy Of Heroes Webstore
Best Pizza In Ft Myers
celebrity guest tape Videos EroThots 🍌 celebrity guest tape Videos EroThots
Craigslist In Killeen Tx
Crandon Skyward
Cetaphil Samples For Providers
Inside Dave Grohl's past love life and cheating scandals
Power Outage Chehalis
Lakeridge Funeral Home Lubbock Texas Obituaries
Yahoo Sports Pga Leaderboard
Latest Posts
Article information

Author: Terrell Hackett

Last Updated:

Views: 5594

Rating: 4.1 / 5 (72 voted)

Reviews: 95% of readers found this page helpful

Author information

Name: Terrell Hackett

Birthday: 1992-03-17

Address: Suite 453 459 Gibson Squares, East Adriane, AK 71925-5692

Phone: +21811810803470

Job: Chief Representative

Hobby: Board games, Rock climbing, Ghost hunting, Origami, Kabaddi, Mushroom hunting, Gaming

Introduction: My name is Terrell Hackett, I am a gleaming, brainy, courageous, helpful, healthy, cooperative, graceful person who loves writing and wants to share my knowledge and understanding with you.