k-Means clustering algorithm and its limitation

Machine Learning Using Python Unsupervised Learning: Clustering
6 minutes
Share the link to this page
Copied
  Completed
You need to have access to the item to view this lesson.
One-time Fee
$69.99
List Price:  $99.99
You save:  $30
€59.73
List Price:  €85.34
You save:  €25.60
£51.80
List Price:  £74.01
You save:  £22.20
CA$96.84
List Price:  CA$138.36
You save:  CA$41.51
A$106.75
List Price:  A$152.51
You save:  A$45.75
S$89.95
List Price:  S$128.50
You save:  S$38.55
HK$545.64
List Price:  HK$779.53
You save:  HK$233.88
CHF 55.85
List Price:  CHF 79.79
You save:  CHF 23.94
NOK kr703.17
List Price:  NOK kr1,004.57
You save:  NOK kr301.40
DKK kr445.91
List Price:  DKK kr637.04
You save:  DKK kr191.13
NZ$118.73
List Price:  NZ$169.63
You save:  NZ$50.89
د.إ257.03
List Price:  د.إ367.21
You save:  د.إ110.17
৳8,495.52
List Price:  ৳12,136.98
You save:  ৳3,641.45
₹6,172.17
List Price:  ₹8,817.76
You save:  ₹2,645.59
RM295.67
List Price:  RM422.40
You save:  RM126.73
₦107,084.70
List Price:  ₦152,984.70
You save:  ₦45,900
₨19,808.40
List Price:  ₨28,298.93
You save:  ₨8,490.52
฿2,243.26
List Price:  ฿3,204.79
You save:  ฿961.53
₺2,883.76
List Price:  ₺4,119.83
You save:  ₺1,236.07
B$378.92
List Price:  B$541.35
You save:  B$162.42
R1,231.03
List Price:  R1,758.69
You save:  R527.66
Лв116.90
List Price:  Лв167.02
You save:  Лв50.11
₩97,041.13
List Price:  ₩138,636.13
You save:  ₩41,595
₪234.19
List Price:  ₪334.57
You save:  ₪100.38
₱3,968.43
List Price:  ₱5,669.43
You save:  ₱1,701
¥10,316.87
List Price:  ¥14,739.02
You save:  ¥4,422.15
MX$1,309.87
List Price:  MX$1,871.33
You save:  MX$561.45
QR255.14
List Price:  QR364.50
You save:  QR109.36
P938.51
List Price:  P1,340.79
You save:  P402.27
KSh9,032.87
List Price:  KSh12,904.65
You save:  KSh3,871.78
E£3,400.11
List Price:  E£4,857.51
You save:  E£1,457.40
ብር9,985.48
List Price:  ብር14,265.58
You save:  ብር4,280.10
Kz64,180.83
List Price:  Kz91,690.83
You save:  Kz27,510
CLP$67,522.15
List Price:  CLP$96,464.35
You save:  CLP$28,942.20
CN¥499.22
List Price:  CN¥713.21
You save:  CN¥213.98
RD$4,414.63
List Price:  RD$6,306.89
You save:  RD$1,892.25
DA9,089.11
List Price:  DA12,985
You save:  DA3,895.89
FJ$157.67
List Price:  FJ$225.25
You save:  FJ$67.58
Q535.38
List Price:  Q764.86
You save:  Q229.48
GY$14,604.61
List Price:  GY$20,864.62
You save:  GY$6,260.01
ISK kr8,549.27
List Price:  ISK kr12,213.77
You save:  ISK kr3,664.50
DH634.40
List Price:  DH906.33
You save:  DH271.92
L1,171.66
List Price:  L1,673.87
You save:  L502.21
ден3,674.28
List Price:  ден5,249.20
You save:  ден1,574.92
MOP$561.02
List Price:  MOP$801.49
You save:  MOP$240.47
N$1,234.02
List Price:  N$1,762.97
You save:  N$528.94
C$2,569.13
List Price:  C$3,670.35
You save:  C$1,101.21
रु9,858.12
List Price:  रु14,083.64
You save:  रु4,225.51
S/245.88
List Price:  S/351.28
You save:  S/105.39
K291.36
List Price:  K416.25
You save:  K124.88
SAR262.47
List Price:  SAR374.98
You save:  SAR112.50
ZK1,666.62
List Price:  ZK2,380.99
You save:  ZK714.36
L303.21
List Price:  L433.17
You save:  L129.96
Kč1,456.45
List Price:  Kč2,080.74
You save:  Kč624.28
Ft23,442.21
List Price:  Ft33,490.30
You save:  Ft10,048.09
SEK kr657.57
List Price:  SEK kr939.43
You save:  SEK kr281.85
ARS$95,110.21
List Price:  ARS$135,877.55
You save:  ARS$40,767.34
Bs482.36
List Price:  Bs689.11
You save:  Bs206.75
COP$278,383.76
List Price:  COP$397,708.14
You save:  COP$119,324.37
₡35,369.65
List Price:  ₡50,530.25
You save:  ₡15,160.59
L1,828.90
List Price:  L2,612.83
You save:  L783.92
₲504,728.30
List Price:  ₲721,071.34
You save:  ₲216,343.03
$U2,810.15
List Price:  $U4,014.67
You save:  $U1,204.52
zł253.83
List Price:  zł362.63
You save:  zł108.80
Already have an account? Log In

Transcript

Hello everyone, welcome to the course of machine learning with Python. In this video, we shall learn about a clustering algorithm called k means clustering. k means is it partitioning clustering algorithm. The K means algorithm partitions the given data into k clusters. each cluster has cluster center called centroid, K is usually user specified. So, what is the K means algorithm select k points randomly as initial centroids repeat from K clusters like C one c two c k by assigning all points to the closest centroid, we compute the centroid of each cluster using the formula mew suffix ci is equals to one sub mod of ci sum over all exists where x belongs to ci.

Okay, so what is mod of ci it denotes the number of points in cluster I until the centroids don't change or no reassignment of data points in different clusters. We call these convergence criteria. So, In this GIS we will see how k means works. So, as you can see, in each iteration, different data points are getting assigned to different clusters. And these separate separate colors denotes basically the cluster boundaries. And this cross marks denotes the cluster centroid.

As you can see total within clusters sum of square is getting reduced as we increase the number of iterations. So, what is within cluster sum of squares So, total within trust yourself squares is often by the following formula so, let's MUJI is the centroid of cluster chain and the requital keys such clusters then within cluster sum of squares are WC SS is equal to five We compute the distance between the X and Meucci, where x belongs to cluster g that means, we compute the distance between all the points of cluster G to its centroid is square up these these distances and sum it up for all the points in cluster ci, then we do it for all the such clusters that means, from cluster one to cluster key, we some of these and we obtain within cluster sum of squares, okay. So, here distance denote the distance function of users choice, usually we use Euclidean distance.

So, somebody lots about k means as initial centroids are often chosen randomly, the data produced may vary from one run to another K means will converge for more common similarity or dissimilarity measures. Most of the convergence happens in the first few iterations complexity of the algorithm is order of Multiplied by key multiplied by d multiplied by Im where a is equals to the number of data points key is number of clusters, small d denotes the dimension of the data set that means number of features and it denotes the number of iterations, no different runs of Kings algorithm on the same data set can produce very different results. This is one of the limitation of K means clustering. Now, why this happens? This is because random selection of initial centroids So, as you can see, these are the original points for optimal clustering, we may obtain a clustering like this, for suboptimal clustering, where the random initial points can produce a cluster like this.

So, this suboptimal clustering is wildly different from this optimal question. So, this is for round one, we obtain optimal clustering. This is for run to where we operate a sub optimal clustering So, choosing different set of initials incorrect sub produced completely different clustering on the same data set. Now, how we can overcome this limitation, so, we can pre process the data first that means normalize or standardize the data and eliminate the outliers if possible samples a data set and use hierarchical clustering that we shall introduce in the next lecture. To determine the initial set chords see that more than key initial supports and from these select key most widely separated centers after clustering, multiple runs and see that one which gives the minimum WC SS that means within clusters of squares value post processing the data, eliminate small clusters that may represent outliers or noises, split the loose clusters that is clusters with relatively high sum of squared error and merge the clusters that are close that means having relatively low sum of squares now k means faces problem when the clusters are of different sizes.

So, these are the original points and after k means these clusters have been produced as you can see, these are wildly different, okay. So, as the cluster sizes are different gaming's has not performed well in this set of points give me some kidneys also does not work well with the clusters are of different densities. So, these are the original points. So, as you can see this cluster is not densely packed, but these two clusters are densely packed. But, however, Keynes has identified these as clusters. So, YK means does not work well, because of different densities of the clusters k means also does not work well when the clusters are of non spherical shape.

So, as you can see, these two clusters are not of physical shape. So, in this case, the desert produces by k means clustering looks like this. Okay. So, as you can see, it has not understood toward the Underland data distribution and has produced some sort of spherical question over here. Okay, which is different from the original points. So so far, this is our discussion on k means clustering.

In the next video, we shall learn how to implement k means clustering in Python. So thank you. See you in the next lecture.

Sign Up

Share

Share with friends, get 20% off
Invite your friends to LearnDesk learning marketplace. For each purchase they make, you get 20% off (upto $10) on your next purchase.