Hello everyone, welcome to the course of machine learning with Python. In this video, we shall learn about hierarchical clustering. hierarchical clustering produces a set of nested clusters organized as hard people free. It can be visualized as a dendogram a tree like diagram that records the sequence of marches or splits of cluster. So let's consider these are the data points and this is the dendogram Okay. So, what are the strings of hierarchical clustering do not have to assume any particular number of clusters, any desired number of clusters can be obtained by cutting the dendogram at the proper level, they may correspond to meaningful taxonomies for example, book cataloging biological sciences example, Animal Kingdom, phylogeny, reconstruction etc.
There are two main types of hierarchical clustering. One is called agglomerative or bottom up approach. So, start with the points as individual classes at each state match the closest pair of clusters until you work with only one cluster or K mini clusters. Another one is divisive or top down approach. It begins with one all inclusive cluster at each state split a cluster. And in each cluster contains a point or there are key many clusters, though both the approaches are perfectly alright in producing quality clusters.
People prefer to use a dramatic approach more than divisive approach. This is because it is less computationally expensive to deal with smaller cluster as compared to the large all inclusive cluster. Now, let's look at this gi F and see how that dramatic clustering works. So let's begin with all these small small points and based on the distance, they're getting marched and we are constructing To come over here, at any point we can cut that into crop to update the number of clusters we want. So, traditional hierarchical clustering algorithms use a similarity or distance matrix to split or March the cluster. This matrix is known as proximity matrix.
The basic agglomerative clustering algorithm is as following. Let each data point we compute the proximity matrix repeat, match the two cluster which are closest to each other. Update the proximity matrix until only a single or K mini cluster remains. K is really user provided number of clusters different approaches on defining the distance or proximity between two clusters distinguishes different algorithms. However, the basic algorithm as shown above is followed everywhere. It has Automatic clustering.
So, how to define inter cluster distance or proximity. So, following are two different ways to define the inter cluster distance of proximity one is me or single linkage The next one is Max or complete linkage third one is group average and the fourth one is distance between centroids We shall now briefly discuss these methods we are single linkage. Here minimum of all inter cluster distances is considered for proximity calculation. So, these are fixie i comma CJ is the minimum of the distance x comma y where X Factor belongs to cluster i and y vector belongs to cluster j here d suffix ci comma c g denotes representative distance between the cluster I and cluster j max are complete linkage here maximum of all inter cluster distances is considered for proximity calculation. So here the suffix ci comma CJ is equals to max simile of all the distances x comma y, where x is a point in cluster i and y is a point in cluster shape as discussed earlier, the suffix ci comma CJ denotes the representative distance between cluster AI and cluster chain.
So, this is as you can see, this is the maximum are the accomplished linkage group average here average of all inter cluster distances is considered for proximity calculation. So, how many distances are there there are total module ci times module to see the number of distances. What is model the CI, it is the number of points in cluster AI and what is modulus each is number of points in cluster ci. So, we sum up all the distances such that x belongs to cluster a and y belongs to cluster ci and this sum is then divided by total number of distances calculated which is nothing but module ci multiplied by modular CJ Okay, this time Between centroids So, how to calculate centroid of the cluster I it is computed as we sum up all the X Factor which belongs to cluster I and then divide the sum by total number of points in question i.
Similarly, we can compute the centroid Meucci MUJI of cluster j, okay. Then the representative distance between cluster I am cluster j is nothing but the distance between me wind and music. Note that here this this is nothing but the distance function of user choice usually we use Euclidean distance. So, so far this is our agglomerative clustering. In the next video, we will see how to implement automatic clustering in Python. Thank you see you in the next lecture.