Hello everyone, welcome to the course of machine learning with Python. In this video, we shall learn about a new topic called classification and classifier. So, what is the classification problem? A classification problem is a supervised machine learning task to identify to reach of a set of categories or classes a new observation belongs on the basis of a training set of data containing observations or instances whose category membership or classes are already known. So, what is the classifier classifier is an algorithm trained for classifying a particular set of objects into certain number of classes. That means a classifier performs classification.
There are different types of classification tasks like image classification, classification of certain diseases to be malignant or benign classification of news articles to sports, politics, social finance, weather, entertainment, health, etc. Now, consider a document classifier we feed a document to a document classifier and it classifies feed into several of the categories like whether it is a social political sports, whether it entertainment so for this particular document, it classifies these as a sports document. Now consider image classifier, we feed our image to it, and it classifies these among several other categories. So it has classified these into the cat category. Now consider a team of classifier, we feed several attributes of a tumor, like tumor size, lung thickness is etc to the tumor classifier and the tumor classifier classifies the tumor into either malignant or benign. Similarly, they said we consider the email classifier, we feed our email document into the image classifier and it classifies this into either spam or non spam.
So for each type of classification task a separate classifier is needed. A text classifier is unsuitable for image classification and vice versa. Hence classifiers are task specific. If there are only two classes then it is called a binary classification. problem and the corresponding classifier is known as binary classifier if there are more than two classes to cross classify into then it is called a multi class classification problem look patterns patterns are nothing but points in the feature space which denote certain observations or objects the features or attributes features or attributes are the properties of the object or pattern which helps us to classify or cluster certain examples a student has several set of features like class marks attendance, height, weight is etc. So, a particular combination of the values of those features denote a particular student similarly some other combination denote some other student okay now consider the following example of a tumor classification here the two features are age and tumor cells in general a pattern with K many features or attributes is a vector in K dimensional feature space we do not is observation as x vectors superscript i is equal to Two x one superscript i comma x two superscript i comma x two superscript I up to x k superscript where x j superscript I denotes the value of Jade feature or attribute of an observation.
So, in general it looks like following note that each class label is differentiated by the symbols as shown in the legend here, okay. So, it is basically a three class classification problem where each of the data points are denoted in the three dimensional feature space denoted by x one x two x three and class one is noted by these dot class two is doing it by the star and the classes will be by this great friend. So, in classification problem each of the observer ships in training set are provided with corresponding class level and is usually represented by its Victor superscript i comma y where y is the class level of the corresponding observation expected superscript not creating invalidation. So, usually the given Data is subdivided into training data and validation data. So training data is used to train the model and estimate the model parameters.
Validation data is used for evaluation. Sometimes validation data are also called test data. So there are several methods of doing it as discussed following so one method is called bootstrapping Sinitta random sample of the entire data set as validation data and based our training data holdout. A fixed percentage of the entire data set is taken as training data and race as validation data usually 70% training and 30% test speed is most common, but other splitting issues are also acceptable. So we'll be mostly using this holdout method for training and testing our model k fold cross validation so divide the data set into key parts or folds preying on k minus one folds based on the remaining repeat for different combinations leave one out cross validation is a special case of K fold cross validation where k is equal to the number of samples in the data set all the data exceeded.
Single observation are used for training and the model is tested on that single holdout observation. I hope this is clear to you now evaluation of a binary classifier, how do we evaluate a binary classifier? So, how to measure the performance of a classifier. So, accuracy is usually the most common and intuitive measure of classification of the performance it is measured as accurate C is equals to the number of correctly classified patterns divided by the total number of patterns It is usually represented in percentage. So that means accuracy percentage will be nothing but these ratio multiplied with hundred we measure book trading and validation accuracy of classifier trading accuracy measures of how well the classifier performs in training set and is evaluated as training accuracy is equals to the number of correctly classified patterns in training set hold divided by total number of patterns in the training set.
What is the validation accuracy, it measures how well the classifier performs in the world. Addition data set and it is evaluated as validation accuracy is equal to the number of correctly classified patterns in validation set divided by total number of patterns in validation set, note that we train the model on the training data set and we leave the validation data set out, we do not show the validation data to the model. So, our objective is how well the model generalizes to the unseen data. So, if the model generalizes well in our sim data, then in that case, both training and the validation accuracy will be good and we will prefer those models. So a good classifier should have good training as well as validation accuracy. So, there is something called a confusion matrix which helps us to evaluate the performance of a classifier.
So what is the confusion matrix so considered a binary classification problem where there are only two classes positive and negative. So confusion matrix is a nice way to describe the classifiers performance. So as you can see, it is basically a matrix of dimension two by two. So here we are having actual Levels along the rows and along the columns, we are adding classifiers outcome or the predicted levels. So, actually it could be positive or negative and predicted levels could be positive or negative. So, let's see if our sample which is identified as positive by the classifier is also actually positive belongs to the positive class, then it falls over here in this range or in this cell of the matrix.
Now, if the classified outcome is negative, but actually it falls into the positive then that particular observation falls into this cell of the matrix. Similarly, if the classifier outcome is positive, but the actual live level is negative, then it falls in this particular cell of the matrix. And similarly this true negative is nothing but when both classifier and the actual levels of that particular observation is negative. So true positives are the cases when classifier gives positive output for a positive test sample. True negatives are the cases when the classifier keeps negative. Put for actual negative disciples.
So, these are the basically the true positive and the two negative cases this gradual increase what a false positive is where the outcome is incorrectly classified as positive, but actually it is negative and what are false negatives is the case when the outcome is incorrectly classified a positive sample into the negative sample. So, two positives and true negatives are actual or I can say this is basically the correct classification, but false positive and false negative either Miss classification. So, Miss classification rate or the Miss classification error is defined as false positive plus false negative divided by the total number of observation which is nothing but to positive plus to add plus possibility responsibility. So total number of observations in the numerator we have false positives plus false negatives. Other performance measures as we can operate from the official metrics are accuracy. So accuracy is nothing but one minus misclassification error.
So it is how many data sets are How many data points we have actually correctly classified. So, it is two positive plus two negative one divided by the total number of data points then we have position. So how many predicted two events are actually true so precision is nothing but two positives divided by two positive plus false positive then we have recall how many actual events are correctly classified. So recall is equals to two positives divided by true positive plus false add. Then we have a measure which is harmonic mean of precision and recall. Let's go ahead and discuss something called a multi class classification.
So there are plenty of examples of multi class classification problems in real life For example, email folding or tagging. So based on the previous folder, or tagging history of a new email should be classified and placed below the live in folder that I didn't define the weather condition whether it is sunny, rainy, or overcast, etc. I didn't define johner of the music whether it is pop rock, classical or instrumental, then medical diagnosis nephew is taking on parts of speech tagging of words in a sentence, etc. So We have many more examples let's consider a famous example of multi class classification data set IDs data sets. So the iris flower data set is a multivariate data set introduced by British statistician and biologists Ronan a fisher. The data set contains 50 samples from each of the three species of IDs IDs setosa, Iris versicolor, and Iris virginica.
So this is setosa. This is for singular and this is virginica for features were measured from each sample the length and the width of the samples and petals in centimeters. So the data set contains total 150 observation of Iris flowers there are four columns of measurement of the flowers in centimeters and the fifth column is the species of the flower observed. So this is the view of first eight rows of the iris data set there are total four features as we have seen, now here the target variable species species are usually coded numbers zero for say two so one for vertical and two for virginica okay so polygon the visualizations obtained from the data will be considered included. That is two features at a time each graph or figure contains two different features one and only exists in another language. So there are total six plots as you can see, and different classes are coded in different colors.
So evolution of a multi class classifier similar to the binary classifier, here also we can define a confusion matrix, but here's the dimension of the confusion matrix will be C class c, where c is nothing but a number of classes present in the data set. Now, the ij entry of the confusion matrix g knows the number of test instances which originally we don't classify, but predicted as member of class chain. The diagonal entries of the confusion matrix denotes the number of current identification of the test samples. This total number of credit application is nothing but the sum of the diagonal of this square matrix which is at the worst case of MC Okay, so all the entries of the confusion matrix must be positive integer