Evaluation Basics

32 minutes
Share the link to this page
Copied
  Completed
You need to have access to the item to view this lesson.
One-time Fee
$49.99
List Price:  $69.99
You save:  $20
€46.91
List Price:  €65.68
You save:  €18.77
£40.47
List Price:  £56.66
You save:  £16.19
CA$68.51
List Price:  CA$95.92
You save:  CA$27.41
A$77.48
List Price:  A$108.49
You save:  A$31
S$68.13
List Price:  S$95.39
You save:  S$27.25
HK$391.71
List Price:  HK$548.43
You save:  HK$156.71
CHF 45.59
List Price:  CHF 63.83
You save:  CHF 18.24
NOK kr550.63
List Price:  NOK kr770.92
You save:  NOK kr220.29
DKK kr350.03
List Price:  DKK kr490.07
You save:  DKK kr140.04
NZ$84.59
List Price:  NZ$118.44
You save:  NZ$33.84
د.إ183.59
List Price:  د.إ257.05
You save:  د.إ73.45
৳5,482.14
List Price:  ৳7,675.43
You save:  ৳2,193.29
₹4,166.70
List Price:  ₹5,833.71
You save:  ₹1,667.01
RM238.95
List Price:  RM334.55
You save:  RM95.60
₦62,835.43
List Price:  ₦87,974.63
You save:  ₦25,139.20
₨13,901.61
List Price:  ₨19,463.37
You save:  ₨5,561.75
฿1,851.47
List Price:  ฿2,592.21
You save:  ฿740.74
₺1,629.79
List Price:  ₺2,281.84
You save:  ₺652.04
B$258.27
List Price:  B$361.60
You save:  B$103.33
R961.74
List Price:  R1,346.51
You save:  R384.77
Лв91.64
List Price:  Лв128.31
You save:  Лв36.66
₩68,881.10
List Price:  ₩96,439.05
You save:  ₩27,557.95
₪189.07
List Price:  ₪264.71
You save:  ₪75.64
₱2,877.02
List Price:  ₱4,028.06
You save:  ₱1,151.03
¥7,738.45
List Price:  ¥10,834.45
You save:  ¥3,096
MX$855.86
List Price:  MX$1,198.27
You save:  MX$342.41
QR182.16
List Price:  QR255.04
You save:  QR72.88
P691.87
List Price:  P968.68
You save:  P276.80
KSh6,723.65
List Price:  KSh9,413.65
You save:  KSh2,690
E£2,403.50
List Price:  E£3,365.09
You save:  E£961.59
ብር2,843.66
List Price:  ብር3,981.35
You save:  ብር1,137.69
Kz41,738.73
List Price:  Kz58,437.56
You save:  Kz16,698.83
CLP$47,621.97
List Price:  CLP$66,674.57
You save:  CLP$19,052.60
CN¥362.26
List Price:  CN¥507.20
You save:  CN¥144.93
RD$2,946.71
List Price:  RD$4,125.63
You save:  RD$1,178.92
DA6,722.05
List Price:  DA9,411.41
You save:  DA2,689.36
FJ$113.84
List Price:  FJ$159.39
You save:  FJ$45.54
Q388.60
List Price:  Q544.08
You save:  Q155.47
GY$10,450.72
List Price:  GY$14,631.85
You save:  GY$4,181.12
ISK kr7,051.58
List Price:  ISK kr9,872.78
You save:  ISK kr2,821.20
DH506.54
List Price:  DH709.20
You save:  DH202.65
L891.33
List Price:  L1,247.93
You save:  L356.60
ден2,887.76
List Price:  ден4,043.10
You save:  ден1,155.33
MOP$403.15
List Price:  MOP$564.45
You save:  MOP$161.29
N$953.60
List Price:  N$1,335.12
You save:  N$381.51
C$1,838.47
List Price:  C$2,574.01
You save:  C$735.53
रु6,662.39
List Price:  रु9,327.88
You save:  रु2,665.49
S/184.06
List Price:  S/257.70
You save:  S/73.64
K189.83
List Price:  K265.79
You save:  K75.95
SAR187.50
List Price:  SAR262.52
You save:  SAR75.01
ZK1,289.99
List Price:  ZK1,806.09
You save:  ZK516.10
L233.45
List Price:  L326.86
You save:  L93.40
Kč1,185.36
List Price:  Kč1,659.60
You save:  Kč474.24
Ft18,487.45
List Price:  Ft25,883.91
You save:  Ft7,396.45
SEK kr544.30
List Price:  SEK kr762.07
You save:  SEK kr217.76
ARS$43,604.59
List Price:  ARS$61,049.92
You save:  ARS$17,445.32
Bs346.40
List Price:  Bs484.99
You save:  Bs138.58
COP$195,320.69
List Price:  COP$273,464.60
You save:  COP$78,143.90
₡25,002.29
List Price:  ₡35,005.20
You save:  ₡10,002.91
L1,233.13
List Price:  L1,726.48
You save:  L493.35
₲370,137.26
List Price:  ₲518,221.78
You save:  ₲148,084.52
$U1,925.98
List Price:  $U2,696.53
You save:  $U770.54
zł202.69
List Price:  zł283.79
You save:  zł81.09
Already have an account? Log In

Transcript

Once you have developed a machine learning algorithm to classify, then you often need to evaluate how good your algorithm is. So, there are several metrics that are being directed that are used that can be used to evaluate how good a current algorithm is. This is often a case when there are multiple algorithm that can be used. And you need to make an objective selection of which algorithm to use. And that decision can be based on the evaluation criteria. also offer you also need to tune the hyper parameters of your machine learning algorithms like learning rate, the number of hidden layers, the number of nodes in the hidden layers, and often you find that also the size of the training data, right.

So, I'd often you would find that as you find in this parameter that can have Good impact on the evaluation metrics. So by trial and error, you can choose those parameters to ensure during evaluation metrics are improving on the subsequent trials. So, accuracy so there are several evaluation metrics accuracy is one of them, then precision recall f1 score and AUC under RLC. So AUC and RC is really a fancy term. AUc stands for area under curve and Oracle files for receiver operating characteristics. But in reality, these concepts are pretty similar to pretty easy to understand they are just known by their fancy names.

So don't get bogged down by the fancy names. I will go through each and every metric and I will illustrate with an example for how to develop intuition Behind this matrix. So, before going into any matrix is very important to understand something called a confusion matrix. So, we will take a simple example, so that I can drive this point home. Let's take a case exactly stick an example of a data set of images, where you want to identify whether the image can only two objectives, whether it's a dog image or it's not a dog image. So, we are talking about the binary classifier, right.

So, if you plot a metric, so there are two things right that we need to consider what is the actual outcome and what is expected outcome, right? So, once you are done with your training have your data set, what you do is you have a test set where the labels are hidden you know the labels, the labels are hidden so that labels are expected outcomes. Right? And the you hide the labels, and then you ask your machine learning algorithm to predict the actual outcome and the machine learning predicate algorithm can predict positive or negative. And then you compare with the expected outcome and decide the metric, right? So the actual outcome.

So let's take example again, our example whether the image is a dog, let's call it as a positive outcome. And if our classifier predicts image is not a dog, then we call it as a negative outcome, right? So actual outcome can have two types. So let's draw a nice table. Okay. So actual outcome can be positive and or it can be negative, right?

Then also let's try an expected outcome. So the expected outcome also can be positive or negative, right? And now so there are the four possibilities, right? That we need to consider. So, when the expected outcome matches with actual outcome, we call them true scenarios, right? So let's say when the actual outcome is positive and negative outcome is positive, that's a good sign.

So we call this as true, positive outcome. Similarly, when the actual outcome is negative, where you classify an image is not a dog and reality also is not a dog image. That's our outcome, we call it a true negative. Okay? Now the other case where the actual outcome is positive, but the expected outcome is negative, right, so we call them as false positive And the case where the actual outcome is negative, and the expected outcome is positive, we call it as false negative. So these are the four possible type of outcomes that are possible in a machine learning.

So let me explain this again. So again, we have an example of an images of a dog and the actual unexplainable dog or not a dog, right? If the image is a dog, we call it a true positive and it matches with the actual image which is also true positive. Then if the image is not of a dog and actual outcome that our model predicts, and the expected outcome is also is not an talk, then we call it as too negative. I wonder Other side where our model mismatches with expected outcome, we have false positive and false negative scenario. So in the case where we say our model identifies the images that have a dog, but in actuality it's not a dog, we call this a false positive.

And in case where our model says not a dog, but actually it's a dog, we call it as false negative, right? So what we want to achieve is our if our model is really good, we want to make sure our true path our actual outcome matches with expected outcome. So we want to maximize true positive and true negative. And then we want to minimize my false negative and false positive. That's the goal of an evaluation metric. Right now with this knowledge in mind Let's talk about the accuracy.

So accuracy is defined as the total number of true positive samples, plus total number of negative samples divided by total samples. Right. So let's take an example of our dog image set. So let's say there are total samples. Hundred and in which 50 are dog images, and 50 are not a dog images Okay. Now let's see what a model is really, really good.

And it identifies all the 15 Images of a doc correctly. So the remaining 50 images of not a doc correctly then our accuracy would be 50 plus 50 by 100. So equal to 1.0. Right? This is perfect, right, this is a good accuracy and this is what we expect. Now, very often the accuracy doesn't work in cases where the data set is imbalanced.

For example, in this particular case that we considered, our data set was balanced like out of hundred images. 50 images were dogs and 50. We just were not a dog. Now, let's consider a scenario where there are hundred samples. Hate out of that. Let's say 99 images are that of a dog.

And one image is not a dog. So in this particular case, let's say our model let's say we are developing a model, and your model is performing very poorly, and it's just classifying all the outcomes as dog, regardless of what actually it is. Then, although our model is performing poorly, our accuracy would be still higher because let's say everything is going to label as dog right. So in this case, the accuracy would be so if you calculate the accuracy would be 99. plus zero divided by 100. So it's almost 99%. Although our model is saying everybody is a dog, even though the model doesn't know how to classify dog versus not a dog accuracy is 99%.

So, so so although we are showing our model showing the accuracy is 99%, nobody would buy our model because it's not really able to differentiate between a dog and not a dog. So, this is a problem of an imbalanced data set, where because of the heavy presence of dog images and not heavy presence of not dog image, our accuracy matrix gives is very misleading, and we cannot use it in practice. So this is and this is a common problem in machine learning where it has it may or may not be balanced. So you either try to balance your datasets if you are having trouble Justin there is nothing you can do about it. You shouldn't be using accuracy metric. There are other metrics like precision recall, and f1 scores that you should be using, in the cases where the data set is imbalanced.

In order to solve the problem of imbalance dataset, we need to understand two more metric called recall and precision. So let's define the recall. Recall is basically a percentage of relevant items from our data set. Technically it is defined as true positive, divided by true positive plus false negative and precision is defined as out of all the results that we have got. Many are truly truly positive right. So it is defined as true positive divided by true positive plus false positive.

Let's take example so that these concepts are clear. So let's say that we have again, we'll take example of a data set of dogs where total samples are hundred and outcome could be one, it's a dog, or zero, not a dog, right? Okay. So in this in this particular example, let's say our data set identifies, okay, let's take example of again, our data set where there are hundred total number of images. And let's say our model identifies out of 100 images there are 70, dog and 30 The Naughty Dog images. Okay.

Right. So let's say out of all hundred images, our model is able to predict 50 dogs correctly, right? So model prediction 50 dogs, correct. Right. And let's say it predicts 20 incorrect. Wait, okay.

No, let's crash that let's say complex 50 correctly. Okay, so in that particular case, our recall would be Our recall would be, let's say two positives are 50. Right? And 50. And it failed to. So they were actually 70 dogs right in our corpus.

So it failed to identify 20 more incorrectly, right? So 20 are basically false negatives which he should have identified. So a recall would be 50 by 70. So which is around point seven. So procedure in this particular case, would be, again, true positive or 5050. And let's say, out of 50. records that were sold only 50 requests identified correctly.

And let's say 20 requests And if incorrectly 30 reconstitute identified incorrectly. Basically what it means is, it said it's a dog, but it's not a dog. So our recall would be 50 divided by 80. So it should be around, let's say five by 8.6. So that's our recall and precision. Now let's take example of an imbalanced data set right.

Very often this recall and precision are goes opposed to each other. When you try to improve the recall, the questions are first while and if you try to improve the precision, and the recall suffers, ideally, we would like both recall and precision to be as high as possible almost as 1.0. So let's think exam Have, let's take an example of another data set where we can understand the column position better. So let's say we have a data set of people who are boarding the flights in the United States, and people can be classified into terrorists versus not a terrorist. So we have so far we have, you know, the list of people who have boarded the flights between 2000 to 2016 a year or 16 million users who are not a terrorist. Okay.

And we have 10 users which are real terrorists, okay. So again, this is a problem of imbalanced data sets, okay. So, in this particular case, We can define recall as true positive plus true positive plus false negative. So, in this particular data set we can say it will be like there is correct correctly identified were expected and the actual outcome matches and here also to positive is correctly identified plus terrorist incorrectly labeled as not a terrorist okay? This will be the formula for recall. So, similarly, precision we can define as TP divided by t plus FP.

So here the numerator would remain same. So let's say we will say, let's use a shortcut for this as abbreviate this, DCI. So here will be DCI there is correctly identified and is correctly identified plus, here the there will be false positive or false positive will be individuals who are not a terrorist but they are labeled as incorrectly labeled as terrorists. Okay, this is how you define legal And precedent for this particular data set. Okay, so let's consider two cases here. case one, where our model says only one is a terrorist and remaining 16 million plus nine users as not a terrorist, okay, this is how model predicts.

So let's see what happens in this particular in this particular case, our precision will be very high because there's one false true positives and there are no false positive rates. But if you take it at a recall it will be very low because we have a lot many false negatives right. So it will be very, very low. Right. So now so this is a case where high precision and low recall. Now consider a scenario where our model says every one is a terrorist.

There's another extreme case. So, in this particular case, our precision would be 10 divided by 10 plus 16 million, it will be very low and our recall would be perfect one because that will be there are no false negatives on point zero, right. So, so as you can see, there are the two extreme cases of high precision low recall and low Prisma is high recall. So very often, if you Plot precision versus recall graph hate, they're often opposing constants. So, the graph would appear some things like this right and 0.0 0.2 0.4 up to 1.06 and so on. So, if you try to improve the precision of a model, you will tend tend to reduce recall and if you try to improve recall of a model, you try to read reduce the precision.

So, often, the good use of recall and precision would be tried to balance these two characteristics in a machine learning model to achieve an optimum score. And that's where the f1 score comes in. And we will talk about that next. So, f1 score is defined as harmonic mean of precision and recall. So, define is like this recall into precision. So, this is termed as the harmonic mean of precision recall.

So, why we are using harmonic mean interval simple average. So, the reason for that f1 is try to punish the extreme values, which are like high recall low precision and low precision and high recall because we do not want to be generic where one of the value is high. We want to make sure our harmonic mean is high right. So, let's say our our precision is 1.0 and one Recall is zero. So in this case, a simple average would be point five, right? But f1 score will be zero.

Similarly, if you take other extreme case where recall is one and precision is zero, we will say f1 score is zero. And that's exactly what we want. We want extreme punish to extreme values. So simple average is not enough. That's why f1 score is used. So very often the f1 score, the value is between zero to one.

And our goal is to maximize f1 score as much as possible and minimize f1 score as much as possible. That's our goal in machine learning. So, area under curve is another metric that is used in machine learning algorithm to get the maximum benefit of of the accuracy are how the good our model is performing. So So area under is nothing but it's a graph of false positive rate versus true positive rate. Again, the values range from zero to one for each of them. 1.005 don't tangle.

So usually if you start with a random classifier, it'll have a score of point five. Okay, before we do that, let's define TPR and FPR. So TPR is given as, true positive divided by true positives plus false negative and false positive rates. given by false positives plus false divided by false positives plus two negatives. And usually, if you plot the graph, what happens is, we start at a lower level when, let's say we have both. So usually, when whenever we in a binary classification problem, whenever you identify whenever you want to identify something as one or zero, we do it based on a probability distribution like whether the probabilities more than point five or point six, we identify them as positive or as negative.

So let's that's called a threshold. So on by, on which value of probability we want to classify as positive versus negative, right, so See if we start with a threshold of 1.0, where, for probability, nothing would be classified as true positive and false positive. So we start with at zero in the area and our OC curve, right. And as we start lowering the threshold, we get more and more positive output. But at the same time, it also leads to more and more false positives. Also, it could be true positive and false positives.

So that's where we start getting into this region. We can get a curve something like this. This is one curve for one value of ratio, let's say point five. And this could be another code For another value of threshold, so for different points of different values of threshold, we keep plotting this curve. And we want to choose values of a threshold where we want to maximize this area under this curve, right. So we want the server curve to move as much as possible to the upper area, right.

This area, all the curve, so that our model is performing as nicely as possible. Let's take an example of a data set to understand how to compute area under curve. So here is an example of a data set of medical patients where we want to compute the percentage of patients having disease. And again, the outcome is a binary classification whether whether the patient can have a disease or a patient does not have a disease, right. And we have computed for a different value of the probability threshold like the outcome, usually it's a probability distribution, where the probability is between zero to one. And based on that we have arrived some threshold, let's say probability of threshold of point five.

And then we classify based on a threshold the probability is greater than point five, then we say that patient of a patient has a disease or if it is lower than point five with a patient who does not have a disease, then we vary the threshold. We vary the threshold to let's say, point four. And again, we compute how many data points our model classifies as. patients having a disease and patient is not having a disease and then we decide based on the outcome How many are true positive false positive, true negative and false negatives. So this table documents for different values of threshold, how many are true positive false positive to negatives and false negatives. And then we plot the confusion matrix here.

Let's say for a given threshold of point five over here, this value, we brought the confusion matrix over here. And these are the values for this threshold of point fighter 42, tp 13, fn 16, FP and 29, FP, TN. And then we compute recall, precision, and f1 score for all of these values for every threshold right? And then what we do is we also compute true positive rate and false positive rates because that is what is needed for To compute the area under the curve, right, and this example has been taken for from those data sets, not calm, very useful example, to compute precision and recall. So that's for each threshold, you compute true positive rate and false positive rate. And then you keep repeating this whole process for each of the threshold, then what you do is you arrive at, and then you can plot your area under the curve.

And you observe how your area under the curve varies for different values of threshold, and then see how it classifies. So here on the right, you see a table for a different values of threshold, you keep computing all those four metrics and then you draw this chart and see what is the base value of threshold you can choose. So as you can see from this graph, the best value of the f1 score is appear to be this one right? This is a base value of the f1 score out of all these values. And we can say that our threshold of point four is the winner. Also on the graph, you can see that the manga area area is maximum when we choose point five as a threshold and if you compute this area at this point in the maximum area that you get, and that's how you see our model is good.

So that's how we use our evaluation matrix of f1 score to decide with, however with our model is good or not

Sign Up

Share

Share with friends, get 20% off
Invite your friends to LearnDesk learning marketplace. For each purchase they make, you get 20% off (upto $10) on your next purchase.