Welcome to clinical data management program using SAS. In this video we will be discussing about creating summary statistics. That is how do we create summary statistics using clinical data in SAS for this we are going to use the procedure called proc means. So let's start with it. First we have to execute a live name statement. So this is the CDM library, which we have brought in the SAS environment.
These consists of the data sets the following these data sets then the CDM library, we will be using these data sets to do the to explain you all the concept of how to create summary statistics using clinical data in SAS. So let's start with it. We'll be using the procedure called proc means proc means data. proc means data equals to we have to give the library to CDM dot disease and then run let's run this code So, this proc means procedure gives us a summary statistics of all the numerical variables like N means number of observations mean average value, standard deviation, minimum value and maximum value. We you can also see that we also got DOB that is date of birth as a numerical variable. We also got the summary statistics with respect to the variable date of birth.
So that means we see our disease data set this is our disease data set, which consists of DOB as a numerical variable, average commute daily internet use available vehicles, these are all the numerical variables, how do you be treated as a numerical variable SAS is going to count to the number of days between the date that is specified in the data set and and the SAS base date that is first gen 1960. So it is basically the count counting the number of days difference between the date that is specified and the SAS based it. So that's where we got the number of observations with respect to DOB the average value With respect to do a standard deviation minimum value and maximum value so this is a normal proc means procedure now we'll be learning certain modifications to the proc means procedure How do we go with the modifications now in normal proc means procedure by default we get all the variables, we get the summary statistics with respect to all the variables that are present inside our data.
But suppose we want to specify that we want only the summary statistics with respect to few variables then how do we go with that so there is we will be modifying our proc means procedures. So proc means data equals CDM. dot disease will be using var statement. var statement is used to specify the analysis variables that is the variables with respect to which we want the summary statistics to be displayed. those variables will be specified in the var statement so only the analysis variables will be specified in the vast statement so far. We have specified the analysis variable as average commute, that is we want this average statistics to be displayed with respect to the average commute and then we want another variable that is daily interval use you can specify as many number of variables you want to be analyzed or you want you can specify as many number of variables in the vast segment with respect to which the summary statistics should be displayed.
So, I want with respect to these two videos average commute daily internet use, so, that is specified in the vast statement and then let's run this code. So, see, here we got the summary statistics with respect to only these two variables it is average commute and daily inter induce that is the number of observations the mean that is average standard deviation, that is the deviation from Mean Square variance. We also know some deviation square variance, the minimum value and maximum value now we'll be doing the further modifications of creating some restrictions. What is in our summary statistics report? We will be analyzing our analysis variable with respect to a categorical variable. So let me explain you all this concept.
How do we go with that that is proc means. Data equals CDM dot disease. We're using the same data set. Max deck means I want max deck equals to two means I want maximum values after the decimal places. That is mastic stands for maximum number of decimal places. I want to display the summary statistics with respect to the variable called daily interest use.
So this is my analysis video, which I've specified in the summary statistics which are specified in the statement so that the song statistics can be displayed with respect to this variable then I'm using this keyword called class. Class disease means I want to analyze daily internet use class race, or disease race. And class is a key word to define the categorical variable so I'm analyzing or am displaying summary statistics of data interviews, disease ways, so diseases main category, or the categorical variable and then run let's run this code. So see over here, we got disease Weiss, and the analysis of our daily didn't use variable. So daily introduce variable is analyzed disease ways. So number of observations and observations, number of observations and means number of non missing values.
When you have over here in this particular data, there are no missing data So, number of observations and number of non missing values have seen mean is average standard deviation, minimum value and maximum value. Out of all these diseases, we can understand that the maximum amount of people are suffering from Alzheimer's disease because see there are 330 million patients suffering from Alzheimer's disease that we can understand from the number of observations schizophrenia, the average number of patients with schizophrenia diseases basically 4.90 with kidney disease 4.93. So, these are the values which are quite high with hypertension, open entries these out of me gold data. Most patients are there with endometriosis, if you go with the outage with diabetes with breast cancer, so it's a uniform type of data that means like the patients, the patients more or the patients are mostly having all the diseases there is all the diseases each and every patient is at least suffering from any one of the disease but if we go with the average, it will be Over the average the averages average for all the diseases are almost very high it is not very low, it is almost very high and highest is for heart disease that is there are more cardiology patients for prostate cancer it is type 117 and again type 117 Okay next now, in our last report we got all the summary statistics like number of observations number of non missing values mean standard deviation minimum value maximum when but suppose, we want to display only the mean and standard deviation of our analysis variable then how do we go with that, so, we are going to use the same procedure called proc means, data equals CDM dot disease mean standard deviation means transfer we want to display the meaning as to D stands for standard deviation and then I want to define the Analysis variable as daily internet use and then class disease because I want to display the summary statistics of daily internet use and disease wise that's I have defined the categorical variable and then run.
So here we'll be getting the knee and some deviation. So see the number of observations it comes by default and mean and standard deviation with respect to daily internet use has come disease wise for the data disease. So basically, we have displayed the summary statistic, report category wise and my categories muscular disease, these main categories of diseases, they're in my data that is Alzheimer's disease here. HIV, breast cancer, diabetes, endometriosis, gastritis. Heart disease hypertension kidney disease, multiple sclerosis prostate cancer schizophrenia skin cancer so in this video we'll be doing it here in my upcoming video I'll be discussing creating tabulated reports. For now let me end this video over here.
Thank you Goodbye. See you all for the next video.