Lab 13:- Using Data Profiling task to check data quality. (SSIS)

MSBI Step by Step Training Lab 13:- Using Data Profiling task to check data quality. (SSIS)
20 minutes
Share the link to this page
Copied
  Completed
In this lab we will try to understand a very important component called as The Data Profiling task.

  Download

Transcript

In this lab, we'll try to understand a very important component called as the data profiling task. This data profiling task is a control flow task component. In other words, this is visible only in the control flow. So, before we get into the data profiling task, let us try to understand what exactly the word profile means. profile means, description, or you can say information or you can say outline of something a lot of times you must have asked for a profile of a person. And by looking at the profile of the person, you come to know, you know, what is what the person is where he stays, what are his abilities and so on.

In the same way, a lot of times, you know, you'd like to take out a profile of the data and by looking at the profile of the data, you will come to know that what kind of college That data has, for example, you can see over here I have this data which I'm getting into SSIS Okay. Now, you know now this data by looking at this data is very difficult to know what is the quality of this data? So I would like to go and run some kind of algorithm on this, you know, by which, if that algorithm can tell me the profile, it can tell me that saying, okay, there are 10 nulls, there is one primary key, you know, I find that some of the fields have are like email address. So, if I can go and extract the profile of this data, then I can manage my SSIS project in a better way.

And that's what exactly that data profiling tasked us. So, now, here we have a sample data in a CSV file. You can see that you know, we have name in we have email address, we have address, we have date of birth and a lot of fields. And we have some records here as well, right. So, we would like to go and Run the data profiling task on this data. And you'd like to see that what is the profile of this data?

What is the quality of this data? Now this CSV file at this moment is present in your solution explorer. So what I've done is so that you can also practice, I have provided this data file with the name SQL profiling data dot CSV in the Solution Explorer. Now, the first step is in order to go and do data profiling on the CSV file, we need to go and import this data into SQL server. Data profiling task can only profile data which is in SQL Server. It cannot profile directly on CSV files or Excel files or any other format.

So the first step is we need to go and import this CSV file into our SQL Server database. So Let us go and first import this file. So I'm going to go to my solution explorer, and I am going to go and copy the file path name. So, you go to properties and from here I will copy this full file path. Now let us go to SQL Server. And over here, let us right click on this database and let us say task.

Import data. So you can see you have to click on the database icon and say task import data. Once you click on the menu, it actually starts a wizard. So now we have a CSV file. So CSV file means it's a flat file source, right? So I'm going to go and click on the flat file source.

And let us browse to the file name and we will say yes, it is a comma delimited file. The first row is the column names and we will just do a preview and We will do one next. So, the previous screen you know told us that what is the source so the source is the flat file CSV and the destination is SQL Server right. So I will say that yes the destination is SQL Server and it is the customer data warehouse database and I will do a next So, you can see now it says that okay so the source is this file and on the destination it will go and create this table called as SQL profiling data. So, I'm okay with that and I will do a next and I will say finish. And you can see now it is running the Import Wizard and it will go and import the file into my SQL Server table.

So the first thing is the data profiling task needs the data in SQL Server if the data is in SQL Server, he cannot create a profile of your data right. So, this is nice. Now, let us go back to our SQL Server. And let us see if the table has been created. And you can see that the SQL profiling data table has been created good. So let us go and add a new SSIS package.

And the whole goal of this package is to do data profiling. So you can see that I have added a new DTS x package here and I mean naming this package as data profiling. And on this package, you know, you will have only one component on the on the on the control flow and that is the data profiling task. Okay. So I will say that this is check data quality, right. So, when there's only one package and only one component, and the whole goal of this component is to go to this SQL Server Data what we have loaded Go and check what kind of quality of data we have and give us a profile of this data right.

So, that is this data which is loaded right. Great. So now let us go and do edit. Now as soon as you click on Edit, he provides a box here you know where you need to go and give the file name. So, what exactly does this file name do? Now, what happens is you know, when the data quality task actually runs on the data, whatever is the profile of the data, whatever is the quality of the data, all the details are written to this file.

So, this file is where the data profiling task will go and write the profile of the data right. So let us go ahead here and I will provide the file name so I'm going to say yes, new file connection and we will create the file in the same path, you know, All projectors so you're is asking that you know, do you have existing file? No, I don't have existing file create a file by yourself. So let's go and do a browse. So there it is, and we will give the path part of customer information system. In this we will create a file called as profile, okay.

Profile dot txt that's fine. So profile dot txt and let's open okay. And I will say that you know, whenever around the data data profiling task, always go and override it, you know, so that I get the new fresh profile always Okay. Now, the next thing is we need to go and provide the profile request. So what profile request, a profile request is nothing but it says that you know what kind of profiling you want to do. For example, do you want to check, you know that what is the length of each column?

Right? Do you want to check the nulls? How many columns have nulls? How many rows have nulls? Don't to check that what kind of values are there in your data? So you can see over here, there are a list of, you know, profile requests, you know, which you can go and run over your data.

Now, one way is that you you can go and you can click on each one of these profiles request and provide the necessary things are the other thing which I feel very comfortable is you can go and use this quick profile, this quick profile, what it does is, it actually quickly goes and creates, it takes all the profiles request what we have and creates a profile for you, right. So what I'm going to do is I'm going to go and connect to the customer data warehouse, that our team balance on which we want to go and run the profile requests. So there is this table SQL profiling data. And on this I'm going to go and run all the profiles. So, you can see that I have checked to check the nulls I have told to check the statistics to check the value right you know, also I want to check the primary key.

So, we will go into each one of these details later on. But let us first go and run this profile and see the output okay. So let me do okay here. So, you can see now, but other than clicking on each one of them and providing the necessary profile request, he has gone ahead and he has taken all the profile requests and added it to your data profiling task, right. Great. Now let me do okay.

And once I do, okay, you can see that the data profiling task is now configured, right. So let us go and run this DTS x package. So at this moment, you know, we have lots of package over here. But what I'm going to do is I'm going to go and set this at startup, right. And when we run this package, okay, so set this as startup object. And let us do Ctrl A phi or if you wish, you can also do F phi, right.

So f5 or Ctrl f5. So I'm going to do a fire just moment. So I have clicked on the start. So either you can do a phi, or either you can click on Start, both of them are the same things. So you can see that there our data quality task has ran. And whatever is the profile output, it has put it into this folder here in the profile dot txt.

So if you now go and open this profile dot txt, you can see that it's a big XML file here. And it is very difficult to read that what he's saying right? So in order to go and read This XML file in a more user friendly way, let us go back to Visual Studio. And let us stop this application. And let us right click again and do edit. So in order to go and see the output of the file in a proper user friendly way, you can see that he has given us a profile viewer.

You can see this button over here on the screen, which says open profile viewer. So click on Open profile viewers. So the first thing which is very noticeable here is what are the different kinds of profiles which has ran over your data. So you can see that he has ran the candidate profile, the column, land, column null and a lot of things right. And when you click on this profile, he gives you a field level view of what has happened with the profile. And when you click on that seal, he gives you a more detailed view of that profile information.

Right. So there are three sections here one is the profile then what has happened individually with every field and then more detail for that specific field? Okay. So, let us first start with candidate key candidate key profile tells that in your data, which are those columns which has unique records. Now, if I go back to my data here, you know my assumption when I look at this data, my assumption was that the employee code probably should be unique or you know, this is the person which I expect to be unique. So, when you get the data, there is expectation right so, employ employ code normally I expect it to be unique, but when I ran my profile you can see that he's showing me date of birth is unique.

He's showing me email addresses unique but he's not showing me the employed code to be unique. You can see that he shouldn't be pen can't to be unique, right. So, you can see in the employ code that is easy to 01 here There is also easy to 01 here right. So, this candidate key profile tells you that which column of your data are unique. So that you know they can be made as a primary key or they can be made as a candidate key. The second kind of profile what we have here is the column length distribution.

Now, if you look at our data, we have certain length expectations from a data for example, look at country code a country code normally is a three letter character Okay, for example must be a look at the EMP code okay the EMP code is probably a four letter character okay. So normally you know whenever we have any kind of a data we have certain range of length in our mind saying that okay the country code can be three letter character or must be the email address, you know, can be from one character till 10 character right. So this column length profile tells us That every column you know is or is of what length for example, look at the country code, my expectation was that the country code should be of the length three, right. So, you can see that there is a minimum length, but you know, I can also see that there are some data in the country code column, which is a file and, and when I click on this country code down, it says that okay 90% times you can see here 90% times or 90% of the data is having three length, but there are 9% of data, which is having file length.

So, if I go back here, now, I can see that this is three character three character three character, but look at this India, it is five characters, right. So, the column length distribution tells you that basically that column is having, what is the minimum length and the maximum length. The next profile is the column null ratio. It tells us that you know, which have fields are having nulls you know, so, sometimes it is very important as important for us to know that if our data is having null or not. So, you can see that here I am seeing that you know there are two records of nulls in the Name column. So, if I go into my SQL Server I can see that there are two records with null right the next kind of profile is the column pattern profile.

Now, a lot of times what happens is you know a lot of data have certain pattern for example, you can see that over here this EMP code has a pattern it starts with an E and then it follows with a three letter numeric. You can see that the department code has a pattern it for it starts with a D and then follows with a three letter numeric right. So, in the same way the email address also has certain pattern. Look at the date the date has a pattern of mmddyy or a DD mm yy. So, you know if we have such kind of patterns, we would like to actually see that in the profile so, So you can see over here, that's what this column pattern profiles is all about. For example, if I go to date of birth, you can see that he tells me that okay there is a pattern here of date ddmm yyy.

Now, this is a regular expression you can see that he has pulled out a regular expression. In case you are new to regular expression. My suggestion is to go and watch the regular expression videos what we have on questpond.com in the same we look at department code over here also he says that there is a pattern it starts with a D and then there is a three letter numeric Okay, look at the EMP code, again starts with an E and a three letter numeric and there are some other other data as well here which is not matching with the pattern right. So basically, column pattern profile tells you that what kind of patterns are available in your data. So with that, you can know more better that how is the quality of the data. The next kind of data profiling is the Column value distribution.

Now a lot of times you know, you'd like to know that how is the data distributed in the column for example, you can see that in the country code column ind is the most awkward record. So, I would like to know that okay, it has occurred 123456 times right. For instance, look at salary in salary 1000 value is the most awkward value. So 1234 fights four times. So, by this, you know, I can know that what are the most awkward values and how are the values distributed on in the data, right? So, so that's what this thing does.

So, for example, you can see here if I go to the country code column, it says that ind has occurred six times, if I go to the left salary, it says that thousand value, you know, has occurred four times so 36% values are of thousand values, right. So you can see now you know how by using the data profiling task, we can judge our quality of the data. Now before I end this video, I would like to highlight some problems some issues around data profiling task when it comes to implementing the same in real time projects. If you see at this moment, data profiling task is a standalone unit means I run the data profiling task and it gives out the XML and then I can go and view the XML file and see that what kind of quality my data has. Now, as a msbi developer, I would like to take this one step ahead means we would like to go and read that profile data in the data flow task.

And depending on the data profile values, we would like to go and execute some logic in the data flow task. So in order to achieve the same, if you remember, the data profiling task actually gives out the XML file. So this XML file is where he goes and writes the complete details. of the profile. So, what you can do now is you know, once you run this data profiling task, in your data flow task, what you can do is you can go and use the XML data source. So, if you go here to the SSIS Toolbox, you can see that there is something called as an XML source.

So, you can read the XML source, get the data from the data profiling task and then accordingly, take decisions here what to do in this data flow task. In one of the future coming videos, you know, we will talk about you know, how to use the XML reader and the XML source component. Okay. So, so that brings us to the end of this session. So in this session, we were trying to understand what exactly is data profiling task, and how does it help us to know the quality of our data

Sign Up

Share

Share with friends, get 20% off
Invite your friends to LearnDesk learning marketplace. For each purchase they make, you get 20% off (upto $10) on your next purchase.