Lab 37 :- Page Split and Performance issues with SSIS.(SSIS)

MSBI Step by Step Training Lab 37 :- Page Split and Performance issues with SSIS.(SSIS)
25 minutes
Share the link to this page
Copied
  Completed
In this video we will demonstrate the Page Split and Performance issues with (SSIS).

  Download

Transcript

Welcome to lab 37. And in this lab, we will discuss about how page splits affect ETL process or SSIS process. Now, as I've said in some of the previous part of the video that a good msba developer has a very, very good knowledge of SQL Server, he knows how SQL Server works internally. And especially for this video, you need to know two three things before we start. First one is you need to know that how SQL Server organizes internally, the pages right. The second thing you you should know is that how clustered indexes works internally, and then you should know paid splits.

So before you start with this video, I would suggest you to go go through three videos from SQL Server section. The first one is eight KB pages. Second one is indexes. And the third one is, you know, pay splits So go through them. And in this video definitely I'm going to go and revise all those concepts, but you know, these three videos are dedicated for those concepts. So, they are more in depth.

Now, before I move ahead, let us talk about data warehouse a bit. If you look at data warehouse, you know there are two parts to it, one is the left hand side and one is the right hand side. So, the left hand side is you know, the place you know from where the data comes in. So, you have an ETL process in SSIS that pulls data from OLTP systems, you know, it pulls data from file systems, and it inserts data into a data warehouse. So, there is a first process the left hand side process. Then from at the right hand side, you have people fetching data from this data warehouse, the reading data from this data warehouse and doing analysis creating cubes displaying reports and so on.

So you have basically two processes one is the insert process and the other one is reading process and, you know you'd like in a both of these processes should be as efficient as possible, as fast as possible. Now in order to improve the search performance, what do we do? We use our nearest brother and nearest friend, that is indexes. So we create clustered indexes, we create non clustered indexes and so on. So, first let us try to understand how indexes works internally. And then we will try to understand page splits.

Let us take a scenario that we do not have an index. So, you can see that I have a very simple table here, which has two columns, one is ID and one is name. So, let us assume that you do not have indexes and you want to go and search the ninth record. So, what happens? It goes sequentially it makes a table scan. So, it starts from 1234 until it reaches nine.

But now think about that, you know you have a million records like this So, if you have a million records then it has to do a million scan and that is definitely not efficient right. So, when you go and create an index what it does is it actually goes and creates a B tree structure. So, in this B tree structure what SQL does is SQL server does is it actually goes and create group so, it takes this one to five record and it creates one single bucket or you can say leaf node. So it says that okay, this is one group of one to five records, and this is another group of six to 10 Records. So it goes and creates a leaf node on the top of the leaf node it goes and it creates non leaf node saying okay, this is a one to five bucket and this is a six to 10 bucket.

And then again there is level zero so at the top. So what happens now when an end user comes to search something he says, Okay, give me nine. So it four says okay, is nine greater than five. So if nine is greater than five, it completely bypasses The first bucket So, it completely bypasses this first thing over here and it goes and it starts searching from the second bucket and it says okay 678 and nine. So, in other words you know very it was scanning now mine records it is now scanning only three records. So, basically a balanced fee structure is nothing but it is a result of the B that stands for balanced tree.

So, the balanced tree structure is nothing but you know where your records are organized in leaf nodes, non leaf nodes and level zero this leaf nodes is where the final data is present. So, now, let us concentrate more on the leaf nodes you know because you know down in this leaf nodes is where exactly the page split phenomena happens. Now, these leaf nodes have the physical address of the data. So, in other words you know if you say record one, you know this one of that index will have exactly The physical address of the data okay so where the data is stored. So the data is stored differently. The data is stored in something called as data pages why index this whole beat restructure is stored in index pages.

Okay. Now, whatever you store in SQL Server and retrieve in SQL Server they are all stored in pages and retrieving the trillion pages. And these pages are of size eight kb. So, eight KB is the internal structure in which the records are stored and retrieved. In case you want to know more about these eight KB pages, I would request you to go and see the video in SQL Server section where we have internally explained how this eight KB page looks like. Now when we say eight KB eight KB means eight into one zero to four that is approximately 8000 bytes can be stored.

If you see at this moment, my table here has As one record is approximately 3000 bytes you can see this name over here has 3000 bytes. So now if this row size is 3000 bytes and our page structure or pages of 8000 bytes, so maximum in one page, he can only fit in two records, right. So in other words in one page, you know, you will have 3000 for one record 3000 for a second record, and the remaining bytes are approximately 2000 bytes left out. And in the 2000 bytes also, he has to keep some space for the header information and offset information. So hardly any spaces left for the third record. So once SQL knows that, oh, I don't have a lot of space left out, he goes and he creates a new page to fit in your third record.

This phenomena you know that he does not have enough space and he goes and he creates a new page is termed as page split. So in other words, you know, to fit in all of these 10 Records, he will approximately need at least five pages, right? In every page if you say two records so he will approx approximately need five pages. So I'll just let me reiterate, you know what will happen. So at this moment, you know, the first one and two record is in page one, the other two records have gone in page two, and so on and the last fifth page has nine and 10 record that means a depends okayish. And in that, you know, hardly 1500 bytes are left, right.

Now, let's say I go here and I add one more value here, you know, saying added to 11th record. Now what happens? He goes on he asks himself, he says that Okay, so this 11th record is going to add, can I go and fit that into the fifth page? The answer is no. Then he goes, and he does a page split, he creates a new page. And if it's added into that last page, that is the sixth page.

Now, this phenomena of page split, you know, can worsen. If you are loading a lot of records, think about the ETL scenario, where you're loading millions of records. If you're loading millions of records, you then this pace split will happen left and right. And that will make the whole insert process very slow. So we need to figure out a way you know by which, yes, we need indexes, but at the same time, we want to ensure that our inserts happen fast. So how do we go and avoid this piece split?

At the same time? How do we go and also have indexes because at the end of the day, indexes, you know, increases your search performance because of this balanced restructure. But now before I move ahead, the first thing is we need to detect it We have a page split problem or not, if you do not have a page split problem, you know doing unnecessary things does not make sense right. So, first thing is you know, wherever is your production environment you would like to go there and you would like to first detect that is really pay splits are happening or not. Now, before I go ahead and show you how to detect paid splits, the first thing let me tell you, base splits will only happen in clustered indexes, because in clustered index your leaf node points towards the data page, right. So, page splits cannot happen, you know, on non clustered indexes or when when you do not have an index.

So, basically wherever you have a clustered index, only there, the pay splits will happen. So, in this table, if you see TBL test, if I go here to the design, you can see that we have a clustered index over here create a clustered index. Yes, let me do like this. Let me go and say this has no and Let me go and add indexes separately. In other words, normally, you know, the primary key is the index, right? But let me go here and add a separate clustered index, which will actually index on the ID column.

So this is a clustered index. So you can see that over here, I have added a separate index called as ix DBL. Test. I will tell you why I have added it separately, right? So I've kept the primary key separate. And I've kept the indexes separate.

Okay. And I'll tell you that why it is a best practice to keep separate clustered index and not mix it with primary key. So by default, what happens when you create a primary key, this is clustered indexes. Yes, over here, right. So what I've done is I have just created it as a separate entity. So let me go and save this.

And what I will do is I will go and delete all the records, right? So I will go and just delete all the records so that I can show you a split in a more better way so all the records have been deleted. At this moment, you know, this table has no records. Neither, it has any pages. So it is just over there without anything right? Now normally, this base split problem, you'd like to go and detect it on a production server.

And if you see production servers, you know administrators are very protective about their production service, they will not allow you to install anything, they will not give you access, you know, even if they want to give you access, they will send a mail take his permission, right. So normally administrators are very, very sensitive about the production environment, they normally do not like to install any tools or anything, right. So whatever solution you're coming out with to detect baseplates so not have any installation of tools. So whatever is existing in the operating system, we would like to leverage that right. So again, this solution which I'm discussing here, you can use it, you know, without installing any tool. So how do we do it?

We have a very nice software called as perfmon Performance Monitor right. So, let us first go and I'm going to go and start perfmon. So, what is this performance monitor? What does it do? And what exactly is the basic use of a performance monitor? First thing, Performance Monitor is just the tool, the main concept is performance counters.

So, what exactly are Performance Counters? performance counters are nothing but you know they are values which tell you that how a software is performing in your operating system for example, let us say you have SQL Server and you fire SQL. So, there is a performance counter that is saying okay that how much time this SQL took to execute. For example, you have application and you want to know that how much memory has been consumed. So, that application what it does whenever he consumes memory, he actually goes and updates a performance counter called as memory So, you can go and detect their performance counters. So, basically performance counters are nothing but their values which are software in a Windows operating system updates it.

For example, at this moment if you see we have a very simple performance counter which is shown with data over here the processor time. So, it is telling us that basically how much time the processor is now busy right. So, if you go here and if you say add counters you will find there are holes of performance counters right from Excel to word to PowerPoint, even for SQL Server. So, you can see or if you want to detect you know how much your dotnet is consuming a memory, you can use one of these performance counter if you want to go and detect you know that how your SP dotnet is working. Again there is a performance counter for ESP dotnet So, whatever you know software is installed in Windows. This tool which is a part of the operating system which ships as a part of the operating system.

Hence you to detect it for example, you can see even the browser. So, he says that okay, this browser you know basically how much time you know you know how much how much time did it take you know to go to a domain etc. So, basically there are Performance Counters for literally everything in Windows okay. But our interest at this moment is SQL Server and specifically in SQL Server our interest is paid splits right. So, let us go to perfmon and let us go to SQL Server. Now remember, at this moment, I have lots of instances of SQL server running here you can see there is SQL Express and there is SQL 2008.

And then there can be SQL 2012. So I need to now here to SQL 2008. So in SQL 2008, we have something called as access methods. And in this we have a performance counters are called as paid splits per second. So let us go and add this There it is. Now one as you can see that there is a running graph here, which you can see which is running.

But rather than seeing the graph it is to like this you know, we will go and see a report. So in report we can exactly see the value in a one page split has happened to pay split has happened. So, rather than going and seeing the running value, let us go and see the see this exactly the absolute value, right. And let me go and remove this performance counter here as well. So, I'm going to go and remove this processor time at this moment. I'm not interested in processor time, I'm interested only in a report.

So now, at this moment, if you see so let's go back to SQL Server. So what I'm going to do is I'm going to go and place this and SQL Server very nearby. So I just wanted what is the situation, we have no records and we have no big structure internally, right. So let us go here and by default, well, by default, there is already one empty It can be page a located. So that is that right? So what happens here, what I'll do is rather than seeing this because this is per second, right, it will come one and we go very quickly.

So let me just switch back to line. And it clear everything. So there it is, you can see the line is running. So now I'll add one record, let us say shave, nothing will happen, then I will set a test. And I will set tab. Now when I go to add the third record, watch here.

Now when I go to add the third record, you will see a paid split. So if I see your x y Zed, and if I say tab, he says, Oh, I don't find any place now. So he goes and he creates a new page, this is termed as pay splits. Now look at this, test one, nothing. But now when I say again, XD or something, you can see one more page splits, right? So if we're having a lot of these kind of page splits, that means there is a problem.

That means your performance will degrade right. So while inserting These kinds of splits are happening, it is an issue. So, basically on production server, what you will do is you will go and you will add this performance counter which I just spoke about that is base splits per second and this performance counter is present in this access Methods section. So, you will right click on this you will say add counters, you will go to the access methods of SQL Server and ensure that you go to the proper instance. And then you will come to know that pastries are happening. So, now that we know the implications of page split.

Now, think about what will happen if you have an SSIS package, you know who's doing a huge bulk insert or a bulk update into the table. For example, you can see that over here, I have a very simple SSIS package what this package does, it actually goes and loads lots of records into the table. So, if you see when this program runs, so I'm going to go and quickly let us See the line here, let us see the let us go and run the application so that we can run this, look at this look at the page splits. So I ran the application, you can see 231 ps split, it just went away, I will just go back to my line, and you can see there's a huge shoot over here, I'm gonna go and pause this, so that we can watch this. So you can see over here, you know, almost like 200 plus psps has happened.

And this was like hardly 465 rows, but think about if I have millions of rows, which is very much possible in you know, OLAP application, then the performance will completely come down. So we need a way out, you know, by which we definitely want the indexes, but at the same time, you know, I don't want such kind of a shoot, right? So it's kind of pay splits. So how do you go about doing that? So Let me go back to SQL Server and let me just delete all the records. So I'm gonna delete the records.

Right? So basically what a lot of exercise developers do, they go, and they drop the indexes before going and doing such kind of bulk insert. Right? So normally a lot of developers what they do is they go and they execute SQL tasks. So basically, SQL task. Yeah.

So they go and execute SQL task prior to the data flow task, and after the data flow task, so this first SQL task here, what it does is it actually goes and drops the indexes. So it dropped the indexes. It does the insert. And after that, you have one more SQL task here which goes and creates the indexes. Right. In case you are new to this execute SQL task component, what I would suggest is, please go and watch lap 20, where we have a 30 minutes dedicated video only for SQL task, this execute SQL task actually goes and executes SQL, right.

So we need to go and execute to SQL here at this moment, if you remember our index name very quickly, if you remember our index name, I had created an index here called as ix TBL test. And you can now understand that why I created a separate index because if this index was with this primary key, then I need to actually drop this primary key and recreate it right. So that is, I would be affecting two entities for no reason. So that's why I've created a separate index. So you can see this index clustered index name here is ix underscore TBL test. So what we have to do is over here in the first page Good task, we need to go and drop the indexes.

Right? So I'm going to go and copy this. So, let us go here edit. And let us go and give the connection. So already I have a connection here in case you do not have a connection, use the new connection we have done a lot of times, and over here, I will go and put this statement. And I will say parse the query.

So, pause this query and check yes, the query is fine. And the second SQL statement will actually go and create this clustered index. So, you can see that create index ix underscore TBL test. So, again, I'm going to go and create the index here. So, go and put this and put this again let us not bypass this exclude let us go and check this. So, yes, what Great.

So now, let us go and run this. And let us see that if pay splits are happening or not right, so I'm going to go and clear all this right, I'm going to go and start this again. So let us go and run this. But before running, let me just restart my performance counter tracking. So let's go and run this you can see now it is running. And you can see that the pay splits have definitely decreased.

You can see it has decreased a lot. It is showing me now 10 paid split, but still pay splits are happening. So the questions that okay, the values have decreased. But why is the pay split happening? The patient is happening because we are again recreating the indexes, isn't it? So the PHP does not happen for every record insert now, right or it is not evaluated for every record insert.

But when we again, go and create the index, he has to again go and allocate all the pages. That's why this page split, but remember that this page split is very, very lesson number four. example over here Now, let me go and disable this let us say we we again just compare it. So, let us go and run this, let us go and compare this with with indexes. Look at that. So, you can see now with indexes, it was a huge number, but without indexes means dropping and recreating, you can see the numbers are very less.

So, this is more efficient as compared to you know having indexes You know, when we are doing insert. So, basically, the catch is that you go and you execute the SQL, which first goes and drops the indexes. After that you go and you execute your full data flow task. And once you are done, you will then go and recreate the indexes. So, this is one of the very important points you know when it comes to ETL in SSIS to remember that try to avoid pay splits. Remember that in SSIS are insanely easy process, millions of records get inserted and updated.

And, you know, millions of records means you know, you can have at least a lakh base pay splits, right? Wait so that was a 30 minutes video. So in this video we were trying to understand what exactly RP splits and how to go and improve performance in SSIS you know when we have indexes, thank you so much

Sign Up

Share

Share with friends, get 20% off
Invite your friends to LearnDesk learning marketplace. For each purchase they make, you get 20% off (upto $10) on your next purchase.