Lab 29 :- How to implement Multithreading in SSIS?(SSIS)

MSBI Step by Step Training Lab 29 :- How to implement Multithreading in SSIS?(SSIS)
39 minutes
Share the link to this page
Copied
  Completed
In this video we will discuss about what exactly is this multithreading concept.

  Download

Transcript

In this video, we will discuss about what exactly is this multi threading concept? And how does it improve our performance in SSIS. Now, in order to demonstrate the power of multi threading, what I've done is you can see that I have created a very, very simple table. It has three fields here, one is value one, value two, an ID. And at this moment, in this table, there are approximately 433735 Records. So that's a very pretty big number of rows.

And you can see that we have created a very, very simple SSIS package. And in this SSIS package in the data flow task, what I am doing is I have created an oledb source and I'm reading all the records from the table over here. So if you go and see, you can see that in the SQL command, I have written a very simple SQL command, and I'm fetching all the three columns That is value one, value two, and the ID in this oledb source. Now, what I will do is I'm going to go and use a derived column component. And I will be pushing the output of this oledb to the derived column and the derived column, I'm going to go and create two columns here, one column, you know, which will actually make the value one capital. So I'm going to go and see that okay, this is value one capital, and the other column, which will actually make the value to capitals.

So you can see a very, very simple task over here, which I'm doing, I'm trying to convert all the fields or all the records of this table to capital, so both the fields value one and value two. I'm actually trying to convert it into caps. So that's a very simple example over here. But now one thing to remember here is that We have large number of records here you can see it is 433735. So that many records actually are going to get converted into capitals. So that means that this is a very, very resource intensive task right.

So, let me go and run this task and let us see that how much time does it take to complete this task, so I'm going to go and run this task. So that my task is running. And what we will do is you know, once the task finishes, we will go and see that how much time this task has taken to complete. So, let us wait for some time here. So you can see that the task is running. And you know, the time taken for this task, or you know to run will be displayed later on in the progress tab.

Select this experiment come So, we can see it is actually taking every record from here and in the derived column, we are trying to convert both of those fields into capital. So that it is it has been completed. So let me go to the progress tab. And you can see that over here, I'm going to go and copy this message text, I will open notepad and you can see that it has taken me 46 seconds seconds. So approximately, it took me 46 seconds to complete this task over here, right. So good.

Now, the first thing which comes to my mind is that at this moment we have 433735 Records right? So it has taken four to six seconds, but you know, as the records increase in order and definitely this time will also increase. So is there a way we can go and do some kind of optimization and you know, Bring down this process time and improve your performance. So, let us start let us start thinking you know around the same Now, in order to improve the process, let us first understand the current process. So, let us understand the current process and then we will see that you know how this process can be improved. So, the current process is as following First the oledb source goes and he reads the data right he reads the data from database these data or I will say these rows actually are bundled into chunk of memory right and each chunk of memory is sent to the derived column.

So, each chunk of memory is sent to the derived column. Now, these chunks of memory are sent to the derived column sequencing. Again I repeat these chunks of memory you know where 10 rows or five rows you know, depending on the row size is sent sequence In case you are not aware of the, how the memory chunk works, I would suggest you know, please see the video where we talked about memory optimization for SSIS. So, these chunks of memory are sent sequentially and the derived column processes them sequentially. Again I repeat, the chunks of memory are sent sequentially to the derived column and the derived column processes them sequentially. So, everything is sequential over here.

So, fundamentally speaking, we need to change this sequential processing to parallel processing. And in order to do parallel processing, we need to use threads. So, if you go to the data flow task and if you right click on the properties you will find over here there is something called as engine threads. So, you can see over here there is something called as engine threads. So, already we have 10 threads here. So, 10 threads means that basically This data flow task is capable to process 10 things parallely So, at this moment, you know this is one logic.

So, this one logic is getting processed on one thread right, but the remaining nine threads are not used, because our logic does not support it. Remember, over here you are given given 10 threads. So, this data flow task is capable to run and task parallely. But look at your logic your logic is actually processing it sequentially. So, because the logic is doing sequentially, the data flow task has no idea how to run this parallel. So, the first thing what we need to do is we need to change this sequential logic into parallel logic, right.

So, we need to see that, you know if you can divide this data over here into two parts and process them parallely rather than sequentially, then you know, we could have some performance gain here. So let me go and drag and drop one more derive column right or I will rather say, let me copy paste this derive column what I have here because we are going to use the same logic, right. So I'm going to do a Ctrl C and then a Ctrl V. So what we will do now is, you can see that we have two derived columns. So this one derived column over here will actually process the 50% passive part of the data. And the other derived column here will process the remaining 50% part of the data. So I have copied this derive column logic as it is because the logic is same, right?

So what we'll do is we'll put a conditional split here. And this conditional split will divide the records into 50 50%. And he will send the 50% of record to this derive column and 50% of record to disk drive column. This thing what I'm doing here is termed as partitioning. If you want to do parallel processing in SSIS, the first thing what you should do is you should partition your data. For example, at this moment, I'm partitioning my data into 50 50%.

So, once the data is partitioned, then SSIS can take those partition data and run threads on each one of the partition data. So, in other words, the the base of parallel processing in SSIS is that what is your partitioning logic? Now, remember that this partitioning logic can be very complex, and it is very custom, you know, as per project. At this moment, what we will do is we will do, we will send 50% of the records here, and we will send 50% of the records here, right. So, our partitioning logic at this moment will keep it simple. And while I'm doing this parallel programming over here, I will also talk about you know, how people make mistakes in doing parallel processing.

So, at this moment, let me go and drag and drop the The value here the output here. So in this conditional split, I will go and create two partitions, one which will send the 50% of record to the first component here derived column, and the other one will send the 50% of record to the other derived column. So over here, I will say this is partition one, so part one. And over here, I will say, if you remember we had a column here, this column is the ID column, so I'll say ID, so if the ID, so one partition will be greater than zero, right. And it would be from zero to the 50% of the record, right? So, to the 50% of the record.

So, remember our total number of rows were 43375 To write, so, this is the first range of record right I will copy this and the other range would be from this from 50% to the remaining 50% right. So, that it is remember that you know if there are errors in this in this expression here it gives a red sign so it has gone out good. So, we have two partitions In short, one partition, which goes from zero to the 50% of the records, the other partition goes to the remaining 50%. Good. So, I'll just go and see okay So this output goes here. So this is partition one.

The other output goes here that is partition two. So at this moment, if you see, we have, we have a very simple partitioning logic of 50 50% here. Now remember that in projects and if people don't use such simple partitioning logic, you know, you need to have really robust logic. Okay, so that's fine, we'll discuss about the different partitioning logics. So you can see at this moment, it is very simple 50%. So 50% goes here, it is processed.

The other 50% goes here and it is processed. And you can see that both of these tasks, these data flow tasks here are processed parallel, right. So now let me go and run this program. Now, let me make a statement here before I run this program, this logic over here, this code over here, what you have written of this workflow, what you have created here, you think that it is going to execute badly Right, you think so? But if you go and run this and watch this what do you see? You see that yes it is it is processing 50% but the way it is processing 50% is that in a first it is processing the first 50% over here and then the next 50% it will actually process here and again this is sequential.

So, even though now your code you have written thinking parallelism in mind, it is still not executing parallely it is executing sequentially separately definitely, this needs to be improved, it cannot go like this right. So, how do we go about actually making it sequence So, so remember at this moment, this logic is good, the thought is good. It is moving in the right direction, but it is not actually pairing process. This is very important. A lot of developers what they feel I've met them I've seen you know they go to this property here. And to start increasing the number of threads, and the thing that Oh, if we make this 10 times hundred, you know, our, our program will run faster, but remember that it is not about the threads at the end of the day, you have to write the logic or you have to create the SSIS workflow in such a way that it can exclude badly.

So if your logic is not having the nature of parallelism, SSIS cannot do anything about it. Okay, so this logic is moving the right direction, but it is still missing those small pieces of actual parallelism, right. So first thing is your logic should be compatible with the thought of parallelism, or your logic should support parallelism. If our logic is not support, parallelism is not supporting parallelism, then SSIS cannot do anything about it. So now in order to actually start the processing right away parallely what I will do is, I'm going to go and make a small amendment in this conditional split and my partitioning logic So, I will say that if it is even record, right so what I'm gonna do is I'm gonna now go and divide definitely again into 50%. So, the base logic is that yes, I will divide the partitions into 50% but what I will do is I will say that the even records will go to the left side and the order records we go to the right side by doing so, my processing will start immediately right my parallel processing will start immediately rather than going and waiting for from zero to this and then from next to this right.

So, I will say yes okay if it is even so, what I will do is we will say percentage So, that means if there Now remember, the best way to detect even an odd is that if you divide by three and if the remainder is zero then it is odd So, I will say okay if this is equal to zero then this is an odd record or else you know this is a Even record so if you divide by two and if the remainder is zero, so that becomes even record, if you divide by three and if the remainder is zero then that becomes a auto record right? So we're here this partition one is for even records and the partition two is for odd records. And what I will do is I will also go and say that in case you know, if there are situations where probably you know, it is not divisible by two, it is not divisible by three you know, so, those kind of records, I will just go and put it over here.

So, you can see, this one is the default default means, you know the records which are not matching in those if conditions will be sent here. Good. So, let me go and run this program. Remember our previous program, you know, which was sequential, which had the sequential logic took 46 seconds. So, let us see that, you know this time how much it will take So, I'm going to go and run this So you can see, um, you can see that, you know, there are even records and there are order records, I'm not sure why I'm seeing some records in the default. So that's fine.

But you can see that at this moment, you know the all the records are executing parallely. So, you can see that it is not sequential, you know, the rows are going in this direct column task, some rows are going in this direct column task and so on. So, there it is, it is done. So, let us go and see. So, I'm going to go and copy this text. So, this one here was sequential.

And this one year is parallel or we say, multi threaded, multi threaded means the logic actually runs parallely, right. So, you can see almost we have 50% increase In speed, I do know that you know, these readings are not exactly accurate, but because you need to run them three or four times to get to a conclusion, you know, so that's fine, but I can still see that you know, this is 46 seconds and that is this is 28 seconds right. So, that is, that is pretty much it as it is. It is a very nice improvement. It's a 50% improvement in performance right? Good.

Now, this is a very good stage, you know, we are able to do multi threading, we are able to do parallel programming. But if you'd like to improve more, if you see at this moment, what is happening is all those records that means 433735 records are actually coming to conditional split, and then going to the derived columns in a parallel way. But this conditional split over here, which is happening inside my SSIS program is not parallel. So in other Because it is very much possible that now I see a performance improvement, but it is possible that in some situations, this conditional split would be so heavy that actually my overall performance can go down right. So, this partitioning or this conditional split which is happening inside my SSIS program, if it can happen in SQL server itself, again I repeat at this moment the partitioning logic is happening inside this SSIS package, right in this conditional split here, if this partitioning can happen at the SQL Server level that means, at over here you know, when I fetch the records, then I think you know, I would have more better performance right.

So, in other words, if my SQL over here, you know the select statement over here what you see, which I have written by using the SQL command, if I can go and write a valid condition here, you know in this, which I can say okay from zero to x records School is partitioned one then some records to some records is partition two and so on, then I think you know my performance can become much better and much better right. So, let us see. So we have total number of records, this is our total number of records. So, let me go here to my Excel sheet. So, let us say that we want to run five partitions, right, so, five partitions means they should be divided by five, okay, so that means that I should have from zero to whatever right. So, one partition is from zero to eight six, why can't a copy now, I'm copying the formula.

Copy and Paste values right So, one partition this is partition one I can say partition one is from zero to zero to this right? The next partition is from this to this plus this next partition is from this is partition two, right? So I would like to create four or five partitions over here, you know where in one partition the records goes from zero to this and then the other partition is from this to this right? So you're again, this is partition three, this plus this right and so on. So in this way, you know if you can go and create five partitions, and then implement the logic, it would be great Right. So, let me go ahead and first you know complete this calculation of five partitions and then let us see in SSIS package how to implement it.

So, in simple words what we are trying to achieve here is that we are trying to do the partition on the server side rather than doing it right away in SSIS. So, we do not want to do the partitioning logic here, but we want to do the partitioning logic on the server side while we are fetching in SQL. So, right when the data is fetching in SQL View, we want to ensure that it is partitioned and sent. So, what I plan to achieve here is the following. So, in other words okay this So, this one is my partition one okay so, this is not partition one and so, in partition one what we said in our logic we said that it should be between zero to 86747 right So, here I will save that Id is greater than zero and ID is less than 867-478-6747 right.

So this is the first partition. Okay, let's let me preview and check if this is right. Yes. So, that is the first partition which will run right in the same way I will go and create the second partition so Ctrl C and Ctrl V. So this is partition two, right write and here I will say the records are from, so, from 86747 So, from eight six, so it should be greater than 86747 and it should be less than 17349 Okay, why I can copy this 1734941 so 3494 right. So in this way I will go and create my partitioning logic. And you can see that you know, the partitioning is happening right away in the SQL.

Okay. So let me go and create five partitions what I've defined in SQL here so that you can see that I've created the five partitions and each of these partitions are following this calculation, what I've done in this Excel sheet here right. So, let us go and run this you know, if you remember our previous just sequential was four to six, you know, this guy gave us more faster results 28 Let us see you. Again. You know, remember that you know there is no guarantee that you know probably this is a It'll be faster than the previous one. And I'm just trying to give you a view of how the different approaches are, but let's go and run this let us see what happens.

So, you can see are now all of the rules have started in parallel you know rather than having that big if conditional thing what was inside the program, it is happening here in the oledb source right. Again, I would like to point out here you know, until this thing is running in the previous strategy, we had use a if condition of even an odds right, you're again definitely we are doing the partitioning at the source, but you can see that I have 12345 SQL query is now getting fired. So, the stress which was happening in the if condition is now happening on the SQL queries, so, there is still a stress Okay, so, we are finding a query and a lot of things. So, again, you need to see that if it is really worth or not, okay, so if I go here, let me copy this.

So you can see that So this is parallel with if condition even and or kind of a strategy. And this one is parallel with select using the range, right? If you see this, frankly, I was expecting some improvements. But you can see that you know, there are no improvements. This is 29 seconds is 28 seconds. Okay.

So that is fine, you know, but the point here is, you know, I was just trying to talk about different ways of thinking of doing parallel programming or multi threading in SSIS, right. And remember, don't forget that this threads over here the number of threads, what is specified over here are very important. For example, let us say make two threads over here. So here you have written the logic of five and you have just two, right? So again, this is not proper parallel programming, right? So either either have five threads or more than five, all right, so the engine turns are very important.

Now, the way I can spawn threads Inside a data flow task, if you see at this moment the threads are spawning on the data flow task. So, this part one part partition to partition three they're all running on different different threads and they are happening inside the data flow what SSIS has done on the control flow as well you know, they have given you something called as maximum concurrent executables. So, if you wish you can also go and do parallel tasking over here. So, what I can do here is let us do like this. So, let us say this is let us say I have two three data flow task, let me do that. So, let us let me create three data flow task here right and over here I will say that yes I'm going to run maximum three executables fantasy three executables means the data flow task right.

So, data flow task flow task To the data flow task, okay, so what I'll do is I'll say that okay in this you know, just exclude partition, one and two, right? In this execute, partition three and four. Look at the way I'm deleting things. And in the last partition, we say please go and exclude partition five. So at this moment, what I've done as you know, I'm doing multi threading at the level of control flow. So in control flow, if you want to do parallel execution, then don't forget this property here, maximum, concurrent executables.

And if you want to do parallel programming and data flow tasks, don't forget this property or which is threats that is engine threats, right? So, now we have three tasks which will run in parallel. Let us see, do we see any improvements here? So, I'm going to go and run this program, let us see that is what the best so that it is running it will take some time. So, we can see that it is running here it is lying over here parallely running over here right. So, execution execution is happening in every data flow task internally.

So, let us So, we are going to have this is our fourth reading right where we are saying that we are running the SSIS package with threading. Enable. Enable at the control flow task at the controller level. Oh What's happening? Okay. So, let us see how it is done, let us look at the timing of the copier again, you can see that you know, there are no big difference.

So Ctrl V, and there it is. So you can see this is with a select range 2829 seconds, 28 seconds, 30 seconds. So, as you can see that approximately they are doing the one on the same thing. Now, I would like to talk about a very important thing regarding these experiments and conclusions what I'm making here, whatever experiments and conclusions I'm making here can be wrong or it can be different in different scenarios because, for example, you know, when I was running SSIS, with threading, it is possible that at the same time some antivirus started running at the back end, right. And the the results were, you know, not showing up to the mark. So what I would suggest is that, you know, when you're doing these partitioning sessions, tournaments, at least take five or six readings to conclude that you know that that partitioning type is good or bad.

So don't just like the way I'm doing at this moment, you know, I'm just running one time and then checking these readings are right or not, I will suggest that at least two five to six reading and on average, then conclude that yes, this way to go is good or this way to go is bad. Okay. And second, also, I would like to point here that, you know, in this case, also, you know, where I was actually protocol, putting the partitioning on the SQL server side by doing a select. Remember that even when I'm doing a select here, all of these selects, are actually SQL Server trips. So, this is conditioning, what is happening here, right, it is happening in memory in SSIS. And over here, when I'm doing a select, I'm actually making SQL Server trip so I'm making five SQL Server trips, okay.

So, the problem here as well as that you know, because I have five SELECT statements, so it is just compensating, okay. So must be universal to find a really some more way more better way of doing partitioning, okay. So, remember that, you know, even though I, you know, I taught that, you know, yes, if I remove the if conditions from in memory and try to do on the SQL Server level, it will increase the performance. But, you know, I, you know, we did not account the fact that we have to make now five SQL tips over there, right. So, these are the ways of doing partitioning, let me also discuss one of the ways of doing partitioning, one more way of doing partitioning, which is called by using the SQL Server partitioning way. And this is a very professional approach.

Again, I don't guarantee that the speed will increase here, but let me just put the options around you Then depending on the options, you can think more creative of how to do partitioning. Now you know, this video is very focused on SSIS threading, okay, so I won't be going into SQL Server partitioning in detail, but I will tell you what I've done. Second, in case you want to know about SQL Server partitioning, I would suggest you to go and watch the videos of SQL Server for the same where we have what exactly is partitioning in SQL Server. So we have a video on that. But you know, this small text file which I have here, which I have also provided in your SSIS project, right is shipped. So in this you know, I have done the following things to create a partitioning partition in SQL Server.

So you can see that I've said you're the first thing is you have to go and say, specify the logical partitioning means what is the range of partition and how much partition you want to create so you can see for that you fire this screen. partition syntax. And we specify that yes, we want from zero to 86747. And then from 867472173494, and so on. So I've created five partition partitions. So that is the first step.

The second step is you have to specify that where these partitions will be stored. So if you see an SQL Server with we have a primary lock file, and then we have a secondary lock file and so on. So at this moment, I've said that in a store all these these partitions, that means this my firt, my five partition on to the primary one onto my primary device here, right? And the next thing, what I do is that I attach my, you know, my partition scheme to my table so you can see what I'm saying outer table, my table name was TBL source, and I've said that apply the partitioning on this ID column. Right. So apply this partitioning logic on the ID column.

So, basically to create partition, you have to follow these three steps. First is create the partition, second specify, you know where the partition will be stored, and then finally, apply the partitioning logic to the table. Once you do that, internally you can see that he creates partitions. So, for example, the way to the way you go and you refer the partition is you will say select from all the fields from the table name and then you will say where dollar partition and my name of the partition was my five partitions right. So, from that take the first partition. So you can see, when I say count of the first partition, it tells me it is 86747.

If I say now give me the second partition. So it will again be 8647. So in this way, he has created five partitions internally. Right now this is a better approach. Because now the partitioning logic will happen at the back end on the SQL server side. So it will not actually have impact on my SSIS directly.

But yes, you know, we do have the drawback that we have to file five SQL, right. So five times my exercise will fire SQL to SQL Server and get the data. And in case the SQL Server is lying in some other location, or I'll say, in some other server, then you expect this network trip or the network latency. But yes, I think you know, probably that is better as compared to having the if condition inside in memory. So again, we need to check what is good. So now that we have the partition on the SQL server side, what I've done is inside SSIS, wherever we had written that SQL select, you know, saying where ID is greater than this and this, I'm actually I have put the partition values you can see this is partition one.

So this is partition two, right? So what I've done is rather than putting that answer Laws I'm saying ID is less than zero and then greater than eight, eight, whatever, you can see that I'm actually putting the partitioning partition number right. So, let us see if this will be fast or not. Again, as I've said it is all experiments, I'm just trying to give you various kinds of partitioning logic right. And you have to be creative, you have to think out of the box to have performance. So SSIS gives you those threads and executables right, but at the end of the day, it is all up to you, how do you actually define your partitioning logic?

So, let's go and run this you know, let us say let us see that how much time it takes. So that it is running remember, we have done five readings we have taken one which was sequential one which was even an odd then we said parallel, right and then we define threading on the control flow level and now we are doing SQL Server partitioning. So Let us see that if we can get some performance improvements here. Okay finished. Let me see what it is it is 26 Okay, it is better than it is just four seconds faster than other things and I'm happy that it is faster. So you can see now you know just by thinking out of the box or trying to use what SQL server has, you know, I have some extra gains of performance.

So that brings us to this end of the session. The session was a very dear session to me because it took me a lot of time to record because of the complexities that are involved in defining partitioning and thinking about parallel programming but I really loved it and I hope that you have liked it also. In case you have liked it, I would again suggest please go to facebook.com slash correspond and tell us that what you think about our videos so in this video we saw how to do threading. So remember threading are at two levels one is on the control flow task. We have threading over here that is Max concurrent execution. And also we have threading also parallel processing over here as well, which is by the name as engine treads that it is right.

And as we said again number of threads if you just increase and your logic is not having that parallelism nature then nothing will happen. So, increase of thread does not really increase performance, it is the logic what you write it is the partitioning logic how you define write. And also we saw that how we can leverage SQL Server partitioning, you know, to to go in Unhand with engine threads and improve performance. So, hope you enjoyed this video. Thank you very much.

Sign Up

Share

Share with friends, get 20% off
Invite your friends to LearnDesk learning marketplace. For each purchase they make, you get 20% off (upto $10) on your next purchase.