Lab 26 :- CDC(Changed Data Capture) in SSIS. (SSIS)

MSBI Step by Step Training Lab 26 :- CDC(Changed Data Capture) in SSIS. (SSIS)
35 minutes
Share the link to this page
Copied
  Completed
In this video we will discuss a very important concept called as CDC using SSIS.

  Download

Transcript

In this video, we will discuss a very important concept called as CDC and CDC using a size. Now, CDC and SCD. If you remember in one of the previous videos, we discussed a very important concept called as slowly changing dimension. So both CDC and SCD they try to achieve the same goals. In other words, they help you to sync up between source and destination. But the way CDC works internally is very different than SGD.

So both try to achieve the same goals you know, but both are meant for very, very different purpose. So in this video, our concentration would be to understand how to use CDC concept using SSIS. Now before I move ahead, I would like to make a very important statement here. CDC is a concept which belongs to SQL servers. So CDC is not a part of SSIS It is not something which is a, which is given in SSIS. It is a part of core SQL Server.

Once you enable the CDC functionality on SQL Server, it tracks the inserts, updates and deletes What are happening on SQL Server table. So once you go and enable CDC on SQL Server, it helps you to track the changes that are happening on SQL Server table. So my first suggestion would be to go and understand CDC and to end. So before we move ahead with this SSIS plus CDC, I would suggest that you first go and just watch what is CDC, how to enable CDC, what is the role of SQL Server agent in CDC and so on. So I have created a separate dedicated video of CDC with SQL Server. So in this video, I'm not discussing about SSIS.

I'm just discussing about the concept of CDC. So before we move ahead with this video, my humble request would be to first go and see this video, which is in the SQL Server section. What is CDC and how to enable it and then proceed ahead with this video. Once you enable CDC on your database on your table, for example, we have this TBL customer here. Once you enable CDC on this table, you can see that he has created a system table here called as TBL customer underscore CT, right. So, the way this CTC works in SQL servers, whatever changes happens in this TBL customer, it will go and it will update those changes over here in this table.

Okay. So for example, let us say I go and change this, this shift knew I change it to shave updated right Let me go and add one more record here 1001 desti, new. Hundred, right? Let me go and delete this record shift from here. Right? So I perform three operations here, I've added a record called as test a new, updated record from shift to shift updated, and I've deleted one record.

So whatever I've done here, because I have enabled CDC on this table, it goes and it puts all those inserts, updates and deletes over here. So for example, you can see here this dollar operation here indicates that this three and four are updates. So three is the old record, and four is the updated record. This two indicates that it is a new record and one indicates that it is a delete. So basically, CDC is a mechanism is a functionality Just given by SQL Server, and if you go and if you enable it on your table, then it starts tracking the changes that are made to your table data. So one for deletion, two, for adding a new record, and three and four for updation.

So three has the old record, and four has the new record or updated value. Now SSIS uses this wonderful CDC functionality of SQL server to sync up the source and destination tables in an incremental manner. I repeat again, in an incremental manner. So what SSIS has done, you know, they have given a CDC component, which uses this CDC table that means this TBL underscore city table to sync up and to sync up the source and destination in an incremental manner. Now because SSIS is using this CDC table to sync up the sources destination it was only scans for the records which has been updated that means the records which has just recently inserted records which are just recently deleted and updated so it does not scan for all the records. In other words, you know when you're talking about huge tables, huge number of records to sync up between source and destination, there is no other wonderful functionality like CDC.

So now this is the source table we have TBL customer, let us go and create a similar structure of table called as TBL customer destination. So what I will do is I will go and script this table so I'll say create into a new query editor and I will name this table as TBL customer destination. So for simplicity, I will just keep the same field names and I will say this is TBL customer destination must be primary key price. Key table destination right. So just have named is an empty test destination everywhere and I will run this. So with this what happens is we have one more table here created called as the TBL customer test.

So we can see now we have two tables here. One is we have the TBL customer and the other one is we have the TBL customer test. Okay Here it is, something is hanging there I'm so sorry. So you can see that we have two tables here. One is TBL customer and the other one is TBL customer test. Now we want to sync up the TBL customer and TBL customer test destination table is using the CDC component.

Now before I move ahead with the demonstration of how to use the CDC component of SSIS it let me explain you the complete architecture of how this SSIS CDC will work because The configuration is not complicated, but the configuration is very lengthy. And if you do not understand the architecture, then it is very difficult to know that how CDC SSIS is working. So let me first explain you the architecture. And then let me go ahead and show the demonstration. So, here is a diagram, a visual diagram of how the synchronization via CDC will happen. So, we have two tables here one is the TBL customer and the TBL customer desk.

And both of these tables will be synchronized by using the SSIS CDC component. Now, the whole synchronization process happens in a three or four step process I'll say. So, step one, when somebody goes and updates the TBL customer, it sends those updated values to the CDC table. So if you remember, we had this underscore CG table where it uses the flu acts like 1234 to track all the changes to that underscore CD table or the CDC table. Now, what the CDC component does, first, it goes to the main source table that is TBL customer and loads all the record to TBL customer destination. So you can think about that this is step two, and it is the snapshot or execution.

So, in the snapshot execution, he first goes to the TBL customer table. He does not look at the CDC table, he takes all the records from the TBL customer table and dumps it to the destination. So the first step to I will say is one time, it happens just only once. Now after that, what happens after that, when any changes happen in the TBL customer table, the CDC component starts tracking the CDC table. So he does not know track all the records. He just tracks the CDC table and says That okay now let me sink of what has been inserted, updated and deleted, I will not be going to the main table to scan all the records.

So the step three is kind of a background process which will keep running forever. So in short, the CDC step two happens one time where it takes all the records and dumps it to the to the destination table. And step three, you know, happens forever. So after that he just keeps looking at the underscore city table or the CDC table for changes. Now one thing which can come to your mind is saying that, okay, when all of these sync ups are happening, how does the CDC know that which records have been synced up? So for that what SSIS has done, he creates his one small own table called as the SSIS CDC table where it goes and it starts writing the records saying that Okay, I have inserted four records, I've inserted five records.

Don't touch them. Records again. Right. So, this is the architecture of SSIS CDC. So, now, let us go ahead and let us try to sync up the TBL customer and TBL test table what we have using the CDC SSIS component. So as we said, you know, there are going to be two phases in CDC.

The first phase is where the complete data will be taken from the source table to the destination table. And the second phase you know, which will actually keep continuously running afterwards forever will take the data from the CDC table. So, I'm going to go and create two sequence containers here, so I'm gonna go and drag and drop to sequence containers. So, one sequence container will have the logical phase one where the complete load will happen. And the other sequence container will have the logic of phase two, where one Li the changes will be taken. So you can see that this is sequencing container one I have here I will also drag and drop a sequence container to okay.

So this sequence container I will name it as phase one or complete load, right. And this sequence container, I will name it as phase two, incremental only. Okay, so in phase one so now in phase one I'm going to have a data flow task and this data flow task will load the complete data from the source to the destination table. So, in this phase one, we are not looking at the CTC table of SQL Server we are loading the complete data at the start, right and the phase two will be incremental, slow Alright, so, over here, all data. Okay. So we'll go to this all data now and over First thing now this is a normal load.

So, we will have a source table we will have a destination table. So, let's go here. First let me take our source which is a do dotnet a destination which is a do dotnet ad.ca destination right. So, let us start configuring this these are all basic steps you have done. So, this the first phase one is a normal load right. So, new and the database name is my cost dB.

So, this is all normal configuration which we have done till now and the source table is TBL customer right and this table data will be given to the destination which is the TBL customer destination and the mappings are all same because the table structure is all seen right. So, this is the phase one this phase one you can also tell me it as a snapshot phase Where we load the complete data once because for one time the first time, we need to sync up the source and destination and then later on, we are going to have this incremental changes phase running continuously right. Now, if you remember I said that at the end of the day SSIS has to track that how many records have been transferred to what level the synchronization is complete, because if he does not track right then he will be duplicating the records again and again.

So, what it does is, you know, for both of these phases, phase one and phase two, but he goes and he says, the phase one loaded loading starts now, and it ends now. So, in other words, before this, all data is loaded, what we need to do is, we need to mark the start and the end so that SSIS knows till what level the data has been synced up. Again, I repeat by marking the start and the end of these phases SSIS knows till what level his synchronization is finished. So, what we have to do is we have to take the CDC control task. So over here, now those will define the start of phase one, right. And the same way we are going to have the end of phase one, right.

So, with the start and end flags, you know, he comes to know that till what level he has completed things, so, let me go here I'll say Edit, and ABC, okay, mark the initial load start. Now, in order to go this go and mark this load, start and load and he needs to have some variables created internally. So I'll say okay, go ahead and create those variables. So the variable I've named it a CDC state, and also, he needs to go and store this Start and End markup in some table. So let us go ahead and create that table here. So We can see that he's creating a table called a CDC states in your database will say yes to it.

And in that he keeps reporting that till what level the data has been synced up right. And I will say okay. So, there it is. So, this marks the start and in the same way, I would like to also mark the end of the initial load initial load means where we are loading the data for the first time, right. So, again, all these things are same here. Right.

So, if you go to your SQL Server just out of curiosity, you will see that he has created one more table for himself called the CDC states, the CDC states is that table you know, where SSIS writes that till what level has a synchronization finished between the source and the destination. So, let us go and run the application. So, in the first step, when the phase one will run, right, so, I'm going to go and run phase one So, you can see that I have clicked To start here. So in phase one, all the data will be loaded, and the initial load will finish, that means the initial synchronization will happen. So you can see that our initial load has finished. And if I go to my database here, and if I go to my destination table, you can see that all the records have been loaded here.

And this is first time and also in the system table. That means in the CDC table, I'm sorry, in the CDC state stable. He has marked saying that the initial load has started and the initial load has ended successfully. Right. So the first phase of loading has been finished. Now we need to start with the incremental updates.

Now in the same way, let us go and write the logic for the incremental phase as well, right. So I'm going to go and put the CDC control task here. So We will need to mark the start and the end right. So, here again we will say that okay mark the start of phase two right and mark the end of phase two right. And in between now we are going to go and use we are going to go and have a data flow task in which we will write the logic for the incremental load. So, here now, in the data flow task all data will not be loaded one incremental data from the CDC table of SQL server will be loaded right.

So, let me just go and complete this markings first. So, I will say now, I don't want to load the whole data right now. So, I want to load only the processed range that means that the data which has been changed, which has been updated, which has been deleted and not all the data right and Definitely we also need to have the variables here. So, I will say that create a variable CDC state and let us use the same table what we have CDC states right. So okay and also I will mark the end of phase two same way. So, your get processing range was a start and Mark processing range is the end right.

So, get processing range and Mark processing range are nothing but they are markers, which will be stored in the CDC states table saying that till this day will be a finished Good. Now, also let us go and write the incremental CDC logic now in the incremental load, right, I will not be reading from the main source table I will be reading from the CDC source so you can see now that I've taken the CDC source here, and in this I will say that I'm going to go and load The data from this CDC enabled table what it has, and I don't want all the values, right? I just want the net net means I just want the net update, I don't want all the records to be taken, right? And yes, again this variable will be used. And if we do a preview, actually, just to check what is output, you will see that he's actually loading this data from the CDC state tables right.

So that is the CDC CDC table okay. So, this is all good, I will just say Okay, nice. And now, remember that you know from this source table, what we have here, that means from this city table, we have lots of states right. So, something is inserted, something is updated, something is deleted. So, here also I need to identify that from this source, what is coming out, is this an insertion is this updation is that is a deletion right? So for that, you know what we have is over here.

We have a nice component called CDC split of right? So I'm going to go and give this data to the CDC splitter. Now from the CDC splitter, I can say that, okay, this is an insert, this is update or delete. So for example, now let me take a destination here, right? And I can go and put in this destination, and over here and he says, okay, is this an insert? So I'll say yes, this is an insert, I will put it here.

And over here, I will go and define the table by destination table, where to do the insert. And you can see now he's taking the data from the CDC table, the city table, right, great. Good. Now remember, the Phase One has to run only once, right? So I'm going to go and disable this phase one here. So this has ran, it has done its work.

It It has loaded the initial synchronization data between the source and the destination and forth. The data will now be loaded only from the CDC CDC table that means your CDC table offer SQL server that means from here, right? Good. Now what I'll do is let me go and delete all of this data because this data is actually all the bad data right. And so, so let us go and run this. So Henceforth, now we only the incremental logic will run and the phase one complete load will never run, right.

Okay, so you can see I'm just running the program to check everything is working. Yes, everything is working good, right. And let us also check if our source table so let me also go and check our source table here as well. So use Solution Explorer, Object Explorer, that sorry, the destination table, let me just check the destination table. So you can see That our destination table let us see both of these tables in one go right. So, we are this.

So, you can see because of the initial load at this moment both the source and the destination tables are loaded equally and everything is fine. So, now let us go ahead to the database and let us go to TBL customer and let us add some new records in the TBL customer so, I'm going to go and say something like let us say new customer, right, new name and 420 So, that we can remember the number right and there it is. Now, once I make an entry here, the next thing what happens is, he first makes an entry into the city tables, right. So, let us go ahead and just check in our city table. We should now see an entry Have a record with opposition to saying that this is a new record, right? And at this moment, if I go and see I have a destination or both a destination, is it okay?

Sorry. So now if I see in my source table, I have a new record and in my destination table, I don't have a new record right. Now what my expectation is that when I run this program, I expect one record to only come here one year, the incremental changes not all right. So let me go and put a Data Viewer here to just check that really, if I'm just getting that one new record which is sent which has been updated or which has been inserted right. So let me run this. So now what should happen is the Data Viewer should pop up with that only one record you can see now, you can see that he is one day now.

Take The changes he's going to the CDC source table and not to the main table. And I run this Yes, please go and insert it. So you can see over here, he's saying that one record has been sent, yes. And my expectation is that both of these tables should now be synchronized. So you can see now, this new caste which has been added here, has also been added here as well. Great.

So now the way we have done for inserts, you know, we also would like to do for updates. So now what I will do is in the coming part of the video, I'm not going to speak too much here, because if you remember, for filing the updates in tax, we said that you need to use the oledb command, right. So if you remember we said that whenever you say that you want to go and fire update, then you need to use this oledb command right, so that we have already explained so what I'm gonna do is I'm going to go and complete the update and delete in the remaining part of the video, but I will not be speaking and my actions will help you understand what exactly I'm doing. Okay. So let me start from here. Right So, I hope that you have understood you know how to go and set the configuration for oledb command.

So, we are all set here we are have the insert logic we have the update logic we have the Delete logic at place right. So, let us go to a table back again here. So, let us go to our TBL customer and let us start doing some inserts let us start doing some deletes. So, first thing is I'm going to go and delete pq r 9008. So, I'm going to go and delete this. I'm going to add a new customer saying new shave, right new shave one and something, right.

I'm going to go and update this new cost to update cost. Right so you can see i have i've made one delete I believe One update, and I have made one insert. Now let us see that if everything works perfectly well. So I'm going to go and let us also enable the Data Viewer here just to check if everything is working. Right. So enable Data Viewer here enable Data Viewer, you're my mouse.

There it is. And as I've said that the phase one is complete the phase two is something that keeps running again and again right. And also to quickly check quickly check quickly quickly check that our city table so before I go and show you all the output here, just to quickly check over here, the CDC has ensured you can see he has ensured all the tracking has happened, right. So everything is working synchronous in a proper sync. So you can see the first thing is update output. So update so did I Update.

Yes, I did update 1003 perfectly well run it right. So that is done. Next thing What is this? This is a delete? Yes I did delete pq r 9008 perfectly well very nice. And yes new shoe I did insert a new record perfectly nice and this is so you can see now everything one row here, one row here and one row here, right?

Perfectly incremental incremental updates. We are not loading the whole data and again and again, like how we did in SCD This is completely incremental right? And let us go and finally just check if the records are proper or not. So I did some changes here, right. So let me just go and see both the tables side by side. And you can see that everything is in sync but it's Problem Yo, I need to check that.

Why am I having pq r? Oh, very interesting you can see here. Nice. The delete is a problem we can see the new record has been inserted. Right? perfectly nice.

You can see that. Even the Okay, the update also is a problem, right? You can and I'll tell you why it has you know it is because Can you see the IDS over here the IDS on which I'm doing the updates and, and deletes, right? Those IDs are not in sync. This Id never inserted into this table here. Right?

And why did it happen? Let us go and check that so interesting. So we can see the updates and deletes are not working because these IDs are not in sync. Okay. So CDC is working perfectly well. It is just my mistake what I've done here, okay.

That is because there is Google Object Explorer. That is strange. I go to my design here. You can see that there is an ID and this ID for some reason has been auto incremented. I don't know why. Okay, when it happened, because when I did a create syntax, it was never auto incremental.

So I think this is a problem. So this should not be auto incremented. Right? This should also be updated from the table right? from the source table. So again, because of this, that sync up has not happened, or else this lab is working perfectly well.

Right. Good. So there is a small error at my end, which I've done. I'm extremely sorry for that. But I hope that it is clear that why that is happening, the error is happening because my destination table ID is identity. So he is not updating from the source table, he is creating his own IDs.

And on the whole base by which I'm doing the update and delete is on the IDs. So that's where the deletes and updates are not working. But if I go and remove the identity, and if I set it properly, I'm sure it will work. But I hope that You will have got the gist of CDC, you have understood the two different phases in CDC and how to configure them. Now, before I end this session, I would like to also put a comparison sheet between CDC and a CD because both of these functionalities, you know, actually helps us to achieve the same thing you know, they helps us they help us to synchronize between some source and destination. But here are some of the points you know, which will help you to decide that when to use CDC and when to use a CD.

The first point is that both the ends are SQL Server, then when the CDC works, so it will the entire SQL Server, I will say that, don't even think about a CD just move towards CDC. But if you have a non SQL Server such as like a flat file or a text file or Excel file, then CDC cannot work because CDC is a concept of SQL Server, right? So at that moment, you know, SCD is very useful In case you are looking at audit trail history, maintenance, old records and new records, then CDC does not provide any inbuilt functionality, then you have to go to a CD and that also for type one and type two kind of replication right. If you are looking at pure incremental updates, if you look at a CD, a CD goes and scans all the records and then says that if this is proper if this is not there if this is existing and then does the update, but CDC on the other hand is is evaluated at the source the source tells that what has been inserted what has been updated and what has been deleted.

So, in case you're looking at one Li the incremental values that means the values which are changed at the source, then it is CDC. And the final thing I think, which is very important as if it is a large number of updates large number of records, then CDC is the way to go because SCD as the name says slowly changing dimensions. It is only for records you know, which are master tables which are dimension tables. So in case you're looking for large number of updates, a CD will be very, very slow. So these are some of the bullet points you can keep ahead you know when you are looking at implementing CDC and a CD, and you can make decisions accordingly. So I hope that you enjoyed this video in this video we were trying to understand what is it what is CDC, how to enable it and how to use it in SSIS.

And finally, we also saw that what is the difference between CDC and STD thank you so much

Sign Up

Share

Share with friends, get 20% off
Invite your friends to LearnDesk learning marketplace. For each purchase they make, you get 20% off (upto $10) on your next purchase.