ETL From scratch using SSIS integrate to new Data warehouse By Mody Ahmed
Creating a data warehouse sometimes refers to it as creating a copy of the existing transactions data sometimes referred to as "OLTP", this is a dataset that is a transaction in nature meaning it gets updated with new information as they come in, for example, data coming from an outlet store about the products being sold in each store.
As shown in the image above, various departments data will be dumped into the new data warehouse created.
Now that we created the exact source table but obviously with no data on it... Let's now try to load the data from the source table using 'Visual studio'.
SUCCESS !!
This data usually is very important to the company and prefers to keep it intact with very minimal interaction with it.
Therefore, it is better to have a copy of the data and create a new data warehouse where we can bring the OLTP and mix it together with other sources of data generated by the company, for example, marketing data so we can slice and dice the data to figure out, for example, the correlation between the use of this new channel of marketing (google ads) and the increase or the decrease of sales.
now, for the data warehouse purpose to be fulfilled and for analysis purposes for example we will need to create a star schema.
Star Schema
As we can see the middle table will branch out to other tables each table will contain further information about each specific dimension of the business each branch, for example, will be a branch with info about the store sales, store employees, stores products, and so on.
And to get from one table to another is becoming fairly easy as all the tables use an interconnected column of data or as data people, we refer to to "Primary Key".
This model of storing data is very beneficial for STRUCTURED Data.
ETL Vs ELT - Data warehouse vs Cloud data Lake
with the rise of new cloud technology many companies are moving to build their data warehouse in the cloud, I will explain this in a different article. But the concept will remain the same but some tools will be different.
Now, I will demonstrate ETL using free OLTP data generated by Microsoft, you can access data from GitHub from here: https://github.com/Microsoft/sql-server-samples/releases/tag/adventureworks
I then created a new database and will name it 'Mody', the idea here is to load the information from the source table to the database I just created.
Ok since the source database and my newly created database are in the same server, I could just load the entire tables needed to the new database by running,
select*
Into the new database
From the source database
But for the example's sake, we will assume they are not on the same server.
I will demonstrate how can I load just one table from the source database to our newly created database 'Mody'.
First, I will just create the table but with no information on my database, as we said we will only choose 1 table we will go for the EmployeePayHistory table.
So
just to make sure, the source table data looks like this, but obviously, ours is just empty.
so in visual studio, I created a new project, then connected the server where data is stored by right-clicking on the bottom window where it says connection manager and clicking on new OLE DB connection, it prompted me to choose which server and got the name of the server from SQL editor.
As you can see I now connected both databases, the original source named 'Adventurework' and mine named 'Mody'.
As you can see I created one source and one destination OLE DE tag, I then rename them to make things a bit clearer.
Now I will create the visual creation of the loading phase from one database to another.
When I click on it to connect it to the specific database I will then get prompted and asked for the specific table I want to load to my database, I chose the one we already agreed to load and the one we created a table for the 'employee pay history' table. I will do the same for the destination which should be the exact empty table I created.
Now I dragged the blue arrow to point to the target table, and I also directed the tag to my specific table on my database as shown.
If you click on mapping, you can see that all columns are mapped properly from the source to the destination.
Amazing, now we did the E which is the extract, now what about some transformation. I came across these 2 columns on my source table and both of them are dates. so what if I want to get the difference in days between them i.e create a new column where it shows the number in days difference.
for this, I added to 'derived column' to my control flow, to run some transformations to my source data before I load it.
Its fairly easy to create transformation on SSIS, see below
First, name the new column, then the expression, you can choose from the list above any type of transformation you wish, the logic will be inserted in the box, and then you drag and drop the columns you wish to change or aggregate. I also added 'dd' because I want the number of differences in days.
....
Now you should also create this column head in your new table, i.e Mody database.
Don't forget to map it.
Now, we all ready, let's run the package on VisualStudio by clicking the start button, and we get all green checkmarks.
now let's examine our table on Ms. SQL server to see the handsome data uploaded.
We have completed a whole EXTRACT- TRANSFROM-LOAD 👍👌


Comments
Post a Comment