What is Data warehouse?
Data warehouse is a large collection of business data from multiple databases that helps the organization to make decisions.
Types of data in Data warehousing
There are mainly 2 types of data stored,
- Operational data – These data are required in day-to-day work (like, time of delivery, deduction in the salary of the employee who arrives late to work, etc). These data stay at the small databases as they are only required there.
- Strategic information – These information are required for decision making as to running the business smoothy without any big losses. For example, information regarding sales, products that are sold, weather effect on the sales, etc.
Therefore, Data warehouse is subject-oriented (consists of information regarding the business), integrated (there are large quantity of variation of data present about the business/company), non-volatile (these data are not lost until manually removed) and time-variant (the data is updated and exist with the passage of time) collection of data in support of management’s decisions.
Data Extraction
Note: The process of loading the data into the data warehouse is called data warehousing.
Before we get to data extraction, let’s understand the Data warehouse architecture
Different terms in the architecture:-
- OLAP – stands for Online Analytic Processing. It is a process for answering multi-dimensional analytical queries.
- Reporting – how business is doing, how it will do in the future, requirements to improve the business.
- Data Mining – the goal of data mining is to extract the information from a data set and transform them to an understandable structure for further use.
- Meta data – data about the data. It helps the data warehouse analyst to identify and find out what data is in the warehouse and where it is actually stored by consuming less time and more efficiency.
- Summary data – this is the part of the data which is going to be queries to get the information.
- ETL – stands for Extract, Transform and Load. In data warehouse staging area, data is extracted and cleansed and transformed as per the user requirements.
Extract, Transform and Load in Data Warehouse
This is the process used for data extraction. The complete process of extracting the data from the heterogeneous sources, transforming and loading the data into warehouse is called ETL. This is a regular process.
The ultimate purpose of Data Warehouse is query processing which will not be performed efficiently if the ETL is not worked out.
ETL has 3 phases,
- Extraction
- Transform, and
- Loading
Extraction
Data from the previous layers, i.e., the sources of data to the data warehouse, has been extracted to the staging area from those sources.
Data Extraction Strategies – There are mainly 3 types of data extraction strategies. They are,
- Full extraction – all the data from the Operational systems and all the source systems gets extracted into the staging area. This generally happens in 2 scenarios,
- Initial Load, i.e., when data warehouse is getting populated for the first time.
- there is no strategy for identifying the changed records.
- Partial Extraction (with update notification) – in this, we only extract the data which has been modified. It is easy and quick as compared to full extraction. Here, we get notification from the source systems about which data has been updated. This is also called delta.
- Partial Extraction (without update notification) – in this, we do not extract full data set from the source system. We extract the data based on certain keys or certain strategies.
- Full extraction – all the data from the Operational systems and all the source systems gets extracted into the staging area. This generally happens in 2 scenarios,
Note: The extract step should be designed in a way that it does not negatively affect the source system in terms of performance, response time or any kind of locking.
Transform
Data extracted into a staging server is a raw data and cannot be used as it is. It needs to be cleansed, mapped and transformed.
Basic transformation tasks:-
- Selection – we select the data which is required to be loaded into the data warehouse or which is actually meant to be transformed.
- Matching – in this step, we look-up the data in various look-up files and then match the data which needs to be transformed.
- Data Cleansing and enrichment – data is not cleansed in our source systems. It is not standardized because we are fetching the data from more than one source. It has to be normalized and standardized. Hence, we do data cleansing and enrichment.
- Consolidation and summarization – we consolidate and aggregate the data from the source system because we do not want same data from the source system into the data warehouse.
Major Transformation types:-
- Standardizing data – since data is fetched from various sources so it needs to be standardized before loading it to warehouse.
- Character Set conversion and Encoding handling – we also need to convert the data to defined encoding in the Data warehouse as source system may or may not have same encoding.
- Calculated and derived values – new columns from existing columns.
- Splitting and/or merging fields.
- Conversion of unit of measurements – eg. Date Time conversion
- Summarization, aggregation and consolidation
- De – duplication – deleting of duplicate data received from multiple sources.
- Key restructuring – this is important as we do not use the same key in our data warehouse which are used in out operating system. Data warehouse uses the concept of surrogate keys. These keys are the non-meaningful keys and it should be generated irrespective of the keys (Primary keys) defined in source system.
Loading
Data loading fetches the prepared data, applies it to the warehouse and stores it in the database.
Types of loading:-
- Initial load – populating all the data warehouse tables for the first time.
- Incremental load – applying ongoing changes as necessary in a periodic manner. This will only load the records which has either changed ot they are newly inserted from the source systems.
- Full refresh – completely erasing the contents of one or more tables and reloading it with fresh data.
ETL Tools
- Enterprise Softwares – Informatica, IBM, DataStage, CloverETL, Microsoft SQL Server, talend, ab initio.
- Open Source/Community Softwares – kettle, talend.
ETL Tools Features:-
- Source and Target data systems connection
- Scalability and Performance
- Prebuild Transformation connectors
- Data Profiling and data cleansing
- logging and exception handling
- robust administration features (to perform multiple works in an integrated manner)
- easy integration with web services
- efficient batch and real time processing
Data Cleansing
It is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate or incomplete data within a data set.
Difference between Data Cleaning and Transformation:-
Data Cleaning is the process that removes data that does not belong in your dataset. Data transformation is the process of converting data from one format or structure into another. Transformation processes can also be referred to as data wrangling, or munging, transforming and mapping data from one ‘raw’ data form to another format for warehousing and analyzing.
Steps to cleaning data:-
- Remove duplicate or irrelevant information
- Fix structural errors
- Filter unwanted outliers
- Handle missing data
- Validate and QA
Components of quality data:-
- Validity – degree to define business rules or constraints
- Accuracy
- Completeness
- Consistency
- Uniformity
Benefits of Data Cleaning
- removal of errors from multiple sources of data
- ability to map the different functions
- monitoring errors and better reporting
- efficient business practices and quicker decision-making
Metadata
The index of data warehouse is metadata which gives the location to where the data is stored, to the user.
Metadata are of 3 types:-
- Operational MD – It defines the location of the source of the data in the Data warehouse from where it has been extracted.
- Extraction and Transformation – It keeps the information of the database system names, tables and column names and sizes, data types and allowed values. It also includes structural information such as primary and foreign key attributes and indices.
- End User – It has data ownership information, business definition and changing policies.
Role of Metadata
- It acts as a directory
- Used for query tools
- Used in extraction and cleansing tools
- Used in reporting tools
- Used in transformation tools
Related Links
- Data Mining Notes – Click Here
- Database Management System Notes – Click Here