There are about 6 steps involved in implementation and administration of data model:
- The opertional data
- Data migration
- Database administration
- Decision support application
- The user or presentation interface
This step provides the source data for the enterprise data warehouse. Because there are usually considerable differences in the quality of data on different operational systems, it is necessary, in some instances, to condition the data before it is transported into the data warehouse environment.
It is very likely that there is a considerable gap between the data model for the data warehouse and other data models on which the individual operational systems are based.
An essential task in building a data warehouse is to properly map such data from the operational data to the data warehouse database.
The extraction, conversion, migration of data from the source to the target must be done such that data warehouse holds only accurate, timely and credible data.
- Refreshing the data on the operational system and onto the data warehouse is the simple option. It does not involve any transformation, however, the physical layout may change in terms of hierarchical to relational.
- Updating of data at intervals overcomes the above deficiencies.
It is in this step that the model is actually implemented. Therefore, compliance with a standard model and quick business benefits have to be maintained. Data granularity and metadata are the most important issues to be considered.
The degree to which the data warehouse is accessed by a wide variety of users determines the degree of complexity needed in the middle ware. The range of system software which is necessary to make the data warehouse accessible in a client/server is also termed as middle ware. It is the middle ware that allows an application on a client to execute a request for data on a local (LAN) or remote database server (data warehouse).
Decision support applications are employed to use the data warehouse. Some of those applications are for presentation of information in the form of predefined reports and statistical analysis. Some can be interrogative, allowing the users to construct queries and directly interact with the data
The command-line interface is the most basic interface level and is appropriate for interrogating very complex queries with a SQL program. The menu-driven interface provides the user with controlled access to the data. A hypertext interface is useful in presenting metadata to users.
‘OLAP’ stands for On-Line Analytical Processing.
OLAP describes a class of technologies that are designed for live ad hoc data access and analysis, based on multidimensional views of business data.
With OLAP tools individuals can analyze and navigate through data to discover trends, spot exceptions, and get the underlying details to better understand the flow of their business activity.
A user’s view of the enterprise is multidimensional in nature. Sales, can be viewed not only by product but also by region, time period, and so on. That’s the reason OLAP models should be multidimensional in nature.
- Purpose: Information retrieval and analysis
- Structure: RDBMS
- Data model:Multidimensional
- Access:SQL plus extension of data analysis
- Type of data:Data to analyse the business
- Condition of data:historical and descriptive
- Purpose:RUN day to day operation
- Structure: RDBMS
- Data model:Normalized
- Access: SQL
- Type of data: DATA that runs the business
- Condition of data:incomplete,changing
Check the info-graphic for the differences between OLTP, OLAP, ODS, DWH, DSS.
Most commonly used schema are:
- Star Schema
- Snow Flake Schema
- Fact Constellation Schema
- Simple structure -> easy to understand schema
- Great query effectives -> small number of tables to join
- Relatively long time of loading data into dimension tables -> de-normalization, redundancy data caused that size of the table could be large.
- The most commonly used in the data warehouse implementations -> widely supported by a large number of business intelligence tools
The snowflake schema architecture is a more complex variation of the star schema used in a data warehouse, because the tables which describe the dimensions are normalized.
Similar to a star schema, a snowflake schema contains fact and dimension tables, with the fact table storing facts about business transactions and some columns of the fact table containing foreign key references to the dimension tables.
Snowflake schema differ from star schemata in their level of normalization; the dimension tables of a snowflake schema are typically normalized to third normal form (3NF) or higher.
As a result, the data for a given dimension is spread out over multiple tables, and a diagram of the database resembles a snowflake.
The fact constellation architecture contains multiple fact tables that share many dimension tables.The main shortcoming of the fact constellation schema is a more complicated design because many variants for particular kinds of aggregation must be considered and selected.
Moreover, dimension tables are still large. For each star schema it is possible to construct fact constellation schema
A data warehouse is the main repository of all significant data collected by various business divisions and departments of an enterprise.
A data warehouse server is the physical storage used by a data warehouse system.
There are about three types of warehouse server:
- Enterprise warehouse server
- Data marts
- Virtual data warehouse
Note: There are different versions or naming conventions to the types of warehouse servers. You might find these as different in different conversations.
The vision for the ‘EDW’ is to provide information that is secure, accurate, timely, consistent, integrated, appropriately detailed, well-organized, and easy to obtain so that people throughout the campus — staff, faculty, researchers, and executive-level administrators — will be better able to assess their needs, set priorities, understand the impact of change, and fulfill their programmatic responsibilities more efficiently.
A data mart is the access layer of the data warehouse environment that is used to get data out to the users. The data mart is a subset of the data warehouse that is usually oriented to a specific business line or team. Data marts are small slices of the data warehouse.
Whereas data warehouses have an enterprise-wide depth, the information in data marts pertains to a single department. In some deployments, each department or business unit is considered the owner data of its data mart including all the hardware, software.
There are two types:
- Stand-Alone data mart
- Dependent data mart
This approach enables a department or work-group to implement a data mart with minimal or no impact on the enterprise’s operational database.
A dependent data mart is a place where its data comes from a data warehouse. Data in a data warehouse is aggregated, restructured, and summarized when it passes into the dependent data mart. These data sources include operational database and external sources of data.
There are three benefits of dependent data mart:
- KPI tracking
A virtual data warehouse provides a compact view of the data inventory. It contains Meta data. It uses middle-ware to build connections to different data sources. They can be fast as they allow users to filter the most important pieces of data from different legacy applications.
The main function of ‘OLAP’ engine is to present the user a multidimensional view of the data warehouse and to provide tools for ‘OLAP’ operations. If the warehouse server organizes the data warehouse in the form of multidimensional arrays.
This model assumes that the warehouse organizes data in a relational structure and the engine provides an SQL-like environment for OLAP tools.
There are four types of ‘OLAP’:
‘ROLAP’ works directly with relational databases. The base data and the dimension tables are stored as relational tables and new tables are created to hold the aggregated information. It depends on a specialized schema design.
This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP’s slicing and dicing functionality.
The two main important features of ‘ROLAP’ are:
- Data warehouse and relational database are inseparable
- Any change in the dimensional structure requires a physical reorganization of the database, which is too time consuming.
‘MOLAP’ is the classic form of OLAP and is sometimes referred to as just OLAP. MOLAP stores this data in optimized multi-dimensional array storage, rather than in a relational database.
Some of the advantages of MOLAP are:
- Fast query performance due to optimized storage, multidimensional indexing and caching.
- Smaller on-disk size of data compared to data stored in relational database due to compression techniques.
- Automated computation of higher level aggregates of the data.
- It is very compact for low dimension data sets.
- Array models provide natural indexing
Some of the disadvantages of MOLAP are:
- Within some MOLAP Solutions the processing step (data load) can be quite lengthy, especially on large data volumes. This is usually remedied by doing only incremental processing, i.e., processing only the data which have changed (usually new data) instead of reprocessing the entire data set.
- Some MOLAP methodologies introduce data redundancy.
HOLAP addresses the shortcomings of MOLAP and ROLAP by combining the capabilities of both approaches. HOLAP tools can utilize both pre-calculated cubes and relational data sources.
- MOLAP– Multidimensional OLAP – OLAP done using MDBMS
- Advantages: Powerful, efficient database engines for manipulating data cubes
- Disadvantages: MDBMS use proprietary database structure and DML (e.g., not SQL).
Requires different skill set, modeling tools, etc.
- Many commercial MOLAP systems are tightly integrated with reporting and analysis tools (BI tool sets).
- Some commercial MOLAP servers also support Relational OLAP
- ROLAP– Relational OLAP – OLAP done on relational DBMS
- Advantages: Uses familiar RDBMS technologies and products.
Uses familiar SQL
- Disadvantages: Historically inefficient implementation (although have improved considerably over time).
Data warehouse systems use back end tools and utilities to populate and refresh the data. These tools and utilities include the following functions:
- Data extraction
- Data cleaning
- Data Transformation
Data extraction is the process of Extracting data for the warehouse from various sources. The sources include:
- Production data
- Legacy data
- Internal office systems
- External systems
Data cleaning is also known as data scrubbing. Data cleaning is a process which ensures the set of data is correct and accurate. Data accuracy and consistency, data integration is checked during data cleaning. Data cleaning can be applied for a set of records or multiple sets of data which need to be merged.
Data cleaning techniques include:
- Using transformation rules
- Using domain-specific knowledge
- Performing, parsing and fuzzy matching etc.
A data transformation converts a set of data values from the data format of a source data system into the data format of a destination data system.
After the data has been cleansed and transformed into a structure consistent with the data warehouse requirements, data is ready for loading into the data warehouse.
You may make some final transformation during the loading operation, although you should complete any transformations that could identify inconsistencies before the final loading operation.
There are different loading techniques they are:
- Batch loading
- Sequential loading
- Incremental loading
- Dependency loading
- Full Load
Data Refreshing is the process of updating active data replicates based on a regular, known schedule. The frequency and timing of data refreshing must be established to match business needs and must be known by clients.
Data warehouse architecture usually provides a set of management tools which include load manager, query manager, warehouse manager.
In addition warehouse manager must be supported by other management tools like server manager and network manager.