It is a relational database/system used for reporting and data analysis. DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data and are used for creating analytical reports for knowledge workers throughout the enterprise.
A data warehouse is a centralized relational database that captures information from various divisions of an organization’s related business processes. This information will be stored and analyzed to determine predictive relationships through the use of data mining/analytical techniques.
For an enterprise to be successful and achieve their goals, they must make the right decisions at the right time. And to make and take the right decisions, they (business) need all relevant data to be considered from various departments/divisions of the enterprise. And the best source for that data is a well-designed data warehouse.
Data warehousing is an integral part of any effective business intelligence endeavor. Data warehousing is more than just a database-like method of storing information.
So, the idea is to create a permanent storage space for the data needed to support reporting, analysis, and other Business Intelligence functions. Though it may seem redundant to store data in multiple places (source systems, staging and the data warehouse), there are quite some advantages.
A database is used to store data while a Data Warehouse (though relational) is mostly used to facilitate reporting/analysis/mining/analytics.
Data Warehouse is more than just a database. It is an integral part of any effective business intelligence initiative. Once all the data is gathered from all its raw data via multiple sources, the data is processed using ETL into practical information which the business users can then apply various business rules to determine the effectiveness of business processes, create policy, forecast trends, and analyze the market and much more.
Operational systems are designed to support transaction processing whereas data warehousing systems are designed to support online analytical processing.
For example, the POS in a retail store is Operational system. The final reports generated on the overall sales and profits are done by the analytical system or the Data Warehouse.
An Operational Database is the database-of-record, consisting of system-specific reference data and event data belonging to a transaction-update system. It contains detailed data used to run the day-to-day operations of the business.
For example, a POS (Point of Sale) kiosk in a retail store.
An operational database is used for keeping track of payments and inventory. It takes information and amounts from credit cards and accountants use the operational database because it must balance up to the last penny.
For example, total sales happened in a store and the actual cost of the products to the total amount generated in cash, or online payments or via credit cards or thru any other payment systems. Both should match.
A Data Warehouse is an enterprise wide snapshot of data, typically used for decision-making.
A DBMS that runs these decision-making queries efficiently is sometimes called a “Decision Support System” or DSS.
There are five different stages for a DSS.
- Active Warehousing
Note:- These stages will differ based on the final goal of the Enterprise Data Warehouse, which differs from Enterprise to an other based on the business model.
Data warehouse typically focuses on reporting from a single source of truth within an Enterprise. The data warehouse brings huge value simply by integrating disparate sources of information within an organization into a single repository to drive decision-making across functional and/or product boundaries.
For the most part, the questions in a reporting environment are known in advance. Thus, database structures can be optimized to deliver good performance even when queries require access to huge amounts of information.
Data integration is the biggest challenge of reporting.
In this stage Data Warehouse decision-makers focus less on what happened and more on why it happened.
Analysis activities are concerned with drilling down beneath the numbers on a report to slice and dice data at a detailed level.
As an Enterprise becomes well-entrenched in quantitative decision-making techniques and experiences the value proposition for understanding the “what’s” and “whys” of its business dynamics, the next step is to leverage information for predictive purposes.
Understanding what will happen next in the business has huge implications for proactively managing the strategy for an organization.
This stage requires data mining tools (like RapidMiner (formerly known as YALE), WEKA ,R-Programming, Orange, KNIME, NLTK) for building predictive models using historical detail.
Operational typically means providing access to information for immediate decision-making in the field. Starts to bring us into the realm of active data warehousing.
It focuses on tactical decision support. Tactical decision support is not focused on developing corporate strategy in the ivory tower, but rather on supporting the people in the field who execute it.
The larger the role an active data warehouse plays in the operational aspects of decision support, the more incentive the business has to automate the decision processes.
Both for efficiency reasons and for consistency in decision-making, the business will want to automate decisions when humans do not add significant value.
Data warehouse is termed in four different ways:
- Subject Oriented
A data warehouse can be used to analyze a particular subject area. For example, to learn more about your company’s sales data, you can build a warehouse that concentrates on sales.
Using this warehouse, you can answer questions like “Who was our best customer for this item last year?” This ability to define a data warehouse by subject matter, sales in this case, makes the data warehouse subject oriented.
A Data warehouse integrates data from multiple data sources. For example, source A and source B may have different ways of identifying a product, but in a data warehouse, there will be only a single way of identifying a product to make it unique for the overall Enterprise analytics and reporting.
In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive.
A data warehouse’s focus on change over time is what is meant by the term time variant.
Nonvolatile means that, once entered into the warehouse, data should not change.
This is logical because the purpose of a warehouse is to enable you to analyze what has occurred.
The structure of DWH contains about 4 different components
- Current data
- Older data
- Summarized data
Note: Depending on the business/project needs and the architecture planned to be implemented, these can change.
It reflects the most recent happenings. It is huge in volume because it is stored at the lowest level of granularity. It is fast to access, but expensive and complex to manage.
Infrequently accessed and stored at a level of detail consistent with current detailed data. Generally kept on removable storage like automatic tape library.
Lightly summarized data is distilled from low level of detail, found at the current detailed level.
Highly summarized data is compact and easily accessible. Only frequently used summarized data is permanently stored in the data warehouse.
Familiarly know as data about data. It is used as:
- A directory to locate the contents of the data warehouse.
- A guide to the mapping of data as the data is transformed from the operational environment to the data warehouse environment.
- A guide to the algorithms used for summarization between the current data and the summarized data.
Metadata also contains information about:
- Structure of the data
- Data extraction/transformation history
- Data usage statistics
- Data warehouse table sizes
- Column sizes
- Attribute hierarchies and dimensions
- Performance metrics
It is completely dependent on the set up of DWH, but we can classify these are four types
- Metadata for the data warehouse
- Metadata for the operational data store
- Metadata for the legacy application
- Metadata for the job processing/monitoring
Though each data warehouse is different, all are characterized by a few key common components:
- A data model to define the warehouse contents
- A database, whether hierarchical, relational, or multidimensional
- Numerous utilities for data scrubbing, copy management, data transport, data replication and cross-platform communication
- A server optimized for fast reporting and query processing
- A front-end Decision Support System (DSS) for reporting and trend analysis