A data warehouse is a computer system used for reporting and data analysis. Traditionally, organizations have used data warehouses to store and integrate current and historical data collected from various sources including their sales, marketing, production, and finance systems.
Prior to the arrival of data warehousing, businesses collected data in well-defined, highly structured forms at a reasonably predictable rate and volume. Even as the speed of older technologies advanced, data access and usage were carefully controlled and limited to ensure acceptable performance for every user, thanks to the scarcity of on-premises computing power and storage and the difficulty of increasing those resources. This required organizations to tolerate very long analytics cycles.
The concept of data warehousing dates back to the late 1980s. At that time reporting was typically done from a number of Independent operational systems. The data warehousing concept provided an architectural model for the flow of data from operational systems to decision support environments. In larger corporations, it was typical for multiple decision support environments to operate independently even though their users often required much of the same stored data. The process of gathering, cleaning and integrating data from these disparate sources was in part replicated for each environment.
Dedicated data warehousing systems evolved to reduce the strain on legacy systems during a sustained period of transaction growth and increasing thirst for insights from their data. Hence, all that data was duplicated in a data warehouse for analysis, leaving the database to focus on transactions.
Over the years, data sources expanded beyond internal business operations and external transactions. Data volumes have grown exponentially as have the variety and speed of acquiring data from websites, mobile phones and apps, online games, online banking, and online shopping. Furthermore organizations are now capturing huge amounts of data from Internet of Things (IoT) devices.
Businesses collected data in well-defined, highly structured forms at a reasonably predictable rate and volume. Even as the speed of older technologies advanced, data access and usage were carefully controlled and limited to ensure acceptable performance for every user, thanks to the scarcity of on-premises computing power and storage and the difficulty of increasing those resources. This required organizations to tolerate very long analytics cycles.
There are many types of data warehouses but the three most common are:
1. Enterprise data warehouse (EDW) - an EDW is a database, or collection of databases, that centralizes a business's information acquired from multiple and diverse sources and applications. It makes the data available for analytics and use across the organization. EDWs can be housed in an on-premise server or in the cloud.
2. Operational Data Store (ODS) - an ODS is similar to an EDW in terms of scope, but data is refreshed in near real time and can be used for operational reporting.
3. Data Mart – This is a subset of data stored in a data warehouse and is used to support a specific region, business unit or function area (e.g. Sales)
A data warehouses has become an essential enterprise tool for a variety of reasons:
There are four stages or layers that will always be present in a Data Warehouse architecture.
The data sourcing layer acquires raw data from a variety of data sources. These are typically from operational systems but in recent times companies are increasingly analysing data from a wide range of new sources made possible by the explosion of online apps, mobile data and the internet of things.
In the data staging layer, the data is stored in a staging area, or landing zone, and goes through an Extract, Transform and Load process. The data is cleansed and transformed into a format that is suitable for querying and analysis.
In the data storage layer, depending upon the approach of the data warehousing architecture, the data will be stored in the data warehouse as well as in data marts. Data marts are are repositories of summarized data collected for analysis on a specific section or unit within an organization, for example, the sales department. Some also include an operational data store (ODS), which is a central database that provides a snapshot of the latest data from multiple transactional systems for operational reporting. A recent trend in data warehousing is to move its storage to the cloud, or to have a hybrid cloud-on premise solution.
The presentation layer is the layer with which users interact to fulfill their business intelligence requirements. Queries are run at this layer to perform analysis using popular reporting tools like Tableau and QlikView. This is also the interface between the data warehouse and any data driven apps.