A data warehouse is a computer system used for reporting and data analysis. Traditionally, organizations have used data warehouses to store and integrate current and historical data collected from various sources including their sales, marketing, production, and finance systems.
Prior to the arrival of data warehousing, businesses collected data in well-defined, highly structured forms at a reasonably predictable rate and volume. Even as the speed of older technologies advanced, data access and usage were carefully controlled and limited to ensure acceptable performance for every user, thanks to the scarcity of on-premises computing power and storage and the difficulty of increasing those resources. This required organizations to tolerate very long analytics cycles.
The concept of data warehousing dates back to the late 1980s. At that time reporting was typically done from a number of Independent operational systems. The data warehousing concept provided an architectural model for the flow of data from operational systems to decision support environments. In larger corporations, it was typical for multiple decision support environments to operate independently even though their users often required much of the same stored data. The process of gathering, cleaning and integrating data from these disparate sources was in part replicated for each environment.
Dedicated data warehousing systems evolved to reduce the strain on legacy systems during a sustained period of transaction growth and increasing thirst for insights from their data. Hence, all that data was duplicated in a data warehouse for analysis, leaving the database to focus on transactions.
Over the years, data sources expanded beyond internal business operations and external transactions. Data volumes have grown exponentially as have the variety and speed of acquiring data from websites, mobile phones and apps, online games, online banking, and online shopping. Furthermore organizations are now capturing huge amounts of data from Internet of Things (IoT) devices.
Businesses collected data in well-defined, highly structured forms at a reasonably predictable rate and volume. Even as the speed of older technologies advanced, data access and usage were carefully controlled and limited to ensure acceptable performance for every user, thanks to the scarcity of on-premises computing power and storage and the difficulty of increasing those resources. This required organizations to tolerate very long analytics cycles.
There are many types of data warehouses but the three most common are:
1. Enterprise data warehouse (EDW) - an EDW is a database, or collection of databases, that centralizes a business's information acquired from multiple and diverse sources and applications. It makes the data available for analytics and use across the organization. EDWs can be housed in an on-premise server or in the cloud.
2. Operational Data Store (ODS) - an ODS is similar to an EDW in terms of scope, but data is refreshed in near real time and can be used for operational reporting.
3. Data Mart – This is a subset of data stored in a data warehouse and is used to support a specific region, business unit or function area (e.g. Sales)
A data warehouses has become an essential enterprise tool for a variety of reasons:
There are four stages or layers that will always be present in a Data Warehouse architecture.
The data sourcing layer acquires raw data from a variety of data sources. These are typically from operational systems but in recent times companies are increasingly analysing data from a wide range of new sources made possible by the explosion of online apps, mobile data and the internet of things.
In the data staging layer, the data is stored in a staging area, or landing zone, and goes through an Extract, Transform and Load process. The data is cleansed and transformed into a format that is suitable for querying and analysis.
In the data storage layer, depending upon the approach of the data warehousing architecture, the data will be stored in the data warehouse as well as in data marts. Data marts are are repositories of summarized data collected for analysis on a specific section or unit within an organization, for example, the sales department. Some also include an operational data store (ODS), which is a central database that provides a snapshot of the latest data from multiple transactional systems for operational reporting. A recent trend in data warehousing is to move its storage to the cloud, or to have a hybrid cloud-on premise solution.
The presentation layer is the layer with which users interact to fulfill their business intelligence requirements. Queries are run at this layer to perform analysis using popular reporting tools like Tableau and QlikView. This is also the interface between the data warehouse and any data driven apps.
Business intelligence (BI) assists organisations in making data-driven decisions. Nowadays, BI is a combination of things including business analytics, data mining, data visualization, data tools and best practices working together to give your organization a comprehensive view of your data. BI is a driver of corporate change, identifying where efficiencies can be made and allowing the business to quickly adapt to changing market or supply conditions.
Interestingly, the term Business Intelligence was first coined in 1865 by an American, Richard Millar Devens, in his publication Cyclopædia of Commercial and Business Anecdotes. BI as we know it really only kicked off when Edgar Codd pioneered the development of the relational database but it wasn't until the 1980s that a variety of tools emerged to assess and organise data in simpler ways.
Some activities that fall under the Business Intelligence umbrella include:
Reporting: Sharing data analysis in a variety of formats so that stakeholders can draw conclusions and make decisions.
Data mining: The practice of analysing large databases in order to generate new information
Data visualisation: Turning data analysis into charts, graphs, histograms and other visual representations to more easily understand the data
Benchmarking: Drawing comparisons between current performance data and historical data
Predictive Analytics: The use of data, statistical algorithms and machine learning techniques to identify the likelihood of future outcomes based on historical data
Data Querying: Using Structured Query Language (SQL) and other methods for asking specific questions and pulling specific answers from datasets
Data preparation: Compiling multiple data sources so that it can be used in data analysis
A cloud data warehouse is one that is delivered in a public cloud as a managed service. Broadly speaking there are 3 models:
In the recent past most major data warehouse technology vendors were promoting on-premises solutions over cloud, mainly because of the volumes of data involved. Nowadays most vendors are shifting their focus and we see a strong uptake in cloud data warehousing solutions, particularly hybrid ones. It may seem that cloud computing will soon entirely replace the idea of on-premises infrastructure.
There are a number of factors that have led to the move to the cloud. These include:
Scalability - Previously, IT teams would estimate their compute power and storage needs years in advance, estimating each department's needs not so much as a crystal ball. As such it was commonplace to over or under estimate. Too much and the company would be paying for storage it didn't need, too little and the business was hamstrung. With cloud data warehouses, this appraisal is unnecessary. Cloud data warehouse providers typically offer a “pay-as-you-go” model so capacity can be increased or decreased on demand.
Improved speed and performance - Modern businesses connect multiple data sources to their data warehouse including CRM, ERP, social media and marketing systems. Cloud data warehouses can integrate more easily with new data sources and so the time to production is much faster than on-premises systems. Query performances are also often superior to on-premises systems as they leverage multiple servers that can balance the data load and increase processing speeds across multiple geographies.
Cost savings - Whereas running a data warehouse in the cloud was previously thought of as an expensive alternative, the costs of running a cloud data warehouse nowadays are significantly less than they were. With large Data Warehouse As a Service (DWaaS) providers such as Amazon Redshift, Google BigQuery, Microsoft Azure Synapse and Snowflake the infrastructure costs tend to be less, and the cost of mass storage in the cloud continues to fall. On-premises data warehouses, however, require expensive hardware and technical resources to perform upgrades, ongoing maintenance, and outage management. The capacity planning team needs to ensure that sufficient computing power is available to accommodate increased demand during busy periods and this excess capacity must be paid for even when demand drops off. Cloud data warehouses, on the other hand, allow data warehouse teams to increase or decrease the available compute power and storage as needed.
Improved self-service capabilities for business users - Self-service analytics is difficult to deliver with a traditional on-premises data warehouse because it is difficult to predict user demand and to ensure that the queries the users run don't affect processing power or hit storage limits. Setting up virtual machines and reserving physical storage for individual users is time consuming and even then business users must understand the data and be proficient in the query language specific to that system to run queries. Cloud data warehouses, however, deliver higher performance even as data volume and number of concurrent users increase. All users can query data from disparate sources without affecting each other, even choosing their own BI or analytical tool - something that is impossible with traditional on-premise solutions.
Security - Previously, on-premises data warehouses were considered more secure than cloud data warehouses, but this too has changed. On-premises data warehouses employ teams who oversee database security which is costly and prone to human error. Cloud providers have invested billions in building controls for safeguarding data such as cloud encryption services and multi-factor authentication and so businesses are now realising that cloud data is actually more secure and less costly to administer.
High availability - Cloud data warehouse providers guarantee a certain percentage of uptime—usually greater than 99% They achieve this with high built-in redundancy, sophisticated monitoring and mechanisms that can switch automatically between active and redundant components if a failure is detected. Whilst modern on-premises data warehouse systems are very reliable, the processes that are in place to detect problems and to recover the systems when they fail require human intervention and so by nature are inferior to those that are built into cloud technologies.
Improved disaster recovery - Disaster recovery for an on-premises data warehouse is costly and often fails to perform when required, such as not being able to withstand a natural disaster or not having been properly tested for system recovery time. Maintaining a DR solution includes providing a secondary data center containing a duplicate set of hardware complete with software licences and a full set of data that has to be continuously replicated. It's a big overhead that will often never even be used. A cloud disaster recovery solution, however, is a completely different proposition. No physical assets are required and nearly all cloud data warehouses support asynchronous duplication of data, and perform consistent snapshots and backups automatically. If a disaster occurs, users can log into a cloud-based application from any location and get access to the replicated data.