A data warehouse is a computer system used for reporting and data analysis. Traditionally, organizations have used data warehouses to store and integrate current and historical data collected from various sources including their sales, marketing, production, and finance systems.
Prior to the arrival of data warehousing, businesses collected data in well-defined, highly structured forms at a reasonably predictable rate and volume. Even as the speed of older technologies advanced, data access and usage were carefully controlled and limited to ensure acceptable performance for every user, thanks to the scarcity of on-premises computing power and storage and the difficulty of increasing those resources. This required organizations to tolerate very long analytics cycles.
The concept of data warehousing dates back to the late 1980s. At that time reporting was typically done from a number of Independent operational systems. The data warehousing concept provided an architectural model for the flow of data from operational systems to decision support environments. In larger corporations, it was typical for multiple decision support environments to operate independently even though their users often required much of the same stored data. The process of gathering, cleaning and integrating data from these disparate sources was in part replicated for each environment.
Dedicated data warehousing systems evolved to reduce the strain on legacy systems during a sustained period of transaction growth and increasing thirst for insights from their data. Hence, all that data was duplicated in a data warehouse for analysis, leaving the database to focus on transactions.
Over the years, data sources expanded beyond internal business operations and external transactions. Data volumes have grown exponentially as have the variety and speed of acquiring data from websites, mobile phones and apps, online games, online banking, and online shopping. Furthermore organizations are now capturing huge amounts of data from Internet of Things (IoT) devices.
Businesses collected data in well-defined, highly structured forms at a reasonably predictable rate and volume. Even as the speed of older technologies advanced, data access and usage were carefully controlled and limited to ensure acceptable performance for every user, thanks to the scarcity of on-premises computing power and storage and the difficulty of increasing those resources. This required organizations to tolerate very long analytics cycles.
There are many types of data warehouses but the three most common are:
Enterprise data warehouse (EDW) - an EDW is a database, or collection of databases, that centralizes a business's information acquired from multiple and diverse sources and applications. It makes the data available for analytics and use across the organization. EDWs can be housed in an on-premise server or in the cloud.
Operational Data Store (ODS) - an ODS is similar to an EDW in terms of scope, but data is refreshed in near real time and can be used for operational reporting.
Data Mart – This is a subset of data stored in a data warehouse and is used to support a specific region, business unit or function area (e.g. Sales)
A data warehouses has become an essential enterprise tool for a variety of reasons:
There are four stages or layers that will always be present in a Data Warehouse architecture.
The data sourcing layer acquires raw data from a variety of data sources. These are typically from operational systems but in recent times companies are increasingly analysing data from a wide range of new sources made possible by the explosion of online apps, mobile data and the internet of things.
In the data staging layer, the data is stored in a staging area, or landing zone, and goes through an Extract, Transform and Load process. The data is cleansed and transformed into a format that is suitable for querying and analysis.
In the data storage layer, depending upon the approach of the data warehousing architecture, the data will be stored in the data warehouse as well as in data marts. Data marts are are repositories of summarized data collected for analysis on a specific section or unit within an organization, for example, the sales department. Some also include an operational data store (ODS), which is a central database that provides a snapshot of the latest data from multiple transactional systems for operational reporting. A recent trend in data warehousing is to move its storage to the cloud, or to have a hybrid cloud-on premise solution.
The presentation layer is the layer with which users interact to fulfill their business intelligence requirements. Queries are run at this layer to perform analysis using popular reporting tools like Tableau and QlikView. This is also the interface between the data warehouse and any data driven apps.
Business intelligence (BI) assists organisations in making data-driven decisions. Nowadays, BI is a combination of things including business analytics, data mining, data visualization, data tools and best practices working together to give your organization a comprehensive view of your data. BI is a driver of corporate change, identifying where efficiencies can be made and allowing the business to quickly adapt to changing market or supply conditions.
Interestingly, the term Business Intelligence was first coined in 1865 by an American, Richard Millar Devens, in his publication Cyclopædia of Commercial and Business Anecdotes. BI as we know it really only kicked off when Edgar Codd pioneered the development of the relational database but it wasn't until the 1980s that a variety of tools emerged to assess and organise data in simpler ways.
Some activities that fall under the Business Intelligence umbrella include:
Reporting: Sharing data analysis in a variety of formats so that stakeholders can draw conclusions and make decisions
Data mining: The practice of analysing large databases in order to generate new information
Data visualisation: Turning data analysis into charts, graphs, histograms and other visual representations to more easily understand the data
Benchmarking: Drawing comparisons between current performance data and historical data
Predictive Analytics: The use of data, statistical algorithms and machine learning techniques to identify the likelihood of future outcomes based on historical data
Data Querying: Using Structured Query Language (SQL) and other methods for asking specific questions and pulling specific answers from datasets
Data preparation: Compiling multiple data sources so that it can be used in data analysis
A cloud data warehouse is one that is delivered in a public cloud as a managed service. Broadly speaking there are 3 models:
Most data warehouse providers offer at a minimum a PaaS equivalent to their on-premises data warehouse offering, but the market for full SaaS solutions is growing rapidly.
In the recent past most major data warehouse technology vendors were promoting on-premises solutions over cloud, mainly because of the volumes of data involved. Nowadays most vendors are shifting their focus and we see a strong uptake in cloud data warehousing solutions, particularly hybrid ones. It may seem that cloud computing will soon entirely replace the idea of on-premises infrastructure.
There are a number of factors that have led to the move to the cloud. These include:
Hybrid cloud computing combines an on-premises solution, typically a private cloud, with one or more public cloud services, with proprietary software enabling communication between each distinct service. Hybrid cloud services are powerful because they combine the best of both worlds: the integration and optimization of both on-premises and cloud solutions to handle the workloads and processes for which each is best suited.
Its popularity stems from big advances in cloud computing technology and huge growth in cloud storage capacity. Cloud-based solutions are now promoted as a convenient choice for businesses. Until fairly recently most major data warehouse technology vendors were promoting on-premises solutions over cloud, mainly because of the volumes of data involved. In this decade most vendors have shifted their focus and we see a strong uptake in cloud data warehousing solutions, particularly hybrid ones. It may seem that cloud computing will soon entirely replace the idea of on-premises infrastructure.
Data migration is the process of permanently transferring your data from one computer storage system to another. Activities in the data migration project include selecting, preparing, extracting, and transforming the data on the source system; the physical transfer to the target system; and validation of the migrated data for completeness.
Data migration is an important stage in any system implementation, upgrade, or consolidation. Given the volumes of data involved, especially with data warehouse migrations, they are performed in such a way as to be as automated as possible, freeing up human resources from mundane tasks. Nowadays, data warehouse migrations frequently occur as businesses move from on-premises infrastructure and applications to cloud-based storage and applications to optimise or transform their company.
According to a 2019 Gartner report, more than 50% of data migration projects will exceed budget and timeline and/or harm the business, due their flawed strategy and execution.
Your data is extremely important, and if it is not migrated properly, it can cause a devastating loss of data and disrupt your business. As such you need a data migration plan to help ensure success for your data migration.
The selected data migration strategy should take into consideration the needs of the business/end users, the type/volume of data to be migrated, and the company’s resources. It should also be sponsored at board-level to ensure that adequate funding and resources are provided.
There are three broad strategies that can be adopted:
A big bang approach refers to all the data being transferred from its source to a target system in a single operation and point of time. This is the highest risk strategy that requires a lot of planning and careful execution to ensure success. A downside of this approach is a that some downtime will be needed to complete the migration. Business units that require real-time access to their data 24x7 may not find this acceptable.
A trickle migration divides the data into smaller chunks and is a more in-depth method of database migration. If one of these sub-migrations fails, it can usually be rolled back and re-run without interfering with the rest of the migrations. A downside of this approach is that it is time-consuming and requires more resources, as the old system and new system must run simultaneously.
A zero-downtime migration replicates data from the original database to the target database and allows the use of the source database during the migration process. This lessens the disruption to the business and facilitates a faster migration time. Our product SmartSafe Database Replication as a Service facilitates this approach.
Here are some examples of types of data migration:
Storage data migration. This will occur when a business chooses to rationalise their data to take advantage of more efficient storage technologies, often using virtualisation techniques. It involves involves transferring data between storage subsystems or server systems. The content and format of the data itself will not usually be changed in the process and it can usually be achieved with minimal or no impact to the layers above.
There can be several reasons to do a storage data migration, such as:
Database migration. This is the process of migrating data from one or more source databases to one or more target databases, usually by using a database migration service or tool. It is often required when moving from one database technology to another, or to upgrade the version of the database software. In the latter case it is less likely to require a physical data migration, but this can happen with major upgrades such as to the underlying technology. In these cases a physical transformation process may also be required as the underlying data format may have also changed.
When a database migration is finished, the dataset in the source databases resides fully in the target databases. Clients that accessed the source databases are then switched over to the target databases, and the source databases are decommissioned.
Cloud migration. This is the process of moving data, workloads, IT resources, or applications to cloud infrastructure. It often refers to moving tools and data from old, legacy infrastructure or an on-premises data centre to the cloud, but it may also refer to the movement of data between between cloud providers. In many cases, it also entails a storage migration.
Application migration. This is the process of moving software applications between different computing environments. The reasons for doing so include:
Netezza Replication Services is a feature of Netezza that allows Netezza databases to be replicated between a master and one or more secondary nodes. SmartSafe, on the other hand, is a managed service provided by Smart Associates that allows multi-master Netezza database replication with a number of enhanced features.
Here is a comparison between the capabilities of each:
When an enterprise maintains a replicated Netezza architecture, the replication that occurs between the primary and secondary systems is typically uni-directional, i.e. the primary system is fed by the source systems, and the secondary systems are mirror copies of the primary system. If the primary system fails, a secondary system assumes the role of primary, DNS changes are made to redirect the feeds and the user connections, and business carries on as usual until the other system is fixed and roles are reversed again.
Traditional Netezza Database Replication
SmartSafe database replication is bi-directional (BDR) and allows all Netezza nodes to be primary systems.
An example of this is where you have two or more systems and each of them can be the primary system for a set of users, one or more applications and one or more source systems, and you can replicate the data between them all. This gives you a natural way of dividing your users across your infrastructure whilst still being able to have replication across all those different systems. Moreover, in a multi-master scenario, each master also potentially acts as a Disaster Recovery site for one or more of the other masters.
Multi-master bi-directional replication (BDR) as provided by Smart Management Frameworks
If you are thinking of buying a new Netezza Performance Server for IBM Cloud Pak for Data system or of migrating to Netezza in the cloud, Smart Management Frameworks SmartSafe can be used to automate the migration between the old and the new environments.
Replicating the existing Netezza databases to the new environment allows the migration to be performed incrementally and in a highly controlled manner. It allows for a dual run of both old and new systems side by side which means that you can control when the cutover occurs between the old and the new.
For customers wanting to try out the cloud service, SmartSafe will allow you to do a dual-run whilst doing an evaluation as part of the migration process. So if you are not sure if the Cloud solution will cope with the volume of data, queries and users and be better and more reliable than the on-premise option, it could potentially be used to replicate some or all of your existing data to that cloud option. It provides a safe way of testing the new netezza platform without having to take a leap in faith, keeping existing systems in place until you are ready to do the transition and the switch over, without it having to be a big-bang approach.
Many customers maintain a separate Netezza appliance as a development environment, and this system will typically have much less capacity than the production system. Maintaining a subset of data on dev can be quite a fiddly task, especially if you want the data accurately reflect the production environment. With SmartSafe you can replicate a subset of your production data to your dev environment so that it contains real world data that is accurate and up-to-date and small enough to fit on the dev environment. In exactly the same way you could create a separate test environment, and replicate a sub-set of production data to it.
So if, for example, you are constrained by disk space and so only want the last six months of transactional data on dev and test, SmartSafe can maintain that rolling window of six months of historical data for you. Alternatively, it can replicate a percentage of records using a percentage algorithm to determine the data that gets replicated. This is a much better option having to using a WHERE constraint or applying filter to the source data
SmartSafe works with all combinations of versions of Netezza systems - from the PureData for Analytics Nx00x range, to IBM’s latest generation Netezza on System or Netezza on Cloud offerings.
By choosing SmartSafe you will be able to replicate your old and new Netezza systems and avoid the problems that are invariably associated with upgrading to new technologies. Why is this? Because you can run your old and your new systems in parallel and choose the moment you cut-over with a few simple steps, obviating the need for extended system downtime and drawn-out post-migration user acceptance testing.
Mean Time to Recovery is quite simply, “how long will it take to recover our system to being fully operational after a failure”. It is sometimes also referred to as Recovery Time Objective.
MTTR is important to the business as it is regarded as a measure of how fit for purpose your Netezza disaster recovery system is. It implies that a full disaster recovery test has been conducted and that this time to recovery has been achieved, rather than just being an objective. If the MTTR has not been tested, your business runs the risk of losing their Netezza system for a considerable period of time which may have a serious impact on business continuity.
It's all explained in this video.
Recovery Point Objective determines how much data the business can afford to lose after an outage. So, does the data have to be right up to date, or can the business tolerate losing some of it? It factors in that the database transactions that have occurred since the most recent backup will also need to be restored if the RPO is quite a short one, and this may impact the Mean Time to Recovery.
This too is explained in this video.
For customers wanting to migrate from an alternative platform to Netezza, or vice versa, we have a solution called Smart Data Frameworks which will not only automate the migration but can also keep trickle-feeding changes from the Netezza database to the new target database until you are ready to switch over.
So, in the same way that DR as Service can be used to do a parallel run of an old and new Netezza platform, SDF will allow you to do a parallel run between an old Netezza and a new non-Netezza system, as long as you have primary keys declared in your tables. Accordingly, this is a form of replication but it is not designed to be permanent, but rather to assist with the migration process.
Containerization is a major trend in software development that provides an alternative to the use of virtual machines. Containerization is the packaging up of software code and all its related configuration files, libraries and dependencies so that it can run uniformly and consistently on any infrastructure. So, whereas with traditional programming methods there would be considerable effort, both in terms of coding and testing, to transfer code between two computing environments, this is not the case with containerized applications. The “container” is abstracted away from the host operating system, and hence it stands alone and becomes portable and is able to run across any platform or cloud.
The technology is quickly maturing, resulting in measurable benefits for developers and operations teams as well as overall software infrastructure. Developers are able to create and deploy applications faster and more securely without the risk of bugs and errors being introduced when code is transferred between, for example, a desktop computer to a virtual machine or from a Linux to a Windows operating system.
The concept of containerization is decades old, but it wasn't until the open source Docker Engine emerged in 2013 that its adoption accelerated. Docker has become the industry standard for containers with simple developer tools and a universal packaging approach. IBM has its own container format, Podman, which is similar to Docker but is native to OpenShift, which IBM acquired when it bought Red Hat in 2019.
Containers are frequently referred to as “lightweight,” as they share the machine’s operating system kernel which removes the need to associate an operating system within each application. Containers are inherently smaller in capacity than a virtual machine and require less start-up time. The compute capacity available can therefore run more containers than VMs. This results in higher server efficiencies and, in turn, reduces server and licensing costs.
Containerization allows applications to be "baked in" and to run anywhere. This portability is important in terms of the development process and vendor compatibility. It also offers other notable benefits, like fault isolation, ease of management and security.
Kubernetes, also known as K8S, is an open source container orchestration platform that automates the deployment, management, and scaling of containerized applications.
The name Kubernetes originates from Greek, meaning helmsman or pilot, which aptly describes the function it performs. Kubernetes was created and first released by Google in 2015, to manage the massive number of clusters that formed the basis of the Google search engine. Shortly thereafter, Google donated Kubernetes to the Cloud Native Computing Foundation (CNCF) that it had set up with the Linux Foundation to promote container technology.
Kubernetes' role in the containerized Netezza environment is to continuously monitor all the data nodes and host nodes, checking their health and restarting any that have failed. Kubernetes groups containers into logical clusters for easy management and discovery. It sits on top of all the physical resources and decides which CPU will run any given process. Without Kubernetes you would have to run your Netezza host on a dedicated processor which, if it were to fail, would require replacing and would result in downtime. As such, Kubernetes provides a much higher level of fault tolerance and reliability.
K8S can operate both in cloud and on-premise environments, including hybrid clouds. Its key features include:
Red Hat® OpenShift® is a commercialized software product derived from the Kubernetes open source project. OpenShift and Kubernetes are both container orchestration software, but Red Hat OpenShift contains additional features not available from the Kubernetes open source project. Red Hat was one of the first companies to work with Google on Kubernetes, even prior to launch, and has become the 2nd leading contributor to the Kubernetes upstream project.
OpenShift has become the leading enterprise Kubernetes platform. It enables a cloud-like experience everywhere it's deployed. OpenShift gives you the ability to choose where you build, deploy, and run applications whether it’s in the cloud, on-premise or at the edge.
OpenShift and Kubernetes are two of the best known container orchestration platforms. They both enable you to easily deploy and manage containerized applications. Whereas Kubernetes is an integral element of OpenShift, the latter contains additional features that aren't available from the Kubernetes open source project. There are several differences between Kubernetes and OpenShift. The list below compares the main features of each:
Product vs. Project - OpenShift is a commercial product, whereas Kubernetes is an open-source project. To use OpenShift you would need to pay a license fee which would generally be incorporated into any product you use that is built on OpenShift. For example, IBM Cloud Pak for Data provides a restricted use license of OpenShift in support of the services it provides. Kubernetes offers a self-support model that has an extensive community for supporting and growing the open source project. Whether you would use a commercial Kubernetes product or work on your own Kubernetes project depends entirely on the user.
Security - OpenShift has enhanced the security features of Kubernetes. It offers a much improved user experience for setting up and configuring Kubernetes authentications with an integrated server.
Web-UI - The OpenShift web-based User Interface (UI) is much easier to use than the basic one provided by Kubernetes.
Deployment Approach - OpenShift deployment of objects is much more automated, and has improved version control, than the basic Kubernetes deployment approach.
Continuous Integration/Continuous Deployment (CI/CD) - Both OpenShift and Kubernetes allow the construction CI/CD pipelines, but in both you need to deploy additional tools such as automated testing and monitoring, and CI servers, to build a full CI/CD pipeline. OpenShift makes this task easier because it offers a certified Jenkins container that you can use for the CI server, whereas you would need to deploy a third party tool to build a CI/CD pipeline with plain Kubernetes.
Integrated Image Registry - In plain Kubernetes you can set up your own Docker registry, but it doesn't come with an integrated image registry. OpenShift provides an integrated image registry that you can use with Red Hat or Docker Hub. The image registry has a console where you can search for information about images and image streams to projects in a cluster.
Installation - Kubernetes installation is complex and often requires a third-party solution. With OpenShift, however, Kubernetes is built-in so installation is more straight-forward, but it is limited to Red Hat Linux distributions.
Updates - The way updates are handled varies between platforms, although existing clusters can be upgraded instead of being rebuilt from scratch on both platforms. In OpenShift, upgrades are managed via the Red Hat Enterprise Linux package management system. The process is manual on plain Kubernetes using the kubeadm upgrade command.
Hyperconverged infrastructure (HCI) uses virtualization software to combine all the elements of a traditional data center - storage, networking, compute and management - into a distributed infrastructure platform. The virtualization software abstracts and pools all the underlying resources and allocates them dynamically to applications running in virtual machines or containers.
© Smart Associates 2023
Smart Management Frameworks
Smart Data Frameworks
IBM Netezza Performance Server
Yellowbrick Data Warehouse
PDA to CP4D Estimating Tool
Netezza After Life Support
SmartSafe Database Replication as a Service
SmartSecure Active Directory Synchronization
SmartSure Housekeeping Utility for Netezza
Netezza Support & Housekeeping
Smart Health Check Service
Free Smart Health Check Report
Data Warehouse Migration
Data Warehouse FAQs