Data Warehousing FAQs

DW Cube

General Data Warehousing FAQs

A data warehouse is a computer system used for reporting and data analysis. Traditionally, organizations have used data warehouses to store and integrate current and historical data collected from various sources including their sales, marketing, production, and finance systems.

Prior to the arrival of data warehousing, businesses collected data in well-defined, highly structured forms at a reasonably predictable rate and volume. Even as the speed of older technologies advanced, data access and usage were carefully controlled and limited to ensure acceptable performance for every user, thanks to the scarcity of on-premises computing power and storage and the difficulty of increasing those resources. This required organizations to tolerate very long analytics cycles.

The concept of data warehousing dates back to the late 1980s. At that time reporting was typically done from a number of Independent operational systems. The data warehousing concept provided an architectural model for the flow of data from operational systems to decision support environments. In larger corporations, it was typical for multiple decision support environments to operate independently even though their users often required much of the same stored data. The process of gathering, cleaning and integrating data from these disparate sources was in part replicated for each environment.

Dedicated data warehousing systems evolved to reduce the strain on legacy systems during a sustained period of transaction growth and increasing thirst for insights from their data. Hence, all that data was duplicated in a data warehouse for analysis, leaving the database to focus on transactions.

Over the years, data sources expanded beyond internal business operations and external transactions. Data volumes have grown exponentially as have the variety and speed of acquiring data from websites, mobile phones and apps, online games, online banking, and online shopping. Furthermore organizations are now capturing huge amounts of data from Internet of Things (IoT) devices.

Businesses collected data in well-defined, highly structured forms at a reasonably predictable rate and volume. Even as the speed of older technologies advanced, data access and usage were carefully controlled and limited to ensure acceptable performance for every user, thanks to the scarcity of on-premises computing power and storage and the difficulty of increasing those resources. This required organizations to tolerate very long analytics cycles.

There are many types of data warehouses but the three most common are:

  1. Enterprise data warehouse (EDW) - an EDW is a database, or collection of databases, that centralizes a business's information acquired from multiple and diverse sources and applications. It makes the data available for analytics and use across the organization. EDWs can be housed in an on-premise server or in the cloud.

  2. Operational Data Store (ODS) - an ODS is similar to an EDW in terms of scope, but data is refreshed in near real time and can be used for operational reporting.

  3. Data Mart – This is a subset of data stored in a data warehouse and is used to support a specific region, business unit or function area (e.g. Sales)

A data warehouses has become an essential enterprise tool for a variety of reasons:

  • It is a single point of access for all the company's data, rather than having users connect to many different systems individually
  • It enhances decision making by giving access to all of the company's data rather than selected subsets
  • Being centralised, data is available in a more timely fashion.
  • It promotes enhanced data quality and consistency. As most, if not all, systems feed the EDW, it gives focus to strong data quality policies throughout the organisation.
  • It facilitates historical intelligence by storing large amounts of historical data
  • Performance gains - most data warehouses are modeled, built, and optimized for read access and subsequent fast report generation.
  • It facilitates data mining through the analysis of “Big Data” to predict future trends.
  • It provides a separation between day-to-day operational systems and analytical systems, thus enhancing security
  • It provides a standard set of semantics around data, such as consistency in naming conventions, codes for different types of products, languages, and currencies, and so on.

There are four stages or layers that will always be present in a Data Warehouse architecture.

DW Architecture

The data sourcing layer acquires raw data from a variety of data sources. These are typically from operational systems but in recent times companies are increasingly analysing data from a wide range of new sources made possible by the explosion of online apps, mobile data and the internet of things.

In the data staging layer, the data is stored in a staging area, or landing zone, and goes through an Extract, Transform and Load process. The data is cleansed and transformed into a format that is suitable for querying and analysis.

In the data storage layer, depending upon the approach of the data warehousing architecture, the data will be stored in the data warehouse as well as in data marts. Data marts are are repositories of summarized data collected for analysis on a specific section or unit within an organization, for example, the sales department. Some also include an operational data store (ODS), which is a central database that provides a snapshot of the latest data from multiple transactional systems for operational reporting. A recent trend in data warehousing is to move its storage to the cloud, or to have a hybrid cloud-on premise solution.

The presentation layer is the layer with which users interact to fulfill their business intelligence requirements. Queries are run at this layer to perform analysis using popular reporting tools like Tableau and QlikView. This is also the interface between the data warehouse and any data driven apps.

Business intelligence (BI) assists organisations in making data-driven decisions. Nowadays, BI is a combination of things including business analytics, data mining, data visualization, data tools and best practices working together to give your organization a comprehensive view of your data. BI is a driver of corporate change, identifying where efficiencies can be made and allowing the business to quickly adapt to changing market or supply conditions.

Interestingly, the term Business Intelligence was first coined in 1865 by an American, Richard Millar Devens, in his publication Cyclopædia of Commercial and Business Anecdotes. BI as we know it really only kicked off when Edgar Codd pioneered the development of the relational database but it wasn't until the 1980s that a variety of tools emerged to assess and organise data in simpler ways.

Some activities that fall under the Business Intelligence umbrella include:

Reporting: Sharing data analysis in a variety of formats so that stakeholders can draw conclusions and make decisions
Data mining: The practice of analysing large databases in order to generate new information
Data visualisation: Turning data analysis into charts, graphs, histograms and other visual representations to more easily understand the data
Benchmarking: Drawing comparisons between current performance data and historical data
Predictive Analytics: The use of data, statistical algorithms and machine learning techniques to identify the likelihood of future outcomes based on historical data
Data Querying: Using Structured Query Language (SQL) and other methods for asking specific questions and pulling specific answers from datasets
Data preparation: Compiling multiple data sources so that it can be used in data analysis

A cloud data warehouse is one that is delivered in a public cloud as a managed service. Broadly speaking there are 3 models:

  • Software-as-a-Service (SaaS) - Where the cloud provider manages everything
  • Platform-as-a-Service (PaaS) - Where the cloud provider manages the virtual hardware layer and the virtual software layer except for applications that are built by subscribers
  • Infrastructure-as-a-Service (IaaS) - Where the infrastructure is provided but the subscriber is responsible for managing the virtual operating systems, databases and middleware technologies

Most data warehouse providers offer at a minimum a PaaS equivalent to their on-premises data warehouse offering, but the market for full SaaS solutions is growing rapidly.

In the recent past most major data warehouse technology vendors were promoting on-premises solutions over cloud, mainly because of the volumes of data involved. Nowadays most vendors are shifting their focus and we see a strong uptake in cloud data warehousing solutions, particularly hybrid ones. It may seem that cloud computing will soon entirely replace the idea of on-premises infrastructure.

There are a number of factors that have led to the move to the cloud. These include:

  • Scalability - Previously, IT teams would estimate their compute power and storage needs years in advance, estimating each department's needs not so much as a crystal ball. As such it was commonplace to over or under estimate. Too much and the company would be paying for storage it didn't need, too little and the business was hamstrung. With cloud data warehouses, this appraisal is unnecessary. Cloud data warehouse providers typically offer a “pay-as-you-go” model so capacity can be increased or decreased on demand.
  • Improved speed and performance - Modern businesses connect multiple data sources to their data warehouse including CRM, ERP, social media and marketing systems. Cloud data warehouses can integrate more easily with new data sources and so the time to production is much faster than on-premises systems. Query performances are also often superior to on-premises systems as they leverage multiple servers that can balance the data load and increase processing speeds across multiple geographies.
  • Cost savings - Whereas running a data warehouse in the cloud was previously thought of as an expensive alternative, the costs of running a cloud data warehouse nowadays are significantly less than they were. With large Data Warehouse As a Service (DWaaS) providers such as Amazon Redshift, Google BigQuery, Microsoft Azure Synapse and Snowflake the infrastructure costs tend to be less, and the cost of mass storage in the cloud continues to fall. On-premises data warehouses, however, require expensive hardware and technical resources to perform upgrades, ongoing maintenance, and outage management. The capacity planning team needs to ensure that sufficient computing power is available to accommodate increased demand during busy periods and this excess capacity must be paid for even when demand drops off. Cloud data warehouses, on the other hand, allow data warehouse teams to increase or decrease the available compute power and storage as needed.
  • Improved self-service capabilities for business users - Self-service analytics is difficult to deliver with a traditional on-premises data warehouse because it is difficult to predict user demand and to ensure that the queries the users run don't affect processing power or hit storage limits. Setting up virtual machines and reserving physical storage for individual users is time consuming and even then business users must understand the data and be proficient in the query language specific to that system to run queries. Cloud data warehouses, however, deliver higher performance even as data volume and number of concurrent users increase. All users can query data from disparate sources without affecting each other, even choosing their own BI or analytical tool - something that is impossible with traditional on-premise solutions.
  • Security - Previously, on-premises data warehouses were considered more secure than cloud data warehouses, but this too has changed. On-premises data warehouses employ teams who oversee database security which is costly and prone to human error. Cloud providers have invested billions in building controls for safeguarding data such as cloud encryption services and multi-factor authentication and so businesses are now realising that cloud data is actually more secure and less costly to administer.
  • High availability - Cloud data warehouse providers guarantee a certain percentage of uptime—usually greater than 99% They achieve this with high built-in redundancy, sophisticated monitoring and mechanisms that can switch automatically between active and redundant components if a failure is detected. Whilst modern on-premises data warehouse systems are very reliable, the processes that are in place to detect problems and to recover the systems when they fail require human intervention and so by nature are inferior to those that are built into cloud technologies.
  • Improved disaster recovery - Disaster recovery for an on-premises data warehouse is costly and often fails to perform when required, such as not being able to withstand a natural disaster or not having been properly tested for system recovery time. Maintaining a DR solution includes providing a secondary data center containing a duplicate set of hardware complete with software licences and a full set of data that has to be continuously replicated. It's a big overhead that will often never even be used. A cloud disaster recovery solution, however, is a completely different proposition. No physical assets are required and nearly all cloud data warehouses support asynchronous duplication of data, and perform consistent snapshots and backups automatically. If a disaster occurs, users can log into a cloud-based application from any location and get access to the replicated data.

Hybrid cloud computing combines an on-premises solution, typically a private cloud, with one or more public cloud services, with proprietary software enabling communication between each distinct service. Hybrid cloud services are powerful because they combine the best of both worlds: the integration and optimization of both on-premises and cloud solutions to handle the workloads and processes for which each is best suited.

Its popularity stems from big advances in cloud computing technology and huge growth in cloud storage capacity. Cloud-based solutions are now promoted as a convenient choice for businesses. Until fairly recently most major data warehouse technology vendors were promoting on-premises solutions over cloud, mainly because of the volumes of data involved. In this decade most vendors have shifted their focus and we see a strong uptake in cloud data warehousing solutions, particularly hybrid ones. It may seem that cloud computing will soon entirely replace the idea of on-premises infrastructure.

Hybrid Cloud Computing

Data Migration FAQs

Data migration is the process of permanently transferring your data from one computer storage system to another. Activities in the data migration project include selecting, preparing, extracting, and transforming the data on the source system; the physical transfer to the target system; and validation of the migrated data for completeness.

Data migration is an important stage in any system implementation, upgrade, or consolidation. Given the volumes of data involved, especially with data warehouse migrations, they are performed in such a way as to be as automated as possible, freeing up human resources from mundane tasks. Nowadays, data warehouse migrations frequently occur as businesses move from on-premises infrastructure and applications to cloud-based storage and applications to optimise or transform their company.

According to a 2019 Gartner report, more than 50% of data migration projects will exceed budget and timeline and/or harm the business, due their flawed strategy and execution.

Your data is extremely important, and if it is not migrated properly, it can cause a devastating loss of data and disrupt your business. As such you need a data migration plan to help ensure success for your data migration.

The selected data migration strategy should take into consideration the needs of the business/end users, the type/volume of data to be migrated, and the company’s resources. It should also be sponsored at board-level to ensure that adequate funding and resources are provided.

There are three broad strategies that can be adopted:

A big bang approach refers to all the data being transferred from its source to a target system in a single operation and point of time. This is the highest risk strategy that requires a lot of planning and careful execution to ensure success. A downside of this approach is a that some downtime will be needed to complete the migration. Business units that require real-time access to their data 24x7 may not find this acceptable.

A trickle migration divides the data into smaller chunks and is a more in-depth method of database migration. If one of these sub-migrations fails, it can usually be rolled back and re-run without interfering with the rest of the migrations. A downside of this approach is that it is time-consuming and requires more resources, as the old system and new system must run simultaneously.

A zero-downtime migration replicates data from the original database to the target database and allows the use of the source database during the migration process. This lessens the disruption to the business and facilitates a faster migration time. Our product SmartSafe Database Replication as a Service facilitates this approach.

Here are some examples of types of data migration:

Storage data migration. This will occur when a business chooses to rationalise their data to take advantage of more efficient storage technologies, often using virtualisation techniques. It involves involves transferring data between storage subsystems or server systems. The content and format of the data itself will not usually be changed in the process and it can usually be achieved with minimal or no impact to the layers above. There can be several reasons to do a storage data migration, such as:

  • Undertaking a storage hardware refresh to take advantage of new features
  • Resolving storage performance or capacity issues
  • Relocation of data centre

Database migration. This is the process of migrating data from one or more source databases to one or more target databases, usually by using a database migration service or tool. It is often required when moving from one database technology to another, or to upgrade the version of the database software. In the latter case it is less likely to require a physical data migration, but this can happen with major upgrades such as to the underlying technology. In these cases a physical transformation process may also be required as the underlying data format may have also changed.

When a database migration is finished, the dataset in the source databases resides fully in the target databases. Clients that accessed the source databases are then switched over to the target databases, and the source databases are decommissioned.

Cloud migration. This is the process of moving data, workloads, IT resources, or applications to cloud infrastructure. It often refers to moving tools and data from old, legacy infrastructure or an on-premises data centre to the cloud, but it may also refer to the movement of data between between cloud providers. In many cases, it also entails a storage migration.

Application migration. This is the process of moving software applications between different computing environments. The reasons for doing so include:

  • When a business changes an application vendor, such as acquiring a new CRM or ERP
  • When migrating applications from on-premises infrastructure to third-party cloud infrastructures, such as IBM Cloud, Amazon Web Services (AWS), or Microsoft Azure.
  • When moving web applications from the private cloud to the public cloud. When a new application is acquired, the likelihood is that the underlying data model will be quite different and as such the changes will involve substantial transformation.

Smart Database Replication FAQs

Netezza Replication Services is a feature of Netezza that allows Netezza databases to be replicated between a master and one or more secondary nodes. Smart Database Replication, on the other hand, is a managed service provided by Smart Associates that allows multi-master Netezza database replication with a number of enhanced features.

Here is a comparison between the capabilities of each:

FeatureSmart Database ReplicationNetezza Replication Services
Appliance Cross-Type (CP4D/NPS) SupportYesNo
Multi-Master SupportYesNo
Multi-Secondary SupportYesYes
Same Database in Different Replication SetsYesNo
Partial Database Restore (Selected Tables)YesNo
Partial Table Data Restore (via customisable data filter)YesNo
Cross Database View FixingYesNo
Out-of-the-Box Metadata Replication SetupYesYes
Own Replication Metadata DatabaseYesYes
Data Auto-Healing MechanismsYesNo
Built-In Integrity ChecksYesYes
Replicating Databases With Different NameYesNo
Configurable Data Resynchronisation FrequencyYesNo

When an enterprise maintains a replicated Netezza architecture, the replication that occurs between the primary and secondary systems is typically uni-directional, i.e. the primary system is fed by the source systems, and the secondary systems are mirror copies of the primary system. If the primary system fails, a secondary system assumes the role of primary, DNS changes are made to redirect the feeds and the user connections, and business carries on as usual until the other system is fixed and roles are reversed again.

Traditional Netezza Replication Services

Traditional Netezza Database Replication

Smart Database Replication is bi-directional (BDR) and allows all Netezza nodes to be primary systems.

An example of this is where you have two or more systems and each of them can be the primary system for a set of users, one or more applications and one or more source systems, and you can replicate the data between them all. This gives you a natural way of dividing your users across your infrastructure whilst still being able to have replication across all those different systems. Moreover, in a multi-master scenario, each master also potentially acts as a Disaster Recovery site for one or more of the other masters.

Multi Master Bi Directional Replication with SMF

Multi-master bi-directional replication (BDR) as provided by Smart Database Replication

If you are thinking of buying a new Netezza Performance Server for IBM Cloud Pak for Data system or of migrating to Netezza in the cloud, Smart Management Frameworks SmartSafe can be used to automate the migration between the old and the new environments.

Replicating the existing Netezza databases to the new environment allows the migration to be performed incrementally and in a highly controlled manner. It allows for a dual run of both old and new systems side by side which means that you can control when the cutover occurs between the old and the new.

For customers wanting to try out the cloud service, Smart Database Replication will allow you to do a dual-run whilst doing an evaluation as part of the migration process. So if you are not sure if the Cloud solution will cope with the volume of data, queries and users and be better and more reliable than the on-premise option, it could potentially be used to replicate some or all of your existing data to that cloud option. It provides a safe way of testing the new netezza platform without having to take a leap in faith, keeping existing systems in place until you are ready to do the transition and the switch over, without it having to be a big-bang approach.

Many customers maintain a separate Netezza appliance as a development environment, and this system will typically have much less capacity than the production system. Maintaining a subset of data on dev can be quite a fiddly task, especially if you want the data accurately reflect the production environment. With Smart Database Replication you can replicate a subset of your production data to your dev environment so that it contains real world data that is accurate and up-to-date and small enough to fit on the dev environment. In exactly the same way you could create a separate test environment, and replicate a sub-set of production data to it.

So if, for example, you are constrained by disk space and so only want the last six months of transactional data on dev and test, Smart Database Replication can maintain that rolling window of six months of historical data for you. Alternatively, it can replicate a percentage of records using a percentage algorithm to determine the data that gets replicated. This is a much better option having to using a WHERE constraint or applying filter to the source data

Smart Database Replication works with all combinations of versions of Netezza systems - from the PureData for Analytics Nx00x range, to IBM’s latest generation Netezza on System or Netezza on Cloud offerings.

By choosing Smart Database Replication you will be able to replicate your old and new Netezza systems and avoid the problems that are invariably associated with upgrading to new technologies. Why is this? Because you can run your old and your new systems in parallel and choose the moment you cut-over with a few simple steps, obviating the need for extended system downtime and drawn-out post-migration user acceptance testing.

Mean Time to Recovery is quite simply, “how long will it take to recover our system to being fully operational after a failure”. It is sometimes also referred to as Recovery Time Objective.

MTTR is important to the business as it is regarded as a measure of how fit for purpose your Netezza disaster recovery system is. It implies that a full disaster recovery test has been conducted and that this time to recovery has been achieved, rather than just being an objective. If the MTTR has not been tested, your business runs the risk of losing their Netezza system for a considerable period of time which may have a serious impact on business continuity.

It's all explained in this video.

Recovery Point Objective determines how much data the business can afford to lose after an outage. So, does the data have to be right up to date, or can the business tolerate losing some of it? It factors in that the database transactions that have occurred since the most recent backup will also need to be restored if the RPO is quite a short one, and this may impact the Mean Time to Recovery.

This too is explained in this video.

For customers wanting to migrate from an alternative platform to Netezza, or vice versa, we have a solution called Smart Data Frameworks which will not only automate the migration but can also keep trickle-feeding changes from the Netezza database to the new target database until you are ready to switch over.

So, in the same way that Smart Database Replication can be used to do a parallel run of an old and new Netezza platform, SDF will allow you to do a parallel run between an old Netezza and a new non-Netezza system, as long as you have primary keys declared in your tables. Accordingly, this is a form of replication but it is not designed to be permanent, but rather to assist with the migration process.

SDF also now has a Database Replication feature for customers migrating from Netezza to Yellowbrick. This differs from the trickle-feeding described above in that the entire data migration process can be undertaken and then both systems kept synchronized using this new feature.

Netezza Performance Server FAQs

Containerization is a major trend in software development that provides an alternative to the use of virtual machines. Containerization is the packaging up of software code and all its related configuration files, libraries and dependencies so that it can run uniformly and consistently on any infrastructure. So, whereas with traditional programming methods there would be considerable effort, both in terms of coding and testing, to transfer code between two computing environments, this is not the case with containerized applications. The “container” is abstracted away from the host operating system, and hence it stands alone and becomes portable and is able to run across any platform or cloud.

The technology is quickly maturing, resulting in measurable benefits for developers and operations teams as well as overall software infrastructure. Developers are able to create and deploy applications faster and more securely without the risk of bugs and errors being introduced when code is transferred between, for example, a desktop computer to a virtual machine or from a Linux to a Windows operating system.

The concept of containerization is decades old, but it wasn't until the open source Docker Engine emerged in 2013 that its adoption accelerated. Docker has become the industry standard for containers with simple developer tools and a universal packaging approach. IBM has its own container format, Podman, which is similar to Docker but is native to OpenShift, which IBM acquired when it bought Red Hat in 2019.

Containers are frequently referred to as “lightweight,” as they share the machine’s operating system kernel which removes the need to associate an operating system within each application. Containers are inherently smaller in capacity than a virtual machine and require less start-up time. The compute capacity available can therefore run more containers than VMs. This results in higher server efficiencies and, in turn, reduces server and licensing costs.

Containerization allows applications to be "baked in" and to run anywhere. This portability is important in terms of the development process and vendor compatibility. It also offers other notable benefits, like fault isolation, ease of management and security.

Docker v Container

Kubernetes, also known as K8S, is an open source container orchestration platform that automates the deployment, management, and scaling of containerized applications.

The name Kubernetes originates from Greek, meaning helmsman or pilot, which aptly describes the function it performs. Kubernetes was created and first released by Google in 2015, to manage the massive number of clusters that formed the basis of the Google search engine. Shortly thereafter, Google donated Kubernetes to the Cloud Native Computing Foundation (CNCF) that it had set up with the Linux Foundation to promote container technology.

Kubernetes' role in the containerized Netezza environment is to continuously monitor all the data nodes and host nodes, checking their health and restarting any that have failed. Kubernetes groups containers into logical clusters for easy management and discovery. It sits on top of all the physical resources and decides which CPU will run any given process. Without Kubernetes you would have to run your Netezza host on a dedicated processor which, if it were to fail, would require replacing and would result in downtime. As such, Kubernetes provides a much higher level of fault tolerance and reliability.

K8S can operate both in cloud and on-premise environments, including hybrid clouds. Its key features include:

  • Load Balancing and Service Discovery: Automatically assigning DNS names, IP addresses, and load-balances to pods.
  • Automatic Bin Packing: Ensures the availability and optimization of resources by placing containers based on their resource requirements.
  • Self-Recovery: Restarts failed containers, replaces containers after node failures, and removes containers that fail health checks.
  • Rollout and Rollbacks Automation: Rolls back to previous versions when issues occur and prevents system failure by running modifications.
  • Batch Execution and Scaling: Enables you to scale applications manually or automatically and manages batches and Continuous Integration (CI) workloads.

Red Hat® OpenShift® is a commercialized software product derived from the Kubernetes open source project. OpenShift and Kubernetes are both container orchestration software, but Red Hat OpenShift contains additional features not available from the Kubernetes open source project. Red Hat was one of the first companies to work with Google on Kubernetes, even prior to launch, and has become the 2nd leading contributor to the Kubernetes upstream project.

OpenShift has become the leading enterprise Kubernetes platform. It enables a cloud-like experience everywhere it's deployed. OpenShift gives you the ability to choose where you build, deploy, and run applications whether it’s in the cloud, on-premise or at the edge.

OpenShift and Kubernetes are two of the best known container orchestration platforms. They both enable you to easily deploy and manage containerized applications. Whereas Kubernetes is an integral element of OpenShift, the latter contains additional features that aren't available from the Kubernetes open source project. There are several differences between Kubernetes and OpenShift. The list below compares the main features of each:

Product vs. Project - OpenShift is a commercial product, whereas Kubernetes is an open-source project. To use OpenShift you would need to pay a license fee which would generally be incorporated into any product you use that is built on OpenShift. For example, IBM Cloud Pak for Data provides a restricted use license of OpenShift in support of the services it provides. Kubernetes offers a self-support model that has an extensive community for supporting and growing the open source project. Whether you would use a commercial Kubernetes product or work on your own Kubernetes project depends entirely on the user.

Security - OpenShift has enhanced the security features of Kubernetes. It offers a much improved user experience for setting up and configuring Kubernetes authentications with an integrated server.

Web-UI - The OpenShift web-based User Interface (UI) is much easier to use than the basic one provided by Kubernetes.

Deployment Approach - OpenShift deployment of objects is much more automated, and has improved version control, than the basic Kubernetes deployment approach.

Continuous Integration/Continuous Deployment (CI/CD) - Both OpenShift and Kubernetes allow the construction CI/CD pipelines, but in both you need to deploy additional tools such as automated testing and monitoring, and CI servers, to build a full CI/CD pipeline. OpenShift makes this task easier because it offers a certified Jenkins container that you can use for the CI server, whereas you would need to deploy a third party tool to build a CI/CD pipeline with plain Kubernetes.

Integrated Image Registry - In plain Kubernetes you can set up your own Docker registry, but it doesn't come with an integrated image registry. OpenShift provides an integrated image registry that you can use with Red Hat or Docker Hub. The image registry has a console where you can search for information about images and image streams to projects in a cluster.

Installation - Kubernetes installation is complex and often requires a third-party solution. With OpenShift, however, Kubernetes is built-in so installation is more straight-forward, but it is limited to Red Hat Linux distributions.

Updates - The way updates are handled varies between platforms, although existing clusters can be upgraded instead of being rebuilt from scratch on both platforms. In OpenShift, upgrades are managed via the Red Hat Enterprise Linux package management system. The process is manual on plain Kubernetes using the kubeadm upgrade command.

Hyperconverged infrastructure (HCI) uses virtualization software to combine all the elements of a traditional data center - storage, networking, compute and management - into a distributed infrastructure platform. The virtualization software abstracts and pools all the underlying resources and allocates them dynamically to applications running in virtual machines or containers.