Data Warehouse Performance Matters
In this fourth and final episode in the series of blogs accompanying our joint webinars with Yellowbrick about data warehouse modernization we are going to concentrate on why performance matters in a data warehouse context.
How does performance matter?
From a user perspective we are all used to snappy interactive applications on our phones and other devices. Having to go out for coffee whilst your query comes back or waiting for a dashboard to load may have been alright before, but it isn’t now. Users are now more data centric and are doing deeper and deeper analytics. A query that runs for 30 minutes is not the way people typically work; it breaks their train of thought.
From a cost management perspective performance is crucial. If you must have a high throughput but your queries are not performing and your system is inefficient, you will need more infrastructure and hardware to deliver a consistent performance regardless of workload.
From a productivity perspective, being able to do stuff more quickly means you save time and money on effort on everything from development to administration, to the time that end-users physically spend doing their jobs. Better performance means they have more time to do more with less. These days people who are accessing data aren’t always specialist SQL experts. They rely on a BI or no-code tool to query data and these tools sometimes create arbitrary SQL that’s correct syntactically but is not written optimally as we might have done in the past, spending days optimizing a query before running it. That is not the reality today. Your data warehouse platform must run the query it’s presented with. Being able to execute that query efficiently regardless is crucial.
Performing with mixed workloads
Let’s look now at the different classes of activity or types of workloads you may want to execute on your data warehouse at different times of the day, or month, or year. What is it that Yellowbrick can quickly do that, say, Netezza wasn’t so good at? A lot customers bought Netezza in the first place because of its speed when doing batch processing. It was not designed for doing individual operations on single records at a time. You couldn’t parallelize that type of workstream and so customers learned quickly to avoid that type of activity.
Yellowbrick on the other hand was built from the ground-up to be a hybrid row and column store. The column store is good for doing analytics that require an element of batch loading, and the row store deals with any sort of fast-moving data. As an end user you don't see a difference. It doesn't matter how you load data; the data can still be queried regardless of whether it goes first into the row store or goes straight into the column store. Yellowbrick gives you the ability to have both in one place and not have multiple systems to manage those two different types of velocities of data.
Yellowbrick can deal with hundreds of users running queries against a single database at a time, which is previously unheard of in the analytics space. Most other data warehouse platforms max out at 8 or 10 concurrent queries. Furthermore, Yellowbrick can also mix different types of workloads, which is one of its key strengths. It can ensure that small, short running queries don't get stuck behind long running data ingestion tasks or data transformation tasks. So that means you can run BI and reporting analytics confidently whilst data loading and data transformation are happening simultaneously. You can also connect your applications directly to the data warehouse to serve up data through web experiences or through mobile applications to thousands of active users concurrently.
How does Yellowbrick achieve that kind of performance?
Netezza’s game-changing performance was because of its specialist dedicated hardware, the FPGA (Field Programmable Gate Array) processes, which were excellent at running parts of your query in hardware.
To be game-changing today, Yellowbrick has had to do something completely different, given that it is designed to run in the Cloud and as well as on-premises. The founders of Yellowbrick came from a hardware background and they saw that storage technologies, like NVME (Non-Volatile Memory Express) drives and CPU processing capabilities were doubling or tripling their performance capabilities year on year, but data warehousing database performance was only improving 15-20%. So, there was a growing gap between the capabilities of the underlying hardware: storage, CPU, memory and networking, and the ability of the ageing database technologies to take advantage of that. Even in the Cloud today most of the Cloud vendors are only tweaking with software or query planning rather than improving their database technologies.
Yellowbrick has looked at the entire data path, from storage through to the CPU and they have developed the directed accelerator, a patented technology that allows them to move data more efficiently through the whole stack. It uses less hardware and makes better use of CPUs and memory. In fact, it uses a lot less memory than conventional database platforms with the resulting cost savings being streamed back to the customer.
Running different types of workloads simultaneously is made possible by using multiple independent compute resources, or clusters. You can, for example, have a ten-node machine belonging to finance, a five-node machine for HR and a three-node machine for your data ingestion team. They would all be operating on the same data, but completely independent from each other unless they are trying to update the same data at the same time, where there's always locking, but from a resource perspective they're completely independent. Yellowbrick can do this better than other Cloud providers like Snowflake because they are also able to segment resources within a single cluster. Furthermore, having advanced workload management capabilities ensures that shorter queries don’t get stuck behind longer running queries and the resources available to certain teams can be capped.
Where is the proof?
Seeing is believing, and Yellowbrick will set up a proof of concept/value in the Cloud, or on-premises, loading your data so you can see it running against your data. If your data is complex and/or needs any sort of transformation, Smart Associates can help you with that with our automated tools.
We advise customers to test with as realistic data volumes and queries as possible, including mixed workloads. This is because it’s easy for vendors to optimize for a few queries and impress you with the performance only to discover that when you’re running the system against your data in the real world with hundreds of people connected, loads running in parallel, the performance can just seem to disappear, and the costs spiral upwards. In a consumption model it’s important to prove both the performance and cost in your proof of value and not to rely on assumptions that the vendor has given you.
Another big benefit of working with realistically sized real data, and running it in your environment, is that the data transformation is done, and your loaded data will have already passed your security checks and so if the proof of concept is successful you can transition into your production program very quickly.
Proofs of value last anything from two weeks to a month and customers are usually able to demonstrate everything they want to in that time. Cost-wise, only your Cloud costs are covered by you, Yellowbrick will not charge for their software until you’ve signed up, and Yellowbrick has techniques built-in to pause and resume and other sorts of things that help you manage costs especially during a proof of value where you don't want to blow through your budget.
Get a free data migration to Yellowbrick
Getting your data into Yellowbrick is something we at Smart Associates can also help you with. As we’ve talked about in previous webinars, we have software that can automate database migration activities. We can do all the heavy lifting of and customers can just focus on getting results, also done at no charge to the customer during the proof of concept/value stage.
Want to learn more? Why not get in touch with us.
Previous blogs in this series:
The Benefits of Simplicity in Data Warehouse Architecture
Value Drivers in Data Warehouse Selection
A Netezza Customer Journey