Want to make data analysis fast for everyone?Join Us!
Every day customers ask us what the best data warehouse technology is for handling the workload of their growing analytics department: they want to run queries at interactive, real-time speeds at a price-point that fits in their budget. We’ve helped our users with everything from tried-and-true MySQL servers, day-old Postgres boxes, and fifteen year old SQL Server deploys. We help customers work with their production databases and modern cloud analytics solutions like AWS Redshift, Snowflake and Google BigQuery.
Our customers want to know where to get the fastest query times, how much data their data warehouse can handle, how much it will cost them. We want to make sure we provide the best recommendations to our users when they ask for our help navigating the data warehouse space, so we put Redshift, Snowflake, and BigQuery to the test.
After investigating Redshift, Snowflake, and BigQuery, we found that Redshift is the best choice for real-time query speeds on our customers’ typical data volumes. In this post, we will walk through our analysis of these three data warehouse solutions and the compelling use cases we found for each of the technologies.
Real-world Review: Average Query Speeds in Periscope Data
Generally, our customers have tens-of-millions to tens-of-billions of rows of data: 54% of clients have at least tens of millions of rows of data, 29% have at least hundreds of millions, and 10% have billions of rows of data. Customers frequently consider data warehouses when analytics query speeds begin to take longer than 2-3 seconds in production. The average query from a customer’s database in Periscope runs in 8.1 seconds, and when we ran them on a similarly priced, unoptimized Redshift Cluster, we saw an average speedup of 2.9x.
When we look at queries that average 60 seconds or longer, we find an even more compelling story: on average, Redshift increasingly outpaces user databases for long-running queries.
Redshift enables greater enhancements by analyzing queries and optimizing around the run time bottlenecks. The Periscope Data Cache is built on hundreds of Redshift nodes. With optimizations like materialized views and optimized sort and dist keys, we’ve seen dashboards run up to 150 times faster than their source databases.
Cluster Sizes, Running Time, and Getting the Most for Your Dollar
Our challenge was to estimate the performance of Redshift, Snowflake and BigQuery on real-world, interactive use cases. We pulled 1 billion rows (150 GB uncompressed) from our internal page visit logs, and a smaller dimension table to join against. We selected queries written by analysts at Periscope Data over the past three months, and measured peak performance of individual queries, without optimizing the cluster technologies for these specific queries. Our goal was to understand the fastest possible speeds for each technology. Our queries perform filters, aggregations, joins, and sorts. We selected queries that work with many different data sizes, filtering down from one billion rows by factors of 100X to 100,000X. We choose join keys with both high and low cardinality.
The very first chart on Periscope’s own dashboard measures how much time users have spent on the site by hour for every day. We varied the data volume by looking across the last 4, 20, and 120 days and we ran our query on different configurations of our Redshift and Snowflake clusters. We will describe the different clusters and the pricing economics in the next section.
Raw run times on our first performance test. Snowflake’s x-small cluster and the 8-node Redshift cluster cost the same at $2/hr.
Bolstered that the warehouse solutions were producing comparable performance, we stepped into a wider set of queries. Speed is important, but the deciding factor for many users is their budget. The real question we want to answer is how much performance will be available at different price levels.
We want to understand if BigQuery or Snowflake would make for a good alternative to our Redshift caching layer for empowering interactive analytics, so we compared the always-on performance for Redshift, Snowflake, and BigQuery.
Our BigQuery queries cost between seven cents and fifteen cents each. Periscope automatically refreshes queries in the background to keep user dashboards fresh. As a result, customers run thousands of queries per day through Periscope. At seven cents a query it would cost around $70/day for each customer. We could get faster speeds at a lower price point using an 8 node dc1.large Redshift cluster for each customer at $48/day, so migrating to BigQuery would not be cost-effective for us.
Redshift was faster per-dollar than Snowflake across most of our testing. We ran seven additional queries and compared the query run-time relative to the cost of the nodes. Here we break down the data by query to show how the computation cost compares for different node sizes.
Performance relative to spend without optimizing for individual queries.
For clusters running twenty-four hours a day, Redshift is the clear winner, offering better performance per dollar. Redshift performance-per-dollar increased with the size of the cluster. This happens when the aggregated memory, network, and IO capacity increases to hold more of our workload. Snowflake’s performance-per-dollar was comparable through the X-Small, Small, and Medium clusters, with a step down at the Large cluster. Snowflake’s smaller cluster sizes excelled at joining small amounts of data. We suspect it was performing an in-memory join without having to transfer data between nodes over a network layer.
Comparing unoptimized clusters helps you understand performance right out of the box. We want to investigate how much we can improve the query speed on each system. We re-wrote the three longest-running queries and set the sort keys that match the query. On Redshift we additionally set data distribution and compression settings to optimize this set of queries.
Normalized performance relative to spend after optimization.
Query performance changes anywhere from 10% faster to 100 times faster, so we show the relative performance. When we optimize our queries on both data warehouses, we get larger gains on Redshift than we do on Snowflake. Redshift allows for maximum performance through it’s high level of customizability, enabling users to get the maximum performance from their cluster.
In addition to understanding our own use case, we want to help guide our customers to the data warehouse that’s right for them. We will take a look at the performance patterns that make economical sense for Redshift, Snowflake, and BigQuery, as well as the usability of each warehouse.
Three Very Different Pricing Models
BigQuery, Redshift and Snowflake have very different pricing models. All of the warehouses offer on-demand pricing and volume discounts. We’ll cover the on-demand pricing and highlight BigQuery’s monthly-billing model. Redshift and Snowflake offer 30% to 70% discounts for prepaying.
Redshift charges per-hour per-node, which covers both computational power and data storage. With Redshift, you can easily calculate the monthly price by multiplying the price per hour by the size of the cluster and the number of hours in a month. Pricing starts at $0.25 per-hour for 160GB of data. A cluster of 8 dc1.large nodes costs $1,440 for a 30 day month. Redshift lets you treat hardware resources as a commodity, describing exactly how much memory, storage, and I/O throughput you’ll get for the money.
Snowflake’s pricing bills at hour granularity for each virtual warehouse and depends heavily on your usage pattern. Since data storage is decoupled from the computational warehouses, it’s billed separately at $150 per TB per month, or $0.2083 per TB per hour. Snowflake offers seven different tiers of computational warehouses. The smallest cluster, X-Small, costs one credit per hour, or $2/hour. At each level, the number of credits per hour doubles. Snowflake offers a dynamic pricing model - clusters will stop when no queries are running and automatically resume when they are, and they can flexibly resize themselves based on a changing workload, which can save you money when query load decreases.
BigQuery is a totally different ball-game. Instead of paying for clusters by hour, you pay for each query you run and the data you are storing. You can additionally specify a maximum value to let your queries charge. The maximum billing tier corresponds to $5 per TB, and you can set a multiplier for queries that require high-levels of computational power.
BigQuery also offers a flat-rate pricing option that enables predictable monthly billing. Most BigQuery customers find that their query workloads easily fit into the 2000 BigQuery Slots available to them in the monthly billing option. If your workload needs more you can expand your slot allocation in 500 slot increments.
Choosing the Right Cluster Depends on Your Usage Patterns
Snowflake makes it easier to optimize for cost using automatic stopping and starting of its warehouses, and is much faster at increasing/decreasing cluster resources compared to Redshift. The comparable performance of the standard medium Snowflake warehouse and the 8 node Redshift cluster was a good trade-off between speed and performance for our analytics use case. We can plot the cost as a function of percentage of uptime to see where using Redshift or Snowflake’s suspended-warehouse model could save us money over running a cluster 24 hours a day.
Thinking through your use cases of when and how much computational power you need can have a huge impact on your data analytics bill. In our example, if we wanted to use our data warehouse for a continuous block of 10 or fewer hours per day to save on costs,
- Snowflake’s automatic suspending and re-starting of clusters happens automatically
- Redshift you would shut down you cluster and restart it through the Redshift API.
We run our cluster 24 hours a day to handle query volume and manage our ETL pipelines: it makes economical sense for us to stick with Redshift.
There is more nuance when evaluating Snowflake for your own use case. If you have a 12-hour ETL pipeline you run once a week, Snowflake’s dynamic Warehouse model can help you save costs while still offering high burst-performance querying during peak hours. One Snowflake user we spoke with loads huge data sets every day using X-Large Snowflake warehouses for a few hours in the morning as part of their ETL pipeline, and then switches to a Small warehouse for querying during the day to run analyses on the roll-up tables generated daily. Experimenting with your own workload needs is paramount to understand if Snowflake is the best option for your data pipeline and your budget.
In our experiments, BigQuery query speeds were frequently comparable to a Redshift cluster made of 8-12 dc.1large nodes. Empirically, our per-query cost ranged between $0.07 and $0.15 for our billion row, 29 GB compressed dataset. We can look at the cross-over points in terms of number of queries run per-hour on average.
We can compare this against how many user-run queries are run through Periscope per-hour, excluding the background queries we run automatically. Some users simultaneously refresh hundreds of queries on a dashboard multiple times every day, while others run individual queries on an occasional ad-hoc basis throughout their workday. Even at our data volume, relatively small for BigQuery’s standard, it can be worth investigating for those users who only run occasional analytics queries.
Most Periscope Data users run hundreds of interactive queries every hour. We found that only 12% of our customers average 13 or fewer interactive queries per-hour, and 29% average 43 or fewer queries per-hour. The largest and smallest cross-over points in our graph is where BigQuery is economically competitive for our users’ analytics queries.
Our experiments were limited to queries running over one billion rows, or tens of gigabytes of data. When researching data warehouses for your use case, remember to benchmark query performance at the maximum data size in your data sets. The economics and performance can look very different as you scale from Gigabytes to Terabytes and Petabytes. We frequently recommend people experiment with multiple solutions at once - testing peak concurrent workload to measure the total system throughput is as important as the individual performance of queries. When thinking about large data scales, it’s also a good idea to investigate data ingestion, storage, and extraction costs, which vary across data warehouse technologies.
Since our queries are running at low latencies in the single and double-digit seconds, we don’t frequently reach the maximum throughput other users see. For example, our peak BigQuery data throughput was at 10 GB/s, at the Petabyte scale BigQuery demonstrations reach speeds measuring in terabytes per second.
Warehouse Overview: Differentiating Factors
We have been long-time fans of Redshift - the service has great metrics and logging, reliable uptime, and the support team works tirelessly on our behalf when an issue does arise. Redshift has numerous improvements for high performance for massively parallel queries that leads to our average speedup of 8.6x on long-running queries over PostgreSQL, MySQL, Oracle, and SQL Server:
- Workload management provides database admins control over query queues - you can prioritize ad-hoc analytics queries so they aren’t stuck behind slow-moving ETL jobs.
- Data compression: Redshift stores individual columns separately, and compression settings can be defined for each column for not only maximum data storage, but faster throughput loading data from disk and transferring data across the cluster’s network.
- Query optimizer intelligently designed for targeting the parallel computation use cases popular in modern data warehousing.
They also provide a wealth of documentation for understanding Redshift and writing faster queries. Optimizing performance becomes systematic and approachable: there is a good balance between achieving performance through optimizing your SQL queries, optimizing your data layout, and migrating to a larger cluster. You can find a great overview of their automatic data backups, how to load data into Redshift, and a lot more in this helpful FAQ.
Snowflake has been on our radar for a long time, and we recently partnered with Snowflake to support their warehouse solution in Periscope Data. Snowflake, like Redshift, is a relational columnar-store cluster warehousing solution that targets massively parallel and concurrent data warehouse use cases. Snowflake’s key differentiators stems from its use of
- Virtual computation warehouses that are decoupled from nearline data storage. Snowflake makes it easy to have many users running queries through different virtual warehouses at the same time. You can have your overnight ETL processes run on a slower and less expensive warehouse, and then enable real-time ad-hoc queries through a more powerful warehouse during business hours. You can instantly scale up, scale down, or even pause compute power. You can reserve computational power (and cost!) for when you really need it.
- Data retention Similar to Redshift’s automatic backups to S3, Snowflake’s Time Travel feature enables you to revisit data at any point in time over the last ninety days. You can rapidly recover from corrupted data or setup an environment to revisit memories of data exploration from a month ago.
- Automatic tuning efforts: Snowflake tunes the system to improve query run time as you use it. It can automatically scale your warehouses to match your query volume, resizing itself to be larger or smaller based on active demand from your analysts. It won’t totally replace manual query optimization, but it allows for a less hands-on approach to cluster management.
With Snowflake you get competitively fast analytics performance out of the gate, without having to spend a lot of effort optimizing the system for your use case. There are a handful of easter eggs that you will come across with Snowflake, our favorite being semi-structured data processing that makes it easier to query data with flexible schemas, like JSON or Avro.
A vocal minority of our users are using Google BigQuery - they frequently find it works very well in their data workflows, providing fast query speeds even as they scale to very large datasets. BigQuery’s model differs the most from our other data warehouse considerations:
- Serverless: Managing BigQuery is even easier than Snowflake, you can confidently award administrative duties to analysts without an background as a database administrator. Navigating the BigQuery ecosystem, takes a little while to get used to, but after becoming acclimated it’s straightforward to manage your projects and datasets in the Google Cloud Platform.
- Quickly Scale to Petabytes: Out of all of the tools, reaching petabyte scale is the most direct on BigQuery. You don’t have to provision a larger cluster, think if you are using the right size cluster for your dataset, or track cluster scaling over time. You just run your query, and BigQuery handles it all under the hood, watch how easily BigQuery crunches through a petabyte of data in under four minutes.
- Deep Google Cloud Integration: if you are already a heavy user of Google products, trying out BigQuery is a no-brainer. It easily interfaces with tools like Google Analytics and user management is easy with G Suite - the recently rebranded Google for Work.
For technical teams managing workloads of ranging from the millions to the few billions, Redshift is our winner on both price and performance. The pricing model is easy to understand, prevents unexpected fluctuations in cost, and query speeds are available at lower price points for daily-active clusters.
The detailed documentation and many resources surrounding Redshift go a long way in enhancing your Redshift experience. The deep query and system performance information enables extensive analysis and review of the queries running on your clusters, and how to make them faster. Check out our detailed cluster management comparison for an overview of the consoles for managing your Redshift, BigQuery, and Snowflake deployments.
An important consideration that we did not model is total throughput. Looking at the optimal query speed of an isolated query helps you understand if achieving interactive query speeds are feasible, but it is also important to understand the maximum capacity of your data warehouse at peak load. Both Snowflake and BigQuery can automatically and rapidly provision greater compute resources to handle larger data load. Resizing a large Redshift cluster takes several minutes where your cluster is read-only and analysts are unable to change data on the cluster. When evaluating which data solution is the best for your workload, it is important to understand the size of the cluster that will handle many analysts running queries concurrently.
We are heavy users of Redshift, and have found Redshift to be a fantastic choice for users with large, on-going data needs. Redshift clusters offer top-of-the-line performance at best-in-market price points. The ecosystem surrounding Redshift makes debugging and optimization accessible to newcomers through their extensive documentation, and to seasoned pros through the highly tunable cluster configuration.
The next frontier we want to explore is how different data warehouses operate through the hundreds-of-Gigabyte and into the multi-Terabyte level. We would love to hear your experiences working with large data sets on your favorite warehouse @PeriscopeData!