Data Warehouse Management: Redshift, Bigquery, and Snowflake
In our post comparing Redshift, BigQuery, and Snowflake on query performance and cost for interactive analytics, we looked at the trade-offs across different data warehouses from a performance perspective.
BigQuery, Snowflake and Redshift all have web based consoles where you control your data, clusters, user management, query logging and system analytics. In this post we will visit the management console of each system and the supporting tools for managing your warehouse deployments.
Redshift supports directly accessing the compute cluster through command line tools like psql.
With Snowflake you can run queries through their UI. Each user has a user session which is linked across open browsers windows, so it’s best to work in a single browser tab.
BigQuery’s query editor has some additional features over Snowflake - you can format your SQL queries and save common subqueries in a view to save computation across queries.
Both Redshift and Snowflake offers JDBC and ODBC drivers for connecting through Business Intelligence tools like Periscope Data. With BigQuery you can create a Service Account and connect over the custom API they provide.
Monitoring & Logging
Redshift offers granular insight into the queries running on the cluster and an overview of system performance. It is easy to investigate cluster behavior using the different time slice and filter options available through the performance interface.
The query interface provides a live history of the queries being run on the cluster
You can drill down into individual queries to see the query plan and the cluster metrics when the query was running.
The additional insights into the cluster status and data loading offers a comprehensive overview, and enables database administrators to optimize cluster performance by setting cluster parameters for their use case.
Snowflake offers query history that helps you understand the usage patterns of your users and your warehouses. You can see the queries running on the different warehouses and metadata around the run time and the data accessed.
You can dive into a specific warehouse for a snapshot of the cluster’s usage over time
BigQuery has a query history log that shows the query, start and end time, and total bytes processed. They do not have aggregated time views of total usage, each query is it’s own line item.
Redshift stores data directly on the hard drives of the compute nodes, storing the data in slices using adjustable sort keys and distribution strategies that allow you to optimize the data layout to the queries that your analysts run empirically. You can use COPY statements to load data from S3.
Snowflake stores data on Amazon S3 and loads the data into a virtual warehouse at query time, relying on memory caches to support fast performance when running multiple queries. There are less controls over data layout - you can specify the sort order when inserting data into a table - and you largely rely on the Snowflake optimizer for performance improvement.
BigQuery tables can be created from file upload, Google Cloud Storage, or Google Drive. You can load data into Cloud Storage through their Transfer interface or through the command line tool gsutils. Data storage is completely opaque from the user’s perspective.
Redshift has a single computational warehouse associated with your dataset. You can prioritize different queries using workload management (WLM). WLM creates query queues that limit how many queries different users can run in parallel, allowing you to create fairness and prioritization in your workload.
Snowflake’s computational warehouses are totally decoupled from the storage layer. Within a few seconds or minutes, you can add additional computational power and enable different users to work with different clusters, creating fine-grained control over your analytics budget.
BigQuery does not work with isolated compute engines - with BigQuery you submit your query to be run either interactively or as part of a larger batch. Rather than having a fixed cluster resources, you simply pay for what queries you run.
Snowflake offers online, nearline, and offline cluster options - you can enable nearline functionality using their auto-suspend and auto-resume features. Snowflake enterprise offers additional features like IP whitelisting.
Redshift offers online clusters and offline backups. The management interface allows users to control their clusters, specifying security and permission settings, modifying cluster parameters, and inspecting clusters for issues.
BigQuery is a different experience. You can define separate Projects, and they operate as separate spaces with their own data and query management. Projects aren’t directly tied to compute capacity, since you are not provisioning hardware directly. Querying data through BigQuery also offers the nearline-style experience, the data is only activated once you run your query.
Security & User Access
BigQuery integrates with Google for Work identity management, Redshift and Snowflake both also offer user identity management and two-factor authentication, user access controls around data management, warehouse management, and administrative access. Redshift offers fine-grain networking controls through the AWS VPC and IAM roles, and Snowflake Enterprise supports enhanced security and encryption protocols on dedicated servers.
Redshift maintains compliance with a large number of security assurance programs, including HIPAA, PCI - you can check out the full list here. Redshift is SOC 1, 2, and 3 compliant has met the security certifications of numerous nations, including the U.S., U.K., the EU, Australia, Malaysia, Japan, Singapore, and New Zealand.
BigQuery is Soc 2 and 3 compliant, supports HIPAA and PCI agreements, and participates in the EU Data Protection Directive.
Snowflake is Soc 2 Type 2 certified, and HIPAA compliant.
You can learn more about each of these warehouses in our overview of query performance and cost for interactive analytics. We’d love to hear from you about your favorite data warehouse @PeriscopeData!