<p>Columnar stores like Amazon Redshift achieve high speeds by reading only the columns needed to complete a query. The best speedups are achieved when these columns and the intermediate results fit in RAM. These speedups degrade if the intermediate results exceed the available RAM and get written to disk.</p>



<p>Your query is likely exceeding the available RAM if it causes spikes in your disk usage graph:</p>



<figure class="wp-block-image size-full fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/image-01-29.png" alt="Spikes in the data" class="wp-image-80956"/></figure>



<p>The disk space spikes as temporary tables are created and destroyed, slowing our queries in the process.</p>



<p>Redshift keeps detailed statistics on each query execution, available in the system views svl_query_report and svl_query_summary.</p>



<p>These tables are keyed on query with the ID found in the redshift console or by selecting from svl_qlog. The svl_query_summary view is a summarized version of svl_qlog.</p>



<p>To find recent queries that are hitting disk, run:</p>



<pre class="wp-block-code"><code>select query, substring
from svl_qlog join svl_query_summary using(query)
where starttime > date(getdate()) - interval '1 day'
  and is_diskbased = 't';</code></pre>



<h2 class="wp-block-heading">Digging In</h2>



<p>To dig in, we’ll open up our Redshift console and check the run time for our sample query.</p>



<figure class="wp-block-image size-full fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/image-02-28.png" alt="Sample query run time" class="wp-image-80962"/></figure>



<p>To see the execution details, run:</p>



<pre class="wp-block-code"><code>select query, step, rows, workmem, label, is_diskbased
from svl_query_summary
where query = 127387
order by workmem desc</code></pre>



<p>The workmem is the upper bound on memory the query planner requires for each step, not the amount actually used. If workmem exceeds the available RAM the step will use disk for intermediate results. This is indicated by the is_diskbased column.</p>



<figure class="wp-block-image size-full fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/image-03-23.png" alt="diskbased column" class="wp-image-80969"/></figure>



<p>The amount of RAM available is based on the number of query slots allocated to this query. By default, Redshift uses 5 query slots and allocates one slot per query. Many of the steps are using disk so we need to optimize this query.</p>



<h2 class="wp-block-heading">Mitigation</h2>



<p>There are always two approaches to solving memory constraints: add more memory or use less. Adding more memory is expensive and using less can be difficult or even impossible. Some queries will always need to use disk.</p>



<p>For the queries that could fit in RAM, here are some mitigation strategies:</p>



<h3 class="wp-block-heading">Update Database Statistics</h3>



<p>workmem is only the estimate of space needed by the query and is based on the statistics the database collects about each table.</p>



<p>Make sure to run analyze on your tables to keep their internal statistics up to date. This reduces extraneous padding from outdated row count estimates.</p>



<h3 class="wp-block-heading">Making More RAM Available</h3>



<p>Adding more memory is done by either adding more nodes to your cluster or by increasing the <a href="https://docs.aws.amazon.com/redshift/latest/dg/r_wlm_query_slot_count.html" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">wlm_query_slot_count</a>.</p>



<p>This can be done per session using set wlm_query_slot_count or per user using <a href="https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-parameter-groups.html" target="_blank" rel="noreferrer noopener" aria-label=" (opens in a new tab)">parameter groups</a>.</p>



<p>Increasing the wlm_query_slot_count from 1 to 5 gives this query access to all of the cluster’s RAM. The workmem has increased for all steps, and most are no longer disk-based.</p>



<figure class="wp-block-image size-full fancybox"><img decoding="async" src="https://cdn.sisense.com/wp-content/uploads/image-04-17.png" alt="Query info" class="wp-image-80975"/></figure>



<p>Requiring all of the query slots means that this query needs to wait until all 5 slots are available before it can run. This waiting can eliminate the performance improvements of the additional RAM on a busy cluster.</p>



<h2 class="wp-block-heading">More efficient queries</h2>



<p>This query counts the distinct number of users who have post or visit events each day. It takes 75 seconds and four of its steps are disk-based. Instead of increasing the query slots, let’s make the query more efficient.</p>



<pre class="wp-block-code"><code>select
  date(event_time) as date,
  count(distinct case when event_type = 'Post'
    then user_id else null end) as posts_created,
  count(distinct case when event_type = 'Visit'
    then user_id else null end) as visits
from events
where event_time >= '2014-05-21'
group by 1</code></pre>



<p>The best way to improve the performance of any query is to reduce the amount of data that is stored and read during its execution. This query only cares about visits with an event_type of ‘Post’ or ‘Visit’. Everything else will resolve to null and not be counted in the distinct.</p>



<p>Adding a where clause to filter for only these two event types greatly speeds up this query. It runs in about 5 seconds and doesn’t hit the disk at all. Hope this helps with your queries!</p>


Demystifying Redshift — What’s Up with My Disks?

LinkedIn

Twitter

GitHub

curve-image-unique-image-unique

curve

3-dark-2-image-unique-image-unique

3 DARK 2

Get the latest in analytics right in your inbox.

Article