Redshift : Sort Key


Choosing the Sort Key

If you query most on
  Recent data  - Choose timestamp column as the leading column for sort key
  Range filter - Use that column as sort key
  Join a table - specify join column as sort key


Loading data

When you load data incrementally, the rows are sorted and reside in temporarily in separate region on disk. To maintain fully sorted table, you have to run the VACUUM command at regular intervals.

Load your data in sort key order to avoid needing to vacuum.
COPY sorts each batch of incoming data as it loads. So if you load the data in sort order then the data will be properly stored in sort order.



Range Scan Query

If the column used for range scan is sort key, then the query processor can use min and max value to skip over large number of blocks during table scans. (Redshift stores columnar data in 1mb blocks and the min/max values of each block is stored as part of metadata)

JoinsMerge join can perform better than hash join when the data is distributed and presorted on the joining columns.

GROUP BY/ORDER BY and Window functions are also will be faster if it is on Sorted column.

Comments