Redsfhit Others1

Distribution Style
Number of slices = number of processor cores on the node * number of nodes

Query Execution
Query optimizer redistributes the rows to compute nodes to perform joins/aggregations.
It might send either specific rows to nodes or the entire table to all nodes.

Master User has access to the Initial Database in the cluster.
Users you create are global across all the databases in cluster

User: rsdb is internal user used by Amazon Redshift
Password:
Password for Database user should have minimum 8 characters. It should contain atleast 1 Uppercase, 1 lowercase and 1 number.

System Tables and Views

-- Only some of the system tabels and views can be accessed by users.
-- In some of the system tables and views, Users are able to see the data for their own activities. SuperUser will be able to see for all the users.

NUL terminator character (\x000 or \x0) to indicate NULL values.
Despite very similar spelling, NUL and NULL are not the same. NUL is a UTF-8 character with codepoint x000 that is often used to indicate end of record (EOR). NULL is a SQL value that represents an absence of data.
By default, COPY treats a NUL terminator character as an EOR character and terminates the record,

Redshift is fully managed datawarehouse service in cloud.

Redshift Cluster
- Redshift Cluster contains a Leader Node and 1 or more Compute Nodes.
- Number of compute Nodes depends on the size of data, number of queries and the performance.
- You can add/remove compute nodes without any interruption.

Leader Node
- Receives queries from client apps, parses the queries and develops the execution plan. It coordiantes parallel execution of these plans with compute ndoe, aggreagates the intermediate results andf inalay returns the result back to clients.

Compute Nodes
- Two types of nodes are available
1) Dense Storage -
2) Dense Compute

- Executes the steps and transmit data among other compute nodes to serve the queries.
- Each compute node has its own dedicated CPU, memory, and attached disk storage.
Node slices

Node Slice
- Compute node is partitioned into slices; one slice for each core of the node's multi-core processor.
- Each slice is allocated a portion of the node's memory and disk space, where it processes a portion of the workload assigned to the node.
- The leader node manages distributing data to the slices and apportions the workload for any queries or other database operations to the slices. The slices then work in parallel to complete the operation.

Network between Leader and Compute Nodes uses custom communication protocols to provide private, very high-speed network communication. It runs on separate isolated network which client applications cant access.

Databases
- A cluster contains one or more databases. User data is stored on the compute nodes.

Cluster Snapshots
- They are point-in time backups of a cluster and they are stored in S3.
- You can restore it to a new Redshift Cluster.

Resizing Cluster
- It puts the node in Read-only and then provisions the node and then copies the data to new cluster. It switches connections to new cluster, existing queries will lose connections.
- You must wait until resize completes before loading the data.

For operations where performance is heavily affected by the amount of memory allocated, such as
Vacuum, increasing the value of wlm_query_slot_count can improve performance. In particular, for slow
Vacuum commands, inspect the corresponding record in the SVV_VACUUM_SUMMARY view. If you
see high values (close to or higher than 100) for sort_partitions and merge_increments in the
SVV_VACUUM_SUMMARY view, consider increasing the value for wlm_query_slot_count the next time
you run Vacuum against that table.
Increasing the value of wlm_query_slot_count limits the number of concurrent queries that can be run.
For example, suppose the service class has a concurrency level of 5 and wlm_query_slot_count is set
to 3. While a query is running within the session with wlm_query_slot_count set to 3, a maximum of 2
more concurrent queries can be executed within the same service class. Subsequent queries wait in the
queue until currently executing queries complete and slots are freed.

Snapshots / Backups
Snapshots are point-in-time backups of a cluster. There are two types of snapshots: automated and
manual. Redshift stores them in S3.
To restore from a snapshot Redshift creates a new cluster and imports the data from the snapshots.

EXPLAIN [VERBOSE] query
- Displays execution plan of the query.
- Query is breaken down into discrete sequence of stesp and table operations
- You can run explain plan for SELECT and DML. For DDL it will fail

The cost is a measure that compares the relative execution times of the steps within a plan.
It does not provide any actual execution times or memory consumption.
It does not provide a meaningful comparison between execution plans, but it does give you an indication of which steps in a query are consuming the most resources.

STL_EXPLAN
- This will have explain plan for the SELECT Query which was previously run.

system views SVL_QUERY_SUMMARY and SVL_QUERY_REPORT capture query activity at a finer level of granularity than the
EXPLAIN output, and they contain metrics that you can use to monitor query activity.

A vacuum operation might not be able to start if a load or insert operation is already in progress.Vacuum
operations temporarily require exclusive access to tables in order to start. This exclusive access is
required briefly, so vacuum operations do not block concurrent loads and inserts for any significant
period of time

NULL

If your data includes null terminators, also referred to as NUL (UTF-8 0000) or binary zero (0x000),
you can load these characters as NULLS into CHAR or VARCHAR columns by using the NULL AS
option in the COPY command: null as '\0' or null as '\000' . If you do not use NULL AS, null
terminators will cause your COPY to fail

Data Loads

• Query STL_LOAD_ERRORS to discover the errors that occurred during specific loads.
• Query STL_FILE_SCAN to view load times for specific files ant to see what file was read.
* query STL_LOAD_COMMITS system table to verify that the expected files were loaded

Optimizing storage for narrow tables
If you have a table with very few columns but a very large number of rows, the three hidden metadata
identity columns (INSERT_XID, DELETE_XID, ROW_ID) will consume a disproportionate amount of the disk space for the table.
In order to optimize compression of the hidden columns, load the table in a single COPY transaction
where possible. If you load the table with multiple separate COPY commands, the INSERT_XID column will not compress well.You will need to perform a vacuum operation if you use multiple COPY commands,

By default, the COPY command applies automatic compression whenever you run the COPY command with an empty target table and all of the table columns either have RAW encoding or no encoding.

To apply automatic compression to an empty table, regardless of its current compression encodings, run the COPY command with the COMPUPDATE option set to ON. To disable automatic compression, run the COPY command with the COMPUPDATE option set to OFF.

Datawarehousing

Search This Blog

Redsfhit Others1

Comments