Amazon Redshift is the managed data warehouse solution offered by Amazon Web Services. It has a collection of computing resources called nodes, which are organized into a group called a cluster. The first node you create is called the leader node. If customers add more, they are called a compute node. Each cluster runs an Amazon Redshift engine and contains one or more databases.
- Amazon Redshift is built on the data warehouse technology MPP (Massive Parallel Processing) ParAccel by Actian. The product is a simple and cost-effective way to analyze all your business data using customers existing business intelligence tools.
- Amazon Redshift is a relational database management system, as such it is compatible with other RDBMS applications. Although it provides the same functionality as a typical RDBMS, including online transaction processing (OLTP) functions such as inserting and deleting data, Amazon Redshift is optimized for high-performance analysis and reporting of very large datasets
Data compression reduces storage requirements, thereby reducing disk I/O, which improves query performance. When you execute a query, the compressed data is read into memory, then uncompressed during query execution.
- Loading less data into memory enables Amazon Redshift to allocate more memory to analyzing the data.
- Because columnar storage stores similar data sequentially, Amazon Redshift is able to apply adaptive compression encodings specifically tied to columnar data types.
- The best way to enable data compression on table columns is by allowing Amazon Redshift to apply optimal compression encodings when you load the table with data.
Massively parallel processing (MPP) allows fast execution of the most complex queries operating on large amounts of data. Multiple compute nodes handle all query processing leading up to final result aggregation, with each core of each node executing the same compiled query segments on portions of the entire data.
- Amazon Redshift distributes the rows of a table to the compute nodes so that the data can be processed in parallel.
- By selecting an appropriate distribution key for each table, customers can optimize the distribution of data to balance the workload and minimize movement of data from node to node.
Columnar storage for database tables drastically reduces the overall disk I/O requirements and is an important factor in optimizing analytic query performance.
- Storing database table information in a columnar fashion reduces the number of disk I/O requests and reduces the amount of data you need to load from disk.
- Loading less data into memory enables Amazon Redshift to perform more in-memory processing when executing queries.
- When columns are sorted appropriately, the query processor is able to rapidly filter out a large subset of data blocks
The leader node distributes fully optimized compiled code across all of the nodes of a cluster. Compiling the query eliminates the overhead associated with an interpreter and therefore increases the execution speed, especially for complex queries.
- The compiled code is cached and shared across sessions on the same cluster, so subsequent executions of the same query will be faster, often even with different parameters.
- The execution engine compiles different code for the JDBC connection protocol and for ODBC and psql(libq) connection protocols, so two clients using different protocols will each incur the first-time cost of compiling the code.
To reduce query execution time and improve system performance, Amazon Redshift caches the results of certain types of queries in memory on the leader node.
- When a user submits a query, Amazon Redshift checks the results cache for a valid, cached copy of the query results. If a match is found in the result cache, Amazon Redshift uses the cached results and doesn’t execute the query, and the result of caching is transparent to the user.
- To maximize cache effectiveness and efficient use of resources, Amazon Redshift doesn’t cache some large query result sets. Amazon Redshift determines the number of entries in the cache and the instance type of the customer Amazon Redshift cluster.
Amazon Redshift workload management (WLM) enables users to flexibly manage priorities within workloads so that short, fast-running queries won’t get stuck in queues behind long-running queries.
- Amazon Redshift WLM creates query queues at runtime according to service classes, which define the configuration parameters for various types of queues, including internal system queues and user-accessible queues.
- From a user perspective, a user-accessible service class and a queue are functionally equivalent. For consistency, this documentation uses the term queue to mean a user-accessible service class as well as a runtime queue.
Columnar storage for database tables is an important factor in optimizing analytic query performance because it drastically reduces the overall disk I/O requirements and reduces the amount of data you need to load from disk.
- In a relational database table, each row contains field values for a single record. In row-wise database storage, data blocks store values sequentially for each consecutive column making up the entire row.
- If block size is smaller than the size of a record, storage for an entire record may take more than one block.
- If block size is larger than the size of a record, storage for an entire record may take less than one block, resulting in an inefficient use of disk space.
- Amazon Redshift uses a block size of 1 MB, which is more efficient and further reduces the number of I/O requests needed to perform any database loading or other operations that are part of query execution.
The Amazon Redshift query execution engine incorporates a query optimizer that is MPP-aware and also takes advantage of the columnar-oriented data storage.
- The Amazon Redshift query optimizer implements significant enhancements and extensions for processing complex analytic queries that often include multi-table joins, subqueries, and aggregation.
The core infrastructure component of an Amazon Redshift data warehouse is a cluster.
- A cluster is composed of one or more compute nodes. If a cluster is provisioned with two or more compute nodes, an additional leader node coordinates the compute nodes and handles external communication. Customers users application interacts directly only with the leader node. The compute nodes are transparent to external applications.
- The leader node manages communications with client programs and all communication with compute nodes. It parses and develops execution plans to carry out database operations. Based on the execution plan, the leader node compiles code, distributes the compiled code to the compute nodes, and assigns a portion of the data to each compute node.
- The leader node distributes SQL statements to the compute nodes only when a query references tables that are stored on the compute nodes. All other queries run exclusively on the leader node.
- The leader node compiles code for individual elements of the execution plan and assigns the code to individual compute nodes. The compute nodes execute the compiled code and send intermediate results back to the leader node for final aggregation.
- Each compute node has its own dedicated CPU, memory, and attached disk storage, which are determined by the node type.
- AWS Amazon Redshift provides two types of node; dense storage nodes and dense compute nodes. Each node provides two storage choices.
- A compute node is partitioned into slices. Each slice is allocated a portion of the node’s memory and disk space, where it processes a portion of the workload assigned to the node. The leader node manages distributing data to the slices and apportions the workload for any queries or other database operations to the slices.
- The slices then work in parallel to complete the operation.
- The number of slices per node is determined by the node size of the cluster.
Data Warehousing is used to extract data in periodic stages, or as they are generated, which makes it more efficient and simpler to process queries over data that actually came from different sources. The raw data is turned into high-quality information to meet all enterprise reporting requirements and also for all levels of users.
- Because Amazon Redshift is using Big Data and data warehousing, companies can build powerful applications and generate reports that provide all of the data they need to run a business.
- A data warehouse is a database designed to enable AWS clients to intelligence activities; it exists to help them understand and enhance their organization’s performance.
- Data warehouses are consumers of data, and it is called online analytical processing (OLAP) systems.
- The data for a data warehouse system can come from Online Transactional Processing (OLTP) systems, Enterprise Resource Planning (ERP) systems such as SAP, internally developed systems, and so on.
- OLTP databases collect a lot of data quickly, but OLAP databases typically import large amounts of data from various source systems by using batch processes and scheduled jobs.
- Data warehouses are distinct from OLTP systems. With a data warehouse, customers separate the analytical workload from the transaction workload. As such they are very much read-oriented systems.
- They have a far higher amount of data reading versus writing and updating.
Amazon Redshift takes advantage of high-bandwidth connections, close proximity, and custom communication protocols to provide private, very high-speed network communication between the leader node and compute nodes. The compute nodes run on a separate, isolated network that client applications never access directly.
Amazon Redshift lets customers quickly and simply work with their data in open formats, and easily connects to the AWS ecosystem. They can query open file formats such as Parquet, ORC, JSON, Avro, CSV, and more directly in S3 using familiar ANSI SQL. To export data to the data lake they simply use the Redshift UNLOAD command in the SQL code and specify Parquet as the file format and Redshift automatically takes care of data formatting and data movement into S3.
Customers can use AWS Data Pipeline to automate data movement and transformation into and out of Amazon Redshift.
- By using the built-in scheduling capabilities of AWS Data Pipeline, they can schedule and execute recurring jobs without having to write their own complex data transfer or transformation logic.
Amazon DynamoDB is a fully managed NoSQL database service. The COPY command leverages the Amazon Redshift massively parallel processing (MPP) architecture to read and load data in parallel from an Amazon DynamoDB table.
- Customers can take maximum advantage of parallel processing by setting distribution styles on their Amazon Redshift tables.
Customers can migrate data to Amazon Redshift using AWS Database Migration Service. AWS DMS can migrate their data to and from most widely used commercial and open-source databases such as Oracle, PostgreSQL, Microsoft SQL Server, Amazon Redshift, Aurora, DynamoDB, Amazon S3, MariaDB, and MySQL.
Amazon Simple Storage Service (Amazon S3) is a web service that stores data in the cloud. Amazon Redshift leverages parallel processing to read and load data from multiple data files stored in Amazon S3 buckets.
- They can also use parallel processing to export data from their Amazon Redshift data warehouse to multiple data files on Amazon S3.
Customers can use the COPY command in Amazon Redshift to load data from one or more remote hosts, such as Amazon EMR clusters, Amazon EC2 instances, or other computers.
- COPY connects to the remote hosts using SSH and executes commands on the remote hosts to generate data. Amazon Redshift supports multiple simultaneous connections.
- The COPY command reads and loads the output from multiple host sources in parallel.
Lake House Architecture
Amazon Redshift powers the lake house architecture enables customers to query data across their data warehouse, data lake, and operational databases to gain faster and deeper insights not possible otherwise. With a lake house architecture, customers can store data in open file formats in Amazon S3 data lake. Which allows them to make this data available easily to other analytics and machine learning tools rather than locking it in a new silo. Using Amazon Redshift lake house architecture, AWS clients can:
- Easily query data in customers data lake and write data back to the data lake in open formats.
- Use familiar SQL statements to combine and process data across all their data stores.
- Execute queries on live data in the operational databases without requiring any data loading and ETL pipelines.
Query open format data directly in the Amazon S3 data lake without having to load the data or duplicating your infrastructure. Using the Amazon Redshift Spectrum feature, clients can query open file formats such as Apache Parquet, ORC, JSON, Avro, and CSV.
DATA LAKE EXPORT
Save the results of an Amazon Redshift query directly to your S3 data lake in an open file format (Apache Parquet) using Data Lake Export. AWS customers can then analyze this data using Amazon Redshift Spectrum feature as well as other AWS services such as Sagemaker for machine learning, and EMR for ETL operations.
Federated Query enables Amazon Redshift to query data directly in Amazon RDS and Aurora PostgreSQL stores. This allows you to incorporate timely and up-to-date operational data in your reporting and BI applications, without any ETL operations.
DATABASE MIGRATION SERVICE (DMS)
AWS Database Migration Service (DMS) is a self-service tool you can use to migrate your data from the most widely used commercial data warehouses to Amazon Redshift. The source database remains fully operational during the migration, minimizing downtime to applications that rely on the database.