Amazon Redshift

Amazon Redshift is the managed data warehouse solution offered by Amazon Web Services. It has a collection of computing resources called nodes, which are organized into a group called a cluster. The first node you create is called the leader node. If customers add more, they are called a compute node. Each cluster runs an Amazon Redshift engine and contains one or more databases.

  • Amazon Redshift is built on the data warehouse technology MPP (Massive Parallel Processing) ParAccel by Actian. The product is a simple and cost-effective way to analyze all your business data using customers existing business intelligence tools.
  • Amazon Redshift is a relational database management system, as such it is compatible with other RDBMS applications. Although it provides the same functionality as a typical RDBMS, including online transaction processing (OLTP) functions such as inserting and deleting data, Amazon Redshift is optimized for high-performance analysis and reporting of very large datasets

Redshift Features

Data compression reduces storage requirements, thereby reducing disk I/O, which improves query performance. When you execute a query, the compressed data is read into memory, then uncompressed during query execution. 

  • Loading less data into memory enables Amazon Redshift to allocate more memory to analyzing the data. 
  • Because columnar storage stores similar data sequentially, Amazon Redshift is able to apply adaptive compression encodings specifically tied to columnar data types. 
  • The best way to enable data compression on table columns is by allowing Amazon Redshift to apply optimal compression encodings when you load the table with data.

Massively parallel processing (MPP) allows  fast execution of the most complex queries operating on large amounts of data. Multiple compute nodes handle all query processing leading up to final result aggregation, with each core of each node executing the same compiled query segments on portions of the entire data.

  • Amazon Redshift distributes the rows of a table to the compute nodes so that the data can be processed in parallel. 
  • By selecting an appropriate distribution key for each table, customers can optimize the distribution of data to balance the workload and minimize movement of data from node to node. 

Columnar storage for database tables drastically reduces the overall disk I/O requirements and is an important factor in optimizing analytic query performance. 

  • Storing database table information in a columnar fashion reduces the number of disk I/O requests and reduces the amount of data you need to load from disk. 
  • Loading less data into memory enables Amazon Redshift to perform more in-memory processing when executing queries. 
  • When columns are sorted appropriately, the query processor is able to rapidly filter out a large subset of data blocks

The leader node distributes fully optimized compiled code across all of the nodes of a cluster. Compiling the query eliminates the overhead associated with an interpreter and therefore increases the execution speed, especially for complex queries. 

  • The compiled code is cached and shared across sessions on the same cluster, so subsequent executions of the same query will be faster, often even with different parameters.
  • The execution engine compiles different code for the JDBC connection protocol and for ODBC and psql(libq) connection protocols, so two clients using different protocols will each incur the first-time cost of compiling the code.

To reduce query execution time and improve system performance, Amazon Redshift caches the results of certain types of queries in memory on the leader node. 

  • When a user submits a query, Amazon Redshift checks the results cache for a valid, cached copy of the query results. If a match is found in the result cache, Amazon Redshift uses the cached results and doesn’t execute the query, and the result of caching is transparent to the user.
  • To maximize cache effectiveness and efficient use of resources, Amazon Redshift doesn’t cache some large query result sets. Amazon Redshift determines the number of entries in the cache and the instance type of the customer Amazon Redshift cluster.

Amazon Redshift workload management (WLM) enables users to flexibly manage priorities within workloads so that short, fast-running queries won’t get stuck in queues behind long-running queries.

  • Amazon Redshift WLM creates query queues at runtime according to service classes, which define the configuration parameters for various types of queues, including internal system queues and user-accessible queues. 
  • From a user perspective, a user-accessible service class and a queue are functionally equivalent. For consistency, this documentation uses the term queue to mean a user-accessible service class as well as a runtime queue.

Columnar storage for database tables is an important factor in optimizing analytic query performance because it drastically reduces the overall disk I/O requirements and reduces the amount of data you need to load from disk.

  • In a relational database table, each row contains field values for a single record. In row-wise database storage, data blocks store values sequentially for each consecutive column making up the entire row. 
  • If block size is smaller than the size of a record, storage for an entire record may take more than one block. 
  • If block size is larger than the size of a record, storage for an entire record may take less than one block, resulting in an inefficient use of disk space.
  • Amazon Redshift uses a block size of 1 MB, which is more efficient and further reduces the number of I/O requests needed to perform any database loading or other operations that are part of query execution.

The Amazon Redshift query execution engine incorporates a query optimizer that is MPP-aware and also takes advantage of the columnar-oriented data storage. 

  • The Amazon Redshift query optimizer implements significant enhancements and extensions for processing complex analytic queries that often include multi-table joins, subqueries, and aggregation.

Data Warehousing

The core infrastructure component of an Amazon Redshift data warehouse is a cluster.

  • A cluster is composed of one or more compute nodes. If a cluster is provisioned with two or more compute nodes, an additional leader node coordinates the compute nodes and handles external communication. Customers users application interacts directly only with the leader node. The compute nodes are transparent to external applications.
  • The leader node manages communications with client programs and all communication with compute nodes. It parses and develops execution plans to carry out database operations. Based on the execution plan, the leader node compiles code, distributes the compiled code to the compute nodes, and assigns a portion of the data to each compute node.
    • The leader node distributes SQL statements to the compute nodes only when a query references tables that are stored on the compute nodes. All other queries run exclusively on the leader node.
  • The leader node compiles code for individual elements of the execution plan and assigns the code to individual compute nodes. The compute nodes execute the compiled code and send intermediate results back to the leader node for final aggregation.
    • Each compute node has its own dedicated CPU, memory, and attached disk storage, which are determined by the node type.
    • AWS Amazon Redshift provides two types of node; dense storage nodes and dense compute nodes. Each node provides two storage choices.
  • A compute node is partitioned into slices. Each slice is allocated a portion of the node’s memory and disk space, where it processes a portion of the workload assigned to the node. The leader node manages distributing data to the slices and apportions the workload for any queries or other database operations to the slices. 
    • The slices then work in parallel to complete the operation.
    • The number of slices per node is determined by the node size of the cluster.

Data Warehousing is used to extract data in periodic stages, or as they are generated, which makes it more efficient and simpler to process queries over data that actually came from different sources. The raw data is turned into high-quality information to meet all enterprise reporting requirements and also for all levels of users.

  • Because Amazon Redshift is using Big Data and data warehousing,  companies can build powerful applications and generate reports that provide all of the data they need to run a business.
  • A data warehouse is a database designed to enable AWS clients to intelligence activities; it exists to help them understand and enhance their organization’s performance.
  • Data warehouses are consumers of data, and it is called online analytical processing (OLAP) systems. 
  • The data for a data warehouse system can come from Online Transactional Processing (OLTP) systems, Enterprise Resource Planning (ERP) systems such as SAP, internally developed systems, and so on.
  • OLTP databases collect a lot of data quickly, but OLAP databases typically import large amounts of data from various source systems by using batch processes and scheduled jobs.
  • Data warehouses are distinct from OLTP systems. With a data warehouse, customers separate the analytical workload from the transaction workload. As such they are very much read-oriented systems. 
    • They have a far higher amount of data reading versus writing and updating.

Amazon Redshift takes advantage of high-bandwidth connections, close proximity, and custom communication protocols to provide private, very high-speed network communication between the leader node and compute nodes. The compute nodes run on a separate, isolated network that client applications never access directly.

AWS Integration

Amazon Redshift lets customers quickly and simply work with their data in open formats, and easily connects to the AWS ecosystem. They can query open file formats such as Parquet, ORC, JSON, Avro, CSV, and more directly in S3 using familiar ANSI SQL. To export data to the data lake they simply use the Redshift UNLOAD command in the SQL code and specify Parquet as the file format and Redshift automatically takes care of data formatting and data movement into S3. 

Customers can use AWS Data Pipeline to automate data movement and transformation into and out of Amazon Redshift. 

  • By using the built-in scheduling capabilities of AWS Data Pipeline, they can schedule and execute recurring jobs without having to write their  own complex data transfer or transformation logic.

Amazon DynamoDB is a fully managed NoSQL database service. The COPY command leverages the Amazon Redshift massively parallel processing (MPP) architecture to read and load data in parallel from an Amazon DynamoDB table. 

  • Customers can take maximum advantage of parallel processing by setting distribution styles on their  Amazon Redshift tables.

Customers can migrate data to Amazon Redshift using AWS Database Migration Service. AWS DMS can migrate their data to and from most widely used commercial and open-source databases such as Oracle, PostgreSQL, Microsoft SQL Server, Amazon Redshift, Aurora, DynamoDB, Amazon S3, MariaDB, and MySQL.

Amazon Simple Storage Service (Amazon S3) is a web service that stores data in the cloud. Amazon Redshift leverages parallel processing to read and load data from multiple data files stored in Amazon S3 buckets.

  • They can also use parallel processing to export data from their Amazon Redshift data warehouse to multiple data files on Amazon S3. 

Customers can use the COPY command in Amazon Redshift to load data from one or more remote hosts, such as Amazon EMR clusters, Amazon EC2 instances, or other computers. 

  • COPY connects to the remote hosts using SSH and executes commands on the remote hosts to generate data. Amazon Redshift supports multiple simultaneous connections. 
  • The COPY command reads and loads the output from multiple host sources in parallel.

Lake House Architecture

Amazon Redshift powers the lake house architecture enables customers to query data across their data warehouse, data lake, and operational databases to gain faster and deeper insights not possible otherwise. With a lake house architecture, customers can store data in open file formats in Amazon S3 data lake. Which allows them to make this data available easily to other analytics and machine learning tools rather than locking it in a new silo. Using Amazon Redshift lake house architecture, AWS clients can:

  • Easily query data in customers data lake and write data back to the data lake in open formats.
  • Use familiar SQL statements to combine and process data across all their data stores.
  • Execute queries on live data in the operational databases without requiring any data loading and ETL pipelines.

Federated Query

Federated Query enables Amazon Redshift to query data directly in Amazon RDS and Aurora PostgreSQL stores. This allows you to incorporate timely and up-to-date operational data in your reporting and BI applications, without any ETL operations. 

Database Migration Service (DMS)

AWS Database Migration Service (DMS) is a self-service tool you can use to migrate your data from the most widely used commercial data warehouses to Amazon Redshift. The source database remains fully operational during the migration, minimizing downtime to applications that rely on the database.

Redshift Spectrum

Query open format data directly in the Amazon S3 data lake without having to load the data or duplicating your infrastructure. Using the Amazon Redshift Spectrum feature, clients can query open file formats such as Apache Parquet, ORC, JSON, Avro, and CSV. 

Data Lake Export

Save the results of an Amazon Redshift query directly to your S3 data lake in an open file format (Apache Parquet) using Data Lake Export. AWS customers can then analyze this data using Amazon Redshift Spectrum feature as well as other AWS services such as Sagemaker for machine learning, and EMR for ETL operations. 

Amazon DynamoDB

Amazon DynamoDB is a fully managed NoSQL database service which provides fast and predictable performance with seamless scalability, and it enables developers to build modern, serverless applications that can start small and scale globally to support petabytes of data and tens of millions of read and write requests per second. DynamoDB also offers encryption at rest, which eliminates the operational burden and complexity involved in protecting sensitive data. DynamoDB is designed to run high performance, internet-scale applications that would overburden traditional relational databases.

  • DynamoDB enables customers to create database tables that can store and retrieve any amount of data and serve any level of request traffic. They can scale up or down their tables’ throughput capacity without downtime or performance degradation. 



DynamoDB Features

DynamoDB is serverless there are no servers to provision, patch, or manage, and no software to install, maintain, or operate. DynamoDB automatically scales tables to adjust for capacity and maintains performance with zero administration. DynamoDB provides capacity modes for each table on-demand and provisioned:

  • For workloads that are less predictable customers can use on-demand capacity mode. For tables using provisioned capacity, DynamoDB delivers automatic scaling of throughput and storage ba
  • sed on previously set capacity by monitoring the performance usage of the application.
  • Tables using provisioned capacity mode require customers to set read and write capacity, and it is also more cost effective. For tables using on-demand capacity mode, DynamoDB instantly accommodates customers workloads as they ramp up or down to any previously reached traffic level. 
  • DynamoDB integrates with AWS Lambda to provide triggers. Using triggers, clients can automatically execute a custom function when item-level changes in a DynamoDB table are detected. With triggers, they can build applications that react to data modifications in DynamoDB tables. 

DynamoDB is built for mission-critical workloads, including support for ACID transactions for a broad set of applications that require complex business logic. DynamoDB helps secure clients data with encryption and continuously backs up their data for protection, with guaranteed reliability through a service level agreement.

  • DynamoDB encrypts all customer data at rest by default. Encryption at rest enhances the security of customers data by using encryption keys stored in AWS Key Management Service. With encryption at rest, they can build security-sensitive applications that meet strict encryption compliance and regulatory requirements.
  • DynamoDB encrypts all customer data at rest by default. Encryption at rest enhances the security of their data by using encryption keys stored in AWS Key Management Service. With encryption at rest, customers can build security-sensitive applications that meet strict encryption compliance and regulatory requirements.
  • Point-in-time recovery (PITR) helps protect customer’s DynamoDB tables from accidental write or delete operations. PITR provides continuous backups of their DynamoDB table data, and they can restore that table to any point in time up to the second during the preceding 35 days.
  • On-demand backup and restore allows customers to create full backups of their DynamoDB tables’ data for data archiving, which can help them meet their corporate and governmental regulatory requirements.

DynamoDB is a key-value and document database that can support tables of virtually any size with horizontal scaling. This enables DynamoDB to scale to more than 10 trillion requests per day with peaks greater than 20 million requests per second, over petabytes of storage.

  • DynamoDB supports both key-value and document data models. This enables DynamoDB to have a flexible schema, so each row can have any number of columns at any point in time. This allows customers to easily adapt the tables as their business requirements change, without having to redefine the table schema as they would in relational databases.
  • DynamoDB Accelerator (DAX) is an in-memory cache that delivers fast read performance for users tables at scale by enabling them to use a fully managed in-memory cache.
  • DynamoDB global tables replicate customer’s data automatically across their choice of AWS Regions and automatically scale capacity to accommodate their workloads.
  • DynamoDB Streams capture a time-ordered sequence of item-level modifications in any DynamoDB table and store this information in a log for up to 24 hours.

High Availability and Durability:- DynamoDB automatically spreads the data and traffic for users tables over a sufficient number of servers to handle their throughput and storage requirements, while maintaining consistent and fast performance. 

  • All data is stored on solid-state disks (SSDs) and is automatically replicated across multiple Availability Zones in an AWS Region, providing built-in high availability and data durability.

Amazon DynamoDB global tables provide a managed solution for deploying a multiregion, multi-master database. Global tables lets customers to specify the AWS Regions where they want the table to be available. 

  • DynamoDB performs all of the necessary tasks to create identical tables in these Regions and propagate ongoing data changes to all of them.

Amazon DynamoDB transactions simplify the developer experience of making coordinated, all-or-nothing changes to multiple items both within and across tables. Transactions provide atomicity, consistency, isolation, and durability (ACID) in DynamoDB, helping customers to maintain data correctness in their applications.

  • AW clients can use the DynamoDB transactional read and write APIs to manage complex business workflows that require adding, updating, or deleting multiple items as a single, all-or-nothing operation. 

DynamoDB Components

  • Partition key and sort key:- Referred to as a composite primary key, because it is composed of two attributes. The first attribute is the partition key, and the second attribute is the sort key.
    • DynamoDB uses the partition key value as input to an internal hash function. The output from the hash function determines the partition (physical storage internal to DynamoDB) in which the item will be stored. 
    • All items with the same partition key value are stored together, in sorted order by sort key value.
  • Secondary Index:- A secondary index lets customers query the data in the table using an alternate key, in addition to queries against the primary key. DynamoDB supports two kinds of indexes:
    • Global secondary index:- An index with a partition key and sort key that can be different from those on the table.
    • Local secondary index:- An index that has the same partition key as the table, but a different sort key.
    • Each table in DynamoDB has a limit of 20 global secondary indexes (default limit) and 5 local secondary indexes per table.
  • DynamoDB Streams:- DynamoDB Streams is an optional feature that captures data modification events in DynamoDB tables. The data about these events appear in the stream in near-real time, and in the order that the events occurred, and each event is represented by a stream record. When a stream on a table is enabled, DynamoDB Streams writes a stream record whenever one of the following events occurs:
    • A new item is added to the table: The stream captures an image of the entire item, including all of its attributes.
    • An item is updated: The stream captures the “before” and “after” image of any attributes that were modified in the item.
    • An item is deleted from the table: The stream captures an image of the entire item before it was deleted.

Tables, items, and attributes are the core components of DynamoDB. A table is a collection of items, and each item is a collection of attributes. DynamoDB uses primary keys to uniquely identify each item in a table and secondary indexes to provide more querying flexibility. DynamoDB Streams enables users to capture data modification events in DynamoDB tables.

  • Tables:- DynamoDB stores data in tables, and a table is a collection of data. 
  • Items – Each table contains zero or more items. An item is a group of attributes that is uniquely identifiable among all of the other items. Items in DynamoDB are similar in many ways to rows, records, or tuples in other database systems. There is no limit to the number of items customers can store in a table.
  • Attributes – Each item is composed of one or more attributes. An attribute is a fundamental data element, something that does not need to be broken down any further. 
    • Attributes in DynamoDB are similar in many ways to fields or columns in other database systems.
  • Primary Key:- The primary key uniquely identifies each item in the table, so that no two items can have the same key. DynamoDB supports two different kinds of primary keys:
    • Partition key:- A simple primary key, composed of one attribute known as the partition key.
    • DynamoDB uses the partition key’s value as input to an internal hash function. The output from the hash function determines the partition (physical storage internal to DynamoDB) in which the item will be stored.
    • Each primary key attribute must be a scalar (meaning that it can hold only a single value). The only data types allowed for primary key attributes are string, number, or binary.

Schemaless Web-scale

Amazon DynamoDB global tables provide a fully managed solution for deploying a multiregion, multi-master database, without having to build and maintain  replication solutions. With global tables customers can specify the AWS Regions where they want the table to be available. DynamoDB performs all of the necessary tasks to create identical tables in these Regions and propagate ongoing data changes to all of them.

  • DynamoDB global tables are ideal for massively scaled applications with globally dispersed users. 
  • Global tables provide automatic multi-master replication to AWS Regions worldwide, that enable customers to deliver low-latency data access to their users no matter where they are located.
  • Transactional operations provide atomicity, consistency, isolation, and durability (ACID) guarantees only within the region where the write is made originally. Transactions are not supported across regions in global tables.

DynamoDB is schemaless Web-scale applications, including social networks, gaming, media sharing, and Internet of Things (IoT). Every table must have a primary key to uniquely identify each data item, but there are no similar constraints on other non-key attributes. DynamoDB can manage structured or semistructured data, including JSON documents.

Customers can use the AWS Management Console or the AWS CLI to work with DynamoDB and perform ad hoc tasks. Applications can use the AWS software development kits (SDKs) to work with DynamoDB using object-based, document- centric, or low-level interfaces.

  • DynamoDB is optimized for compute, so performance is mainly a function of the underlying hardware and network latency. As a managed service, DynamoDB insulates.

DynamoDB is designed to scale out using distributed clusters of hardware. This design allows increased throughput without increased latency. Customers specify their throughput requirements, and DynamoDB allocates sufficient resources to meet those requirements. There are no upper limits on the number of items per table, nor the total size of that table.

DynamoDB Accelerator (DAX)

Web-based applications that have hundreds, thousands, or millions of concurrent users, with terabytes or more of new data generated per day need to use a database, which can handle tens (or hundreds) of thousands of reads and writes per second. Amazon DynamoDB is well-suited for such kinds of workloads. Developers can start with a small amount of provisioned throughput and gradually increase it as their application becomes more popular. DynamoDB scales seamlessly to handle very large amounts of data and very large numbers of users.

  • NoSQL is a term used to describe non-relational database systems that are highly available, scalable, and optimized for high performance. Instead of the relational model, NoSQL databases (like DynamoDB) use alternate models for data management, such as key-value

SQL to NoSQL

Web-based applications that have hundreds, thousands, or millions of concurrent users, with terabytes or more of new data generated per day need to use a database, which can handle tens (or hundreds) of thousands of reads and writes per second. Amazon DynamoDB is well-suited for such kinds of workloads. Developers can start with a small amount of provisioned throughput and gradually increase it as their application becomes more popular. DynamoDB scales seamlessly to handle very large amounts of data and very large numbers of users.

NoSQL is a term used to describe non-relational database systems that are highly available, scalable, and optimized for high performance. Instead of the relational model, NoSQL databases (like DynamoDB) use alternate models for data management, such as key-value pairs or document storage.

NoSQL Workbench

NoSQL Workbench for Amazon DynamoDB is a cross-platform client-side application for modern database development and operations and is available for Windows and macOS. NoSQL Workbench is a unified visual tool that provides data modeling, data visualization, and query development features to help you design, create, query, and manage DynamoDB tables.

  • Data Modeling:- With NoSQL Workbench for DynamoDB, you can build new data models from, or design models based on, existing data models that satisfy your application’s data access patterns. You can also import and export the designed data model at the end of the process.
  • Data Visualization:- The data model visualizer provides a canvas where you can map queries and visualize the access patterns (facets) of the application without having to write code. Every facet corresponds to a different access pattern in DynamoDB. You can manually add data to your data model or import data from MySQL.
  • Operation Building:- NoSQL Workbench provides a rich graphical user interface for you to develop and test queries. You can use the operation builder to view, explore, and query datasets. You can also use the structured operation builder to build and perform data plane operations. It supports projection and condition expression, and lets you generate sample code in multiple languages.

DynamoDB Global Tables

Amazon DynamoDB global tables provide a fully managed solution for deploying a multiregion, multi-master database, without having to build and maintain  replication solutions. With global tables customers can specify the AWS Regions where they want the table to be available. DynamoDB performs all of the necessary tasks to create identical tables in these Regions and propagate ongoing data changes to all of them.

  • DynamoDB global tables are ideal for massively scaled applications with globally dispersed users. 
  • Global tables provide automatic multi-master replication to AWS Regions worldwide, that enable customers to deliver low-latency data access to their users no matter where they are located.
  • Transactional operations provide atomicity, consistency, isolation, and durability (ACID) guarantees only within the region where the write is made originally. Transactions are not supported across regions in global tables.

DynamoDB Accelerator (DAX)

Amazon DynamoDB is designed for scale and performance. In most cases, the DynamoDB response times can be measured in single-digit milliseconds. However, there are certain use cases that require response times in microseconds. For these use cases, DynamoDB Accelerator (DAX) delivers fast response times for accessing eventually consistent data. DAX is a DynamoDB-compatible caching service that enables you to benefit from fast in-memory performance for demanding applications. DAX addresses three core scenarios:

  • As an in-memory cache, DAX reduces the response times of eventually consistent read workloads by an order of magnitude from single-digit milliseconds to microseconds.
  • DAX reduces operational and application complexity by providing a managed service that is API-compatible with DynamoDB. Therefore, it requires only minimal functional changes to use with an existing application.
  • For read-heavy or bursty workloads, DAX provides increased throughput and potential operational cost savings by reducing the need to overprovision read capacity units. This is especially beneficial for applications that require repeated reads for individual keys.
  • DAX provides access to eventually consistent data from DynamoDB tables, with microsecond latency, and AWS clients whose Applications require the fastest possible response time for reads, read a small number of items more frequently than others, read-intensive, but are also cost-sensitive, and require repeated reads against a large set of data.

Best Practice

In database certain key table design decisions heavily influence overall query performance. The design choices that the customers make also have a significant effect on storage requirements, which in turn affects query performance by reducing the number of I/O operations and minimizing the memory required to process queries. To avoid all this customers need to apply the best Practices presented by AWS for optimizing query performance. Here are some of them

  • Choose the Best Sort Key Amazon Redshift stores:- AWS customers data on disk in sorted order according to the sort key. The Amazon Redshift query optimizer uses sort order when it determines optimal query plans.
  • Choose the Best Distribution Style:- When  customers execute a query, the query optimizer redistributes the rows to the compute nodes as needed to perform any joins and aggregations. The goal in selecting a table distribution style is to minimize the impact of the redistribution step by locating the data where it needs to be before the query is run.
  • Define Primary Key and Foreign Key Constraints:- Define primary key and foreign key constraints between tables wherever appropriate. Even though they are informational only, the query optimizer uses those constraints to generate more efficient query plans.
  • Use Date/Time Data Types for Date Columns:- Amazon Redshift stores DATE and TIMESTAMP data more efficiently than CHAR or VARCHAR, which results in better query performance. Use the DATE or TIMESTAMP data type, depending on the resolution you need, rather than a character type when storing date/time information.
  • Use a COPY Command to Load Data:- The COPY command loads data in parallel from Amazon S3, Amazon EMR, Amazon DynamoDB, or multiple data sources on remote hosts. COPY loads large amounts of data much more efficiently than using INSERT statements, and stores the data more effectively as well.
  • Split Load Data into Multiple Files:- The COPY command loads the data in parallel from multiple files, dividing the workload among the nodes customers cluster. The number of files should be a multiple of the number of slices in cluster
  • Compress Data Files:- individually compress load files using gzip, lzop, bzip2, or Zstandard for large datasets.