Amazon Redshift is the managed data warehouse solution offered by Amazon Web Services. It has a collection of computing resources called nodes, which are organized into a group called a cluster. The first node you create is called the leader node. If customers add more, they are called a compute node. Each cluster runs an Amazon […]
Lake House Architecture
Amazon Redshift powers the lake house architecture – enabling customers to query data across their data warehouse, data lake, and operational databases to gain faster and deeper insights not possible otherwise. With a lake house architecture, they can store data in open file formats in their Amazon S3 data lake. This allows them to make this data available easily to other analytics and machine learning tools rather than locking it in a new silo.
ETL and ELT
There are two common design patterns when moving data from source systems to a data warehouse. The primary difference between the two patterns is the point in the data-processing pipeline at which transformations happen. This also determines the set of tools used to ingest and transform the data, along with the underlying data structures, queries, and optimization engines used to analyze the data. The first pattern is ETL(extract, transform, load), which transforms the data before it is loaded into the data warehouse. The second pattern is ELT (extract, load, transform), which loads the data into the data warehouse and uses the familiar SQL semantics and power of the Massively Parallel Processing (MPP) architecture to perform the transformations within the data warehouse.
Using Amazon Redshift lake house architecture customers can:
- Easily query data in customers data lake and write data back to their data lake in open formats.
- Use familiar SQL statements to combine and process data across all customers data stores.
- Execute queries on live data in customers operational databases without requiring any data loading and ETL pipelines.
- Query open format data directly in the Amazon S3 data lake without having to load the data or duplicating users infrastructure. Using the Amazon Redshift Spectrum feature, clients can query open file formats such as Apache Parquet, ORC, JSON, Avro, and CSV. Follow this step-by-step tutorial to get started.
- Federated Query enables Amazon Redshift to query data directly in Amazon RDS and Aurora PostgreSQL stores. This allows users to incorporate timely and up-to-date operational data in your reporting and BI applications, without any ETL operations. Watch this 5-minute video or read this tutorial to get started.
Data lake export
Amazon Redshift supports unloading the result of a query to customers data lake on S3 in Apache Parquet, an efficient open columnar storage format for analytics. The Parquet format is up to two times faster to unload and consumes up to six times less storage in S3, compared to text formats. AWS clients can also specify one or more partition columns, so that unloaded data is automatically partitioned into folders in thier S3 bucket to improve query performance and lower the cost for downstream consumption of the unloaded data.