The motivation of this blog is from my work experience. Previously, when I am pretty green hand in engineer path, I thought the database our company is using AWS redshift to host our postgres database and even thought AWS RDS is a short hand of AWS Redshift, which is super funny when I am now looking back. After having more experience, I know RDS and Redshift are completely different two things with different use cases now. In this blog, I will cover these two AWS database services and the comparison between them.
RDS
RDS is actually a short hand of relational databases services. It allows us to host a traditional database in cloud. There are many database choices provided in RDS such as MySQL, MariaDB, PostgreSQL, AWS Aurora, Oracle DB and SQL server. All the typical administrative tasks related to running a database, like hardware provision, database setup, patching updates and backing up data etc are automated in RDS. RDS provides high available option, meaning a replica of the the database is automatically maintained in another region.
Architecturally, RDS works on top of virtual instance. There is no concepts of cluster and node when it comes to RDS. The virtual instances can be scaled for performance or storage with a few clicks in AWS console.
Redshift
Redshift is a completely managed data warehouse as a service and can scale up to petabytes of data while offering lightning-fast querying performance. The data structure of Redshift is columnar. The query engine is Postgres with a little bit difference data type. The scaling can happen through elastic size feature.
Redshift is architected over cluster which is difference from RDS. There is a leader node and this node is responsible for client communication, query optimization, execution plan creation and sending tasks to individual nodes for execution.
Comparison
Scaling
RDS: RDS is based on virtual instances. Its scaling is done by reconfiguring the virtual instance capability.
Redshift: Redshift scales with elastic resize. The architecture is complex so the database unavailable time window is certainly higher then that of RDS while scaling. Redshift also has an option called concurrency scaling which can be used to support a virtually unlimited number of concurrent users without a drop in query performance
Storage Capacity
Storage Capacity is a big differentiator of these two services.
RDS: The storage capacity depends on different query engines.
Engine | Capacity |
---|---|
SQL Server | 16TB |
Aurora Engine | 64TB |
Other | 32TB |
Redshift: The limit of Redshift is up to 2PB with ds2.8xlarge type instance。
Performance
RDS: For queries not scanning across millions of rows, it offers better performance. Redshift has a very sophisticated query optimizer and execution planner at work before the actual query execution so when the query is not complex, the overhead of optimization in redshift will make the query executed slowly due to the cost of optimization. In other words, for simple queries, the optimization time in Redshift will be longer than execution time.
Redshift: Performs good when the queries need to scan million of rows and aggregate them.
Both improve performance through key distribution mechanism. RDS offers sharded capability and with carefully designed key and Redshift has options for SORT KEY and DIST KEY.
Data Loading
Since these services have different architecture, the procedure of loading data is different.
RDS: This is closely tied to the underlying database engine that is being used. Importing the data will use the engine specific commands.
Redshift: Importing data will involve copying the complete data to S3 and loading it use COPY command of AWS
Data Structure
RDS:
- It is basic relation data store and row-oriented
- Querying may be vary according to the engine
- Offers support for unique constraint in all database engines
Redshift:
- It is columnar and is optimized for fast retrieval of columns
- Conforms to Postgres standard
- Not good in enforcing unique constraints in insertion keys and expected that the end-users will manage it
Use Cases
RDS
- Traditional databases in cloud and the only requirement is to offload the database management
- Data volume is in TBs and not anticipate a large increase in the the near future
- Online transaction processing use case
- No query that scan across millions of rows and the query complexity is limited
- Analytics workloads are minial and do not interfere with the OLTP workloads
- The cost of budget is low
Redshift
- Petabyte-scale data warehouse and not happy with traditional database
- Analytics and report workload is heavy and can interfere with OLTP
- Queries scan across millions of rows and anticipate even more complexity
- Anticipate a constant query workload and the cluster will be running for the most part of the day (Since Redshift pricing is relatively high and charged by hours)
- Ready for managing the uniqueness of insertion keys
- Take care of DIST KEYs and SORT KEYs and structure data for best performance