Note: A version of this article was published on Equinox’s tech blog in Jan 2019
On the surface, it would appear there is not much similar between Redshift—AWS’s cloud data warehouse service launched in 2012—and Postgres—one of the most popular open source databases first introduced in 1989. One is optimized for analytic workloads (Redshift); the other performs better when requests come as frequent, small transactions (Postgres). One is proprietary, one is open-source. One has a vertical data storage model, the other horizontal.
It should come as a little bit surprising, then, that an oft-mentioned fact states that Redshift is “based on” Postgres version 8.0.2. “Based on” is very general phrasing and the truth is that Redshift’s being based on Postgres has little practical significance. Nevertheless, the common ancestry of the two technologies does allow for some interoperability between them, one example of which will be explored in this article.
First, some quick background about our usage of Redshift here at Equinox, and a problem we were able to solve with it.
On the data team, we use Redshift’s storage and compute capabilities extensively to support our analytic workloads. The general process involves first landing data from various sources like production databases and our Salesforce instance into our Redshift warehouse (nicknamed Jarvis after Iron Man’s computer). To perform this ELT we use a mix of home-grown tooling and the proprietary Informatica Cloud service. Once landed, transaction blocks of SQL queries are executed on the raw data to transform it into the various fact and dimension tables that serve as the home for all metrics we keep about our users and business.
One of the most widely-consumed metrics at Equinox is membership sales data, which is derived from what we call contract actions.
During peak hours, there can be hundreds of internal users opening sales reports at one time. And every time another user views or refreshes a sales report, a query is dispatched to be executed on Redshift to pull the latest data. Given Redshift’s concurrency limitations, this potential load carries the risk of overwhelming the system and drastically degrading performance (or forcing us to overprovision resources).
With our legacy warehouse backed by SQL server, traffic of this sort was not as immediate an issue. But when switching to Redshift, we were mindful of its limitation in this regard. As great as it is for performing computations on data, Redshift is not an ideal solution for making that data available to downstream consumers. So, we explored other options.
The first proposed solution involved using Redis as a cache layer to hold commonly used data. While Redis would have provided many benefits—like extremely fast performance for retrieval of data—it would also require designing and maintaining a custom ETL solution to map data from a relational structure to Redis’ key-value model, as well as a reporting solution to consume data in this format.
Luckily in our research we found a post by Tony Gibbs on the AWS Big Data Blog titled JOIN Amazon Redshift AND Amazon RDS PostgreSQL WITH dblink. It presented the Postgres module dblink that supports connections to other Postgres-compatible databases. Given Redshift’s compatibility with Postgres, using dblink could allow for querying Redshift data on a Postgres database natively with minimal ETL work.
We quickly realized this feature was an elegant solution to our sales reporting problem and thus project Vision at Equinox was born. Outlined below are the steps we took to implement this solution within our data infrastructure in a way that serves our current needs and can also scale to meet future ones.
Going from right to left, the steps to update the membership_sales table in Postgres are:
- Refresh the stage_contract_action Redshift table with the last couple hours of contract_action_history data
- Refresh the materialized view in Postgres with the latest data from the stage_contract_action table
- Perform standard upsert logic into contract_action Postgres table to add latest data
There you have it! Using nothing other than pure SQL, the downstream consumption burden is transferred to Postgres, and Redshift can focus on what it does best.
Since launched one year ago, the Vision architecture has served us well in providing sales metrics data with low-latency to many concurrent users. With minimal maintenance on the RDS instance, we’ve seen stable performance and have not dealt with data discrepancy issues.
While our current needs are met by a single RDS instance, as food for thought, it is interesting to think about scaling the user-facing part of the system. By introducing multiple read-replicas of RDS placed behind an Elastic Load Balancer, the same design could be used to serve thousands or even tens of thousands of users.