If you’re as sick of this three-letter phrase as I am, you’ll be happy to know there is a way around it.

If you work in data in 2021, the acronym ETL is everywhere.

Ask certain people what they do, and their whole response will be “ETL.” On LinkedIn, there are thousands of people with the title ETL Developer. It can be a noun, verb, adjective, and even a preposition. (Yes, a mouse can ETL a house.)

Standing for “Extract, Transform, and Load,” ETL refers to the general process of taking batches of data out of one database or application and loading it into another.

Data teams are the masters of ETL as they often have to stick their grubby fingers into tools and databases of other teams — the software engineers, marketers, and operations folk — to prep a company’s data for deeper, custom analyses.

The good news is that with a bit of foresight, data teams can remove most of the ETL onus off their plate entirely. How is this possible?

Replacing ETL with ITD

The path forward is with ITD or Intentional Transfer of Data. You see, the need for ETL arises because no one builds their user database or CMS with downstream analytics in mind.

Instead of making the data team select * from purchases_table where event_date > now() every hour… you can add logic in the application code that first processes events and emit them in a pub/sub model.

Example IDT architecture on AWS with a real-time Lambda consumer + durable storage to S3

With no wasted effort, the data team can set up a subscriber process to receive these events and process them in real-time (or store them durably in S3 for later use). All it takes is one brave soul on the data team to muster the confidence to ask this of the core engineering squad.

Ten years ago, I get it. Data teams were beginning to establish their capabilities and needs, and such an ask might be met with a healthy dose of resistance.

A decade later, however, that excuse no longer flies. And if you are on a data team doing solely traditional ETL on internal datasets, it’s time you upped your game.

Beyond the obvious benefit of avoiding maintaining costly, inefficient ETL processes, there are several other benefits to IDT.

1. IDT Forces Upfront Agreement on a Data Model Contract

How many times has one team changed a database table’s schema, only to later learn the change broke a downstream analytics report? Any analytics veteran will tell you it’s a data tale as old as time! 

Frankly, It is difficult to establish the cross-team communication and awareness necessary to avoid these issues… when you have ETL scripts running directly against raw database tables.

Instead with IDT, when an event occurs, it will be published with certain fields always present that are previously agreed upon and documented. For example, a purchase by a customer might look like this:

{
"event_name": "transaction",
"user_id": 12345,
"event_action": "purchase",
"action_object": "gift_card",
"event_timestamp: "2021-01-02T03:04:05+01:00",
...
}

Any changes engineering makes in the purchases table schema will not affect the fields in the IDT publisher’s events. And everyone should know that any change to this JSON model contract needs to be communicated first.

2. IDT Removes Data Processing Latencies

Most frequently, ETL jobs are run once-per-day overnight. But I’ve also worked on projects where they’ve run incrementally every 5 minutes. It all depends on the requirements of the data consumers.

What doesn’t change is that there will always be some latency between an event occurring and the data team receiving it. This latency only limits what one can do with the data and introduces tricky edge cases to any data application.

With IDT, however, events are published immediately as they happen. Using real-time services like Amazon SNS, SQS, and Lambda, they can be responded to immediately.

It does not necessitate you implement a streaming-based process, but at least you have the option.

Taking The First Steps

Moving from ETL to IDT isn’t a transformation that will happen for all your datasets overnight. Such an all-encompassing change would be overwhelming.

Taking one dataset to start, though, and setting up a pub/sub messaging pattern for it is extremely doable. Make a Hackathon out of it if you need to.

My advice is to find a use case that would most benefit from real-time data processing — whether it’s a feed of users’ current locations or cancellation events — then transition it from ETL to the IDT pattern.

And maybe one day, the phrase ETL will never be uttered in your presence again!

1 Comment

  1. Great post Paul! I’d encourage anybody reading it to check out Snowplow – we’ve never called it intentional data transfer – we see it more as Behavioral Data Generation and Management – but it’s the same concept, and we’ve building this in an open-source way since 2012.

Leave a Reply to Alex Dean Cancel reply

Your email address will not be published. Required fields are marked *