Building the 'Impossible' ETL Pipeline

Jameson Welch
February 23, 2023

A real estate investment company with properties all over the United States and some abroad was looking to mine massive amounts of data to leverage near real-time vacation rental information. Prices and fees can change daily so keeping up with a country-sized dataset would require a quarter of a billion requests every day. In fact, several agencies said “it can’t be done.”

Depending on the host servers, an HTTP request can take anywhere from milliseconds up to several seconds on average. Our two host servers took 3 - 5 seconds per request. Handling this sequentially would take over 23 years what we needed to make happen in a single day, not taking errors or peak traffic lags into account. 

We were able to bring down the quarter of a billion request execution time down from 23 years to 16 hours.

On top of the monumental challenge of handling the sheer volume of requests, the few agencies that said they could do it were estimating a yearly operating cost of $150,000. We brought that number down to roughly $25,000. That’s an 84% cost reduction with better performance!

Solution: Single-Responsibility Principle

We separated the pipeline into five phases using a combination of dedicated and auto-scaling AWS EC2 Instances hosting asynchronous and multi-threaded Python applications, S3 buckets, SQS queues and Snowflake. The first phase gets all listing URLs in the US from our two sources, the second builds a listing queue from the listing URLs in the database, third fetches the raw data from the listing addresses storing it in a data lake, fourth builds the queue of the raw data to be processed and the fifth phase parses each listing’s raw data for the required data points storing them in a data warehouse. Separating out these phases removed the major bottlenecks and minimized costs: parsing data is a CPU task while HTTP requests are an I/O task. It is tempting to ask “why not just parse the data as it comes in?” For standard ETL pipelines, this might be a viable option. But when venturing into the hundreds of millions of requests in a very short period of time, even millisecond operations in single transactions add up to significant time delays. 

First Phase: Discovering the listings

The first phase gathered all listings in the US. Our two sources only displayed a maximum of 50 listings regardless if there were more in the query. So we wrote a recursive latitude/longitude grid coordinate subdivision algorithm that subdivided grid squares into 4 smaller grids until each grid contained no more than 50 listings. While much of the Discovery program required the use of Python’s Selenium, we found that a portion of this phase could be handled by asynchronous I/O libraries. This cut down the duration by several hours and allowed us to use a less expensive EC2 instance. Once we reached our target listing number, each listing’s URL was stored in the Snowflake database ready for the next phase. Storing metadata allowed us to know when listings were added, taken off the market, had blackout dates and removed redundant requests.

Second Phase: Building the Fetch Queue

The second phase grabs each listing URL from the database and sends batches of them to the SQS Fetch queue. Decoupling this task from the Discovery phase allowed both to run more efficiently. We had to consider downtime waiting on I/O and the expense of SQS as messages over 256 KB incurred higher costs. To solve this, we wrote a dynamic sizing algorithm that placed the maximum number of URLs we could fit into each message without exceeding the 256KB limit. This program operated on the same EC2 instance as the first phase’s Discovery instance since our selected EC2 had available vCPU cores and we were well below our first phase’s target time limit. Let no thread go to waste.

Third Phase: Fetching raw data

Now that we have all the listing URLs, the third phase grabs these URL batches from SQS and makes an HTTP request for the listing’s raw data. Because much of the data gets returned with event-driven JavaScript, we used Python’s Selenium to call the urls, perform a few actions to generate the content in its entirety, and capture the raw data to be stored. Even though this phase doesn’t represent the majority of the HTTP requests, the fact that we had to use Selenium exclusively made an auto-scaling EC2 group a good fit. Now remember, we needed to separate I/O tasks from CPU tasks into separate compute instances so raw data wasn’t parsed in this phase. Instead, each listing’s raw data was stored in S3 - our data lake.

Fourth Phase: Building the Parse Queue

In the first phase, we delineated the manner in which various metadata is cataloged for Discovery. Yet, each subsequent phase features a dedicated database to accommodate its unique metadata. Utilizing this information, we were able to determine the specific S3 objects that required parsing and added their paths in batches to the Parse Queue’s SQS. The same dynamic sizing algorithm was employed to maximize the efficiency.

Fifth Phase: Parsing raw content into usable information

With all the raw content parked in the data lake, thus concluding our quarter of a billion HTTP requests, what’s left is parsing out the data points relevant to the business. Parsing raw content is a CPU task which means that we could scale up a fleet of EC2 instances to handle the batch jobs, scale back down when finished to minimize costs while making the data available for the BI team relatively quickly. At that time, we only had limited knowledge of AWS Fargate and it didn’t make business sense to halt development to explore this option. But we knew this could likely turn out to be the best solution. Fortunately for us, building this phase’s application to be ready for containerization wasn’t a stretch and only required a few light modifications. For parsed data, Snowflake is one of the best options for a data warehouse.

Interested in working with us?

Reach Out