Running (Py)Spark on Airflow Locally & Processing Good/Bad Data separately (Batch-mode).

3 min readMay 8, 2021

Tech Stack: Docker, Airflow, Spark, S3

Part1 : Running Spark on Airflow locally (& quickly)

As part of an interview-process a while back, I was asked to create a service that can ingest data periodically (🤔 Airflow) and load into a database and one of the requirements was to provide a scalable solution ( 🤔 Spark). Ofcourse, using Docker was the easiest way to ensure replicative behaviour.

I have had the experience of working on Airflow and Spark separately but never had a use-case to combine the two together in a local non-prod environment.

But with a couple minor customisations, it was actually quite easy to combine the two together in a dev/test environment to verify a pipeline’s behaviour and satisfy the use-case in my scenario.

Airflow’s original dockerfile is sourced from puckle and spark’s original dockerfile is sourced from docker-spark. Airflow’s image is based of python and docker-spark had python built in for running PySpark. So, all you had to do is change the base image on Airflow to docker-spark and update the PYTHONPATH for the customised dockerfile to include SPARK_HOME environment variable set inside the Dockerfile of docker-spark.

Just ensure that you have provided enough memory and CPU to the Docker for running spark.

Part2 : An approach to processing Good/Bad Data separately (Batch-Mode)

Possible use-case: your stakeholders want as much data to be available for analysis as soon as possible.

Assumption: you receive a lot of bad data in your big-data ingestion pipeline and the issue types could vary (quite common in financial space) abruptly.

Approach :

Your initial ingestion task could segregate bad data from good data based on Data Quality checks (for eg: non-null criteria, specific value criteria like True/False; Approved/Declined, length criteria like currency code length, primary key constrains etc..)
Load the good data into your database, run the subsequent pipeline (any transformations)and send the bad data to your storage like S3 (you can reconcile counts based on bad+good = total rows in original file).
You could do the same for all different ingestion sources.
Create a separate single pipeline (clean-reingest-bad-data) for all datasources that track the original ingestion (you can use ExternalTaskSensors in Airflow) and once the original ingestion task finishes, it goes and look for the bad data and tries to apply your cleansing functions and cleans data as much as possible and re-process the updated data through the same mechanism (as 1 and 2), thereby re-writing the original bad file with essentially what could not be eventually processed.
You could monitor number of bad rows that eventually end up in the bad file to proactively understand if there is a need for further investigation.

For eg: some of your source data might always have 10 bad rows in the beginning that you want to ignore and then some rows in the middle that needs to be cleansed. so you generally expect that after all tasks, eventually there should be only 10 rows ending up in the bad file. So in case, there are more on a certain date when some other corruption in the data happened, you would have proactively identified the need to investigate.

Alternative approach : Why not clean and ingest in one go ?

So this is definitely a possible approach and can work in many scenarios as long as your stakeholder is OK with delay in data but prefers all to be received in one go. Another possible issue is you could run into ingestion pipeline failure if there is a new issue in the data and it would require immediate attention to find and fix the issue so that the whole pipeline runs correctly.

Running (Py)Spark on Airflow Locally & Processing Good/Bad Data separately (Batch-mode).

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Karan Rewari

No responses yet