Start United States USA — software Using the Airflow ShortCircuitOperator to Stop Bad Data from Reaching ETL Pipelines

Using the Airflow ShortCircuitOperator to Stop Bad Data from Reaching ETL Pipelines

138
0
TEILEN

See how to leverage the Airflow ShortCircuitOperator to create data circuit breakers to prevent bad data from reaching your data pipelines.
Join the DZone community and get the full member experience.
I’m a huge fan of Apache Airflow and how the open-source tool enables data engineers to scale data pipelines by more precisely orchestrating workloads. 
But what happens when Airflow testing doesn’t catch all of your bad data? What if “unknown unknown” data quality issues fall through the cracks and affect your Airflow jobs? 
One helpful but underutilized solution is to leverage the Airflow ShortCircuitOperator to create data circuit breakers to prevent bad data from flowing across your data pipelines.
Data circuit breakers are powerful, but as with most data quality tactics, the nuances of how they are implemented are critical. Otherwise, you can make a bad problem worse.
In electrical engineering, a circuit breaker is a safety device that protects your home from damage caused by an overcurrent or a short. When the breaker encounters those electrical incidents it breaks the current to prevent an even worse issue, like a fire, from occurring. 
Data circuit breakers are essentially data tests on steroids and the philosophy is the same. When the data does not meet your defined quality or integrity thresholds in your Airflow DAG the pipeline is stopped, preventing a worse outcome, like a CEO getting bad information, from occurring.
While data circuit breakers are most frequently used to prevent bad data from entering the storage layer, they can be deployed at multiple stages prior to the BI dashboards being updated– between transformation steps or after an ETL or ELT job executes, for example.
For that reason, both data testing and data circuit breakers work best to reduce data downtime when paired with data observability or end-to-end data monitoring and alerting. 
Proactive monitoring and alerting can also supplement and help overcome the challenges with Apache Airflow’s native monitoring and logging capabilities at scale. Specifically, that Airflow pipelines are not data aware.

Continue reading...