Home United States USA — software Best Practices for Building the Data Pipelines

Best Practices for Building the Data Pipelines

December 17, 2023

115

Using best practices for building the data pipelines will improve the data quality and reduce the risk of pipeline breakage significantly.
In my previous article ‘Data Validation to Improve Data Quality’, I shared the importance of data quality and a checklist of validation rules to achieve it. Those validation rules alone may not guarantee the best data quality. In this article, we focus on the best practices to employ while building the data pipelines to ensure data quality. 1. Idempotency
A data pipeline should be built in such a way that, when it is run multiple times, the data should not be duplicated. Also, when a failure happens and it is resolved and run again, there should not be a data loss or improper alterations. Most pipelines are automated and run on a fixed schedule. By capturing the logs of previous successful runs such as the parameters passed (date range), record inserted/modified/deleted count, timespan of the run, etc., the next run parameters can be set relative to the previous successful run. For example, if a pipeline runs every hour and a failover happens at 2 pm, the next run should capture the data from 1 pm automatically and the timeframe should not be incremented until the current run is successful.2. Consistency
In some cases where the data flows from upstream to downstream databases, if the pipeline ran successfully and did not add/modify/delete any records, the next run should include a bigger time frame accounting for the previous run to avoid any data loss. This will help to maintain the consistency between source and target databases if the data is landed in the source with a bit of delay. For the example we considered in the above scenario if a pipeline ran at 2 pm successfully and did not add/modify/delete any records, the next run which happens at 3 pm should fetch the data from 1 pm-3 pm instead of 2 pm-3 pm.