Home United States USA — software What is Data Ingestion? The Definitive Guide

What is Data Ingestion? The Definitive Guide

April 11, 2022

215

Learn what data ingestion is, why it matters, and how you can use it to power your analytics and activate your data as an essential part of the modern data …
Join the DZone community and get the full member experience. Data ingestion is an essential step of any modern data stack. At its core data ingestion is the process of moving data from various data sources to an end destination where it can be stored for analytics purposes. This data can come in multiple different formats and be generated from various external sources (e.g., website data, app data, databases, SaaS tools, etc.) The data ingestion process is important because it moves data from point A to B. Without a data ingestion pipeline, data is locked in the source it originated in and this isn’t actionable. The easiest way to understand data ingestion is to think of it as a pipeline. In the same way that oil is transported from the well to the refinery, data is transported from the source to the analytics platform. Data ingestion is important because it gives business teams the ability to extract value from data that would otherwise be inaccessible. The end goal of the ingestion layer is to power analytics. In most scenarios, data ingestion is used to move data from disparate sources into a specific data platform, whether that be a data warehouse like Snowflake, a data lake, or even a data lakehouse like Databricks. Once the data is consolidated into these cloud platforms data engineers try to decipher it by building robust data models to power Business Intelligence (BI) dashboards and reports so key stakeholders can use this information to drive business outcomes. In general, there are only two types of data ingestion methods, real-time and batch-based. Real-time processing focuses on collecting data as soon as it has been generated and creating a continuous output stream. Real-time ingestion is extremely important for time-sensitive use cases where new information is vital for decision-making. As an example, large oil companies like Exxon Mobil and Chevron need to monitor their equipment to ensure that their machines are not drilling into rocks, so they generate large amounts of IoT (Internet of Things) data. On the same note, large financial institutions like CapitalOne, Discover, Coinbase, BankofAmerica, etc. need to be able to identify fraudulent actions. These are just two use case examples, but both of them rely heavily on real-time data ingestion. Batch processing focuses on bulk ingestion at a later point(i.e. loading large quantities of data at a scheduled interval or after a specific triggered event.) This method of data ingestion is largely beneficial when data is not needed in real-time. It’s also much cheaper and more efficient when it comes to processing large amounts of data collected over a set period of time. In many scenarios, companies choose to leverage a combination of both batch and real-time data ingestion to ensure that data is constantly available at low latency. In general, real-time should be used as sparingly as possible because it is much more complex and expensive compared to batch-based processing. Every company has a slightly different standard as to what “real-time data” actually is. For some, it’s every ten seconds, for others, it’s every five or ten minutes. However, real-time data ingestion is really only ever needed for sub-minute use cases. For anything equal to or greater than five minutes, batch-based data ingestion should work just fine. The core definition of data ingestion is relatively narrow as it only refers to moving data from one system to another, so it’s better to understand data ingestion through the overarching lens of data integration.