Start United States USA — software The Future of the Data Engineer

The Future of the Data Engineer

192
0
TEILEN

Is the data engineer still the „worst seat at the table?“ Maxime Beauchemin, creator of Apache Airflow, weighs in on the future of data engineering.
Join the DZone community and get the full member experience. In the world of data engineering, Maxime Beauchemin is someone who needs no introduction. One of the first data engineers at Facebook and Airbnb, he wrote and open-sourced the wildly popular orchestrator Apache Airflow, followed shortly thereafter by Apache Superset, a data exploration tool that’s taking the data viz landscape by storm. Currently, Maxime is CEO and co-founder of Preset, a fast-growing startup that’s paving the way forward for AI-enabled data visualization for modern companies. It’s fair to say that Maxime has experienced — and even architected — many of the most impactful data engineering technologies of the last decade, and pioneering the data engineering role itself through his landmark 2017 blog post „The Rise of the Data Engineer“, in which he chronicles many of his observations. In short, Maxime argues that to effectively scale data science and analytics, teams needed a specialized engineer to manage ETL, build pipelines, and scale data infrastructure. Enter the data engineer. The data engineer is a member of the data team primarily focused on building and optimizing the platform for ingesting, storing, analyzing, visualizing, and activating large amounts of data. A few months later, Maxime followed up that piece with a reflection on some of the data engineer’s biggest challenges: the job was hard, the respect was minimal, and the connection between their work and the actual insights generated were obvious but rarely recognized. Data engineering was a thankless but increasingly important job, with data engineering teams straddling between building infrastructure, running jobs, and fielding ad-hoc requests from the analytics and BI teams. As a result, being a data engineer was both a blessing and a curse. In fact, in Maxime’s opinion, the data engineer was the “worst seat at the table.”
So, five years later, where does the field of data engineering stand? What is a data engineer today? What do data engineers do? I sat down with Maxime to discuss the current state of affairs, including the decentralization of the modern data stack, the fragmentation of the data team, the rise of the cloud, and how all these factors have changed the role of the data engineer forever. Maxime recalls a time not too long ago when data engineering would require running Hive jobs for hours at a time, frequent context switching between jobs and managing different elements of your data pipeline. To put it bluntly, data engineering was boring and exhausting at the same time.
“This never-ending context switching and the sheer length of time it took to run data operations led to burnout,” he says. “All too often, 5-10 minutes of work at 11:30 p.m. could save you 2-4 hours of work the next day — and that’s not necessarily a good thing.”
In 2021, data engineers can run big jobs very quickly thanks to the compute power of BigQuery, Snowflake, Firebolt, Databricks, and other cloud warehousing technologies. This movement away from on-prem and open source solutions to the cloud and managed SaaS frees up data engineering resources to work on tasks unrelated to database management. On the flipside, costs are more constrained.
“It used to be fairly cheap to run on-prem, but in the cloud, you have to be mindful of your compute costs,” Maxime says. “The resources are elastic, not finite.”
With data engineers no longer responsible for managing compute and storage, their role is changing from infrastructure development to more performance-based elements of the data stack, or even specialized roles.
“We can see this shift in the rise of data reliability engineering, and data engineering being responsible for managing (not building) data infrastructure and overseeing the performance of cloud-based systems.”
In a previous era of data engineering, data team structure was very much centralized, with data engineers and tech savvy analysts serving as the “librarians” of the data for the entire company. Data governance was a siloed role, and data engineers became the de facto gate keepers of data trust — whether or not they liked it. Nowadays, Maxime suggests, it’s widely accepted that governance is distributed. Every team has their own analytic domain they own, forcing decentralized team structures around broadly standardized definitions of what “good” data looks like.
“We’ve accepted that consensus seeking is not necessary in all areas, but that doesn’t make it any easier,” he says. “The data warehouse is the mirror of the organization in many ways. If people don’t agree on what they call things in the data warehouse or what the definition of a metric is, then this lack of consensus will be reflected downstream.

Continue reading...