Data Engineering in the Age of AI
Much like the introduction of the personal computer, the internet, and the iPhone into the public sphere, recent developments in the AI space, from generative AI to agentic AI, have fundamentally changed the way people live and work. Since ChatGPTβs release in late 2022, itβs reached a threshold of 700 million users per week, approximately 10% of the global adult population. And according to a 2025 report by Capgemini, agentic AI adoption is expected to grow by 48% by the end of the year. Itβs quite clear that this latest iteration of AI technology has transformed virtually every industry and profession, and data engineering is no exception.
As Naveen Sharma, SVP and global practice head at Cognizant, observes, βWhat makes data engineering uniquely pivotal is that it forms the foundation of modern AI systems, itβs where these models originate and what enables their intelligence.β Thus, itβs unsurprising that the latest advances in AI would have a sizable impact on the discipline, perhaps even an existential one. With the increased adoption of AI coding tools leading to the reduction of many entry-level IT positions, should data engineers be wary about a similar outcome for their own profession? Khushbu Shah, associate director at ProjectPro, poses this very question, noting that βweβve entered a new phase of data engineering, one where AI tools donβt just support a data engineerβs work; they start doing it for you.Β .Β .Β .Where does that leave the data engineer? Will AI replace data engineers?β
Despite the growing tide of GenAI and agentic AI, data engineers wonβt be replaced anytime soon. While the latest AI tools can help automate and complete rote tasks, data engineers are still very much needed to maintain and implement the infrastructure that houses data required for model training, build data pipelines that ensure accurate and accessible data, and monitor and enable model deployment. And as Shah points out, βPrompt-driven tools are great at writing code but they canβt reason about business logic, trade-offs in system design, or the subtle cost of a slow query in a production dashboard.β So while their customary daily tasks might shift with the increasing adoption of the latest AI tools, data engineers still have an important role to play in this technological revolution.
The Role of Data Engineers in the New AI Era
In order to adapt to this new era of AI, the most important thing data engineers can do involves a fairly self-evident mindshift. Simply put, data engineers need to understand AI and how data is used in AI systems. As Mike Loukides, VP of content strategy at OβReilly, put it to me in a recent conversation, βData engineering isnβt going away, but you wonβt be able to do data engineering for AI if you donβt understand the AI part of the equation. And I think thatβs where people will get stuck. Theyβll think, βSame old same old,β and it isnβt. A data pipeline is still a data pipeline, but you have to know what that pipeline is feeding.β
So how exactly is data used? Since all models require huge amounts of data for initial training, the first stage involves collecting raw data from various sources, be they databases, public datasets, or APIs. And since raw data is often unorganized or incomplete, preprocessing the data is necessary to prepare it for training, which involves cleaning, transforming, and organizing the data to make it suitable for the AI model. The next stage concerns training the model, where the preprocessed data is fed into the AI model to learn patterns, relationships, or features. After that thereβs posttraining, where the model is fine-tuned with data important to the organization thatβs building the model, a stage that also requires a significant amount of data. Related to this stage is the concept of retrieval-augmented generation (RAG), a technique that provides real-time, contextually relevant information to a model in order to improve the accuracy of responses.
Other important ways that data engineers can adapt to this new environment and help support current AI initiatives is by improving and maintaining high data quality, designing robust pipelines and operational systems, and ensuring that privacy and security measures are met.
In his testimony to a US House of Representatives committee on the topic of AI innovation, Gecko Robotics cofounder Troy Demmer affirmed a golden axiom of the industry: βAI applications are only as good as the data they are trained on. Trustworthy AI requires trustworthy data inputs.β Itβs the reason why roughly 85% of all AI projects fail, and many AI professionals flag it as a major source of concern: without high-quality data, even the most sophisticated models and AI agents can go awry. Since most GenAI models depend upon large datasets to function, data engineers are needed to process and structure this data so that itβs clean, labeled, and relevant, ensuring reliable AI outputs.
Just as importantly, data engineers need to design and build newer, more robust pipelines and infrastructure that can scale with Gen AI requirements. As Adi Polak, Director of AI & Data Streaming at Confluent, notes, βthe next generation of AI systems requires real-time context and responsive pipelines that support autonomous decisions across distributed systemsβ, well beyond traditional data pipelines that can only support batch-trained models or power reports. Instead, data engineers are now tasked with creating nimbler pipelines that can process and support real-time streaming data for inference, historical data for model fine-tuning, versioning, and lineage tracking. They also must have a firm grasp of streaming patterns and concepts, from event driven architecture to retrieval and feedback loops, in order to build high-throughput pipelines that can support AI agents.
While GenAIβs utility is indisputable at this point, the technology is saddled with notable drawbacks. Hallucinations are most likely to occur when a model doesnβt have the proper data it needs to answer a given question. Like many systems that rely on vast streams of information, the latest AI systems are not immune to private data exposure, biased outputs, and intellectual property misuse. Thus, itβs up to data engineers to ensure that the data used by these systems is properly governed and secured, and that the systems themselves comply with relevant data and AI regulations. As data engineer Axel Schwanke astutely notes, these measures may include βlimiting the use of large models to specific data sets, users and applications, documenting hallucinations and their triggers, and ensuring that GenAI applications disclose their data sources and provenance when they generate responses,β as well as sanitizing and validating all GenAI inputs and outputs. An example of a model that addresses the latter measures is OβReilly Answers, one of the first models that provides citations for content it quotes.
The Road Ahead
Data engineers should remain gainfully employed as the next generation of AI continues on its upward trajectory, but that doesnβt mean there arenβt significant challenges around the corner. As autonomous agents continue to evolve, questions regarding the best infrastructure and tools to support them have arisen. As Ben Lorica ponders, βWhat does this mean for our data infrastructure? We are designing intelligent, autonomous systems on top of databases built for predictable, human-driven interactions. What happens when software that writes software also provisions and manages its own data? This is an architectural mismatch waiting to happen, and one that demands a new generation of tools.β One such potential tool has already arisen in the form of AgentDB, a database designed specifically to work effectively with AI agents.
In a similar vein, a recent research paper, βSupporting Our AI Overlords,β opines that data systems must be redesigned to be agent-first. Building upon this argument, Ananth Packkildurai observes that βitβs tempting to believe that the Model Context Protocol (MCP) and tool integration layers solve the agent-data mismatch problem.Β .Β .Β .However, these improvements donβt address the fundamental architectural mismatch.Β .Β .Β .The core issue remains: MCP still primarily exposes existing APIsβprecise, single-purpose endpoints designed for human or application useβto agents that operate fundamentally differently.β Whatever the outcome of this debate may be, data engineers will likely help shape the future underlying infrastructure used to support autonomous agents.
Another challenge for data engineers will be successfully navigating the ever amorphous landscape of data privacy and AI regulations, particularly in the US. With the One Big Beautiful Bill Act leaving AI regulation under the aegis of individual state laws, data engineers need to keep abreast of any local legislations that might impact their companyβs data use for AI initiatives, such as the recently signed SB 53 in California, and adjust their data governance strategies accordingly. Furthermore, what data is used and how itβs sourced should always be at top of mind, with Anthropicβs recent settlement of a copyright infringement lawsuit serving as a stark reminder of that imperative.
Lastly, the quicksilver momentum of the latest AI has led to an explosion of new tools and platforms. While data engineers are responsible for keeping up with these innovations, that can be easier said than done, due to steep learning curves and the time required to truly upskill in something versus AIβs perpetual wheel of change. Itβs a precarious balancing act, one that data engineers must get a bead on quickly in order to stay relevant.
Despite these challenges however, the future outlook of the profession isnβt doom and gloom. While the field will undergo massive changes in the near future due to AI innovation, it will still be recognizably data engineering, as even technology like GenAI requires clean, governed data and the underlying infrastructure to support it. Rather than being replaced, data engineers are more likely to emerge as key players in the grand design of an AI-forward future.
