Categories: AITech

Before AI Can Train Itself, It Has to Learn What Its Data Means

The most important barrier to automating model development is data understanding within proper context. Data does not carry its own interpretation. The same field name can refer to subtly different things across teams, and the institutional knowledge required to make sense of any of it typically lives in people’s heads rather than in any system. Until that problem is solved, the goal of automating model development will remain out of reach, regardless of how capable the models themselves become.

Tied into this challenge is the common assumption that machine learning is a predecessor technology, quietly retired after foundation models arrived. This misreads the relationship between ML and AI. As a discipline, ML encompasses everything involved in training models, including the foundation models that form the base of the current AI moment. What has changed is not relevance but scale and complexity.

Traditional ML models were trained on carefully curated, domain-specific datasets. Foundation models are trained on thousands of datasets simultaneously, drawn from sources with inconsistent formats, uncertain provenance and wildly variable quality. An AI system that can genuinely interpret heterogeneous data in context could improve the pipelines built around it, and ultimately improve its own training process. Every downstream domain where AI is applied would feel the effects.

Contextual understanding of data is the near-term milestone for automating model development. But to go further, we need to fully understand the barriers that lie between here.

Table of Contents

Toggle

The Context Problem Has Three Faces

The first is fragmentation. In any sufficiently complex organization, the signals, experiments and institutional knowledge relevant to a modeling problem are scattered across systems that were never designed to communicate with each other.

For example, a data science team building a churn model might find that customer interaction logs live in Salesforce, billing history in a legacy Oracle system, support tickets in Zendesk and product usage telemetry in a data warehouse. Each silo is maintained by a different team, on a different update cadence, with no shared key to join them reliably.

The second face is semantic ambiguity, and it is subtler and, ultimately, more damaging than fragmentation. Meaning is contextual, organizational and unstable across teams. Radiology offers a vivid illustration of how quickly contextual ambiguity compounds. A single training dataset might contain patient records where the same ID field maps to three different identification systems across facilities; “acquisition date” could mean something different depending on which platform recorded it; and outcome labels of “normal” could reflect a clean scan in one department and a stable-but-abnormal baseline in another. None of these ambiguities announce themselves in the data. They are invisible without the institutional knowledge of how a specific department, system or clinician operates.

The third face is the absence of institutional memory. The AI team at a bank might discover over several years that a specific data vendor’s income verification feed behaves erratically during tax season, that self-reported employment data from one acquisition channel is systematically inflated, and that a regulatory change in a prior year makes any data before it came into effect unreliable as a training signal. All of this knowledge is critical to understand when building a model, but it all lives in one analyst’s notebook and a few buried email threads. When the team reorganizes, that memory vanishes.

A History of Useful but Insufficient Answers

The field has not been standing still. AutoML, which emerged around 2014, addressed hyperparameter tuning effectively. MLOps, which saw widespread adoption around 2017, made production pipelines more robust and easier to monitor. More recently, coding agents generate code with impressive fluency. Each of these advances solved a real and meaningful problem, but none has solved the context problem.

The field has built progressively better tools for doing what humans specify, while leaving untouched the harder question of whether the specification itself reflects what the data actually means. AutoML could not handle objective mismatches or reason about organizational intent. MLOps tools execute a strategy rather than define one. Coding agents operate without organizational context or institutional memory.

What Genuine Automation Would Actually Require

A system capable of truly autonomous ML engineering would need to translate business goals into model objectives. This translation cannot be inferred from data alone. It requires genuine understanding of organizational intent and would need to maintain rigorous audit trails tracking provenance across data versions, feature definitions and code commits, not as an administrative record but as a mechanism for grounding decisions in what actually happened.

Critically, model automation would need to be designed with human judgment built into its architecture rather than bolted on as an afterthought. The model requires calibrated support for varying levels of human involvement depending on the task, the stakes and the system’s own confidence at each decision point. Automation that bypasses human judgment at critical moments is failure mode dressed up as efficiency.

The connectivity layer, which enables disparate systems to talk to each other, is a largely tractable problem. What remains genuinely open is the semantic layer, which builds systems that understand what organizational data means in a specific institutional context, not just what it contains.

Why This Is Bigger Than Enterprise AI

There is a reason to care about all of this that goes beyond operational efficiency or competitive advantage. The challenges of semantic ambiguity, missing institutional memory and contextual fragmentation are not unique to enterprise ML teams. They manifest at even greater scale in the training of foundation models themselves, where thousands of heterogeneous datasets must be aggregated, filtered, and iteratively refined.

The tools and techniques built to automate ML engineering in organizational settings are also needed (under different constraints) in the training pipelines of foundation models. Progress on the enterprise problem translates into progress on the larger one. Solving the enterprise setting addresses a central part of that problem.

The economic implications follow naturally. Custom ML development today requires specialist practitioners and weeks of iteration even for well-scoped problems. A system that could navigate the full workflow autonomously, from problem definition through feature engineering and model evaluation, would compress those timelines dramatically and open high-value use cases that are currently too resource-intensive to pursue.

The teams that crack the semantic understanding problem first will have unlocked the mechanism by which AI systems can begin to improve themselves. And unlike the milestones that get most of the attention, this one has a clear and specific barrier standing in front of it: not more compute, not a better architecture, but the unsolved problem of what data means and who gets to decide.

Doris Xin has served as an ML engineer at LinkedIn, contributed to Apache Spark MLlib as Databricks’ first intern, and earned her PhD at UC Berkeley’s RISELab as an NSF Graduate Research Fellow. Her research, including work with Google Research and Microsoft Research, has appeared at ACM SIGMOD, VLDB, ACM CHI, ACM RecSys, and JMLR. At Disarray, she’s building a systme that addresses fragmented data context, reinvented solutions, and institutional knowledge trapped in people’s heads.