Open Isn’t Enough: Why Agent-Ready Open Data Matters for AI

We’re delighted to hand the mic over to Gautham Krishnadas for our latest guest blog. Gautham is the Co-founder of Dtechtive, which helps organisations such as the Scottish Government, make data more discoverable, trustworthy and AI-ready. His background spans industry-academia research, full-stack data science, user-centric product design and commercial delivery. He has a PhD in Applied AI from the University of Edinburgh.

The Abundance Trap

We have more open data than ever before. Governments, research institutions, and international bodies publish millions of datasets covering everything from climate to public spending. And yet a troubling gap is emerging. As AI agents become capable of autonomously discovering, retrieving, and reasoning over data, the distance between data that is published and data that is genuinely usable by AI is widening fast.

According to Oxford Insights Government AI Readiness Index 2025, which assessed 195 governments across every region, data readiness is the primary constraint on converting AI investment into real-world value. IBM adds that only 26% of Chief Data Officers feel confident their data can support AI applications. For open data, that figure is almost certainly lower.

Opening data is necessary. But it is no longer sufficient.

From Open to Agent-Ready: What’s the Difference?

AI agents demand far more from a dataset than a human analyst does. A human can interpret an ambiguous column header or work around an inconsistent date format. An agent either finds what it needs in a machine-interpretable form, or it moves on to a proprietary dataset that is better prepared.

Agent-ready open data is structured and well-documented, findable and accessible via APIs, carries provenance and licensing that machines can interpret, and is semantically enriched through knowledge graphs and vector embeddings.

The FAIR-R ² framework captures this well for the AI age: data must be Findable, Accessible, Interoperable, and Reusable, while adding two new dimensions:

AI-Readiness (AIR) — technical criteria ensuring that data can be efficiently used in machine learning (covering structure, labeling, versioning, APIs, scalability), and
Responsible AI (RAI) — ethical and accountability safeguards, including bias assessment, explainability, and human oversight.

How to Get There: A Practical Framework

Treat Datasets as Products

The most important shift is organisational, not technical. Agent-ready data requires dedicated ownership: someone accountable for quality, freshness, and usability throughout a dataset’s lifecycle. EY calls this a “data product mindset”. For publishers, it means maintaining metadata catalogues, publishing update schedules, keeping metadata up-to-date and actively engaging with the communities who use their data.

Give Your Dataset a Passport

The field now has a dedicated standard for machine-readable dataset metadata. Croissant, developed by MLCommons with contributions from Google, Kaggle, and Hugging Face, describes dataset contents, structure, provenance, and usage restrictions in a form AI tools can read directly. A Croissant-tagged dataset is discoverable via Google Dataset Search and loadable into TensorFlow, or PyTorch, with minimal code. Its Responsible AI (RAI) vocabulary extension also captures fairness, explainability, and compliance metadata. Alongside Croissant, publishers should produce Datasheets for Datasets: structured documentation covering how data was collected, known limitations, and recommended uses.

APIs are the Front Door

AI agents interact with data through APIs, and the design of those APIs determines how easily an agent can discover, filter, and retrieve what it needs. Adopt an API-first approach: expose datasets through versioned, well-documented endpoints with consistent schemas and support for filtering by date, geography, and category. Poor API design is one of the most overlooked barriers to agent-readiness.

Help Agents Find Your Datasets

Even well-prepared datasets are useless if an agent cannot find them. The Model Context Protocol (MCP), open-sourced by Anthropic and now governed under the Linux Foundation, provides a standardised interface for agents to connect with data sources. Eclair, built on Croissant and MCP, lets agents search datasets across major repositories, retrieve metadata, and load data into workflows automatically. For publishers, adopting Croissant metadata and MCP-compatible APIs is a concrete step towards genuine agent-accessibility.

Make Data Retrieval-Ready

For open data to participate in Retrieval-Augmented Generation (RAG) architectures, it needs to be enriched with metadata context at ingestion time, chunked meaningfully, and indexed with vector embeddings. AI-driven techniques are now available to enrich metadata and generate context such as titles, descriptions and tags, at scale, saving months of manual effort for large data estates. Using semantic search and keyword search together, with metadata filtering on top, gives agents far more accurate results.

Embed Governance and Audit for Bias

Static governance policies cannot keep pace with AI. Embed quality guarantees, access controls, and provenance tracking directly into data pipelines so governance travels with the data. And audit proactively for bias: open data often reflects historical inequalities in collection. UNESCO’s AI Ethics Recommendation and the OECD AI Principles both call for proactive fairness audits and diverse, representative datasets. Agent-ready open data must be trustworthy, not just accessible.

What’s at Stake

An agent-ready open data future is one where publicly available datasets power publicly beneficial AI. Getting there requires action from multiple stakeholders. Governments must invest in data-metadata infrastructure and standards. Publishers must adopt product thinking, generate metadata context, and adopt API-first design. The AI community must build tools that work with open data, not around it.

If we do not act, AI agents will increasingly route around poorly maintained open data in favour of well-curated proprietary sources. The result is AI that serves proprietary interests rather than public ones, and a digital divide that maps almost perfectly onto existing inequalities. As the Oxford Insights index warns, the gap between AI-ready and AI-aspiring nations is already widening faster than the gap in computing power.

Open data was built on the principle that knowledge should be shared. Making it agent-ready is how we honour that principle in an age of intelligent machines.

Open Data Month at The Data Lab Community

In April, we’re shining a spotlight on open data – showcasing interesting and innovative uses of data and AI!

Here’s what we’ve got lined up:

We Build Together – Public Sector Open Data for Scotland – 23 April 13:00 – 13:50 BST [Online]
Open Data: Dundee Data Meetup x The Data Lab Community – 28 April 18:00 – 20:00 BST [Dundee, UK]
Open Isn’t Enough: Rethinking Data for AI Agents – 1 May 12:30 – 13:30 BST [Online] (Gautham will be joining us as a speaker for this session on making open data usable for AI!)

About Us

For Business

The Data Lab Academy

For Universities and Colleges

Community

Open Isn’t Enough: Rethinking Data for AI Agents