What Is Dataset Schema? The Complete Guide (2026)
What Is Dataset Schema?
Dataset schema is a type of structured data — defined in the Schema.org Dataset vocabulary — that describes a collection of data, typically used in scientific, academic, or government contexts. It powers Google Dataset Search, a specialized search engine for open datasets, and makes research data discoverable by AI systems and data scientists worldwide.
- Required by Google:
nameanddescription - Recommended:
creatorwith ORCID (persons) or ROR (organizations) insameAs,identifier(DOI),license(URL),keywords,isAccessibleForFree,temporalCoverage,spatialCoverage,distribution(DataDownload withcontentUrlandencodingFormat) - Greadme warns if description is shorter than 50 characters or longer than 5000 characters — Google guidelines recommend a sufficient, focused description
Why Dataset Schema Matters
Google Dataset Search indexes datasets from repositories like Zenodo, Figshare, Harvard Dataverse, and government data portals. Datasets without structured markup are either not indexed or indexed with poor metadata — making them effectively invisible to data discovery systems.
AI research assistants (including those built on GPT-4, Claude, and Gemini) increasingly use Dataset Search as a retrieval source when answering questions about available research data. Properly marked-up datasets with complete metadata are far more likely to be cited in AI-generated research summaries.
Required Properties
| Property | Greadme rule |
|---|---|
name | Error if missing. Should be specific and descriptive — generic names like "dataset" or "data" trigger a warning |
description | Error if missing. Greadme warns if under 50 characters (too brief to be useful) or over 5000 characters (likely exceeds display limits). Google guidelines recommend a clear, focused description |
Recommended Properties
| Property | Notes |
|---|---|
creator | Person or Organization who created the dataset. Use ORCID ID for persons, ROR ID for organizations in sameAs. Greadme warns if missing |
identifier | Unique persistent identifier such as a DOI (e.g., https://doi.org/10.1234/abc) or Compact Identifier. Greadme warns if missing |
license | URL of the license (e.g., Creative Commons). Must be a valid URL when given as a string. Greadme warns and deducts 10 points if missing |
keywords | Keywords describing the dataset's subject matter |
isAccessibleForFree | Boolean: true or false. Indicates whether the data is freely accessible |
temporalCoverage | ISO 8601 date, year, or range (e.g., "2008", "1990/2020", "2018-01-01/..") |
spatialCoverage | Named place string or Place object with GeoCoordinates or GeoShape |
distribution | DataDownload object(s) with contentUrl (download link) and encodingFormat (e.g., "CSV", "JSON") |
url | URL of the page describing this dataset. Greadme warns if missing |
includedInDataCatalog | DataCatalog object if the dataset belongs to a larger catalog |
funder | Person or Organization that funded the dataset creation |
citation | Text string or CreativeWork object citing academic papers that describe or use this dataset |
hasPart | Sub-datasets (nested Dataset objects) if the dataset has components |
ORCID and ROR: Persistent Researcher and Institution Identifiers
Dataset schema has specific conventions for identifying creators that differ from other content schemas. The scientific community uses:
- ORCID (Open Researcher and Contributor ID): A persistent digital identifier for individual researchers. Format:
https://orcid.org/0000-0000-0000-0000. - ROR (Research Organization Registry): A persistent identifier for research organizations. Format:
https://ror.org/xxxxxxxxx.
These identifiers go in the sameAs property of the creator object. Greadme warns when a creator is missing sameAs, and the message specifies ORCID for persons and ROR for organizations.
Temporal Coverage Formats
The temporalCoverage property accepts four ISO 8601 formats. Greadme errors on anything that does not match one of these patterns:
| Format | Example | Meaning |
|---|---|---|
| Year only | "2020" | Data covers a single year |
| Full date | "2020-03-15" | Data covers a single day |
| Date range | "1990/2020" | Data spans from 1990 to 2020 |
| Open-ended range | "2018-01-01/.." | Data starts in 2018 and is continuously updated |
Dataset Schema Code Example
A complete dataset schema for a climate research dataset:
{
"@context": "https://schema.org",
"@type": "Dataset",
"name": "Arctic Sea Ice Extent 1979-2024",
"description": "Monthly satellite measurements of Arctic sea ice extent from 1979 to 2024, collected by NSIDC using passive microwave remote sensing. Covers the Arctic Ocean and surrounding seas.",
"url": "https://data.example.edu/arctic-ice",
"identifier": "https://doi.org/10.1234/arctic-ice",
"license": "https://creativecommons.org/licenses/by/4.0",
"isAccessibleForFree": true,
"keywords": ["Arctic", "sea ice", "climate", "remote sensing"],
"temporalCoverage": "1979/2024",
"creator": {
"@type": "Organization",
"name": "National Snow and Ice Data Center",
"sameAs": "https://ror.org/02nv7yv05"
},
"distribution": {
"@type": "DataDownload",
"contentUrl": "https://data.example.edu/arctic-ice.csv",
"encodingFormat": "CSV"
}
}Common Mistakes to Avoid
- Description under 50 characters: Too brief to be useful in dataset discovery. Greadme warns on short descriptions — expand with methodological or subject-matter context.
- Description over 5000 characters: Greadme warns — trim to a clear summary and use documentation pages for extended methodology.
- Generic dataset name: Names like "dataset" or "data" are flagged by Greadme. Use a specific, descriptive title.
- License as text instead of URL: Writing
"license": "CC BY 4.0"instead of the actual URL causes a Greadme warning. Always use the full license URL. - Missing distribution: Without a DataDownload, users and AI systems cannot access the data. Greadme warns and deducts 10 points.
- Invalid temporalCoverage format: Date ranges must use the ISO 8601 interval format (
YYYY/YYYYorYYYY-MM-DD/YYYY-MM-DD). Using a slash with just years (e.g.,"1990/2020") is valid, but"1990-2020"(hyphen instead of slash) is not.
How Greadme Validates Dataset Schema
Greadme runs a dedicated Dataset validator aligned with Google Dataset Search requirements. The score starts at 100 with the following key deductions:
| Issue | Points lost |
|---|---|
Missing name | −20 |
Missing description | −20 |
| Description under 50 characters | −15 |
| Description over 5000 characters | −15 |
Missing creator | −10 |
Missing license | −10 |
Missing distribution | −10 |
Missing url | −8 |
Missing identifier | −8 |
Missing keywords | −5 |
Missing isAccessibleForFree | −5 |
Missing temporalCoverage | −5 |
Invalid temporalCoverage format | Error |
Frequently Asked Questions
Is Dataset schema only for scientific data?
No. While it is most commonly used by academic and government organizations, Dataset schema can be used for any structured collection of data — including business datasets, marketing statistics, or public APIs. Google Dataset Search indexes any dataset with valid schema markup, regardless of domain.
What if my dataset has multiple file formats?
Use an array for the distribution property, with one DataDownload object per format. Each should have its own contentUrl and encodingFormat. For example, if your dataset is available as both CSV and JSON, include two distribution objects.
Can I nest datasets inside other datasets?
Yes. Use the hasPart property to embed sub-dataset objects. Each nested Dataset must have its own name, description, and license. Greadme validates each nested Dataset and warns on missing required fields.
