What Is Dataset Schema?

Saar TwitoPublished May 21, 20269 min read

Saar TwitoFounder & SEO Engineer

Hi, I'm Saar - a software engineer, SEO specialist, and lecturer who loves building tools and teaching tech.

What Is Dataset Schema?

Dataset schema is a type of structured data — defined in the Schema.org Dataset vocabulary — that describes a collection of data, typically used in scientific, academic, or government contexts. It powers Google Dataset Search, a specialized search engine for open datasets, and makes research data discoverable by AI systems and data scientists worldwide.

TL;DR

Required by Google: name and description
Recommended: creator with ORCID (persons) or ROR (organizations) in sameAs, identifier (DOI), license (URL), keywords, isAccessibleForFree, temporalCoverage, spatialCoverage, distribution (DataDownload with contentUrl and encodingFormat)
Greadme warns if description is shorter than 50 characters or longer than 5000 characters — Google guidelines recommend a sufficient, focused description

Why Dataset Schema Matters

Google Dataset Search indexes datasets from repositories like Zenodo, Figshare, Harvard Dataverse, and government data portals. Datasets without structured markup are either not indexed or indexed with poor metadata — making them effectively invisible to data discovery systems.

AI research assistants (including those built on GPT-4, Claude, and Gemini) increasingly use Dataset Search as a retrieval source when answering questions about available research data. Properly marked-up datasets with complete metadata are far more likely to be cited in AI-generated research summaries.

Required Properties

Property	Greadme rule
`name`	Error if missing. Should be specific and descriptive — generic names like "dataset" or "data" trigger a warning
`description`	Error if missing. Greadme warns if under 50 characters (too brief to be useful) or over 5000 characters (likely exceeds display limits). Google guidelines recommend a clear, focused description

Recommended Properties

Property	Notes
`creator`	Person or Organization who created the dataset. Use ORCID ID for persons, ROR ID for organizations in `sameAs`. Greadme warns if missing
`identifier`	Unique persistent identifier such as a DOI (e.g., `https://doi.org/10.1234/abc`) or Compact Identifier. Greadme warns if missing
`license`	URL of the license (e.g., Creative Commons). Must be a valid URL when given as a string. Greadme warns and deducts 10 points if missing
`keywords`	Keywords describing the dataset's subject matter
`isAccessibleForFree`	Boolean: `true` or `false`. Indicates whether the data is freely accessible
`temporalCoverage`	ISO 8601 date, year, or range (e.g., `"2008"`, `"1990/2020"`, `"2018-01-01/.."`)
`spatialCoverage`	Named place string or Place object with GeoCoordinates or GeoShape
`distribution`	DataDownload object(s) with `contentUrl` (download link) and `encodingFormat` (e.g., "CSV", "JSON")
`url`	URL of the page describing this dataset. Greadme warns if missing
`includedInDataCatalog`	DataCatalog object if the dataset belongs to a larger catalog
`funder`	Person or Organization that funded the dataset creation
`citation`	Text string or CreativeWork object citing academic papers that describe or use this dataset
`hasPart`	Sub-datasets (nested Dataset objects) if the dataset has components

ORCID and ROR: Persistent Researcher and Institution Identifiers

Dataset schema has specific conventions for identifying creators that differ from other content schemas. The scientific community uses:

ORCID (Open Researcher and Contributor ID): A persistent digital identifier for individual researchers. Format: https://orcid.org/0000-0000-0000-0000.
ROR (Research Organization Registry): A persistent identifier for research organizations. Format: https://ror.org/xxxxxxxxx.

These identifiers go in the sameAs property of the creator object. Greadme warns when a creator is missing sameAs, and the message specifies ORCID for persons and ROR for organizations.

Temporal Coverage Formats

The temporalCoverage property accepts four ISO 8601 formats. Greadme errors on anything that does not match one of these patterns:

Format	Example	Meaning
Year only	`"2020"`	Data covers a single year
Full date	`"2020-03-15"`	Data covers a single day
Date range	`"1990/2020"`	Data spans from 1990 to 2020
Open-ended range	`"2018-01-01/.."`	Data starts in 2018 and is continuously updated

Dataset Schema Code Example

A complete dataset schema for a climate research dataset:

{
  "@context": "https://schema.org",
  "@type": "Dataset",
  "name": "Arctic Sea Ice Extent 1979-2024",
  "description": "Monthly satellite measurements of Arctic sea ice extent from 1979 to 2024, collected by NSIDC using passive microwave remote sensing. Covers the Arctic Ocean and surrounding seas.",
  "url": "https://data.example.edu/arctic-ice",
  "identifier": "https://doi.org/10.1234/arctic-ice",
  "license": "https://creativecommons.org/licenses/by/4.0",
  "isAccessibleForFree": true,
  "keywords": ["Arctic", "sea ice", "climate", "remote sensing"],
  "temporalCoverage": "1979/2024",
  "creator": {
    "@type": "Organization",
    "name": "National Snow and Ice Data Center",
    "sameAs": "https://ror.org/02nv7yv05"
  },
  "distribution": {
    "@type": "DataDownload",
    "contentUrl": "https://data.example.edu/arctic-ice.csv",
    "encodingFormat": "CSV"
  }
}

Common Mistakes to Avoid

Description under 50 characters: Too brief to be useful in dataset discovery. Greadme warns on short descriptions — expand with methodological or subject-matter context.
Description over 5000 characters: Greadme warns — trim to a clear summary and use documentation pages for extended methodology.
Generic dataset name: Names like "dataset" or "data" are flagged by Greadme. Use a specific, descriptive title.
License as text instead of URL: Writing "license": "CC BY 4.0" instead of the actual URL causes a Greadme warning. Always use the full license URL.
Missing distribution: Without a DataDownload, users and AI systems cannot access the data. Greadme warns and deducts 10 points.
Invalid temporalCoverage format: Date ranges must use the ISO 8601 interval format (YYYY/YYYY or YYYY-MM-DD/YYYY-MM-DD). Using a slash with just years (e.g., "1990/2020") is valid, but"1990-2020" (hyphen instead of slash) is not.

How Greadme Validates Dataset Schema

Greadme runs a dedicated Dataset validator aligned with Google Dataset Search requirements. The score starts at 100 with the following key deductions:

Issue	Points lost
Missing `name`	−20
Missing `description`	−20
Description under 50 characters	−15
Description over 5000 characters	−15
Missing `creator`	−10
Missing `license`	−10
Missing `distribution`	−10
Missing `url`	−8
Missing `identifier`	−8
Missing `keywords`	−5
Missing `isAccessibleForFree`	−5
Missing `temporalCoverage`	−5
Invalid `temporalCoverage` format	Error

Frequently Asked Questions

Is Dataset schema only for scientific data?

No. While it is most commonly used by academic and government organizations, Dataset schema can be used for any structured collection of data — including business datasets, marketing statistics, or public APIs. Google Dataset Search indexes any dataset with valid schema markup, regardless of domain.

What if my dataset has multiple file formats?

Use an array for the distribution property, with one DataDownload object per format. Each should have its own contentUrl and encodingFormat. For example, if your dataset is available as both CSV and JSON, include two distribution objects.

Can I nest datasets inside other datasets?

Yes. Use the hasPart property to embed sub-dataset objects. Each nested Dataset must have its own name, description, and license. Greadme validates each nested Dataset and warns on missing required fields.