What Is Dataset Schema? The Complete Guide (2026)

Saar Twito9 min read
Saar Twito
Saar TwitoFounder & SEO Engineer

Hi, I'm Saar - a software engineer, SEO specialist, and lecturer who loves building tools and teaching tech.

View author profile →

What Is Dataset Schema?

Dataset schema is a type of structured data — defined in the Schema.org Dataset vocabulary — that describes a collection of data, typically used in scientific, academic, or government contexts. It powers Google Dataset Search, a specialized search engine for open datasets, and makes research data discoverable by AI systems and data scientists worldwide.

TL;DR
  • Required by Google: name and description
  • Recommended: creator with ORCID (persons) or ROR (organizations) in sameAs, identifier (DOI), license (URL), keywords, isAccessibleForFree, temporalCoverage, spatialCoverage, distribution (DataDownload with contentUrl and encodingFormat)
  • Greadme warns if description is shorter than 50 characters or longer than 5000 characters — Google guidelines recommend a sufficient, focused description

Why Dataset Schema Matters

Google Dataset Search indexes datasets from repositories like Zenodo, Figshare, Harvard Dataverse, and government data portals. Datasets without structured markup are either not indexed or indexed with poor metadata — making them effectively invisible to data discovery systems.

AI research assistants (including those built on GPT-4, Claude, and Gemini) increasingly use Dataset Search as a retrieval source when answering questions about available research data. Properly marked-up datasets with complete metadata are far more likely to be cited in AI-generated research summaries.

Required Properties

PropertyGreadme rule
nameError if missing. Should be specific and descriptive — generic names like "dataset" or "data" trigger a warning
descriptionError if missing. Greadme warns if under 50 characters (too brief to be useful) or over 5000 characters (likely exceeds display limits). Google guidelines recommend a clear, focused description

Recommended Properties

PropertyNotes
creatorPerson or Organization who created the dataset. Use ORCID ID for persons, ROR ID for organizations in sameAs. Greadme warns if missing
identifierUnique persistent identifier such as a DOI (e.g., https://doi.org/10.1234/abc) or Compact Identifier. Greadme warns if missing
licenseURL of the license (e.g., Creative Commons). Must be a valid URL when given as a string. Greadme warns and deducts 10 points if missing
keywordsKeywords describing the dataset's subject matter
isAccessibleForFreeBoolean: true or false. Indicates whether the data is freely accessible
temporalCoverageISO 8601 date, year, or range (e.g., "2008", "1990/2020", "2018-01-01/..")
spatialCoverageNamed place string or Place object with GeoCoordinates or GeoShape
distributionDataDownload object(s) with contentUrl (download link) and encodingFormat (e.g., "CSV", "JSON")
urlURL of the page describing this dataset. Greadme warns if missing
includedInDataCatalogDataCatalog object if the dataset belongs to a larger catalog
funderPerson or Organization that funded the dataset creation
citationText string or CreativeWork object citing academic papers that describe or use this dataset
hasPartSub-datasets (nested Dataset objects) if the dataset has components

ORCID and ROR: Persistent Researcher and Institution Identifiers

Dataset schema has specific conventions for identifying creators that differ from other content schemas. The scientific community uses:

  • ORCID (Open Researcher and Contributor ID): A persistent digital identifier for individual researchers. Format: https://orcid.org/0000-0000-0000-0000.
  • ROR (Research Organization Registry): A persistent identifier for research organizations. Format: https://ror.org/xxxxxxxxx.

These identifiers go in the sameAs property of the creator object. Greadme warns when a creator is missing sameAs, and the message specifies ORCID for persons and ROR for organizations.

Temporal Coverage Formats

The temporalCoverage property accepts four ISO 8601 formats. Greadme errors on anything that does not match one of these patterns:

FormatExampleMeaning
Year only"2020"Data covers a single year
Full date"2020-03-15"Data covers a single day
Date range"1990/2020"Data spans from 1990 to 2020
Open-ended range"2018-01-01/.."Data starts in 2018 and is continuously updated

Dataset Schema Code Example

A complete dataset schema for a climate research dataset:

{
  "@context": "https://schema.org",
  "@type": "Dataset",
  "name": "Arctic Sea Ice Extent 1979-2024",
  "description": "Monthly satellite measurements of Arctic sea ice extent from 1979 to 2024, collected by NSIDC using passive microwave remote sensing. Covers the Arctic Ocean and surrounding seas.",
  "url": "https://data.example.edu/arctic-ice",
  "identifier": "https://doi.org/10.1234/arctic-ice",
  "license": "https://creativecommons.org/licenses/by/4.0",
  "isAccessibleForFree": true,
  "keywords": ["Arctic", "sea ice", "climate", "remote sensing"],
  "temporalCoverage": "1979/2024",
  "creator": {
    "@type": "Organization",
    "name": "National Snow and Ice Data Center",
    "sameAs": "https://ror.org/02nv7yv05"
  },
  "distribution": {
    "@type": "DataDownload",
    "contentUrl": "https://data.example.edu/arctic-ice.csv",
    "encodingFormat": "CSV"
  }
}

Common Mistakes to Avoid

  • Description under 50 characters: Too brief to be useful in dataset discovery. Greadme warns on short descriptions — expand with methodological or subject-matter context.
  • Description over 5000 characters: Greadme warns — trim to a clear summary and use documentation pages for extended methodology.
  • Generic dataset name: Names like "dataset" or "data" are flagged by Greadme. Use a specific, descriptive title.
  • License as text instead of URL: Writing "license": "CC BY 4.0" instead of the actual URL causes a Greadme warning. Always use the full license URL.
  • Missing distribution: Without a DataDownload, users and AI systems cannot access the data. Greadme warns and deducts 10 points.
  • Invalid temporalCoverage format: Date ranges must use the ISO 8601 interval format (YYYY/YYYY or YYYY-MM-DD/YYYY-MM-DD). Using a slash with just years (e.g., "1990/2020") is valid, but"1990-2020" (hyphen instead of slash) is not.

How Greadme Validates Dataset Schema

Greadme runs a dedicated Dataset validator aligned with Google Dataset Search requirements. The score starts at 100 with the following key deductions:

IssuePoints lost
Missing name−20
Missing description−20
Description under 50 characters−15
Description over 5000 characters−15
Missing creator−10
Missing license−10
Missing distribution−10
Missing url−8
Missing identifier−8
Missing keywords−5
Missing isAccessibleForFree−5
Missing temporalCoverage−5
Invalid temporalCoverage formatError

Frequently Asked Questions

Is Dataset schema only for scientific data?

No. While it is most commonly used by academic and government organizations, Dataset schema can be used for any structured collection of data — including business datasets, marketing statistics, or public APIs. Google Dataset Search indexes any dataset with valid schema markup, regardless of domain.

What if my dataset has multiple file formats?

Use an array for the distribution property, with one DataDownload object per format. Each should have its own contentUrl and encodingFormat. For example, if your dataset is available as both CSV and JSON, include two distribution objects.

Can I nest datasets inside other datasets?

Yes. Use the hasPart property to embed sub-dataset objects. Each nested Dataset must have its own name, description, and license. Greadme validates each nested Dataset and warns on missing required fields.