Captions are time-synchronized text tracks that display the spoken dialogue, sound effects, and important non-verbal audio cues from a video. They are required by WCAG for any video with audio content. Captions differ from subtitles — captions include sound effects and speaker IDs, while subtitles are typically translation-only and assume the viewer can hear non-speech audio.
.vtt) loaded via the HTML5 <track kind="captions"> element.Think of captions as a transcript that arrives at the right moment. Without them, a video with audio is the same as a silent film for a deaf viewer — and effectively the same for the 85% of social-feed viewers watching with sound off.
Captions are usually framed as a disability accommodation, but the audience that benefits is much larger:
On the open web, captions are delivered as WebVTT files (the W3C standard) and attached to HTML5 video via the <track> element. Each cue has a start and end timestamp, the text, and optional speaker IDs and styling cues.
<!-- Bad: <track> without kind="captions" defaults to "subtitles" -->
<video controls>
<source src="demo.mp4" type="video/mp4">
<track src="demo.en.vtt" srclang="en" label="English">
</video>
<!-- Good: explicit kind="captions" with English as the default track -->
<video controls>
<source src="demo.mp4" type="video/mp4">
<track
kind="captions"
src="demo.en.vtt"
srclang="en"
label="English"
default
>
<track
kind="captions"
src="demo.es.vtt"
srclang="es"
label="Espanol"
>
</video>A WebVTT file is plain text. Cues include speaker IDs, sound effects in brackets, and music descriptions:
WEBVTT
NOTE Product launch demo - English captions
1
00:00:00.500 --> 00:00:03.000
[upbeat instrumental music]
2
00:00:03.000 --> 00:00:06.500
SARAH: Welcome to the 2026 product launch.
3
00:00:06.500 --> 00:00:08.000
[applause]
4
00:00:08.500 --> 00:00:12.000
SARAH: Today we're shipping three things
you've been asking for.
5
00:00:12.000 --> 00:00:14.500
[door slams off-screen]
6
00:00:14.500 --> 00:00:17.000
MIKE: [speaking French] Where is the demo?
7
00:00:17.000 --> 00:00:19.000
SARAH: Right on cue.Captions fail silently. A page can pass every visual QA check while shipping videos with no caption track, with a track marked kind="subtitles" instead of captions, or with auto-generated captions that mangle proper nouns. Use these checks:
<video> element without a <track kind="captions">, plus tracks with the wrong kind attribute or missing srclang/label.The HTML5 <track> element defaults to kind="subtitles" if you do not specify it. Subtitles are translation-only; captions include sound effects and speaker IDs. WCAG 1.2.2 requires captions, not subtitles — set the attribute explicitly.
Caption what is actually said, including filler words and stutters when they carry meaning. Paraphrasing erases nuance and is grounds for ADA complaints in regulated industries.
Use ALL CAPS speaker names followed by a colon (SARAH:) at the start of each cue when speakers change. Without speaker IDs, multi-person dialogue is impossible to follow with captions alone.
[door slams], [applause], [phone ringing] — sound effects that affect comprehension or atmosphere must appear in the captions. Decorative ambient noise can be omitted.
[upbeat instrumental music], [somber piano music], [song lyrics: We will rock you]. Music that sets tone, signals a scene change, or carries lyrics relevant to the video must be captioned.
When a speaker switches languages, use [speaking French] Where is the hotel? or transcribe the dialogue in its original language with a translation in brackets, depending on the editorial policy of the video.
Long captions that fill the screen overlap visual content and run faster than viewers can read. Break at natural clause boundaries; keep cues on screen for at least 1 second and no longer than 6 seconds.
Automatic-caption services typically produce 70–85% accuracy on conversational speech, dropping sharply on proper nouns, brand names, technical terms, and accented speech. Auto-captions are a starting point, not a final track.
Open captions are burned into the video and cannot be turned off. Use them only for short clips on platforms where toggle support is unreliable (autoplay social posts, embedded ads). For long-form content, closed captions via a <track> element are far more flexible.
What is happening: The team enabled automatic captions and never reviewed the output. Brand names are misspelled, technical terms are wrong, and punctuation is missing. Accuracy is typically 70–85% — far below the 99% threshold most accessibility audits expect.
Fix: Treat auto-captions as a first draft. Have a human editor pass through every video to fix proper nouns, restore punctuation, and add speaker IDs and sound effects.
What is happening: A <track> element exists, but the kind attribute is missing or set to subtitles. The browser exposes the track in the subtitle menu but not the captions menu, and screen readers may not announce it as captions.
Fix: Set kind="captions" explicitly on every English (or original-language) caption track. Reserve kind="subtitles" for translation tracks only.
What is happening: Two or more people speak, but the captions show only the dialogue with no indication of who said what. A deaf viewer cannot tell whether a single person is monologuing or several people are in conversation.
Fix: Add ALL CAPS speaker names with a colon to every cue where the speaker changes (SARAH:, MIKE:, NARRATOR:). For off-screen speakers, prefix with VOICE-OVER: or OFF-SCREEN:.
What is happening: The track contains only translated dialogue — no sound effects, no music cues, no speaker IDs. It is a subtitle track being passed off as captions, and it fails WCAG 1.2.2.
Fix: Maintain two separate tracks: a captions track in the original language (with sound effects and speaker IDs) and subtitle tracks for each translation. Set kind="captions" on the first and kind="subtitles" on the others.
These four media accessibility artifacts are often confused, but each one solves a different problem and is required by a different WCAG criterion.
| Artifact | What It Contains | Primary Audience | WCAG Criterion |
|---|---|---|---|
| Captions | Dialogue + speaker IDs + sound effects + music cues, time-synced | Deaf and hard-of-hearing viewers; sound-off viewers | 1.2.2 (Level A) prerecorded; 1.2.4 (AA) live |
| Subtitles | Dialogue only, translated into another language | Viewers who do not speak the original language | Not a WCAG requirement (i18n feature) |
| Transcripts | Full text of dialogue, sound effects, and (sometimes) visuals — not time-synced | Search engines, AI, deaf-blind users with refreshable braille | 1.2.1 (Level A) for audio-only content |
| Audio Description | Spoken narration describing key visual content during pauses in dialogue | Blind and low-vision viewers | 1.2.3 (A) and 1.2.5 (AA) |
Captions are written for viewers who cannot hear the audio. They include dialogue, speaker IDs, sound effects, and music cues. Subtitles are written for viewers who can hear but do not speak the language — they translate the dialogue and assume the viewer can hear non-speech audio. WCAG 1.2.2 requires captions; subtitles alone do not meet the criterion.
On their own, no. Auto-caption services typically deliver 70–85% accuracy on conversational speech, drop sharply on proper nouns, brand names, and technical terms, and rarely include speaker IDs or sound effects. WCAG 1.2.2 requires captions to convey the audio content equivalently — a 70–85%-accurate track does not. Use auto-captions as a draft, then edit them.
WebVTT (Web Video Text Tracks) is the W3C standard for time-synchronized text on the web. It is plain text, supports cues with timestamps, speaker IDs, sound effects, positioning, and styling, and is loaded via the HTML5 <track> element. Every modern browser supports it natively. Files use the .vtt extension.
Open captions (burned into the video) are usually safer on social feeds because many platforms either auto-play with sound off or do not surface caption controls reliably. For long-form content on your own site, closed captions via <track> give viewers control (font size, language, on/off) and remain accessible to assistive technology. Many teams ship both — open for social, closed everywhere else.
Yes. Google indexes caption transcripts as text content associated with the video. A captioned video can rank for the words spoken in it; an uncaptioned video is opaque to search. Captions also feed video schema (VideoObject with a transcript property) and improve dwell-time signals.
They cite the transcripts. Generative engines do not watch video, but they readily ingest WebVTT files, video transcripts, and on-page descriptions. A video with a complete, well-formed caption track is significantly more likely to be summarized or cited in an AI answer than the same video with no captions. Without captions, the video effectively does not exist to AI search.
Substantial. The ADA has been applied to digital video in dozens of US lawsuits (notably NFB v. HBO in 2017). The European Accessibility Act became enforceable in June 2025 and covers private-sector video on commercial sites. AODA (Ontario) and the Accessible Canada Act apply in Canada. In regulated industries (government, education, finance, healthcare), uncaptioned video is a documented compliance failure that triggers remediation orders.
Captions are not optional in 2026. WCAG 1.2.2 has been Level A for over a decade, the European Accessibility Act now enforces it across the EU, and the audience that depends on captions — deaf viewers, sound-off mobile users, ESL viewers, AI search engines — is far larger than the audience that does not. A captioned video reaches more people, ranks better in search, gets cited more often by AI, and removes a major source of legal exposure.
Run a Greadme deep scan to find every video on your site that is missing a caption track, has the wrong kind attribute, or is shipping subtitles where captions are required.