How to Add Captions to Videos

Saar TwitoPublished May 21, 2025Updated April 30, 202610 min read

Saar TwitoFounder & SEO Engineer

Hi, I'm Saar - a software engineer, SEO specialist, and lecturer who loves building tools and teaching tech.

What Are Video Captions?

Captions are time-synchronized text tracks that display the spoken dialogue, sound effects, and important non-verbal audio cues from a video. They are required by WCAG for any video with audio content. Captions differ from subtitles — captions include sound effects and speaker IDs, while subtitles are typically translation-only and assume the viewer can hear non-speech audio.

Key Facts (TL;DR)

WCAG 2.2 SC 1.2.2 Captions (Prerecorded) — Level A: required for all prerecorded video that has an audio track.
WCAG 2.2 SC 1.2.4 Captions (Live) — Level AA: required for live broadcasts; the legal threshold for most regulated industries.
WCAG 2.2 SC 1.2.3 Audio Description or Media Alternative — Level A: covers important visual content not conveyed in audio.
Audience size: approximately 466 million people worldwide have disabling hearing loss (WHO, 2024) and depend on captions for video content.
Sound-off viewing: ~85% of Facebook video views happen with sound off (industry data). Captions multiply view-completion rates by 2–3×.
Standard format: WebVTT (.vtt) loaded via the HTML5 <track kind="captions"> element.
Legal exposure: required by the ADA (US), the European Accessibility Act (effective June 2025), and AODA (Ontario). The 2017 NFB v. HBO settlement established caption obligations for streaming platforms.

Think of captions as a transcript that arrives at the right moment. Without them, a video with audio is the same as a silent film for a deaf viewer — and effectively the same for the 85% of social-feed viewers watching with sound off.

Why Captions Matter — Beyond Accessibility

Captions are usually framed as a disability accommodation, but the audience that benefits is much larger:

Deaf and hard-of-hearing users: roughly 466 million people globally (WHO, 2024). Without captions, video content is inaccessible.
Cognitive disabilities and ESL viewers: users who process text faster than rapid speech, or whose first language differs from the audio, comprehend captioned video at significantly higher rates.
Sound-off mobile users: ~85% of Facebook video views happen muted (industry data). Captions multiply average view-completion by 2–3× on social feeds.
SEO: Google indexes caption transcript text. A video with English captions ranks for the words spoken in it; without captions, the video is opaque to search.
AI search visibility: generative engines (Google AI Overviews, ChatGPT, Perplexity, Claude) use caption transcripts to summarize, cite, and recommend videos. No captions = no citation.
Legal compliance: the ADA, European Accessibility Act (EAA, effective June 2025), and AODA all mandate captions for public-facing video. The 2017 NFB v. HBO case established caption obligations for streaming platforms.

How Captions Work: WebVTT and the <track> Element

On the open web, captions are delivered as WebVTT files (the W3C standard) and attached to HTML5 video via the <track> element. Each cue has a start and end timestamp, the text, and optional speaker IDs and styling cues.

<!-- Bad: <track> without kind="captions" defaults to "subtitles" -->
<video controls>
  <source src="demo.mp4" type="video/mp4">
  <track src="demo.en.vtt" srclang="en" label="English">
</video>

<!-- Good: explicit kind="captions" with English as the default track -->
<video controls>
  <source src="demo.mp4" type="video/mp4">
  <track
    kind="captions"
    src="demo.en.vtt"
    srclang="en"
    label="English"
    default
  >
  <track
    kind="captions"
    src="demo.es.vtt"
    srclang="es"
    label="Espanol"
  >
</video>

A WebVTT file is plain text. Cues include speaker IDs, sound effects in brackets, and music descriptions:

WEBVTT

NOTE Product launch demo - English captions

1
00:00:00.500 --> 00:00:03.000
[upbeat instrumental music]

2
00:00:03.000 --> 00:00:06.500
SARAH: Welcome to the 2026 product launch.

3
00:00:06.500 --> 00:00:08.000
[applause]

4
00:00:08.500 --> 00:00:12.000
SARAH: Today we're shipping three things
you've been asking for.

5
00:00:12.000 --> 00:00:14.500
[door slams off-screen]

6
00:00:14.500 --> 00:00:17.000
MIKE: [speaking French] Where is the demo?

7
00:00:17.000 --> 00:00:19.000
SARAH: Right on cue.

How to Check Whether Your Videos Have Proper Captions

Captions fail silently. A page can pass every visual QA check while shipping videos with no caption track, with a track marked kind="subtitles" instead of captions, or with auto-generated captions that mangle proper nouns. Use these checks:

Greadme deep scan — flags every <video> element without a <track kind="captions">, plus tracks with the wrong kind attribute or missing srclang/label.
Greadme crawler scan — runs the same checks across every page on your site, so you can audit a full library of videos in one pass.
Greadme AI visibility analyzer — shows whether AI engines are extracting your video transcripts (a strong downstream signal that captions exist and are well-formed).
Sound-off playback test — mute the video and try to follow it using only the captions. If you cannot follow speakers, sound effects, or scene cues, the captions are incomplete.
Screen reader test — VoiceOver, NVDA, and JAWS expose caption tracks via the player's controls. Confirm the captions menu lists your tracks and that the default track is correct.
WebVTT validation — the W3C's WebVTT validator catches malformed cues, broken timestamps, and overlapping cues.

9 Rules for High-Quality Captions

1. Use kind="captions", Not the Default Subtitle Mode

The HTML5 <track> element defaults to kind="subtitles" if you do not specify it. Subtitles are translation-only; captions include sound effects and speaker IDs. WCAG 1.2.2 requires captions, not subtitles — set the attribute explicitly.

2. Caption Verbatim, Not Paraphrased

Caption what is actually said, including filler words and stutters when they carry meaning. Paraphrasing erases nuance and is grounds for ADA complaints in regulated industries.

3. Identify Every Speaker When There Are Two or More

Use ALL CAPS speaker names followed by a colon (SARAH:) at the start of each cue when speakers change. Without speaker IDs, multi-person dialogue is impossible to follow with captions alone.

4. Describe Sound Effects in Square Brackets

[door slams], [applause], [phone ringing] — sound effects that affect comprehension or atmosphere must appear in the captions. Decorative ambient noise can be omitted.

5. Describe Music Cues

[upbeat instrumental music], [somber piano music], [song lyrics: We will rock you]. Music that sets tone, signals a scene change, or carries lyrics relevant to the video must be captioned.

6. Caption Foreign Language With a Bracketed Note

When a speaker switches languages, use [speaking French] Where is the hotel? or transcribe the dialogue in its original language with a translation in brackets, depending on the editorial policy of the video.

7. Limit Each Cue to 1–2 Lines, ~32 Characters per Line

Long captions that fill the screen overlap visual content and run faster than viewers can read. Break at natural clause boundaries; keep cues on screen for at least 1 second and no longer than 6 seconds.

8. Edit Auto-Generated Captions — Never Ship Them Raw

Automatic-caption services typically produce 70–85% accuracy on conversational speech, dropping sharply on proper nouns, brand names, technical terms, and accented speech. Auto-captions are a starting point, not a final track.

9. Use Open Captions Only Where the Player Cannot Be Trusted

Open captions are burned into the video and cannot be turned off. Use them only for short clips on platforms where toggle support is unreliable (autoplay social posts, embedded ads). For long-form content, closed captions via a <track> element are far more flexible.

Common Caption Mistakes and How to Fix Them

Problem: Auto-Generated Captions Left Unedited

What is happening: The team enabled automatic captions and never reviewed the output. Brand names are misspelled, technical terms are wrong, and punctuation is missing. Accuracy is typically 70–85% — far below the 99% threshold most accessibility audits expect.

Fix: Treat auto-captions as a first draft. Have a human editor pass through every video to fix proper nouns, restore punctuation, and add speaker IDs and sound effects.

Problem: <track> Element With No kind="captions"

What is happening: A <track> element exists, but the kind attribute is missing or set to subtitles. The browser exposes the track in the subtitle menu but not the captions menu, and screen readers may not announce it as captions.

Fix: Set kind="captions" explicitly on every English (or original-language) caption track. Reserve kind="subtitles" for translation tracks only.

Problem: Captions Without Speaker IDs in Multi-Person Dialogue

What is happening: Two or more people speak, but the captions show only the dialogue with no indication of who said what. A deaf viewer cannot tell whether a single person is monologuing or several people are in conversation.

Fix: Add ALL CAPS speaker names with a colon to every cue where the speaker changes (SARAH:, MIKE:, NARRATOR:). For off-screen speakers, prefix with VOICE-OVER: or OFF-SCREEN:.

Problem: Subtitles Shipped as Captions

What is happening: The track contains only translated dialogue — no sound effects, no music cues, no speaker IDs. It is a subtitle track being passed off as captions, and it fails WCAG 1.2.2.

Fix: Maintain two separate tracks: a captions track in the original language (with sound effects and speaker IDs) and subtitle tracks for each translation. Set kind="captions" on the first and kind="subtitles" on the others.

Captions vs Subtitles vs Transcripts vs Audio Description

These four media accessibility artifacts are often confused, but each one solves a different problem and is required by a different WCAG criterion.

Artifact	What It Contains	Primary Audience	WCAG Criterion
Captions	Dialogue + speaker IDs + sound effects + music cues, time-synced	Deaf and hard-of-hearing viewers; sound-off viewers	1.2.2 (Level A) prerecorded; 1.2.4 (AA) live
Subtitles	Dialogue only, translated into another language	Viewers who do not speak the original language	Not a WCAG requirement (i18n feature)
Transcripts	Full text of dialogue, sound effects, and (sometimes) visuals — not time-synced	Search engines, AI, deaf-blind users with refreshable braille	1.2.1 (Level A) for audio-only content
Audio Description	Spoken narration describing key visual content during pauses in dialogue	Blind and low-vision viewers	1.2.3 (A) and 1.2.5 (AA)

FAQ

What is the difference between captions and subtitles?

Captions are written for viewers who cannot hear the audio. They include dialogue, speaker IDs, sound effects, and music cues. Subtitles are written for viewers who can hear but do not speak the language — they translate the dialogue and assume the viewer can hear non-speech audio. WCAG 1.2.2 requires captions; subtitles alone do not meet the criterion.

Are auto-generated captions WCAG compliant?

On their own, no. Auto-caption services typically deliver 70–85% accuracy on conversational speech, drop sharply on proper nouns, brand names, and technical terms, and rarely include speaker IDs or sound effects. WCAG 1.2.2 requires captions to convey the audio content equivalently — a 70–85%-accurate track does not. Use auto-captions as a draft, then edit them.

What is WebVTT and why does it matter?

WebVTT (Web Video Text Tracks) is the W3C standard for time-synchronized text on the web. It is plain text, supports cues with timestamps, speaker IDs, sound effects, positioning, and styling, and is loaded via the HTML5 <track> element. Every modern browser supports it natively. Files use the .vtt extension.

Should social media videos use open or closed captions?

Open captions (burned into the video) are usually safer on social feeds because many platforms either auto-play with sound off or do not surface caption controls reliably. For long-form content on your own site, closed captions via <track> give viewers control (font size, language, on/off) and remain accessible to assistive technology. Many teams ship both — open for social, closed everywhere else.

Do captions help SEO?

Yes. Google indexes caption transcripts as text content associated with the video. A captioned video can rank for the words spoken in it; an uncaptioned video is opaque to search. Captions also feed video schema (VideoObject with a transcript property) and improve dwell-time signals.

Do AI search engines like ChatGPT and Perplexity cite videos?

They cite the transcripts. Generative engines do not watch video, but they readily ingest WebVTT files, video transcripts, and on-page descriptions. A video with a complete, well-formed caption track is significantly more likely to be summarized or cited in an AI answer than the same video with no captions. Without captions, the video effectively does not exist to AI search.

What is the legal exposure for shipping uncaptioned video in 2026?

Substantial. The ADA has been applied to digital video in dozens of US lawsuits (notably NFB v. HBO in 2017). The European Accessibility Act became enforceable in June 2025 and covers private-sector video on commercial sites. AODA (Ontario) and the Accessible Canada Act apply in Canada. In regulated industries (government, education, finance, healthcare), uncaptioned video is a documented compliance failure that triggers remediation orders.

Conclusion

Captions are not optional in 2026. WCAG 1.2.2 has been Level A for over a decade, the European Accessibility Act now enforces it across the EU, and the audience that depends on captions — deaf viewers, sound-off mobile users, ESL viewers, AI search engines — is far larger than the audience that does not. A captioned video reaches more people, ranks better in search, gets cited more often by AI, and removes a major source of legal exposure.

Run a Greadme deep scan to find every video on your site that is missing a caption track, has the wrong kind attribute, or is shipping subtitles where captions are required.