Video and Multimedia Accessibility: Captions, Audio Description, and Beyond

Video and multimedia content have become central to how organizations communicate. Product demos, training courses, marketing campaigns, customer support resources, news coverage, and entertainment — all increasingly delivered through video and audio. Yet multimedia is inherently inaccessible in its default form. A deaf user cannot hear dialogue. A blind user cannot see on-screen action. A user with a cognitive disability may struggle with fast-paced content that offers no pause mechanism. Making multimedia accessible is not a niche concern — it is a requirement that affects millions of people and is mandated by law in most jurisdictions.

Why Multimedia Accessibility Matters

The scale of the issue is significant. Approximately 466 million people worldwide have disabling hearing loss. Over 2.2 billion people have some form of visual impairment. Beyond permanent disabilities, situational barriers are equally common: a commuter watching a video in a noisy train, a parent with a sleeping baby who cannot turn on sound, a non-native speaker who follows written text more easily than spoken language. Research consistently shows that captions benefit all viewers — studies indicate that 80% of people who use captions are not deaf or hard of hearing.

From a legal perspective, inaccessible multimedia is one of the most frequently cited violations in accessibility lawsuits. The landmark NAD v. Netflix case established that streaming video services must provide closed captions. The European Accessibility Act and national implementations across EU member states explicitly require multimedia accessibility for products and services offered to consumers.

WCAG Criteria for Multimedia

WCAG dedicates an entire guideline — Guideline 1.2: Time-Based Media — to multimedia accessibility. Understanding the specific success criteria and their conformance levels is essential for planning your approach.

1.2.1 Audio-only and Video-only (Prerecorded) — Level A. For prerecorded audio-only content (such as a podcast episode), provide a text transcript. For prerecorded video-only content (such as a silent animation or a surveillance-style recording), provide either a text description or an audio track that describes the visual information. This is a baseline requirement. If you publish a podcast without a transcript or a silent instructional animation without a text alternative, you fail Level A.

1.2.2 Captions (Prerecorded) — Level A. All prerecorded audio content in synchronized media must have captions. This applies to any video that includes spoken dialogue, narration, or meaningful sound effects. The captions must be synchronized with the audio and must convey not just speech but also relevant non-speech audio — music, sound effects, speaker identification. This is Level A, meaning it is a fundamental requirement for any level of WCAG conformance.

1.2.3 Audio Description or Media Alternative (Prerecorded) — Level A. For prerecorded synchronized media, provide either audio description of visual content or a full text alternative (such as a screenplay-style transcript that describes both dialogue and visual action). At Level A, you have the choice between these two approaches. A detailed transcript that includes visual descriptions can satisfy this criterion without producing a separate audio description track.

1.2.4 Captions (Live) — Level AA. Live audio content in synchronized media must have real-time captions. This applies to live webinars, live-streamed events, video calls, and live broadcasts. Real-time captioning is technically demanding — it typically requires either a professional CART (Communication Access Realtime Translation) provider or increasingly capable AI-powered live captioning tools. Some delay and reduced accuracy is expected compared to prerecorded captions, but the captions must still be provided.

1.2.5 Audio Description (Prerecorded) — Level AA. At Level AA, a full audio description track is required for prerecorded synchronized media. Unlike 1.2.3, a text alternative alone does not satisfy this criterion — you must provide an actual audio track that describes important visual content during natural pauses in dialogue. This is the most resource-intensive multimedia accessibility requirement at the AA level.

Level AAA requirements (1.2.6 through 1.2.9) go further. 1.2.6 requires sign language interpretation for prerecorded audio content. 1.2.7 requires extended audio description — where the video is paused to allow time for description when natural pauses in dialogue are insufficient. 1.2.8 requires a full text alternative for all prerecorded synchronized media, providing a complete description of all visual and auditory information in a readable format. 1.2.9 requires a text alternative for live audio-only content. While Level AAA conformance across an entire site is rarely required by law, these criteria represent best practice and may be necessary for specific content or audiences.

Captions vs. Subtitles

The terms "captions" and "subtitles" are often used interchangeably, but they serve different purposes and audiences.

Captions are designed for viewers who cannot hear the audio. They include all spoken dialogue, but also identify speakers, describe relevant sound effects ("[door slams]", "[phone ringing]"), indicate music ("[suspenseful music plays]"), and note other auditory information essential to understanding the content. Captions are typically in the same language as the audio. In the United States, the term "closed captions" refers to captions that can be toggled on and off, while "open captions" are burned into the video and always visible.

Subtitles are designed for viewers who can hear the audio but need a text translation. They typically include only spoken dialogue, translated into another language. They assume the viewer can hear sound effects and music, so these are usually omitted.

For WCAG compliance, captions are what is required — not subtitles. The key difference is that captions must convey all relevant audio information, not just dialogue. If your "captions" only include speech and ignore sound effects and speaker identification, they are functionally subtitles and do not fully meet WCAG 1.2.2.

Creating Quality Captions

Quality captions go beyond verbatim transcription. Several factors determine whether captions are genuinely useful to their audience.

Timing and synchronization. Captions must appear when the corresponding audio occurs and disappear at appropriate points. A caption that appears two seconds late or lingers on screen after the speaker has moved on is disorienting. Industry standards recommend that captions appear within one frame of the corresponding audio. Each caption should remain on screen long enough to be read — typically a minimum of one second for short captions and following a reading speed of approximately 160-180 words per minute.

Speaker identification. When multiple speakers are present, identify who is speaking. This is typically done by prefixing with the speaker's name or role in brackets, such as "[Interviewer]" or "[Sarah]". When a speaker is off-screen, indicate this as well. Without speaker identification, a deaf viewer cannot follow a conversation between multiple people.

Sound effects and non-speech audio. Describe sounds that are important to understanding the content. A doorbell ringing that prompts a character to stand up, background music that establishes mood, an alarm that creates urgency — these should all be captioned. Use brackets or italics to distinguish sound descriptions from spoken dialogue: "[alarm blaring]", "[soft piano music]". Do not over-describe — not every ambient sound needs captioning, only those relevant to comprehension.

Formatting and readability. Keep captions to a maximum of two lines on screen. Line breaks should follow natural linguistic boundaries — do not break in the middle of a phrase or clause. Use standard capitalization and punctuation. Avoid all-caps except for emphasis that mirrors the speaker shouting. Position captions to avoid obscuring important visual content when possible.

Accuracy. Captions must be accurate. Misspelled words, incorrect homophones, missing words, and garbled phrases undermine comprehension. Professional captioning services typically aim for 99% accuracy. Review captions before publishing, particularly for proper nouns, technical terminology, and numbers.

Auto-Generated Captions

Automatic speech recognition (ASR) technology has improved dramatically, but auto-generated captions alone do not meet WCAG requirements in most cases.

YouTube auto-captions use Google's speech recognition and can produce reasonable results for clear, single-speaker English content. However, accuracy degrades significantly with accents, multiple speakers, background noise, technical jargon, proper nouns, and non-English languages. YouTube's auto-captions also lack speaker identification, sound effect descriptions, and proper punctuation in many cases.

AI-powered captioning tools such as Otter.ai, Rev AI, Whisper (OpenAI), and Microsoft's captioning services have raised the accuracy floor considerably. Some achieve word error rates below 5% for ideal audio conditions. However, even 95% accuracy means one error every twenty words — roughly one error per caption frame. For dense or technical content, this is noticeable.

Best practice: Use auto-generated captions as a starting point, then edit them manually for accuracy, add speaker identification, add sound effect descriptions, and verify timing. This hybrid approach is significantly faster than captioning from scratch while producing results that meet WCAG requirements. Never publish unreviewed auto-generated captions on content that needs to meet accessibility standards.

Audio Description

Audio description (also called video description or described video) is a narration track that describes important visual information for people who cannot see the video. It fills the gaps between dialogue with descriptions of actions, scene changes, on-screen text, facial expressions, and other visual elements essential to understanding the content.

When audio description is required. Audio description is necessary when the video conveys information visually that is not available through the audio track alone. A talking-head interview where the speaker describes everything verbally may not need audio description. A product demonstration where someone silently manipulates an object, a tutorial with on-screen text instructions, or a narrative film with significant visual storytelling all require audio description.

Standard audio description fits descriptions into natural pauses in the existing audio. The describer must prioritize the most important visual information and fit it into available gaps. This requires skill — writing concise descriptions that convey meaning without overwhelming the listener or talking over dialogue.

Extended audio description (WCAG 1.2.7, Level AAA) pauses the video playback to allow time for longer descriptions when natural pauses are insufficient. This is particularly useful for visually dense content where standard audio description cannot convey enough information. Extended audio description requires a custom player or specially prepared video.

Writing effective audio descriptions:

Describe what is seen, not what you interpret — "Sarah frowns" rather than "Sarah is displeased".
Prioritize information essential to understanding the narrative or instructional content.
Describe on-screen text verbatim when it is important to the content.
Use present tense — "A woman walks to the door" rather than "A woman walked to the door".
Be concise. Audio description competes for the same auditory channel as the main content.
Identify characters by name once established, and by visual characteristics initially.

Transcripts

Transcripts are full text versions of audio and video content. They serve multiple audiences: deaf users, deafblind users (who can use braille displays with transcripts but not with captions), users who prefer reading to watching, users with cognitive disabilities who process text at their own pace, and anyone who wants to search within or reference specific content.

For audio-only content (podcasts, audio recordings), a transcript is the primary accessible alternative required by WCAG 1.2.1.

For video content, a descriptive transcript — one that includes both spoken content and descriptions of important visual information — satisfies WCAG 1.2.3 at Level A. At Level AAA, WCAG 1.2.8 requires such transcripts for all prerecorded synchronized media.

Transcript best practices:

Identify speakers by name or role.
Include descriptions of relevant visual content in brackets.
Add headings and timestamps to allow navigation through long content.
Make transcripts available directly on the same page as the media, or link to them prominently.
Use proper HTML semantics — headings, paragraphs, lists — rather than dumping text in a single block.
Consider providing an interactive transcript that highlights the current position as the media plays.

Accessible Video Players

The video player itself must be accessible, regardless of how well the content within it is captioned or described. An inaccessible player creates a barrier even when the media files meet every WCAG criterion.

Keyboard accessibility. All player controls — play, pause, stop, volume, mute, captions toggle, full screen, progress scrubbing — must be operable with the keyboard alone. Users must be able to tab to the player, navigate between controls, and activate them with Enter or Space. The tab order should follow a logical sequence.

Screen reader compatibility. Controls must have accessible names and roles. A play button must be identified as a button with the label "Play" (or "Pause" when toggled). The volume slider must be identified as a slider with its current value. Caption and audio description toggles must indicate their current state (on/off). Time indicators should be announced in a meaningful format.

Visible focus indicators. When navigating the player with a keyboard, the currently focused control must have a visible focus outline. Many custom video players strip default focus styles without providing replacements — this is a significant accessibility failure.

No keyboard traps. Users must be able to navigate into and out of the video player with the keyboard. A player that captures focus and prevents tabbing away traps keyboard users.

YouTube, Vimeo, and Custom HTML5 Video

YouTube provides a reasonably accessible embedded player. Its controls are keyboard-operable, and it supports both auto-generated and uploaded caption files. To maximize accessibility when embedding YouTube videos: upload edited caption files rather than relying solely on auto-captions, provide a link to a full transcript, and ensure the iframe embedding includes a meaningful title attribute — for example, title="Product demo: Setting up your account".

Vimeo also offers an accessible player with caption support. Vimeo allows uploading of SRT or VTT caption files and supports multiple caption tracks. Its player controls are keyboard-accessible. As with YouTube, provide a transcript alongside the embedded video.

Custom HTML5 video. Building a custom player using the HTML5 video element gives you full control but full responsibility. The native HTML5 video controls vary between browsers in accessibility. If you build custom controls, you must implement full keyboard support, ARIA labels and roles, focus management, and caption rendering. Libraries such as Able Player, Video.js with accessibility plugins, and Plyr are designed with accessibility in mind and provide a strong foundation.

Regardless of platform, always ensure that:

The embed or player has a descriptive title or label.
Captions can be turned on and off by the user.
Audio description tracks can be selected where available.
The player does not start playing automatically (or if it does, meets autoplay requirements).

Background Video and Autoplay

Background videos on hero sections, landing pages, and decorative areas have become a common design pattern. These present specific accessibility challenges.

WCAG 1.4.2 Audio Control (Level A) requires that if any audio plays automatically for more than three seconds, there must be a mechanism to pause or stop the audio, or to control the volume independently of the system volume. Autoplaying video with sound is one of the most jarring accessibility failures — it can disorient screen reader users, startle people with anxiety disorders, and create confusion for anyone who does not expect audio.

Best practices for background video:

Mute background videos by default. Use the muted attribute on the HTML5 video element.
Provide a visible pause button so users can stop the video.
Do not convey essential information through background video alone — it is decorative, not informational.
Use prefers-reduced-motion media query to disable background video for users who have indicated they prefer reduced motion.
Keep background videos short and low-contrast to minimize distraction.
Ensure text overlaid on background video maintains adequate contrast ratios at all times.

Animations and GIFs

WCAG 2.2.2 Pause, Stop, Hide (Level A) requires that for any moving, blinking, or scrolling information that starts automatically, lasts more than five seconds, and is presented alongside other content, users must be able to pause, stop, or hide it. This applies to animated GIFs, CSS animations, JavaScript-driven motion, and auto-playing video content.

Animated GIFs are a particular concern. They loop indefinitely by default, cannot be paused by the user without browser extensions, and can cause seizures in people with photosensitive epilepsy if they flash at certain frequencies (see WCAG 2.3.1: Three Flashes or Below Threshold).

Making animations accessible:

Provide a pause mechanism for any animation that runs longer than five seconds.
Respect the prefers-reduced-motion media query — disable or reduce animations for users who have enabled this setting.
Avoid flashing content. Nothing should flash more than three times per second.
For animated GIFs, consider replacing them with short video elements that offer pause controls, or use the picture element to show a static image when reduced motion is preferred.
Decorative animations should be subtle and non-distracting. If a user cannot focus on the primary content because of a looping animation, the animation is a barrier.

Podcasts and Audio Content

Podcasts and audio content fall under WCAG 1.2.1 (Audio-only, Prerecorded). The primary requirement is a text transcript.

Transcript requirements for podcasts:

Include all spoken content — dialogue, narration, and meaningful utterances.
Identify speakers throughout.
Note significant non-speech sounds that contribute to meaning.
Publish transcripts alongside the audio — on the same page, linked from the episode page, or included in the RSS feed show notes.

Additional podcast accessibility considerations:

Provide clear episode titles and descriptions so users can determine whether to listen.
Ensure the podcast player is keyboard-accessible with labeled controls.
Offer playback speed controls — users with cognitive disabilities or non-native speakers may benefit from slower playback.
Consider providing chapter markers for long episodes to aid navigation.
If the podcast includes references to visual materials ("as you can see on screen"), describe those materials verbally.

Live Streaming Accessibility

Live streaming introduces unique challenges because content cannot be prepared or reviewed in advance.

Live captions are required at WCAG Level AA (1.2.4). Options include:

CART providers: Professional stenographers who caption in real time with high accuracy (typically 98%+). They work remotely and can caption any live event via audio feed. This is the gold standard for accuracy.
AI-powered live captioning: Services built into platforms like Zoom, Microsoft Teams, Google Meet, and YouTube Live offer automatic live captions. Quality has improved significantly but still falls short of professional CART for complex content, heavy accents, or technical terminology.
Hybrid approaches: An AI-generated caption stream reviewed and corrected by a human operator in real time offers a balance between cost and accuracy.

Additional live streaming considerations:

Plan for sign language interpreters when your audience includes sign language users. Show the interpreter in a picture-in-picture view that is large enough to be clearly visible.
Describe visual content verbally. If a presenter shares a slide, they should read or summarize its content rather than saying "as you can see here".
Provide accessible chat or Q&A mechanisms alongside the stream.
Record and archive the stream with full captions and, where possible, audio description added post-event.
Test your streaming setup with assistive technologies before going live.

Tools for Creating Captions and Audio Descriptions

Captioning tools:

Amara: A free, open-source platform for creating and editing captions collaboratively. Supports multiple formats including SRT and VTT.
YouTube Studio: Offers auto-generated captions as a starting point and provides an editor for correcting and timing them. Exports SRT files.
Subtitle Edit: A free desktop application (Windows, with Mono support on other platforms) for creating, editing, and converting caption files. Supports numerous formats.
Rev: A professional service offering both human-generated and AI-generated captions. Human captions meet 99% accuracy standards.
Otter.ai: AI-powered transcription useful as a first draft. Best for clear, single-speaker English content.
Descript: Combines transcription, editing, and caption generation. Allows text-based editing of video and audio.
Whisper (OpenAI): Open-source speech recognition model that can be run locally. Strong multilingual support and high accuracy for a free tool.

Audio description tools:

YouDescribe: A free platform for adding audio descriptions to YouTube videos collaboratively.
Able Player: An open-source accessible media player that supports audio description tracks and can handle extended audio description by pausing the video.
Professional audio description services: Companies like Audio Eyes, Descriptive Video Works, and CaptionMax provide professional description services including scripting, voice talent, and mixing.

Caption file formats:

WebVTT (.vtt): The web standard for HTML5 video. Supports styling, positioning, and metadata. Recommended for web use.
SRT (.srt): The most widely supported format. Simple text with timestamps. Works on virtually every platform.
TTML (.ttml): An XML-based format used primarily in broadcast. More complex but supports advanced styling.
SCC (.scc): A broadcast format used in North American television. Relevant if your content airs on TV.

Multimedia Accessibility Checklist

Use this checklist to verify that your multimedia content meets accessibility requirements.

For all prerecorded video with audio:

Synchronized captions are provided and include all dialogue, speaker identification, and relevant sound effects (WCAG 1.2.2).
Captions are accurate — reviewed and corrected, not only auto-generated.
Audio description or a descriptive transcript is available for visual content not described in the audio (WCAG 1.2.3 / 1.2.5).
A full transcript is available, ideally on the same page as the video.

For audio-only content (podcasts, recordings):

A text transcript is provided (WCAG 1.2.1).
The transcript identifies speakers and includes relevant non-speech sounds.
The transcript is easy to find — linked directly from the audio player.

For video-only content (silent video, animation):

A text description or audio track describing the visual content is provided (WCAG 1.2.1).

For live audio/video content:

Real-time captions are provided (WCAG 1.2.4).
Visual content is described verbally by the presenter where possible.
A captioned recording is made available after the event.

For the video player:

All controls are keyboard-accessible.
Controls have accessible names and roles for screen readers.
Focus indicators are visible.
No keyboard traps exist.
Caption and audio description toggles are available and labeled.

For autoplay, background video, and animation:

Audio does not play automatically, or a mechanism to stop it is provided within three seconds (WCAG 1.4.2).
A pause button is available for background video and looping animations (WCAG 2.2.2).
Animations respect the prefers-reduced-motion setting.
Nothing flashes more than three times per second (WCAG 2.3.1).

PDF & Document Accessibility Mobile Accessibility

Votre site web est-il accessible ?

Scannez votre site web gratuitement et obtenez votre score WCAG en quelques minutes.

Scanner votre site gratuitement