Using WebVTT to carry media streams

Some of you may know the WebVTT format as a subtitling format. But WebVTT can also be used to carry video-synchronized metadata, i.e. any data that is not meant to be displayed, as is (or at all), but that should be processed at a given time in the video by some JavaScript code in the HTML page.

In this post, I describe the results of the modifications I made to MP4Box to export any MP4 tracks in WebVTT, including audio or video data. The idea here is not to replace the MP4 file format by WebVTT because WebVTT was not meant for that and is not really good at that (for one it is a textual format). The idea is more to provide a (temporary) way for people to experiment with JavaScript decoders for their data. You can already start using MP4Box to test but this work is still preliminary and is subject to changes.

Currently browsers don’t support all media formats that exist in the world and don’t even expose that data to the JavaScript layer. In the future, this will probably be feasible with the HTML5 DataCue interface. But for now, it’s not possible to exploit that data in your HTML page if it comes from an MP4 file. With the work described here, you can export any MP4 track in WebVTT, retrieve it using the existing TextTrackCue interface and process it with your own JavaScript decoder/processor.

WebVTT metadata

HTML5 allows the use of metadata tracks. In such case, a page could simple look like the following:

And the associated WebVTT file would look like this:

The data in each cue of a metadata track can be free-formed, but has to be text. It can be XML, or even binary data if encoded properly (e.g. with base64).

The interesting aspect in using metadata track is that synchronization almost comes for free. The Web Browser will trigger events on the JavaScript TextTrack interface associated with the track element when a cue becomes active. So with this approach, you can actually carry any private data in out-of-band tracks.

Exporting MP4 tracks as WebVTT

To experiment with that approach, I’ve landed in MP4Box a new feature: you can now export the content of any MP4 track into WebVTT. You can decide to embed the media track data in the MP4 file (it will be base64 encoded if it is binary data) or put the media data in a separate file from the WebVTT and to reference it from the WebVTT file.

The following command line shows how to export the content of the track with TrackID 1 from the file named file.mp4 into a WebVTT file name raw.vtt. Here by default the content of the track will not be put in the WebVTT file but in a media file called

As an example, I’ve exported an AAC track from an MP4 file and the result is as follows:

As you can see, I have added a lot of things in the WebVTT header. Here is a short description:

  • the ‘kind’ attribute (as defined in HTML5) indicates that this WebVTT contains metadata,
  • the ‘language’ attribute indicates the language of the track (not yet conformant to the HTML5 spec because it uses 3 characters to represent the language code per ISO 639-2 and not per BCP-47),
  • the ‘label’ attribute which could be used in the video player GUI. In this case, the label is just the string contained in the ‘hdlr’ box of the MP4 track. In MP4Box this can be set when importing a track, as follows:
  • the ‘trackID’ attribute contains the original track identifier in the MP4 file, which might be handy to have in JS if you have multiple metadata tracks and dependencies between them,
  • the ‘baseMediaFile’ attribute indicates that the media data is not embedded in the WebVTT file but in a separate file and gives the name of that file.
  • In this particular case, because AAC is carried in MP4 using the MPEG-4 Systems approach, the MPEG-‘ stream type and object type indication values are given in the ‘MPEG-4-streamType’ and ‘MPEG-4-objectTypeIndication’ attributes, decimal values. In other cases, these attributes will not be present.
  • Similarly, the configuration information of the decoder is given in the ‘MPEG-4-DecoderSpecificInfo’ attribute and the ‘encoding’ attribute indicates that it is encoded using base64.
  • In this case, because this is an audio track, I’ve added the ‘sampleRate’ ad ‘numChannels’ attributes. For video tracks, you would have width, height …
  • Finally, I have added the ‘inBandMetadataTrackDispatchType’ attribute to provide the MIME type of the content of the track if any.

Then, because the data is not included in this file, the cues are empty but only contain additional settings, as shown below. You can notice also that there is no overlap in the cue timing (as in the MP4 track). The settings are:

  • the offset and size of the cue data in the ‘baseMediaFile’ given by the ‘mediaOffset’ and ‘dataLength’ attributes
  • information about the properties of the cue data. In this case of audio, all cues are random access points as indicated by the ‘isRAP’ attribute. Other attributes could be provided (redundant cue …).

If now I export the same track using the ’embedded’ parameter, with the following command line, the media data will be in the WebVTT file.

The WebVTT file header will have just one change (no ‘baseMediaFile’ attribute) and the cues will look like:


I think this is a good step towards using any media data track in HTML5 but for this to be usable, some things need to change. The WebVTT spec currently misses two things:

  • An API to access the header information. I’m currently considering a hack, where I would generate a ‘dummy’ cue at the beginning of the file with the header information in it.
  • An API to access additional settings per cue.

Ideally, it would be good to if the HTML5 spec would other formats than WebVTT in the track element, for instance a single-track MP4 file.

As for the next steps in MP4Box, I would like to see how to align the attributes in the header with the SDP description to provided less base64 encoded data and help reuse existing code.

2 thoughts on “Using WebVTT to carry media streams”

  1. Hi Cyril.

    Did you consider using the timing object and the sequencer for this?

    The point would be that timing object + sequencer already provides the data cue mechanism that you mention. In other words, you could use whatever data format you like and not have to worry about WebVTT at all. The timing would also be more precise.

    I can see that it would be tempting to use the track mechanism if you depend on built-in GUI support for subtitle visualisation in the media element. On the other hand, if the prize to pay is an extra converting step via a suboptimal, non-JSON format, making a subtitle viewer yourself (using the sequencer) seems like an attractive alternative.

    1. Hi Ingar,
      Thanks for the pointer. I’ll check. The initial idea was consume data synchronized with audio and/or video streams, with a typical media player with a GUI, indeed. I don’t see WebVTT as a sub-optimal format for that purpose. WebVTT is a good candidate here. It can be packaged consumed as a side-car file or in MP4 files, simplifying delivery. It cannot be simply JSON anyway because JSON is not streamable. There has to be a streamable wrapper around JSON objects.

Leave a Reply

Your email address will not be published. Required fields are marked *