HTML5 allows the use of metadata tracks. In such case, a page could simple look like the following:
<video controls width="1920" height="1080"> <source src="file.mp4" type="video/mp4"> <source src="file.webm" type="video/webm"> <track src="file.vtt" kind="metadata"> </video>
And the associated WebVTT file would look like this:
WEBVTT 00:00:00.000 --> 00:00:01.000 Some data 00:00:03.000 --> 00:00:10.000 Some additional data
The data in each cue of a metadata track can be free-formed, but has to be text. It can be XML, or even binary data if encoded properly (e.g. with base64).
Exporting MP4 tracks as WebVTT
To experiment with that approach, I’ve landed in MP4Box a new feature: you can now export the content of any MP4 track into WebVTT. You can decide to embed the media track data in the MP4 file (it will be base64 encoded if it is binary data) or put the media data in a separate file from the WebVTT and to reference it from the WebVTT file.
The following command line shows how to export the content of the track with TrackID 1 from the file named file.mp4 into a WebVTT file name raw.vtt. Here by default the content of the track will not be put in the WebVTT file but in a media file called raw.media.
MP4Box –webvtt-raw 1:output=raw file.mp4
As an example, I’ve exported an AAC track from an MP4 file and the result is as follows:
WEBVTT Metadata track generated by GPAC MP4Box kind:metadata language:eng label: counter-10mn_aac_16k.aac - Imported with GPAC 0.5.1-DEV-rev4127:4130M trackID: 1 baseMediaFile: mp4-onDemand-aaclc_low_track1.media MPEG-4-streamType: 5 MPEG-4-objectTypeIndication: 64 sampleRate: 44100 numChannels: 1 MPEG-4-DecoderSpecificInfo: Egg= inBandMetadataTrackDispatchType: application/octet-stream encoding: base64
As you can see, I have added a lot of things in the WebVTT header. Here is a short description:
- the ‘kind’ attribute (as defined in HTML5) indicates that this WebVTT contains metadata,
- the ‘language’ attribute indicates the language of the track (not yet conformant to the HTML5 spec because it uses 3 characters to represent the language code per ISO 639-2 and not per BCP-47),
- the ‘label’ attribute which could be used in the video player GUI. In this case, the label is just the string contained in the ‘hdlr’ box of the MP4 track. In MP4Box this can be set when importing a track, as follows:
MP4Box -add file.aac:name="This is the track label" file.mp4
- the ‘trackID’ attribute contains the original track identifier in the MP4 file, which might be handy to have in JS if you have multiple metadata tracks and dependencies between them,
- the ‘baseMediaFile’ attribute indicates that the media data is not embedded in the WebVTT file but in a separate file and gives the name of that file.
- In this particular case, because AAC is carried in MP4 using the MPEG-4 Systems approach, the MPEG-‘ stream type and object type indication values are given in the ‘MPEG-4-streamType’ and ‘MPEG-4-objectTypeIndication’ attributes, decimal values. In other cases, these attributes will not be present.
- Similarly, the configuration information of the decoder is given in the ‘MPEG-4-DecoderSpecificInfo’ attribute and the ‘encoding’ attribute indicates that it is encoded using base64.
- In this case, because this is an audio track, I’ve added the ‘sampleRate’ ad ‘numChannels’ attributes. For video tracks, you would have width, height …
- Finally, I have added the ‘inBandMetadataTrackDispatchType’ attribute to provide the MIME type of the content of the track if any.
Then, because the data is not included in this file, the cues are empty but only contain additional settings, as shown below. You can notice also that there is no overlap in the cue timing (as in the MP4 track). The settings are:
- the offset and size of the cue data in the ‘baseMediaFile’ given by the ‘mediaOffset’ and ‘dataLength’ attributes
- information about the properties of the cue data. In this case of audio, all cues are random access points as indicated by the ‘isRAP’ attribute. Other attributes could be provided (redundant cue …).
00:00:00.000 --> 00:00:01.024 mediaOffset:0 dataLength:104 isRAP:true 00:00:01.024 --> 00:00:02.048 mediaOffset:104 dataLength:92 isRAP:true 00:00:02.048 --> 00:00:03.072 mediaOffset:196 dataLength:6 isRAP:true 00:00:03.072 --> 00:00:04.096 mediaOffset:202 dataLength:36 isRAP:true 00:00:04.096 --> 00:00:05.120 mediaOffset:238 dataLength:31 isRAP:true 00:00:05.120 --> 00:00:06.144 mediaOffset:269 dataLength:48 isRAP:true 00:00:06.144 --> 00:00:07.168 mediaOffset:317 dataLength:38 isRAP:true
If now I export the same track using the ’embedded’ parameter, with the following command line, the media data will be in the WebVTT file.
MP4Box –webvtt-raw 1:embedded:output=raw file.mp4
The WebVTT file header will have just one change (no ‘baseMediaFile’ attribute) and the cues will look like:
00:00:00.000 --> 00:00:01.024 isRAP:true 3gQAAGxpYmZhYWMgMS4yNQAAAqSfExBQFgoEiCEgigBRWM8ebmly0RpofT6fT6fT6fT6fT6a/pr+gTrnlv1kSPzxu2aujjZvjlLLFFEhERFq1VRREUNVTVCBqqEIVEIA1QgFawgFagc= 00:00:01.024 --> 00:00:02.048 isRAP:true ATKT/ZUVJkSZFPn5+uPa59wKPSV6C1Z56+oAGHpLalR7i8vdYASOjzSG3IhVotnW5qvGCii5RiNOE5RX8uBCFY0/A/omp/NvbI/BFh8gAHYvJUAAPRkBAESJEA4=
I think this is a good step towards using any media data track in HTML5 but for this to be usable, some things need to change. The WebVTT spec currently misses two things:
- An API to access the header information. I’m currently considering a hack, where I would generate a ‘dummy’ cue at the beginning of the file with the header information in it.
- An API to access additional settings per cue.
Ideally, it would be good to if the HTML5 spec would other formats than WebVTT in the track element, for instance a single-track MP4 file.
As for the next steps in MP4Box, I would like to see how to align the attributes in the header with the SDP description to provided less base64 encoded data and help reuse existing code.
2 thoughts on “Using WebVTT to carry media streams”
Did you consider using the timing object and the sequencer for this?
The point would be that timing object + sequencer already provides the data cue mechanism that you mention. In other words, you could use whatever data format you like and not have to worry about WebVTT at all. The timing would also be more precise.
I can see that it would be tempting to use the track mechanism if you depend on built-in GUI support for subtitle visualisation in the media element. On the other hand, if the prize to pay is an extra converting step via a suboptimal, non-JSON format, making a subtitle viewer yourself (using the sequencer) seems like an attractive alternative.
Thanks for the pointer. I’ll check. The initial idea was consume data synchronized with audio and/or video streams, with a typical media player with a GUI, indeed. I don’t see WebVTT as a sub-optimal format for that purpose. WebVTT is a good candidate here. It can be packaged consumed as a side-car file or in MP4 files, simplifying delivery. It cannot be simply JSON anyway because JSON is not streamable. There has to be a streamable wrapper around JSON objects.