WebVTT is a new subtilting format that is becoming popular amongst browser implementors. Chrome (v23), Opera (v12.5), IE 10 already support it and soon Firefox will too. As opposed to previous formats for subtitles such as DVB subtitles or 3GPP Timed Text, it is being defined by the WHATWG/W3C primarily for the Web. However, the Web being almost ubiquitous, Web technologies now have to be usable in different delivery environment, not only in download and play mode. In particular, just like all the previous subtitle formats, WebVTT has to be also streamable, and for instance usable the context of Dynamic Streaming over HTTP (DASH). This post is about my experiments on this topic. For those who don’t want to read the whole post, in summary, it seems possible to generate WebVTT streams, with good random access properties, that can be delivered in chunks and be processed by standard browsers.
Scenarios
WebVTT content is typically meant to be used with a video. Hence, delivery of WebVTT content should be compatible with the major use cases of video delivery: On-demand and live.
For on-demand scenarios, the entire video and subtitles are available when the playback request is made. In such scenarios, the client will typically download the whole WebVTT file and start processing it (progressively or not) in sync with the video download and playback, which itself could be progressive depending on the video file size. This is the major Web-related use case for which WebVTT was designed.
For live scenarios, the video is typically delivered, for instance using RTP, HTTP streaming technologies such as MPEG DASH, or using broadcasting technologies such as MPEG-2 TS. For these scenarios, the WebVTT content has to be delivered in chunks (in other words as a stream). Each chunk is delivered soon after it is produced.
Additionally, for pure broadcast scenarios (i.e. when a return channel is not available), WebVTT content has to be provided within the broadcast channel. A WebVTT file can either be sent periodically using DSM-CC Object Carousel or FLUTE but sending the whole subtitle file for a given program would be rather inefficient in terms of bandwidth. So, even for non-live scenarios in broadcast, the WebVTT file also has to be cut into smaller chunks, sent prior to its display time, in a streaming manner .
Technical aspects
WebVTT was not designed with all the above scenarios in mind. However, I think it is worth looking at what needs to be done to accomodate them and in general to stream WebVTT content. In particular, one has to take special care of the use of headers and the notion of Random Access Point.
File Signature, Header and Streaming
The delivery of headers in streaming formats is a well-known problem. Traditional solutions require carouselling the header in-band or sending it out-of-band. For instance, recent changes in the MP4 file format now enables the H264/AVC configuration header to be stored in-band, in the video track for easier track management.
Of course for WebVTT, the out-of-band solution could work. In this case, it is up to the used protocol or file format to carry the signature and header. For the in-band approach, the header should be repeated within the WebVTT stream (or file). Fortunately, the WebVTT standard parsing algorithm tolerates that. So you can construct a WebVTT stream by concatenating cues and at some locations insert the WebVTT signature and header. See this example (works in Opera only, Chrome has a bug here and drops the next cue after the repeated header).
WEBVTT Result of a concatenation 1 00:00:00.000 --> 00:00:05.000 First cue from 0s to 5s 2 00:00:05.000 --> 00:00:10.000 Second cue from 5s to 10s 3 00:00:10.000 --> 00:00:15.000 3rd cue from 10s to 15s 4 00:00:15.000 --> 00:00:20.000 4th cue from 15s to 20s 5 00:00:20.000 --> 00:00:25.000 5th cue from 20s to 25s 6 00:00:25.000 --> 00:00:30.000 6th cue from 25s to 30s <u>Last cue before concatenation</u> WEBVTT Second WebVTT file being concatenated 7 00:00:30.000 --> 00:00:35.000 7th cue from 30s to 35s 8 00:00:35.000 --> 00:00:40.000 8th cue from 35s to 40s 9 00:00:40.000 --> 00:00:45.000 9th cue from 40s to 45s 10 00:00:45.000 --> 00:00:50.000 10th cue from 45s to 50s 11 00:00:50.000 --> 00:00:55.000 11th cue from 50s to 55s 12 00:00:55.000 --> 00:01:00.000 12th cue from 55s to 60s <u>Last cue</u>
Random Access in WebVTT
The second problem in the streaming of WebVTT is the use overlapping cues in particular with the notion of Random Access Point (RAP).
A RAP in a stream provides the ability for a client to tune in (or to seek) at that point in the stream and to render the stream as if the client had joined (or played) from the beginning (without processing data before that point). Typically in a streaming/broadcast session, the client will start receiving data but only start processing it after the protocol has signaled a RAP (and possibly some header is received too). From a global point view, in a broadcast session, all clients will display the same thing after they have processed a RAP. In general, streamable formats provides some means to represent the data in such way that some points can be signaled as RAP.
The problem with overlapping cues and RAP is illustrated by the following WebVTT example (and HTML with video):
WEBVTT Example of overlapping cues 1 00:00:00.000 --> 00:00:05.000 First cue from 0s to 5s 2 00:00:03.000 --> 00:00:09.000 Second cue from 3s to 9s 3 00:00:08.000 --> 00:00:14.000 3rd cue from 8s to 14s 4 00:00:10.000 --> 00:00:18.000 4th cue from 10s to 18s 5 00:00:12.000 --> 00:00:24.000 5th cue from 12s to 24s 6 00:00:16.000 --> 00:00:26.000 6th cue from 16s to 26s
You can see that at some points in times, multiple cues are overlapping and should be displayed together. For instance, at time 4, cue 1 and cue 2 should be displayed. At time 13, cues 3, 4 and 5 should be displayed. In terms of streaming, this means that if a client joins at time 2, it won’t have cue 1 (and won’t be able to display) but will only receive cue 2 (and be able to process it if the signature and header is there, see above). But even if it could be processed, the result will not be correct. Indeed, the right rendering should display cue 1 and cue 2 between time 3 and 5. Displaying only cue 2 during that period is not correct, at least not identical to what a client that received cue 1 will render. Additionally, the position of cue 2 can be affected by the presence (or not) of cue 1. So, strictly speaking, some cues cannot be considered as RAP.
Some people say: “Random Access into a WebVTT file is easy: just start from the begining, keep the header, discard all cues that end before your desired point in time and you’re done”. True, if you have access to the entire file, but that’s not the case of the above scenarios.
So I asked myself, how can we transform a given WebVTT stream to insure that some points could be signaled as RAP. At FOMS 2012, I discussed with Andrew Scherkus from the Chromium team and we came up with an idea. Because the concatenation works well (see above), it seems possible to split overlapping cues into non-overlapping ones and create RAP.
WEBVTT Example of overlapping cues being rewritten to be able to signal RAP 1 RAP here 00:00:00.000 --> 00:00:03.000 First cue from 0s to 5s 1b RAP here 00:00:03.000 --> 00:00:05.000 First cue from 0s to 5s 2 00:00:03.000 --> 00:00:05.000 Second cue from 3s to 9s 2b RAP here 00:00:05.000 --> 00:00:08.000 Second cue from 3s to 9s 2c RAP here 00:00:08.000 --> 00:00:09.000 Second cue from 3s to 9s 3 00:00:08.000 --> 00:00:09.000 3rd cue from 8s to 14s 3b RAP here 00:00:09.000 --> 00:00:10.000 3rd cue from 8s to 14s 3c RAP here 00:00:10.000 --> 00:00:12.000 3rd cue from 8s to 14s 4 00:00:10.000 --> 00:00:12.000 4th cue from 10s to 18s 3d RAP here 00:00:12.000 --> 00:00:14.000 3rd cue from 8s to 14s 4b 00:00:12.000 --> 00:00:14.000 4th cue from 10s to 18s 5 00:00:12.000 --> 00:00:14.000 5th cue from 12s to 24s 4c RAP here 00:00:14.000 --> 00:00:16.000 4th cue from 10s to 18s 5b 00:00:14.000 --> 00:00:16.000 5th cue from 12s to 24s 4d RAP here 00:00:16.000 --> 00:00:18.000 4th cue from 10s to 18s 5b 00:00:16.000 --> 00:00:18.000 5th cue from 12s to 24s 6 00:00:16.000 --> 00:00:18.000 6th cue from 16s to 26s 5c RAP here 00:00:18.000 --> 00:00:24.000 5th cue from 12s to 24s 6b 00:00:18.000 --> 00:00:24.000 6th cue from 16s to 26s 6c 00:00:24.000 --> 00:00:26.000 6th cue from 16s to 26s
This seems to work pretty well in Opera (almost in Chrome).
Timestamps in cues
The only problem with the above approach is that it may create cues with inline timestamps that do not fall in the range of the cue timestamps. See this example which as this content:
WEBVTT Cue with a late cue timestamp 1 00:00:00.000 --> 00:00:05.000 First cue from 0s to 5s 2 00:00:05.000 --> 00:00:10.000 Second cue from 5s to 10s<00:00:11.000> late text 3 00:00:10.000 --> 00:00:15.000 3rd cue from 10s to 15s
Here Opera and Chrome don’t seem to agree. Opera displays the text as if the inline timestamp was in the cue time range. This is what I prefer, as this means that the process of rewriting the cue doesn’t have to touch the cue payload content. Otherwise, the rewriting would have to simply remove the inline timestamps.
Conclusion
So, using the above transformation or rewriting of a WebVTT file, and using the repeatition of headers, it seems possible to create WebVTT stream with random access properties. This way, new clients joining a broadcast session or a live session at any time, will just wait for the signaling of a RAP provided by the underlying transport layer (MP4, RTP, MPEG-2 TS) to start processing WebVTT chunks. These clients will get the WebVTT signature and header (either in-band at the beginning of the RAP or by some out-of-band means) and then will start processing the WebVTT stream correctly. Already “connected” clients will receive additional signature/header data in between cues (if in-band) and will (according to the standard) silently ignore it. The two types of clients will therefore be in sync.
Hi Cyril,
It’s great to see you picking up on the discussion at FOMS 2012 and making some experiments. I’ve just registered a bug on Chrome for what you noticed: https://bugs.webkit.org/show_bug.cgi?id=97097 .
I actually think there are two cases that we have to regard separately for streaming WebVTT: one is where we have the WebVTT file grow as an individual resource independent of the video, and the other is where WebVTT is provided in-band.
I think the first case where the WebVTT file is just a text file that grows is not too difficult to resolve. The video player would connect to the video stream at a certain offset, get the time offset of that time and then get the WebVTT file and display the cues from that time offset onwards. Right now, this is possible in a Web browser when writing the code for pulling a WebVTT file that continues to grow on the server through XHR in JS. I don’t think, though, that the browsers will do the right thing with a track element yet. That’s why we have a bug at the W3C: https://www.w3.org/Bugs/Public/show_bug.cgi?id=14104 .
However, I agree that the in-band use case is more challenging, since there it all depends on how you encapsulate the cues. It’s well possible (and indeed typical) that at the time that you connect to a live stream, the cues that are active at that time have already been encapsulated into the stream before you joined and you’ve therefore missed them. This can only be overcome by frequently repeating past and still active cues in the live stream and marking them as “repeats”. Repeats would only be picked up by a video player if it hadn’t already received that same cue before. This applies to both overlapping and non-overlapping cues, since they are all encapsulated at their start time as a packet and not visible during their duration any more.
You are correct that a “repeat” could just be done by splitting cues at the repeat time (which is then also a RAP). Maybe that’s what some encapsulation formats need in order to create a valid stream. I may, however, not be necessary.
It might be interesting to look at Ogg Kate in this context: https://wiki.xiph.org/OggKate . It was developed exactly with this aim of being able to put “repeat” packets into a stream and it does so by repeating them as a complete cue.
If at a later stage somebody was to extract a WebVTT file again from a multiplexed and recorded live stream, the repeated cues need to be thrown away and would thus not pollute the text WebVTT file.