TL;DR: OMAF is a storage and streaming format for omnidirectional media, including 360° video and images, spatial audio, and associated timed text, which is arguably the first virtual reality (VR) system standard.
Abstract: During recent years, there have been product launches and research for enabling immersive audio–visual media experiences. For example, a variety of head-mounted displays and 360° cameras are available in the market. To facilitate interoperability between devices and media system components by different vendors, the Moving Picture Experts Group (MPEG) developed the Omnidirectional MediA Format (OMAF), which is arguably the first virtual reality (VR) system standard. OMAF is a storage and streaming format for omnidirectional media, including 360° video and images, spatial audio, and associated timed text. This article provides a comprehensive overview of OMAF.
TL;DR: In this article, a PES (packetizea elementary stream) packet has a data structure for transferring a text track of an MP4 file and making a data receiving apparatus perform a streaming type reproduction.
Abstract: PROBLEM TO BE SOLVED: To provide a transmission data structure appropriate for using Timed Text in delivery of streaming type. SOLUTION: A PES (packetizea elementary stream) packet 1 has a data structure for transferring a text track of an MP4 file and making a data receiving apparatus perform a streaming type reproduction. A track header 111, a sample description 112 and configuration information 113 are information related to reproduction of the whole text track. A text sample 1142 includes a text 1142a. A segment text header is information disposed for every text sample in relation to reproduction of each text sample. COPYRIGHT: (C)2004,JPO&NCIPI
TL;DR: In this paper, a method for automatically translating timed text for web video is presented. But this method requires a client to identify a web video from a client and retrieve a timed text track for the web video, which is then used to translate the text to a target language.
Abstract: This invention relates to translating timed text in web video. In a first embodiment, a method automatically translates timed text for web video. The method includes receiving a request identifying a web video from a client. In response to the request, a timed text track for the web video is retrieved from a timed text database. Each timed text track in the timed text database specifies text to display at particular times in a video. Text from the timed text track is automatically translated to a target language. Finally, the translated text is sent to the client to display with the web video.
TL;DR: In this paper, a method is described to render a representation of the received source timed text data within a textual array, where the textual array includes at least one row having textual data associated with the received text data contained therein.
Abstract: A method is provided in certain example embodiments, and may include receiving source timed text data and an associated time stamp, and rendering a representation of the received source timed text data within a textual array. The textual array includes at least one row having textual data associated with the received source timed text data contained therein. The method may further include producing at least one data document including row data associated with one or more rows of the textual array when the textual data of the at least one row has changed from a previously rendered on-screen representation of previously received source timed text data. The row data includes a change in textual data for one or more rows from a previously produced caption data document.
TL;DR: This demo enables the automatic creation of semantically annotated YouTube media fragments by first ingested in the Synote system and then NERD is used to extract named entities from the transcripts which are then temporally aligned with the video.
Abstract: This demo enables the automatic creation of semantically annotated YouTube media fragments. A video is first ingested in the Synote system and a new method enables to retrieve its associated subtitles or closed captions. Next, NERD is used to extract named entities from the transcripts which are then temporally aligned with the video. The entities are disambiguated in the LOD cloud and a user interface enables to browse through the entities detected in a video or get more information. We evaluated our application with 60 videos from 3 YouTube channels.