News video story segmentation using fusion of multi-level multi-modal features in TRECVID 2003

Question

1. What are the contributions in "News video story segmentation using fusion of multi-level multi-modal features in trecvid 2003" ?

2. What are the future works mentioned in the paper "News video story segmentation using fusion of multi-level multi-modal features in trecvid 2003" ?

3. How many binary features are generated at a candidate point?

4. What is the relevant feature in the induced binary feature?

Accepted Answer

In this paper, the authors present their new results in news video story segmentation and classification in the context of TRECVID video retrieval benchmarking event 2003.. Using the large news video set from the TRECVID 2003 benchmark, the authors demonstrate satisfactory performance ( F1 measure up to 0. 76 ) and more importantly observe an interesting opportunity for further improvement.

Accepted Answer

According to their observation, a ME model extended with temporal states would be a promising solution since the statistical behaviors of features in relation to the story transition dynamics may change over time in the course of a news program.

Accepted Answer

All the binary features generated at a candidate point are sequently collected into {gj} and are further fed into the ME model; e.g., for pitch jump raw feature with 4 threshold levels, it would generate 3 · 4 = 12 binary features since the authors have to check if the feature is ”on” in the 3 observation windows and each is binarized with 4 different levels.

Accepted Answer

As for the induced binary features, the anchor face feature in a certain observation window is the most relevant; the next induced is the significant pause within the noncommercial section.

Accepted Answer

Their strategy should be to find the effective binarization threshold level in terms of the fitness gain (i.e., divergence reduction defined in Equation 2) of the constructed model rather than the data distribution within the feature itself.

Accepted Answer

In this experiment, the authors use 111 half-hour video programs for development, 66 of which are used for detector training and threshold determination.

Accepted Answer

A good candidate set should have a very high recall rate on the reference (annotated) boundaries and are the places where salient and effective features occur.

Accepted Answer

There are other perceptual features that might improve this work; for example, an inter-chunk energy variations might be highly correlated with the pitch reset feature discussed earlier; another one is the more precise speech rapidity measured at the phoneme level since towards the end of news stories news anchors may have the tendency to decrease their rate of speech or stretching out the last few words.

Accepted Answer

As the raw feature f ri is taken into the feature wrapper, it will be rendered into sets of binary features at each candidate point {tk} with the function Fw(fri , tk, dt, v, B), which is used to take features from observation windows of various lengths B, compute delta values of some features over time interval dt, and finally binarize the feature values against multiple possible thresholds, v.Delta feature:

Accepted Answer

It takes a variety of lexical, semantic and structural features as inputs and generates boundary scores at non-speech candidates, where no ASR words are transcribed [7].

Accepted Answer

When dealing with videos from unknown sources, identification of the source channel can be done through logo detection or calculating model likelihood (fitness) with individual statistical station models.

Accepted Answer

The ME model [1, 3] constructs an exponential log-linear function that fuses multiple features to approximate the posterior probability of an event (i.e., story boundary) given the audio, visual or text data surrounding the point under examination, as shown in Equation 1.

Accepted Answer

The dimension of binary features {gij} generated from raw feature fri or delta feature ∆f r i is the product of the number of threshold levels and number of observation windows (3, in their experiment).

Accepted Answer

The estimated model, a posterior probability, is represented as qλ(b|x), where b ∈ {0, 1} is a random variable corresponding to the presence or absence of a story boundary in the context x and λ is the estimated parameter set.

Accepted Answer

The intuition is that boundary detection might be erroneous but the authors could still classify the segment by checking the surrounding context, commercial or non-commercial.

Accepted Answer

the authors found that taking the shot boundaries only is1 F1 = 2·P ·R P+R , where P and R are precision and recall rates 2TRECVID 2003: http://www-nlpir.nist.gov/projects/tv2003/tv2003.htmlnot complete.

News video story segmentation using fusion of multi-level multi-modal features in TRECVID 2003

Chat with Paper

AI Agents for this Paper

Most frequently asked questions

1. What are the contributions in "News video story segmentation using fusion of multi-level multi-modal features in trecvid 2003" ?

2. What are the future works mentioned in the paper "News video story segmentation using fusion of multi-level multi-modal features in trecvid 2003" ?

3. How many binary features are generated at a candidate point?

4. What is the relevant feature in the induced binary feature?

5. What is the strategy for finding the effective binarization threshold level?

6. How many half-hour video programs are used for development?

7. What is the way to determine the candidate set?

8. What are some other perceptual features that might improve this work?

9. What is the function used to estimate the delta feature?

10. What is the purpose of the ASR-based story segmentation scheme?

11. What is the way to identify a source channel?

12. What is the function that is used to approximate the posterior probability of an event?

13. What is the dimension of binary features gij generated from raw feature fri or?

14. What is the estimated model a posterior probability?

15. What is the intuition that a segment is classified as?

16. What is the recall rate for the shot boundaries?

Figures

Citations

Multimodal fusion for multimedia analysis: a survey

Vlogging: A survey of videoblogging technology on the web

Combining text and audio-visual features in video indexing

A Multimodal Scheme for Program Segmentation and Representation in Broadcast Video Streams

Multimodal and ontology-based fusion approaches of audio and visual processing for violence detection in movies

References

Statistical Models for Text Segmentation

Language-Independent Prosodic Features

Discovery and fusion of salient multimodal features toward news story segmentation

A statistical framework for fusing mid-level perceptual features in news story segmentation

Segmentation, structure detection and summarization of multimedia sequences

Related Papers (5)

Unsupervised and model-free news video segmentation

Multimodal topic segmentation and classification of news video

Baseball scene classification using multimedia features

Video segmentation with the Support of Audio Segmentation and classification

Joint video scene segmentation and classification based on hidden Markov model