Exploring Temporal Concurrency for Video-Language Representation Learning

doi:10.1109/iccv51070.2023.01427

Proceedings Article10.1109/iccv51070.2023.01427

Exploring Temporal Concurrency for Video-Language Representation Learning

Heng Zhang, +4 more

- 01 Oct 2023

pp 15522-15532

6

TL;DR: This paper proposes to learn video-language representations by modeling video-language pairs as Temporal Concurrent Processes (TCP) via a process-wised distance metric learning framework and introduces a regularization term that enforces the embeddings of each modality approximating a stochastic process to guarantee the inherent dynamics.

Abstract: Paired video and language data is naturally temporal concurrency, which requires the modeling of the temporal dynamics within each modality and the temporal alignment across modalities simultaneously. However, most existing video-language representation learning methods only focus on discrete semantic alignment that encourages aligned semantics to be close in the latent space, or temporal context dependency that captures short-range coherence, failing in building the temporal concurrency. In this paper, we propose to learn video-language representations by modeling video-language pairs as Temporal Concurrent Processes (TCP) via a process-wised distance metric learning framework. Specifically, we employ the soft Dynamic Time Warping (DTW) to measure the distance between two processes across modalities and then optimize the DTW costs. Meanwhile, we further introduce a regularization term that enforces the embeddings of each modality approximating a stochastic process to guarantee the inherent dynamics. Experimental results on three benchmarks demonstrate that TCP stands as a state-of-the-art method for various video-language understanding tasks, including paragraph-to-video retrieval, video moment retrieval, and video question-answering. Code is available at https: //github.com/hengRUC/TCP.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.1007/978-3-031-73636-0_19

Chronologically Accurate Retrieval for Temporal Grounding of Motion-Language Models

Kent Fujiwara, +2 more

- 04 Nov 2024

- Lecture Notes in Computer Science

Journal Article•10.18653/v1/2024.findings-emnlp.788

Multilingual Synopses of Movie Narratives: A Dataset for Vision-Language Story Understanding

Yidan Sun, +2 more

- 01 Jan 2024

Journal Article•10.48550/arxiv.2407.07478

EA-VTR: Event-Aware Video-Text Retrieval

Zongyang Ma, +10 more

- 10 Jul 2024

TL;DR: EA-VTR improves video-text retrieval by supplementing pre-training data with event augmentation strategies and constructing an event-aware model that encodes frame-level and video-level visual representations for detailed event content and temporal cross-modal alignment.

...read moreread less

Journal Article•10.48550/arxiv.2402.16769

Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval

Haowei Liu, +10 more

- 26 Feb 2024

TL;DR: Unify framework learns lexicon representations to capture fine-grained semantics and combines latent and lexicon representations for effective video-text retrieval.

...read moreread less

Journal Article•10.48550/arxiv.2407.15408

Chronologically Accurate Retrieval for Temporal Grounding of Motion-Language Models

Kent Fujiwara, +2 more

- 22 Jul 2024

- arXiv.org

TL;DR: Researchers propose Chronologically Accurate Retrieval (CAR) to evaluate temporal understanding in motion-language models, revealing chronological inaccuracies in current models, and propose a training method using shuffled event sequences to improve temporal alignment between text and motion.

...read moreread less

References

•Posted Content

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

- 11 Oct 2018

- arXiv: Computation and Language

TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

81.7K

•Posted Content

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.

Ze Liu, +7 more

- 25 Mar 2021

- arXiv: Computer Vision and Pattern Recog...

TL;DR: Wang et al. as mentioned in this paper proposed a new vision Transformer called Swin Transformer, which is computed with shifted windows to address the differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text.

...read moreread less

15.7K

•Book

Probability, random variables, and stochastic processes

Athanasios Papoulis, +1 more

- 01 Jan 2002

TL;DR: In this paper, the meaning of probability and random variables are discussed, as well as the axioms of probability, and the concept of a random variable and repeated trials are discussed.

...read moreread less

12.7K

•Posted Content

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Jiasen Lu, +3 more

- 06 Aug 2019

- arXiv: Computer Vision and Pattern Recog...

TL;DR: ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language, is presented, extending the popular BERT architecture to a multi-modal two-stream model, pro-cessing both visual and textual inputs in separate streams that interact through co-attentional transformer layers.

...read moreread less

3.3K

•Proceedings Article•10.1109/CVPR.2015.7298698

ActivityNet: A large-scale video benchmark for human activity understanding

Fabian Caba Heilbron, +3 more

- 07 Jun 2015

TL;DR: This paper introduces ActivityNet, a new large-scale video benchmark for human activity understanding that aims at covering a wide range of complex human activities that are of interest to people in their daily living.

...read moreread less

3.2K

...

Expand