Proceedings Article10.1109/iccv51070.2023.01427
Exploring Temporal Concurrency for Video-Language Representation Learning
Heng Zhang,Daqing Liu,Zezhong Lv,Bing Su,Dacheng Tao +4 more
- 01 Oct 2023
pp 15522-15532
6
TL;DR: This paper proposes to learn video-language representations by modeling video-language pairs as Temporal Concurrent Processes (TCP) via a process-wised distance metric learning framework and introduces a regularization term that enforces the embeddings of each modality approximating a stochastic process to guarantee the inherent dynamics.
read more
Abstract: Paired video and language data is naturally temporal concurrency, which requires the modeling of the temporal dynamics within each modality and the temporal alignment across modalities simultaneously. However, most existing video-language representation learning methods only focus on discrete semantic alignment that encourages aligned semantics to be close in the latent space, or temporal context dependency that captures short-range coherence, failing in building the temporal concurrency. In this paper, we propose to learn video-language representations by modeling video-language pairs as Temporal Concurrent Processes (TCP) via a process-wised distance metric learning framework. Specifically, we employ the soft Dynamic Time Warping (DTW) to measure the distance between two processes across modalities and then optimize the DTW costs. Meanwhile, we further introduce a regularization term that enforces the embeddings of each modality approximating a stochastic process to guarantee the inherent dynamics. Experimental results on three benchmarks demonstrate that TCP stands as a state-of-the-art method for various video-language understanding tasks, including paragraph-to-video retrieval, video moment retrieval, and video question-answering. Code is available at https: //github.com/hengRUC/TCP.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Chronologically Accurate Retrieval for Temporal Grounding of Motion-Language Models
Kent Fujiwara,M. Tanaka,Qing Yu +2 more
Multilingual Synopses of Movie Narratives: A Dataset for Vision-Language Story Understanding
Yidan Sun,Jianfei Yu,Boyang Li +2 more
- 01 Jan 2024
EA-VTR: Event-Aware Video-Text Retrieval
Zongyang Ma,Ziqi Zhang,Yuxin Chen,Zhongang Qi,Chunfeng Yuan,Bing Li,Yingmin Luo,Xu Li,Xiaojuan Qi,Ying Shan,Weiming Hu +10 more
- 10 Jul 2024
TL;DR: EA-VTR improves video-text retrieval by supplementing pre-training data with event augmentation strategies and constructing an event-aware model that encodes frame-level and video-level visual representations for detailed event content and temporal cross-modal alignment.
Unifying Latent and Lexicon Representations for Effective Video-Text Retrieval
Haowei Liu,Yaya Shi,Haiyang Xu,Chunfeng Yuan,Qinghao Ye,Chenliang Li,Ming Yan,Ji Zhang,Fei Huang,Bing Li,Weiming Hu +10 more
- 26 Feb 2024
TL;DR: Unify framework learns lexicon representations to capture fine-grained semantics and combines latent and lexicon representations for effective video-text retrieval.
Chronologically Accurate Retrieval for Temporal Grounding of Motion-Language Models
TL;DR: Researchers propose Chronologically Accurate Retrieval (CAR) to evaluate temporal understanding in motion-language models, revealing chronological inaccuracies in current models, and propose a training method using shuffled event sequences to improve temporal alignment between text and motion.
References
•Posted Content
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
81.7K
•Posted Content
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.
TL;DR: Wang et al. as mentioned in this paper proposed a new vision Transformer called Swin Transformer, which is computed with shifted windows to address the differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text.
15.7K
•Book
Probability, random variables, and stochastic processes
Athanasios Papoulis,S. Unnikrishna Pillai +1 more
- 01 Jan 2002
TL;DR: In this paper, the meaning of probability and random variables are discussed, as well as the axioms of probability, and the concept of a random variable and repeated trials are discussed.
12.7K
•Posted Content
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
TL;DR: ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language, is presented, extending the popular BERT architecture to a multi-modal two-stream model, pro-cessing both visual and textual inputs in separate streams that interact through co-attentional transformer layers.
3.3K
ActivityNet: A large-scale video benchmark for human activity understanding
Fabian Caba Heilbron,Victor Escorcia,Bernard Ghanem,Juan Carlos Niebles +3 more
- 07 Jun 2015
TL;DR: This paper introduces ActivityNet, a new large-scale video benchmark for human activity understanding that aims at covering a wide range of complex human activities that are of interest to people in their daily living.