TL;DR: This paper focuses on human motion transfer - generation of a video depicting a particular subject, observed in a single image, performing a series of motions exemplified by an auxiliary (driving) video.
Abstract: Generation of realistic high-resolution videos of human subjects is a challenging and important task in computer vision. In this paper, we focus on human motion transfer - generation of a video depicting a particular subject, observed in a single image, performing a series of motions exemplified by an auxiliary (driving) video. Our GAN-based architecture, DwNet, leverages dense intermediate pose-guided representation and refinement process to warp the required subject appearance, in the form of the texture, from a source image into a desired pose. Temporal consistency is maintained by further conditioning the decoding process within a GAN on the previously generated frame. In this way a video is generated in an iterative and recurrent fashion. We illustrate the efficacy of our approach by showing state-of-the-art quantitative and qualitative performance on two benchmark datasets: TaiChi and Fashion Modeling. The latter is collected by us and will be made publicly available to the community.
TL;DR: A new evaluation framework, Story Oriented Dense video cAptioning evaluation framework (SODA), is proposed for measuring the performance of video story description systems and it is shown that SODA tends to give lower scores than the current evaluation framework in evaluating captions in the incorrect order.
Abstract: Dense Video Captioning (DVC) is a challenging task that localizes all events in a short video and describes them with natural language sentences. The main goal of DVC is video story description, that is, to generate a concise video story that supports human video comprehension without watching it. In recent years, DVC has attracted increasing attention in the vision and language research community, and has been employed as a task of the workshop, ActivityNet Challenge. In the current research community, the official scorer provided by ActivityNet Challenge is the de-facto standard evaluation framework for DVC systems. It computes averaged METEOR scores for matched pairs between generated and reference captions whose Intersection over Union (IoU) exceeds a specific threshold value. However, the current framework does not take into account the story of the video or the ordering of captions. It also tends to give high scores to systems that generate several hundred redundant captions, that humans cannot read. This paper proposes a new evaluation framework, Story Oriented Dense video cAptioning evaluation framework (SODA), for measuring the performance of video story description systems. SODA first tries to find temporally optimal matching between generated and reference captions to capture the story of a video. Then, it computes METEOR scores for the matching and derives F-measure scores from the METEOR scores to penalize redundant captions. To demonstrate that SODA gives low scores for inadequate captions in terms of video story description, we evaluate two state-of-the-art systems with it, varying the number of captions. The results show that SODA gives low scores against too many or too few captions and high scores against captions whose number equals to that of a reference, while the current framework gives good scores for all the cases. Furthermore, we show that SODA tends to give lower scores than the current evaluation framework in evaluating captions in the incorrect order.
TL;DR: This paper presents a simple yet effective way of fusing information from multiple cameras in BBU discovery, and presents and analyzes results on artificial mouse video using single, stereo and three cameras.
Abstract: It has become increasingly popular to study animal behaviors with the assistance of video recordings. The traditional manual human video annotation is a time and labor consuming process and, the observation results vary between different observers. Hence an automated video processing and behavior analysis system is desirable. We propose a framework for automatic video based behavior analysis systems, which consists of four major modules: behavior modeling, feature extraction from video sequences, basic behavior unit (BBU) discovery and complex behavior recognition. In this paper, we focus on BBU discovery using the affinity graph method on the feature data extracted from video images. We present a simple yet effective way of fusing information from multiple cameras in BBU discovery, and we present and analyze results on artificial mouse video using single, stereo and three cameras. Overall the results are encouraging.
TL;DR: Zhang et al. as discussed by the authors proposed General Appearance-Controllable GAN (GAC-GAN), a general method for appearance-controllable human video motion transfer.
Abstract: Human video motion transfer has a wide range of applications in multimedia, computer vision, and graphics. Recently, due to the rapid development of Generative Adversarial Networks (GANs), there has been significant progress in the field. However, almost all existing GAN-based works are prone to address the mapping from human motions to video scenes, with scene appearances encoded individually in the trained models. Therefore, each trained model can only generate videos with a specific scene appearance, and new models are required to be trained to generate new appearances. Besides, existing works lack the capability of appearance control. For example, users have to provide video records of wearing new clothes or performing in new backgrounds to enable clothes or background changing in their synthetic videos, which greatly limits the application flexibility. In this paper, we propose General Appearance-Controllable GAN (GAC-GAN), a general method for appearance-controllable human video motion transfer. To enable general-purpose appearance synthesis, we propose to include appearance information in the conditioning inputs. Thus, once trained, our model can generate new appearances by altering the input appearance information. To achieve appearance control, we first obtain the appearance-controllable conditioning inputs, and then utilize a two-stage GAC-GAN to generate the corresponding appearance-controllable outputs, where we utilize an Appearance-Consistency GAN (ACGAN) loss, and a shadow extraction module for output foreground, and background appearance control respectively. We further build a solo dance dataset containing a large number of dance videos for training, and evaluation. Experimental results on our solo dance dataset, and iPER dataset show that our proposed GAC-GAN can not only support appearance-controllable human video motion transfer but also achieve higher video quality than state-of-art methods.
TL;DR: The synthetic dictionary created in this work can be used for translation system in which spoken or written sentence can be converted into the sign language animation.
Abstract: Objective: Development of Indian Sign Language video dictionary is essential in the today’s world of computerization. Though a lot of human video sign language dictionaries are available, we aim to develop the Indian Sign Language dictionary using synthetic animation which uses the computer generated cartoon rather than real human. Methods/Statistical Analysis: Sign Language cannot be spoken or written unlike other languages like English, Punjabi, Hindi, etc. The most commonly used words in Indian Sign Language are categorized and then these words are converted into the sign language writing notation (HamNoSys - Hamburg Notation System). This HamNoSys notation is then converted into SiGML (Signing Gesture Markup Language) using which the synthetic animation (using a computer generated cartoon) of the sign is generated. Findings: The synthetic animations are better as compared to human videos in terms of memory consumption, standardization, and flexibility. Synthetic animations can be modified as per the requirement whereas the human videos cannot be modified. The only drawback that seem is, these synthetic animations may lack the natural non-manual component of sign. Applications/Improvements: The synthetic dictionary created in this work can be used for translation system in which spoken or written sentence can be converted into the sign language animation. The dictionary created can be used to education to hard of hearing people. Display boards can be created for displaying the important messages in Indian sign language at the public gathering.