TL;DR: A technique to manipulate small movements in videos based on an analysis of motion in complex-valued image pyramids that supports larger amplification factors and is significantly less sensitive to noise is introduced.
Abstract: We introduce a technique to manipulate small movements in videos based on an analysis of motion in complex-valued image pyramids. Phase variations of the coefficients of a complex-valued steerable pyramid over time correspond to motion, and can be temporally processed and amplified to reveal imperceptible motions, or attenuated to remove distracting changes. This processing does not involve the computation of optical flow, and in comparison to the previous Eulerian Video Magnification method it supports larger amplification factors and is significantly less sensitive to noise. These improved capabilities broaden the set of applications for motion processing in videos. We demonstrate the advantages of this approach on synthetic and natural video sequences, and explore applications in scientific analysis, visualization and video enhancement.
TL;DR: The proposed method outperformed all other approaches based on BOV that do not account for contextual information and imposes spatial and temporal constraints on the video volumes so that an inference mechanism can estimate the probability density functions of their arrangements.
TL;DR: An abnormal video event detection system that considers both spatial and temporal contexts is presented and a new region-based descriptor called “Motion Context” is proposed to describe both motion and appearance information of the spatio-temporal segment.
Abstract: Video anomaly detection plays a critical role for intelligent video surveillance. We present an abnormal video event detection system that considers both spatial and temporal contexts. To characterize the video, we first perform the spatio-temporal video segmentation and then propose a new region-based descriptor called “Motion Context,” to describe both motion and appearance information of the spatio-temporal segment. For anomaly measurements, we formulate the abnormal event detection as a matching problem, which is more robust than statistic model-based methods, especially when the training dataset is of limited size. For each testing spatio-temporal segment, we search for its best match in the training dataset, and determine how normal it is using a dynamic threshold. To speed up the search process, compact random projections are also adopted. Experiments on the benchmark dataset and comparisons with the state-of-the-art methods validate the advantages of our algorithm.
TL;DR: This paper proposes an automated method of video key frame extraction using dynamic Delaunay graph clustering via an iterative edge pruning strategy that improves the video summary.
TL;DR: An extensive review of BBME algorithms proposed within the last three decades is presented, divided into five categories based on the search position number reduction, multiresolution, fast full search, simplification of matching criterion, and computation-aware.
Abstract: In the multi-view video coding, both temporal and inter-view redundancies can be exploited by using standard block-based motion estimation (BBME) technique. In this paper, an extensive review of BBME algorithms proposed within the last three decades is presented. Algorithms are divided into five categories: 1) based on the search position number reduction; 2) multiresolution; 3) based on the simplification of matching criterion; 4) fast full search; 5) computation-aware. Algorithms are compared in terms of their efficiency and computational complexity.
TL;DR: This paper examines the authenticity of digital video evidence and in particular it proposes a machine learning approach to detecting frame deletion and it is shown that the proposed solution works for detecting forged videos regardless of the number of deleted frames.
TL;DR: In this paper, a method for processing video signals according to the present invention determines a motion vector list including at least one of spatial motion vectors, temporal motion vectors and variation vectors as motion vector candidates of a target block, extracts motion vector identifying information which specifies the motion vector candidate used as the predicted motion vectors of the target block.
Abstract: The method for processing video signals according to the present invention determines a motion vector list including at least one of spatial motion vectors, temporal motion vectors, and variation vectors as motion vector candidates of a target block, extracts motion vector identifying information which specifies the motion vector candidates used as the predicted motion vectors of the target block, determines the motion vector candidates corresponding to the motion vector identifying information as the predicted motion vectors of the target block, and performs motion compensation on the basis of the predicted motion vectors. The present invention enables accurate motion vector prediction by forming motion vector candidates and induces motion vectors of a target block therefrom, and can enhance coding efficiency by reducing the amount of transmitted residual data.
TL;DR: A novel multivariate sparse representation method for video-to-video face recognition that simultaneously takes into account correlations as well as coupling information among the video frames, and modified to be robust in the presence of noise and occlusion.
Abstract: In video-based face recognition, a key challenge is in exploiting the extra information available in a video; e.g., face, body, and motion identity cues. In addition, different video sequences of the same subject may contain variations in resolution, illumination, pose, and facial expressions. These variations contribute to the challenges in designing an effective video-based face-recognition algorithm. We propose a novel multivariate sparse representation method for video-to-video face recognition. Our method simultaneously takes into account correlations as well as coupling information among the video frames. Our method jointly represents all the video data by a sparse linear combination of training data. In addition, we modify our model so that it is robust in the presence of noise and occlusion. Furthermore, we kernelize the algorithm to handle the non-linearities present in video data. Numerous experiments using unconstrained video sequences show that our method is effective and performs significantly better than many state-of-the-art video-based face recognition algorithms in the literature.
TL;DR: A novel motion-compensated frame interpolation (MCFI) algorithm to increase video temporal resolutions based on multihypothesis motion estimation and texture optimization based on the texture optimization technique is proposed.
Abstract: A novel motion-compensated frame interpolation (MCFI) algorithm to increase video temporal resolutions based on multihypothesis motion estimation and texture optimization is proposed in this paper. Initially, we form multiple motion hypotheses for each pixel by employing different motion estimation parameters, i.e., different block sizes and directions. Then, we determine the best motion hypothesis for each pixel by solving a labeling problem and optimizing the parameters. In the labeling problem, the cost function is composed of color, shape, and smoothness terms. Finally, we refine the motion hypothesis field based on the texture optimization technique and blend multiple source pixels to interpolate each pixel in the intermediate frame. Simulation results demonstrate that the proposed algorithm provides significantly better MCFI performance than conventional algorithms.
TL;DR: An approach for tracking articulated motions that "links" articulated shape models of people in adjacent frames through the dense optical flow that provides a way of integrating image evidence across frames to improve pose inference.
Abstract: We address the problem of upper-body human pose estimation in uncontrolled monocular video sequences, without manual initialization. Most current methods focus on isolated video frames and often fail to correctly localize arms and hands. Inferring pose over a video sequence is advantageous because poses of people in adjacent frames exhibit properties of smooth variation due to the nature of human and camera motion. To exploit this, previous methods have used prior knowledge about distinctive actions or generic temporal priors combined with static image likelihoods to track people in motion. Here we take a different approach based on a simple observation: Information about how a person moves from frame to frame is present in the optical flow field. We develop an approach for tracking articulated motions that "links" articulated shape models of people in adjacent frames through the dense optical flow. Key to this approach is a 2D shape model of the body that we use to compute how the body moves over time. The resulting "flowing puppets" provide a way of integrating image evidence across frames to improve pose inference. We apply our method on a challenging dataset of TV video sequences and show state-of-the-art performance.
TL;DR: A novel multi-level frame interpolation scheme by exploiting the interactions among different levels based on their distinct characteristics and intertwined relationships that has superior performance over several classical schemes in both subjective visual quality and objective peak signal-to-noise ratio/structure similarity measurements.
Abstract: This paper proposes a novel multi-level frame interpolation scheme by exploiting the interactions among different levels. The proposed scheme includes three major stages that work at block level, pixel level, and sequence level, respectively. Effective algorithms are designed for each stage, i.e., block-level motion estimation with dropping unreliable motion vectors, pixel-level motion vector-guided partial scale-invariant feature transform flow matching, and sequence-level 3-D total variation regularized completion. Compared to traditional methods that focus mostly at one single level, the proposed scheme manages to recognize and utilize the interactions among the three levels based on their distinct characteristics and intertwined relationships. With a proper exploitation of interactions, unique advantages for each level can be effectively preserved while inherent limitations of a given level can be overcome by utilizing information from other levels. Extensive experiments have confirmed its superior performance over several classical schemes, in both subjective visual quality and objective peak signal-to-noise ratio/structure similarity measurements, and typical artifacts can be significantly reduced.
TL;DR: A Video SAR (Synthetic Aperture Radar) mode that provides a persistent view of a scene centered at the Motion Compensation Point (MCP) and Generation of synthetic targets with linear motion including both constant velocity and constant acceleration is described.
Abstract: This paper details a Video SAR (Synthetic Aperture Radar) mode that provides a persistent view of a scene centered at
the Motion Compensation Point (MCP). The radar platform follows a circular flight path. An objective is to form a
sequence of SAR images while observing dynamic scene changes at a selectable video frame rate. A formulation of
backprojection meets this objective. Modified backprojection equations take into account changes in the grazing angle
or squint angle that result from non-ideal flight paths.
The algorithm forms a new video frame relying upon much of the signal processing performed in prior frames. The
method described applies an appropriate azimuth window to each video frame for window sidelobe rejection.
A Cardinal Direction Up (CDU) coordinate frame forms images with the top of the image oriented along a given
cardinal direction for all video frames. Using this coordinate frame helps characterize a moving target’s target response.
Generation of synthetic targets with linear motion including both constant velocity and constant acceleration is
described. The synthetic target video imagery demonstrates dynamic SAR imagery with expected moving target
responses. The paper presents 2011 flight data collected by General Atomics Aeronautical Systems, Inc. (GA-ASI)
implementing the video SAR mode. The flight data demonstrates good video quality showing moving vehicles.
The flight imagery demonstrates the real-time capability of the video SAR mode. The video SAR mode uses a circular
shift register of subapertures. The radar employs a Graphics Processing Unit (GPU) in order to implement this
algorithm.
TL;DR: Experimental results show that the quality of the interpolated frames using the proposed method is better when compared with the MCFRUC techniques.
Abstract: In this paper, a new low-complexity true-motion estimation (TME) algorithm is proposed for video processing applications, such as motion-compensated temporal frame interpolation (MCTFI) or motion-compensated frame rate up-conversion (MCFRUC). Regular motion estimation, which is often used in video coding, aims to find the motion vectors (MVs) to reduce the temporal redundancy, whereas TME aims to track the projected object motion as closely as possible. TME is obtained by imposing implicit and/or explicit smoothness constraints on the block-matching algorithm. To produce better quality-interpolated frames, the dense motion field at interpolation time is obtained for both forward and backward MVs; then, bidirectional motion compensation using forward and backward MVs is applied by mixing both elegantly. Finally, the performance of the proposed algorithm for MCTFI is demonstrated against recently proposed methods and smoothness constraint optical flow employed by a professional video production suite. Experimental results show that the quality of the interpolated frames using the proposed method is better when compared with the MCFRUC techniques.
TL;DR: This paper presents an automatic video summarization technique based on motion analysis that defines motion metrics estimated from two optical flow algorithms, each using two different key frame selection criteria.
TL;DR: A novel optimized hierarchical block matching algorithm in which the computational cost is minimized for the scale factor and the number of levels in the hierarchy, based on a generalized version of the Gaussian pyramid and its inter-layer transformation of coordinates is presented.
Abstract: Recently the camera resolution has been highly increased, and the registration between high-resolution images is computationally expensive even by using hierarchical block matching. This paper presents a novel optimized hierarchical block matching algorithm in which the computational cost is minimized for the scale factor and the number of levels in the hierarchy. The algorithm is based on a generalized version of the Gaussian pyramid and its inter-layer transformation of coordinates. The search window size is properly determined to resolve possible error propagation in hierarchical block matching. In addition, we also propose a simple but effective method for aligning colors between two images based on color distribution adjustment as a preprocessing. Simplifying a general color imaging model, we show much of the color inconsistency can be compensated by our color alignment method. The experimental results show that the optimized hierarchical block matching and color alignment methods increase the block matching speed and accuracy, and thus improve image registration. Using our algorithm, it takes about 1.28s for overall registration process with a pair of images in 5 mega-pixel resolution.
TL;DR: The proposed motion estimation algorithm improves the average peak signal-to-noise ratio and the average structural similarity of the interpolated frames by up to 5.31 dB and 0.053, respectively, compared to conventional motion estimation algorithms.
Abstract: In this paper, we propose a new motion estimation algorithm to be used for motion-compensated frame rate up-conversion. The proposed algorithm independently carries out motion estimations in both forward and backward directions, and selects a more reliable one between forward and backward motion vectors by evaluating the motion vector reliability from the viewpoint of the interpolated frame. The proposed algorithm smooths and refines both the forward and backward motion vectors before selecting the reliable one. This procedure helps to select the reasonable motion estimation direction. In identifying the motion vector outliers, the proposed algorithm uses a circular range of which center is located at the mean of the eight neighboring motion vectors of the motion vector being processed. Experimental results using 1720 test images show that the proposed motion estimation algorithm improves the average peak signal-to-noise ratio and the average structural similarity of the interpolated frames by up to 5.31 dB and 0.053, respectively, compared to conventional motion estimation algorithms.
TL;DR: This study gives objective and subjective quality assessment of decoded video frame by means of each search algorithm by giving a comparative study various search algorithms by Block Matching.
Abstract: The Block Matching is a temporal compression technique used in the video encoding. The main purpose of this method is to determine the displacements of each block of pixels between two successive frames. This technique, performed in the step of motion estimation, occupies the majority of the total time of video coding. The aim of this work is to give a comparative study various search algorithms by Block Matching. This study does not focus only on the complexity and computation time of each algorithm, but it also gives objective and subjective quality assessment of decoded video frame by means of each search algorithm.
TL;DR: In this paper, a hand tracking application configures the processor to obtain a reference frame of video data and an alternate frame from the image capture system, identify corresponding pixels within the reference and alternate frames of the video data, and detect at least one candidate finger within a bounded region in the reference frame.
Abstract: Systems and methods for tracking human hands using parts based template matching within bounded regions are described. One embodiment of the invention includes a processor; an image capture system configured to capture multiple images of a scene; and memory containing a plurality of templates that are rotated and scaled versions of a finger template. A hand tracking application configures the processor to: obtain a reference frame of video data and an alternate frame of video data from the image capture system; identify corresponding pixels within the reference and alternate frames of video data; identify at least one bounded region within the reference frame of video data containing pixels having corresponding pixels in the alternate frame of video data satisfying a predetermined criterion; and detect at least one candidate finger within the at least one bounded region in the reference frame of video data.
TL;DR: In this article, a source video stream is processed to extract a desired object from the remainder of video stream to produce a segmented video of the object, which is then displayed over the target video stream.
Abstract: A source video stream is processed to extract a desired object from the remainder of video stream to produce a segmented video of the object. Additional relevant information, such as the orientation of the source camera for each frame in the resulting segmented video of the object, is also determined and stored. During replay, the segmented video of the object, as well as the source camera orientation are obtained. Using the source camera orientation for each frame of the segmented video of the object, as well as target camera orientation for each frame of a target video stream, a transformation for the segmented video of the object may be produced. The segmented video of the object may be displayed over the target video stream, which may be a live video stream of a scene, using the transformation to spatially register the segmented video to the target video stream.
TL;DR: A new algorithm based on differential evolution (DE) is proposed to reduce the number of search locations in the block-matching process, and deploys more accurate motion vectors, yet delivering competitive time rates.
TL;DR: A new and more general parametric model is presented, which takes into account bit rate, frame rate, display resolution, video content and the percentage of packet loss.
Abstract: During the last few years, different parametric models were proposed for video quality estimation. Each model uses different parameters as inputs, such as bit rate, frame rate and percentage of packet loss, and each model was designed and tested by their authors for a particular codec, display resolution and/or application. This paper presents a review of the parametric models published by ten different groups of authors. Each model is briefly described, and the relevant parametric formulas are presented. The performance of each model is evaluated and contrasted to the other models, using a common video clips set, in different coding and transmission scenarios. Based on the results, a new and more general parametric model is presented, which takes into account bit rate, frame rate, display resolution, video content and the percentage of packet loss.
TL;DR: The proposed segmentation and graph-based video sequence matching method can automatically find optimal sequence matching result from the disordered matching results based on spatial feature and can also reduce the noise caused by spatial feature matching.
Abstract: We propose in this paper a segmentation and graph-based video sequence matching method for video copy detection. Specifically, due to the good stability and discriminative ability of local features, we use SIFT descriptor for video content description. However, matching based on SIFT descriptor is computationally expensive for large number of points and the high dimension. Thus, to reduce the computational complexity, we first use the dual-threshold method to segment the videos into segments with homogeneous content and extract keyframes from each segment. SIFT features are extracted from the keyframes of the segments. Then, we propose an SVD-based method to match two video frames with SIFT point set descriptors. To obtain the video sequence matching result, we propose a graph-based method. It can convert the video sequence matching into finding the longest path in the frame matching-result graph with time constraint. Experimental results demonstrate that the segmentation and graph-based video sequence matching method can detect video copies effectively. Also, the proposed method has advantages. Specifically, it can automatically find optimal sequence matching result from the disordered matching results based on spatial feature. It can also reduce the noise caused by spatial feature matching. And it is adaptive to video frame rate changes. Experimental results also demonstrate that the proposed method can obtain a better tradeoff between the effectiveness and the efficiency of video copy detection.
TL;DR: In this paper, a degradation control management scheme for a plurality of video streams associated with a majority of user terminals in a communication network is proposed. But it does not consider the impact of the video quality on the overall video quality.
Abstract: Degradation control management is provided for a plurality of video streams associated with a plurality of user terminals in a communication network, based at least in part on an overall video quality metric, by determining a video quality metric for each video stream based on at least a set of video quality metric input parameters, and calculating an overall video quality metric based on the determined video quality metrics for the video streams, determining, with an objective function, at least one objective parameter based on at least the overall video quality metric, calculating a scheduling parameter for each video stream using a degradation control algorithm based on at least the determined video quality metric for the respective video stream and on the at least one objective parameter, and scheduling network resources for each video stream based on at least the scheduling parameter for the video stream.
TL;DR: A new approach that consists of combining global and local motion compensation at the decoder side is proposed, which improves significantly the quality of the side information, especially for sequences containing high global motion.
Abstract: The quality of side information plays a key role in distributed video coding. In this paper, we propose a new approach that consists of combining global and local motion compensation at the decoder side. The parameters of the global motion are estimated at the encoder using scale invariant feature transform features. Those estimated parameters are sent to the decoder in order to generate a globally motion compensated side information. Conversely, a locally motion compensated side information is generated at the decoder based on motion-compensated temporal interpolation of neighboring reference frames. Moreover, an improved fusion of global and local side information during the decoding process is achieved using the partially decoded Wyner-Ziv frame and decoded reference frames. The proposed technique improves significantly the quality of the side information, especially for sequences containing high global motion. Experimental results show that, as far as the rate-distortion performance is concerned, the proposed approach can achieve a PSNR improvement of up to 1.9 dB for a Group of Pictures (GOP) size of 2, and up to 4.65 dB for larger GOP sizes, with respect to the reference DISCOVER codec.
TL;DR: The experimental results show that video stabilization using the proposed method outperforms the conventional stabilization methods, especially when the moving foreground (FG) objects occupy a large part of the image.
Abstract: The performance of video stabilization is dependent on the accuracy of global motion estimation between two successive frames. In this paper, we propose a novel method to estimate the global motion accurately using the classified background (BG) feature points (FPs). In the proposed method, global motion estimation and FP classification are jointly performed using both the FP correspondences and the global motion parameters of the previous frame. The experimental results show that video stabilization using the proposed method outperforms the conventional stabilization methods, especially when the moving foreground (FG) objects occupy a large part of the image.
TL;DR: The proposed steganography algorithm based on color histograms for data embedding into Video clips directly, where each pixel in each video frame is divided in two parts, the number of bits which will be embedded in the right part are counted in the left part of the pixel.
Abstract: This paper focuses on the utilization of digital video/images as cover to hide data. The proposed steganography algorithm based on color histograms for data embedding into Video clips directly, where each pixel in each video frame is divided in two parts, the number of bits which will be embedded in the right part are counted in the left part of the pixel. This algorithm is characterized by the ability of hiding larger size of data and the ability of extracting the written text without errors, besides it gives a high level of authentication to guarantee integrity of the video/ images before being extracted. Furthermore, the data were embedded inside the video/ images randomly which gave the video/ images a higher security and resistance against extraction by attackers.
TL;DR: The Harmony Search algorithm is a population-based optimization method that is inspired by the music improvisation process in which a musician searches for harmony and continues to polish the pitches to obtain a better harmony.
Abstract: Motion estimation is one of the major problems in developing video coding applications. Among all motion estimation approaches, Block-matching (BM) algorithms are the most popular methods due to their effectiveness and simplicity for both software and hardware implementations. A BM approach assumes that the movement of pixels within a defined region of the current frame can be modeled as a translation of pixels contained in the previous frame. In this procedure, the motion vector is obtained by minimizing a certain matching metric that is produced for the current frame over a determined search window from the previous frame. Unfortunately, the evaluation of such matching measurement is computationally expensive and represents the most consuming operation in the BM process. Therefore, BM motion estimation can be viewed as an optimization problem whose goal is to find the best-matching block within a search space. The simplest available BM method is the Full Search Algorithm (FSA) which finds the most accurate motion vector through an exhaustive computation of all the elements of the search space. Recently, several fast BM algorithms have been proposed to reduce the search positions by calculating only a fixed subset of motion vectors despite lowering its accuracy. On the other hand, the Harmony Search (HS) algorithm is a population-based optimization method that is inspired by the music improvisation process in which a musician searches for harmony and continues to polish the pitches to obtain a better harmony. In this paper, a new BM algorithm that combines HS with a fitness approximation model is proposed. The approach uses motion vectors belonging to the search window as potential solutions. A fitness function evaluates the matching quality of each motion vector candidate. In order to save computational time, the approach incorporates a fitness calculation strategy to decide which motion vectors can be only estimated or actually evaluated. Guided by the values of such fitness calculation strategy, the set of motion vectors is evolved through HS operators until the best possible motion vector is identified. The proposed method has been compared to other BM algorithms in terms of velocity and coding quality. Experimental results demonstrate that the proposed algorithm exhibits the best balance between coding efficiency and computational complexity.
TL;DR: Simulation results show that improvements over other fast block matching motion estimation algorithms could be achieved with 31%~63% of search point reduction, without degradation of image quality.
TL;DR: This work considers a motion-adaptive linear dynamical model for videos that leverages the inherent spatial and temporal redundancies in a video sequence to reduce video-encoder complexity and proposes a CS-based video compression scheme.
Abstract: Compressive sensing (CS) provides a general signal acquisition framework that enables the reconstruction of sparse signals from a small number of linear measurements To reduce video-encoder complexity, we present a CS-based video compression scheme Modern video-encoder complexity arises mainly from the transform-coding and motion-estimation blocks In our proposed scheme, we eliminate these blocks from the encoder, which achieves compression by merely taking a few linear measurements of each image in a video sequence To guarantee stable reconstruction of the video sequence from only a few measurements, the decoder must effectively exploit the inherent spatial and temporal redundancies in a video sequence To leverage these redundancies, we consider a motion-adaptive linear dynamical model for videos Recovery process involves solving an l\-regularized optimization problem, which iteratively updates estimates for the video frames and motion within adjacent frames To evaluate the performance of our proposed scheme we performed experiments on various standard test sequences
TL;DR: In this paper, a video encoder signals, in a bitstream, a syntax element that indicates whether a current video unit is predicted from a VSP picture, and a video decoder decodes the syntax element from the bitstream and determines, based at least in part on the syntax elements, whether the bit stream includes the motion information.
Abstract: A video encoder signals, in a bitstream, a syntax element that indicates whether a current video unit is predicted from a VSP picture. The current video unit is a macroblock or a macroblock partition. The video encoder determines, based at least in part on whether the current video unit is predicted from the VSP picture, whether to signal, in the bitstream, motion information for the current video unit. A video decoder decodes the syntax element from the bitstream and determines, based at least in part on the syntax element, whether the bitstream includes the motion information.