TL;DR: A new approach to motion estimation in video using a set of particles that is useful for a variety of applications and cannot be directly obtained using existing methods such as optical flow or feature tracking.
Abstract: This paper describes a new approach to motion estimation in video. We represent video motion using a set of particles. Each particle is an image point sample with a longduration trajectory and other properties. To optimize these particles, we measure point-based matching along the particle trajectories and distortion between the particles. The resulting motion representation is useful for a variety of applications and cannot be directly obtained using existing methods such as optical flow or feature tracking. We demonstrate the algorithm on challenging real-world videos that include complex scene geometry, multiple types of occlusion, regions with low texture, and non-rigid deformations.
TL;DR: In this paper, a method for head pose estimation is proposed, where the average motion vectors over time (all past frames of video) are combined to determine an accumulated average motion vector, estimating the orientation of a user's head in the video frame based on the accumulated average vector, and outputting at least one parameter indicating the estimated orientation.
Abstract: A method for head pose estimation may include receiving block motion vectors for a frame of video from a block motion estimator, selecting at least one block for analysis, determining an average motion vector for the at least one selected block, combining the average motion vectors over time (all past frames of video) to determine an accumulated average motion vector, estimating the orientation of a user's head in the video frame based on the accumulated average motion vector, and outputting at least one parameter indicative of the estimated orientation.
TL;DR: In this paper, an appearance model is generated for blobs that are close to one another in terms of image position, in combination with a depth factor representing the depth order of the occluded objects, to segment the resulting group blob into regions which are classified as representing one or other of the merged objects.
Abstract: A video surveillance system (10) comprises a camera (25), a personal computer (PC) (27) and a video monitor (29). Video processing software is provided on the hard disk drive of the PC (27). The software is arranged to perform a number of processing operations on video data received from the camera, the video data representing individual frames of captured video. In particular, the software is arranged to identify one or more foreground blobs in a current frame, to match the or each blob with an object identified in one or more previous frames, and to track the motion of the or each object as more frames are received. In order to maintain the identity of objects during an occlusion event, an appearance model is generated for blobs that are close to one another in terms of image position. Once occlusion takes place, the respective appearance models are used, in combination with a depth factor representing the depth order of the occluded objects, to segment the resulting group blob into regions which are classified as representing one or other of the merged objects.
TL;DR: The algorithm, which escapes the complexity of existing methods based, for example, on clustering or optimization strategies, dynamically and rapidly selects a variable number of key frames within each sequence by analyzing the differences between two consecutive frames of a video sequence.
Abstract: Video summarization, aimed at reducing the amount of data that must be examined in order to retrieve the information desired from information in a video, is an essential task in video analysis and indexing applications. We propose an innovative approach for the selection of representative (key) frames of a video sequence for video summarization. By analyzing the differences between two consecutive frames of a video sequence, the algorithm determines the complexity of the sequence in terms of changes in the visual content expressed by different frame descriptors. The algorithm, which escapes the complexity of existing methods based, for example, on clustering or optimization strategies, dynamically and rapidly selects a variable number of key frames within each sequence. The key frames are extracted by detecting curvature points within the curve of the cumulative frame differences. Another advantage is that it can extract the key frames on the fly: curvature points can be determined while computing the frame differences and the key frames can be extracted as soon as a second high curvature point has been detected. We compare the performance of this algorithm with that of other key frame extraction algorithms based on different approaches. The summaries obtained have been objectively evaluated by three quality measures: the Fidelity measure, the Shot Reconstruction Degree measure and the Compression Ratio measure.
TL;DR: A novel spacetime video summarization method which is called space-time video montage, which simultaneously analyzes both the spatial and temporal distribution in a video sequence, and extracts the visually informative space- time portions of the input videos.
Abstract: Conventional video summarization methods focus predominantly on summarizing videos along the time axis, such as building a movie trailer: The resulting video trailer tends to retain much empty space in the background of the video frames while discarding much informative video content due to size limit. In this paper we propose a novel spacetime video summarization method which we call space-time video montage. The method simultaneously analyzes both the spatial and temporal injbrmation distribution in a video sequence, and extracts the visually informative space-time portions of the input videos. The informative video porlions are represented in volumetric layers. The layers are then packrd together in a smull ouzput video volume such that the total amount of visual information in the video volume is maximized. To achieve the packing process, we develop a new algorithm based upon the first-fit and Graph cut optimization techniques. Since our method is uble to cut off spatially und temporally less informative portions, it is uble to generate much more compact yet highly informative output videos. The effecliveness of our method is validated by extensive experiments over a wide variety of videos.
TL;DR: In this paper, the singular value decomposition (SVDC) is used for video segmentation, classification, and summarization based on a metric to measure the amount of information contained in each video shot of the input video sequence.
Abstract: In a technique for video segmentation, classification and summarization based on the singular value decomposition, frames of the input video sequence are represented by vectors composed of concatenated histograms descriptive of the spatial distributions of colors within the video frames. The singular value decomposition maps these vectors into a refined feature space. In the refined feature space produced by the singular value decomposition, the invention uses a metric to measure the amount of information contained in each video shot of the input video sequence. The most static video shot is defined as an information unit, and the content value computed from this shot is used as a threshold to cluster the remaining frames. The clustered frames are displayed using a set of static keyframes or a summary video sequence. The video segmentation technique relies on the distance between the frames in the refined feature space to calculate the similarity between frames in the input video sequence. The input video sequence is segmented based on the values of the calculated similarities. Finally, average video attribute values in each segment are used in classifying the segments.
TL;DR: Experimental results show that the proposed steganographic algorithm has the characteristics of little degrading the visual effect, larger embedding capacity and resisting video processing such as frame adding or frame dropping.
Abstract: In this paper, a steganographic algorithm in MPEG compressed video stream was proposed. In each GOP, the control information for to facilitate data extraction was embedded in I frame, in P frames and B frames, the actually transmitted data were repeatedly embedded in motion vectors of macro-blocks that have larger moving speed, for to resist video processing. Data extraction was also performed in compressed video stream without requiring original video. On a GOP by GOP basis, control information in I frame should be extracted firstly, then the embedded data in P and B frames can be extracted based on the control information. Experimental results show that the proposed algorithm has the characteristics of little degrading the visual effect, larger embedding capacity and resisting video processing such as frame adding or frame dropping.
TL;DR: This paper presents a high-performance sum of absolute difference (SAD) architecture for motion estimation, which is the most time-consuming and compute-intensive part of video coding, and outperforms contemporary architectures in terms of execution speed and area efficiency.
Abstract: This paper presents a high-performance sum of absolute difference (SAD) architecture for motion estimation, which is the most time-consuming and compute-intensive part of video coding. The proposed architecture contains novel and efficient optimizations to overcome bottlenecks discovered in existing approaches. In addition, designed sophisticated control logic with multiple early termination mechanisms further enhance execution speed and make the architecture suitable for general-purpose usage. Hence, the proposed architecture is not restricted to a single block-matching algorithm in motion estimation, but a wide range of algorithms is supported. The proposed SAD architecture outperforms contemporary architectures in terms of execution speed and area efficiency. The proposed architecture with three pipeline stages, synthesized to a 0.18-mum CMOS technology, can attain 770-MHz operating frequency at a cost of less than 5600 gates. Correspondingly, performance metrics for the proposed low-latency 2-stage architecture are 730 MHz and 7500 gates
TL;DR: It is shown that the algorithm effectively reduces the computations of MRF-ME, and achieves similar coding gain compared to the motion search approaches in the reference software.
Abstract: Multiple reference frame motion compensation is a new feature introduced in H.264/MPEG-4 AVC to improve video coding performance. However, the computational cost of multiple reference frame motion estimation (MRF-ME) is very high. In this paper, we propose an algorithm that takes into account the correlation/continuity of motion vectors among different reference frames. We show that the algorithm effectively reduces the computations of MRF-ME, and achieves similar coding gain compared to the motion search approaches in the reference software
TL;DR: This paper presents a new approach for video completion using motion field transfer to avoid large holes, and fills in missing video parts by sampling spatio-temporal patches of local motion instead of directly sampling color.
Abstract: Existing methods for video completion typically rely on periodic color transitions, layer extraction, or temporally local motion. However, periodicity may be imperceptible or absent, layer extraction is difficult, and temporally local motion cannot handle large holes. This paper presents a new approach for video completion using motion field transfer to avoid such problems. Unlike prior methods, we fill in missing video parts by sampling spatio-temporal patches of local motion instead of directly sampling color. Once the local motion field has been computed within the missing parts of the video, color can then be propagated to produce a seamless hole-free video. We have validated our method on many videos spanning a variety of scenes. We can also use the same approach to perform frame interpolation using motion fields from different videos.
TL;DR: In this paper, a method and system for counting moving objects in a digital video stream is presented, where areas of motion are determined by threshold subtracting a current video frame from a short-term average video scene.
Abstract: A method and system are provided for counting moving objects in a digital video stream In contrast to known computationally-expensive methods, areas of motion are determined by threshold subtracting a current video frame from a short term average video scene An object box surrounding an object is determined by threshold subtracting the current video frame from a long term average video scene Coordinates of the moving object are identified by associating the area of motion with the object box, if it overlaps the area of motion, to define a moving object box An event counter can be incremented when the moving object box is in a buffer zone in the current frame, and was in a detection zone in an earlier frame, and was initially detected in a buffer zone on the opposite side of the detection zone
TL;DR: An automatic video enhancement system and method for automatically enhancing video using frame-to-frame motion estimation is described in this paper. But the method is not suitable for the automatic generation of optical flow vectors.
Abstract: An automatic video enhancement system and method for automatically enhancing video. The automated video enhancement method uses frame-to-frame motion estimation as the basis of the video enhancement. Motion estimation includes the computation of global motion (such as camera motion) and the computation of local motion (such as pixel motion). The automated video enhancement method includes generating global alignment transforms, generating optic flow vectors, and using these global alignment transforms and optic flow vectors to enhance the video. The invention also includes video processing and enhancement techniques that use the frame-to-frame motion estimation. These techniques include a deinterlace process, a denoise process, and a warp stabilization process that performs both damped and locked stabilization.
TL;DR: In this paper, a dual-buffer based estimation of a frame budget that defines a number of encoding bits available for a frame of the video is used to control the source video encoding rate.
Abstract: The disclosure relates to techniques for video source rate control for video telephony (VT) applications. The source video encoding rate may controlled using a dual-buffer based estimation of a frame budget that defines a number of encoding bits available for a frame of the video. The dual-buffer based estimation technique may track the fullness of a physical video buffer and the fullness of the virtual video buffer. The source video encoding rate is then controlled based on the resulting frame budget. The contents of the virtual buffer depend on constraints imposed by a target encoding rate, while the contents of the physical buffer depend on constraints imposed by varying channel conditions. Consideration of physical video buffer fullness permits the video source rate control technique to be channel-adaptive. Consideration of virtual video buffer fullness permits the video source rate control technique to avoid encoding excessive video that could overwhelm the channel.
TL;DR: An effective data-hiding scheme that embeds data in digital videos using the phase angle of the motion vector of the macroblock in the inter-frame and can be applied to either compressed or uncompressed videos is proposed.
Abstract: There are many researches that have been proposed for embedding data into digital video. However, most of those schemes extending data hiding technique for still images to videos by treating each single frame as a still image and embed data in intra-frame. In this paper, we propose an effective data-hiding scheme that embeds data in digital videos using the phase angle of the motion vector of the macroblock in the inter-frame. The scheme can be applied to either compressed or uncompressed videos. Furthermore, the embedded data can be extracted directly without using the original video sequences. Our experimental results prove the feasibility of the proposed method.
TL;DR: In this article, a low-complexity automatic region-of-interest (ROI) detection method for video frames of video sequences is proposed. But, it is based on motion information for a video frame and a different video frame of the video sequence.
Abstract: The disclosure is directed to techniques for region-of-interest (ROI) video processing based on low-complexity automatic ROI detection within video frames of video sequences. The low-complexity automatic ROI detection may be based on characteristics of video sensors within video communication devices. In other cases, the low-complexity automatic ROI detection may be based on motion information for a video frame and a different video frame of the video sequence. The disclosed techniques include a video processing technique capable of tuning and enhancing video sensor calibration, camera processing, ROI detection, and ROI video processing within a video communication device based on characteristics of a specific video sensor. The disclosed techniques also include a sensor-based ROI detection technique that uses video sensor statistics and camera processing side- information to improve ROI detection accuracy. The disclosed techniques also include a motion-based ROI detection technique that uses motion information obtained during motion estimation in video processing.
TL;DR: A robust real-time video stabilization algorithm that alleviates the undesirable jitter motions from the unstable video to produce a stabilized video is proposed.
TL;DR: In this article, a methodology of rate control for a video encoding is provided, which is implementable by the means of a method, a device, a computer program and/or a video encoder.
Abstract: In general, a methodology of rate control for a video encoding is provided, which is implementable by the means of a method, a device, a computer program and/or a video encoder. A frame encoding process is performed for each frame in that an initial quantization parameter is calculated for being used as a quantization parameter for encoding a current frame. Each group of macroblocks within the current frame is encoded group by group; i.e. group-wise. A score value is determined after macroblock encoding of a current group of macroblocks. In case the score value exceeds a pre-defined threshold, the quantization parameter for encoding the next group of macroblocks is adjusted; otherwise, the macroblock encoding is continued with the quantization parameter which is currently used for encoding the current group of macroblocks.
TL;DR: A fast algorithm which detects the predominant edge orientations within a block in order to pre-select candidate wedge lines is proposed and a comparison among macroblock partition methods is performed, which points to the higher performance of the wedge partition method.
Abstract: In the H.264/AVC video coding standard, motion compensation can be performed by partitioning macroblocks into square or rectangular sub-macroblocks in a quadtree decomposition. This paper studies a motion compensation method using wedges, i.e. partitioning macroblocks or sub-macroblocks into two regions by an arbitrary line segment. This technique allows the shapes of the divided regions to better match the boundaries between moving objects. However, there are a large number of ways to slice a block and searching exhaustively over all of them would be an extremely computer-intensive task. Thus, we propose a fast algorithm which detects the predominant edge orientations within a block in order to pre-select candidate wedge lines. Finally a comparison among macroblock partition methods is performed, which points to the higher performance of the wedge partition method.
TL;DR: The algorithm is based on the k-medoid clustering algorithms to find the best representative frame for each video shot without visual redundancy, and thus it is an effective tool for video indexing and retrieval.
Abstract: In this paper, we propose a video summarization algorithm by multiple extractions of key frames in each shot. This algorithm is based on the k-medoid clustering algorithms to find the best representative frame for each video shot. This algorithm, which is applicable to all types of descriptors, consists of extracting key frames by similarity clustering according to the given index. In our proposal, the distance between frames is calculated using a fast full search block matching algorithm based on the frequency domain. The proposed approach is computationally tractable and robust with respect to sudden changes in mean intensity within a shot. Additionally, this approach produces different key frames even in the presence of large motion. The experiments results show that our algorithm extracts multiple representatives frames in each video shot without visual redundancy, and thus it is an effective tool for video indexing and retrieval.
TL;DR: In this paper, the authors define an efficient, new method of searching only a very sparse subset of possible displacement positions (or motion vectors) among all possible ones, to see if we can get a good enough match, and terminate early.
Abstract: Motion estimation is the science of predicting the current frame in a video sequence from the past frame (or frames), by slicing it into rectangular blocks of pixels, and matching these to past such blocks. The displacement in the spatial position of the block in the current frame with respect to the past frame is called the motion vector. This method of temporally decorrelating the video sequence by finding the best matching blocks from past reference frames—motion estimation—makes up about 80% or more of the computation in a video encoder. That is, it is enormously expensive, and methods do so that are efficient are in high demand. Thus the field of motion estimation within video coding is rich in the breadth and diversity of approaches that have been put forward. Yet it is often the simplest methods that are the most effective. So it is in this case. While it is well-known that a full search over all possible positions within a fixed window is an optimal method in terms of performance, it is generally prohibitive in computation. In this patent disclosure, we define an efficient, new method of searching only a very sparse subset of possible displacement positions (or motion vectors) among all possible ones, to see if we can get a good enough match, and terminate early. This set of sparse subset of motion vectors is preselected, using a priori knowledge and extensive testing on video sequences, so that these “predictors” for the motion vector are essentially magic. The art of this method is the preselection of excellent sparse subsets of vectors, the smart thresholds for acceptance or rejection, and even in the order of the testing prior to decision.
TL;DR: A fast inter mode decision algorithm for H.264 encoder that efficiently determines a suitable block mode according to the motion field distribution and correlation within a macroblock and is comparable to that of full mode search.
Abstract: Variable block size motion compensation has been adopted by the emerging video coding standard H.264. It can represent the motion characteristic in a macroblock more accurately and, therefore, reduces the prediction error to achieve high compression gains. On the other hand, it causes high computational complexity in motion estimation at the encoder. The motion estimation exhaustively performed over all modes to find the best mode for inter coding is slow and computationally involved. In order to reduce the complexity, we proposed a fast inter mode decision algorithm for H.264 encoder. The proposed method efficiently determines a suitable block mode according to the motion field distribution and correlation within a macroblock. The experimental results show that the proposed method reduces a considerable amount of complexity at encoder, while the rate-distortion performance of the proposed method is comparable to that of full mode search
TL;DR: In this article, a method for compensating for perceived blur due to motion between a current frame and a previous frame of a digital video sequence comprises estimating a motion vector between the frames for each of a plurality of pixel blocks in the current and previous frames.
Abstract: A method for compensating for perceived blur due to motion between a current frame and a previous frame of a digital video sequence comprises estimating a motion vector between the frames for each of a plurality of pixel blocks in the current and previous frames. A cluster motion vector is then estimated for each of a plurality of clusters of the motion vectors based on one of vectors in each cluster and motion vectors in proximate clusters. The cluster motion vector of its corresponding cluster is allocated to each pixel in the current frame. An initial guess frame is generated based on the current frame and pixels in the guess frame are blurred as a function of their respective allocated cluster motion vectors. Each blurred pixel is compared with a respective pixel in the current frame to generate an error pixel for each respective pixel. Each error pixel is blurred and weighted and then each error pixel and its respective pixel is combined in the initial guess frame thereby to update the guess frame and compensate for blur. A system and computer program for perceived blur compensation is also provided.
TL;DR: Experiments show that the proposed method significantly outperforms the conventional peak signal-to-noise ratio (PSNR) and was included in international recommendations for objective video quality measurement.
Abstract: We propose a new method for an objective measurement of video quality. By analyzing subjective scores of various video sequences, we find that the human visual system is particularly sensitive to degradation around edges. In other words, when edge areas of a video sequence are degraded, evaluators tend to give low quality scores to the video, even though the overall mean squared error is not large. Based on this observation, we propose an objective video quality measurement method based on degradation around edges. In the proposed method, we first apply an edge detection algorithm to videos and locate edge areas. Then, we measure degradation of those edge areas by computing mean squared errors and use it as a video quality metric after some postprocessing. Experiments show that the proposed method significantly outperforms the conventional peak signal-to-noise ratio (PSNR). This method was also independently evaluated by independent laboratory groups in the Video Quality Experts Group (VQEG) Phase 2 test. The method consistently provided good performances. As a result, the method was included in international recommendations for objective video quality measurement.
TL;DR: In this article, a method and apparatus for video encoding/decoding is provided to improve compression efficiency by generating a prediction block using an intra-inter hybrid predictor, which can be used to improve video compression efficiency.
Abstract: A method and apparatus for video encoding/decoding are provided to improve compression efficiency by generating a prediction block using an intra-inter hybrid predictor. A video encoding method includes dividing an input video into a plurality of blocks, forming a first predictor for an edge region of a current block to be encoded among the divided blocks through intraprediction, forming a second predictor for the remaining region of the current block through interprediction, and forming a prediction block of the current block by combining the first predictor and the second predictor.
TL;DR: In this article, a flag bit is coded to indicate which predictive motion vector is chosen only if it is not possible to infer the layer from which the predicted motion vector for the current block comes, such as when both predictive motion vectors are substantially the same, or only one of the vectors is reliable or available.
Abstract: In scalable video coding where two predictive motion vectors are calculated: one from the current layer neighboring motion vectors and one from the co-located base layer motion vectors. One of the two predictive motion vectors is chosen as the predictive motion vector for current block. A flag bit is coded to indicate which predictive motion vector is chosen only if it is not possible to infer the layer from which the predictive motion vector for the current block comes. Such inference is possible in many situations, such as when both predictive motion vectors are substantially the same, or only one of the vectors is reliable or available.
TL;DR: A loss-aware rate-distortion optimized macroblock mode decision algorithm for scalable video coding, wherein more macroblock coding modes than intra and inter are involved, which has been adopted into the joint scalable video model by the joint video team.
Abstract: Error resilient macroblock mode decision has been extensively investigated in the literature for single-layer video coding, for which error resilient mode decision is also called as intra refresh. In this paper, we present a loss-aware rate-distortion optimized macroblock mode decision algorithm for scalable video coding, wherein more macroblock coding modes than intra and inter are involved. Thanks to the good performance, the proposed method has been adopted into the joint scalable video model by the joint video team.
TL;DR: In this article, an Encoder Assisted Frame Rate Up Conversion (EA-FRUC) system was proposed to improve the modeling of moving objects, compression efficiency and reconstructed video quality.
Abstract: An Encoder Assisted Frame Rate Up Conversion (EA-FRUC) system that utilizes various motion models, such as affine models, in addition to video coding and pre-processing operations at the video encoder to exploit the FRUC processing that will occur in the decoder in order to improve the modeling of moving objects, compression efficiency and reconstructed video quality. Furthermore, objects are identified in a way that reduces the amount of information necessary for encoding to render the objects on the decoder device.
TL;DR: In this paper, a video encoder selects a start layer for motion estimation from among multiple available start layers, each of which represents a reference video picture at a different spatial resolution.
Abstract: Techniques and tools for adaptive, unit co-location-based motion estimation are described. For example, in a layered block matching framework, a video encoder selects a start layer for motion estimation from among multiple available start layers. Each of the available start layers represents a reference video picture at a different spatial resolution. For a current macroblock in a current video picture, the encoder performs motion estimation relative to the reference video picture starting at the selected start layer. Or, a video encoder computes a contextual similarity metric for a current macroblock. The contextual similarity metric is based at least in part upon a texture measure for the current macroblock and a texture measure for one or more neighboring macroblocks. For the current macroblock, the motion estimation changes depending on the contextual similarity metric for the current macroblock.
TL;DR: A frame rate up-conversion (FRUC) algorithm to increase the temporal resolution of video sequences at the decoder side by segments a frame into several objects while the translational block matching algorithm (BMA) is applied to the background.
Abstract: In this paper, we propose a frame rate up-conversion (FRUC) algorithm to increase the temporal resolution of video sequences at the decoder side. First, the proposed algorithm segments a frame into several objects. Then, perspective transforms are used to motion-compensate each object, while the translational block matching algorithm (BMA) is applied to the background. Furthermore, the overlapped block motion compensation (OBMC) technique is used to reduce blocking artifacts in boundary blocks. Experimental results show that the proposed algorithm provides better performance than the conventional approach.
TL;DR: The proposed algorithms can obtain an average speed up ratio of four for encoding, thus benefiting from the prediction of the motion vector for the reference frames in advance and maintaining good performance.
Abstract: The MPEG-4/AVC/H.264 video coding standard adopts various coding schemes such as multiple reference frames and variable block sizes for motion estimation. Hence, MPEG-4/AVC/H.264 provides gains in compression efficiency of up to 50% over a wide range of bit rates and video resolutions compared to previous standards. However, these features result in a considerable increase in encoder complexity, mainly regarding to mode decision and motion estimation. The proposed algorithms use the stored motion vectors to compose the motion vector without performing the full search in each reference frame. Therefore, the proposed algorithms can obtain an average speed up ratio of four for encoding, thus benefiting from the prediction of the motion vector for the reference frames in advance and maintaining good performance. Any fast search algorithm can be utilized to further largely reduce the computational load.