TL;DR: A generic framework of video summarization based on the modeling of viewer's attention is presented, which takes advantage of computational attention models and eliminates the needs of complex heuristic rules inVideo summarization.
Abstract: Automatic generation of video summarization is one of the key techniques in video management and browsing. In this paper, we present a generic framework of video summarization based on the modeling of viewer's attention. Without fully semantic understanding of video content, this framework takes advantage of understanding of video content, this framework takes advantage of computational attention models and eliminates the needs of complex heuristic rules in video summarization. A set of methods of audio-visual attention model features are proposed and presented. The experimental evaluations indicate that the computational attention based approach is an effective alternative to video semantic analysis for video summarization.
TL;DR: A novel algorithm for segmentation of moving objects in video sequences and extraction of video object planes (VOPs) based on connected components analysis and smoothness of VO displacement in successive frames is proposed.
Abstract: The new video-coding standard MPEG-4 enables content-based functionality, as well as high coding efficiency, by taking into account shape information of moving objects. A novel algorithm for segmentation of moving objects in video sequences and extraction of video object planes (VOPs) is proposed . For the case of multiple video objects in a scene, the extraction of a specific single video object (VO) based on connected components analysis and smoothness of VO displacement in successive frames is also discussed. Our algorithm begins with a robust double-edge map derived from the difference between two successive frames. After removing edge points which belong to the previous frame, the remaining edge map, moving edge (ME), is used to extract the VOP. The proposed algorithm is evaluated on an indoor sequence captured by a low-end camera as well as MPEG-4 test sequences and produces promising results.
TL;DR: In this paper, a video-based security system is described, where the video inputs are each configured to receive an electronic video signal from a video camera, and the processor operates on a digital representation of the electronic video signals from the input video inputs.
Abstract: A computer especially suitable for use as a video-based security system includes video inputs, a processor and a network connection. The video inputs are each configured to receive an electronic video signal from a video camera. The processor operates on a digital representation of the electronic video signals from the video inputs. When the computer detects motion in the electronic video signals it generates a compressed representation of the video signal that includes the motion. The compressed representation is transmitted through the network connection.
TL;DR: In this paper, the concept of video fingerprinting is presented as a tool for persistent video identification, and a technique for extracting essential perceptual features from moving image sequences and for identifying any sufficiently long unknown video segment by efficiently matching the fingerprint of the short segment with a large database of pre-computed fingerprints.
Abstract: This paper presents the concept of video fingerprinting as a tool for video identification. As such, video fingerprinting is an important tool for persistent identification as proposed in MPEG-21. Applications range from video monitoring on broadcast channels to filtering on peer-to-peer networks to meta-data restoration in large digital libraries. We present considerations and a technique for (i) extracting essential perceptual features from moving image sequences and (ii) for identifying any sufficiently long unknown video segment by efficiently matching the fingerprint of the short segment with a large database of pre-computed fingerprints.
TL;DR: In this paper, an integrated, fully automated video production system that provides a video director with total control over all of the video production devices used in producing a show is presented. But, the system is not suitable for the production of live shows.
Abstract: An integrated, fully automated video production system that provides a video director with total control over all of the video production devices used in producing a show. Such devices include, but are not limited to, cameras, robotic pan/tilt heads, video tape players and recorders (VTRs), video servers and virtual recorders, character generators, still stores, digital video disk players (DVDs), audio mixers, digital video effects (DVE), video switchers, and teleprompting systems. The video production system provides an automation capability that allows the video director to pre-produce a show, review the show in advance of “air time,” and then, with a touch of a button, produce the live show. In one embodiment, the invention provides a video production system having a processing unit in communication with one or more of the video production devices mentioned above. The processing unit displays on a monitor graphical controls for controlling the variety of video production devices that it is in communication with. A video director uses a keyboard and mouse that are interfaced with the processing unit to activate the graphical controls, and thereby remotely control the video production devices from one location. The processing unit also enables the video director to automate the production of a show. According to one embodiment, the video director pre-produces the show, defines a set of video production commands or instructions (hereafter “transition macro”) to be executed by the processing unit, and then, by activating a control button displayed by the processing unit, the video director instructs the processing unit to execute the transition macro. Each video production command in a transition macro directs the processing unit to transmit in series and/or parallel one or more control commands to one or more of the video production devices when required.
TL;DR: A new algorithm, called SoftPOSIT, for determining the pose of a 3D object from a single 2D image in the case that correspondences between model points and image points are unknown, which has a run-time complexity that is better than previous methods by a factor equal to the number of image points.
Abstract: The problem of pose estimation arises in many areas of computer vision, including object recognition, object tracking, site inspection and updating, and autonomous navigation using scene models. We present a new algorithm, called SoftPOSIT, for determining the pose of a 3D object from a single 2D image in the case that correspondences between model points and image points are unknown. The algorithm combines Gold's iterative SoftAssign algorithm [19, 20] for computing correspondences and DeMenthon's iterative POSIT algorithm [13] for computing object pose under a full-perspective camera model. Our algorithm, unlike most previous algorithms for this problem, does not have to hypothesize small sets of matches and then verify the remaining image points. Instead, all possible matches are treated identically throughout the search for an optimal pose. The performance of the algorithm is extensively evaluated in Monte Carlo simulations on synthetic data under a variety of levels of clutter, occlusion, and image noise. These tests show that the algorithm performs well in a variety of difficult scenarios, and empirical evidence suggests that the algorithm has a run-time complexity that is better than previous methods by a factor equal to the number of image points. The algorithm is being applied to the practical problem of autonomous vehicle navigation in a city through registration of a 3D architectural models of buildings to images obtained from an on-board camera.
TL;DR: In this article, a vehicle video switcher receives inputs from multiple video cameras on a vehicle, and independently routes any of these video inputs to multiple video monitors in the vehicle by pressing on a button or by using a hand-held remote control.
Abstract: A vehicle video switcher receives inputs from multiple video cameras on a vehicle, and independently routes any of these video inputs to multiple video monitors in the vehicle. A front video monitor is within the view of the driver of the vehicle. When the driver activates the left turn signal, the view from the left side video camera is displayed on the front video monitor. When the driver activates the right turn signal, the view from the right side video camera is displayed on the front video monitor. When the driver puts the vehicle in reverse, thereby activating the backup light, the view from the rear video camera is displayed on the front video monitor. The driver may select the default view for the front video monitor by pressing on a button or by using a hand-held remote control. Graphical view indicators superimposed on the displayed image indicate which view is currently being displayed. A remote video monitor is also provided in the vehicle in a different location, such as in the rear living area of a recreational vehicle. A passenger can direct different video camera views to be displayed on the remote video monitor using a user interface, such as push-buttons on the monitor or handheld remote control, independent of the video image displayed on the front video monitor. In addition, the default view for the remote video monitor may be selected independently from the default view of the front video monitor using a user interface.
TL;DR: A set of methods for multi view image tracking using a set of calibrated cameras to be robust in resolving dynamic and static object occlusions and tracking objects between overlapping and non-overlapping camera views.
Abstract: The paper presents a set of methods for multi view image tracking using a set of calibrated cameras. We demonstrate how effective the approach is for resolving occlusions and tracking objects between overlapping and non-overlapping camera views. Moving objects are initially detected using background subtraction. Temporal alignment is then performed between each video sequence in order to compensate for the different processing rates of each camera. A Kalman filter is used to track each object in 3D world coordinates and 2D image coordinates. Information is shared between the 2D/3D trackers of each camera view in order to improve the performance of object tracking and trajectory prediction. The system is shown to be robust in resolving dynamic and static object occlusions. Results are presented from a variety of outdoor surveillance video sequences.
TL;DR: A new computational model of motion attention and the approach to applying this model in video skimming is presented and the effectiveness of the architecture and model is demonstrated by user studies of visual skimming experiments.
Abstract: One of the key issues in video manipulation is video abstraction in the form of skimmed video. For this purpose, an important task is to determine the content significance of each chunk of frames in a video sequence. In this paper, we present a new computational model of motion attention and the approach to applying this model in video skimming. The effectiveness of our architecture and model is demonstrated by user studies of visual skimming experiments. The results indicate that the precision of motion attention detection is over 80%, and the user satisfaction of visual skimming is beyond 70%.
TL;DR: A videophone can include an imaging mechanism for taking a video picture of a scene and a mechanism for producing a first video stream and a second video stream of the scene from the video picture and sending the first video streams and the second video streams onto a network at the same time as mentioned in this paper.
Abstract: A videophone. The videophone can include an imaging mechanism for taking a video picture of a scene. The videophone can include a mechanism for producing a first video stream of the scene and a second video stream of the scene from the video picture and sending the first video stream and the second video stream onto a network at the same time, the producing mechanism in communication with the imaging mechanism. The videophone can include a mechanism for receiving a plurality of video streams of different scenes from a network. The videophone can include a mechanism for displaying the different scenes of the plurality of video streams alongside each other. A method for a video call.
TL;DR: An integrated system for automatic recording of activity, movement and interactions of insects, with special emphasis on file management, experiment design, arena and zone definition, object detection, experiment control, visualisation of tracks and calculation of analysis parameters is developed.
TL;DR: A new method for automatic segmentation of moving objects in image sequences for VOP extraction using a Markov random field, based on motion information, spatial information and the memory is presented.
Abstract: The emerging video coding standard MPEG-4 enables various content-based functionalities for multimedia applications. To support such functionalities, as well as to improve coding efficiency, MPEG-4 relies on a decomposition of each frame of an image sequence into video object planes (VOP). Each VOP corresponds to a single moving object in the scene. This paper presents a new method for automatic segmentation of moving objects in image sequences for VOP extraction. We formulate the problem as graph labeling over a region adjacency graph (RAG), based on motion information. The label field is modeled as a Markov random field (MRF). An initial spatial partition of each frame is obtained by a fast, floating-point based implementation of the watershed algorithm. The motion of each region is estimated by hierarchical region matching. To avoid inaccuracies in occlusion areas, a novel motion validation scheme is presented. A dynamic memory, based on object tracking, is incorporated into the segmentation process to maintain temporal coherence of the segmentation. Finally, a labeling is obtained by maximization of the a posteriori probability of the MRF using motion information, spatial information and the memory. The optimization is carried out by highest confidence first (HCF). Experimental results for several video sequences demonstrate the effectiveness of the proposed approach.
TL;DR: In this paper, a three-dimensional point cloud is generated from a time series of video frames and partitioned into a set of vertically-oriented bins, each containing for each bin a corresponding pixel having one or more values computed based upon attributes of the point cloud members occupying the corresponding bin.
Abstract: Object tracking systems and methods are described. In one aspect, a three-dimensional point cloud is generated from a time series of video frames and partitioned into a set of vertically-oriented bins. The point cloud is mapped into one or more plan-view images each containing for each vertically-oriented bin a corresponding pixel having one or more values computed based upon one or more attributes of the point cloud members occupying the corresponding vertically-oriented bin. The object is tracked based at least in part upon one or more of the plan-view images. In another aspect, one or more original object templates are extracted from at least one of the one or more plan-view images, and the object is tracked based at least in part upon a comparison of at least one of the object templates with regions of the corresponding plan-view images. In another aspect, the point cloud may be discretized along the vertical axis into two or more horizontal partitions.
TL;DR: The integration of color distributions into particle filtering, which has typically used edge-based image features, is presented as they are robust to partial occlusion, are rotation and scale invariant and computationally efficient.
Abstract: Robust real-time tracking of non-rigid objects is a challenging task. Particle filtering has been proven very successful for non-linear and non-Gaussian estimation problems. However, for the tracking of non-rigid objects, the selection of reliable image features is also essential. This paper presents the integration of color distributions into particle filtering, which has typically used edge-based image features. Color distributions are applied as they are robust to partial occlusion, are rotation and scale invariant and computationally efficient. Thus, the target model of the particle filter is defined by the color information of the tracked object. As the tracker should find the most probable sample distribution, the model is compared with the current hypotheses of the particle filter using the Bhattacharyya coefficient, which is a popular similarity measure between two distributions. The proposed tracking method directly incorporates the scale and motion changes of the objects. Comparisons with the well known mean shift tracker show the advantages and limitations of the new approach. Keywords— Object tracking, Condensation algorithm, Color filtering, Bhattacharyya coefficient, Mean shift tracking
TL;DR: In this article, the authors present a system for generating panoramic and spatially indexed video for virtual reality applications, where a video is rendered in response to a specified action.
Abstract: Systems and methods generate a video for virtual reality wherein the video is both panoramic and spatially indexed. In embodiments, a video system includes a controller, a database including spatial data, and a user interface in which a video is rendered in response to a specified action. The video includes a plurality of images retrieved from the database. Each of the images is panoramic and spatially indexed in accordance with a predetermined position along a virtual path in a virtual environment.
TL;DR: In this paper, the authors proposed a system for identifying vehicles of traffic violators, the system having elements that include: a video camera for providing, in real-time, a video signal that represents plural sequential video image frames (either perceptually continuous video, such as 30 frames per second, or non-perceptually continuous videos such as 1-2 fps); a traffic violation detector (e.g., a radar gun, an in-ground loop, a pair of self-powered wireless transponders or transmitters, a camera-based speed detection system,
Abstract: Provided is a system for identifying vehicles of traffic violators, the system having elements that include: a video camera for providing, in real-time, a video signal that represents plural sequential video image frames (either perceptually continuous video, such as 30 frames per second, or non-perceptually continuous video, such as 1-2 fps); a traffic violation detector (e.g., a radar gun, an in-ground loop, a pair of self-powered wireless transponders or transmitters, a camera-based speed detection system, or any other speed sensor) that provides a trigger signal (e.g., based on vehicle speed and detection of the state of a traffic signal); a video recorder that receives the video signal provided by the camera and records the video signal in a buffer until receipt of a trigger signal, at which point at least a portion of the video signal stored in the buffer is preserved for recording and direct real-time storage of the video signal to a hard drive, or other high-capacity storage medium, commences. As a result, the video signal is preserved during a pre-programmable sliding (or rolling) time interval prior to provision of the trigger signal.
TL;DR: In this article, a computer-generated annotation or graphic overlay can be registered to the segment, and therefore track the segment from the user's field of view of the segment without prior knowledge of the spatial relationship of a segment to the real-world environment according to a centroid for an interframe difference of the video image associated with the selected object.
Abstract: Video images of objects in a real-world environment are taken from the perspective of a viewer. The user's field of view may be captured in the video images that are processed to select a segment of the video image or an object depicted in the video image. An image such as a computer-generated annotation or graphic overlay way be registered to the segment, and therefore track the segment from the user's field of view of the segment, without prior knowledge of the spatial relationship of the segment to the real-world environment according to a centroid for an interframe difference of the video image associated with the selected object. The image may be displayed in the user's field of view or in the video image. The computer-generated image tracks the movement of the segment with respect to the video image.
TL;DR: Improved high speed object detection and high performance object tracking algorithms for real-time data processing and a system architecture for detection and modelling of dynamic traffic scenes is introduced.
Abstract: Vehicle-mounted laser scanners are able to observe the vehicles environment in order to detect, track and classify the surrounding objects and thus providing data for active safety systems. The latest development of IBEO combines several innovations. The receiver diodes are arranged in an array, which enables simultaneous measurements in 4 horizontal planes, e.g. to compensate pitching of the vehicle. In addition a multi target capability is integrated. This technique enables the detection of two distances with a single measurement, thus enhancing the robustness against rain. This paper introduces improved high speed object detection and high performance object tracking algorithms for real-time data processing. Additionally a classification of the road users is possible. A system architecture for detection and modelling of dynamic traffic scenes is introduced in order to provide a general idea of the different tasks necessary to reach the aim of a complete environmental model using a sensor for a wide range of applications.
TL;DR: In this article, a video clip is captured from a television broadcast on each of a plurality of channels and provided to a display interface, which successively displays the captured video clips within a focus area of a user interface in response to an initiating action by a user.
Abstract: A video clip is captured from a television broadcast on each of a plurality of channels. The captured video clips are provided to a display interface, which successively displays the captured video clips within a focus area of a user interface in response to an initiating action by a user. The display interface then discontinues the successive display of video clips to show a particular video clip corresponding to a selected channel in response to a terminating action by the user.
TL;DR: A motion-compensated, transform-domain super-resolution procedure for creating high-quality video or still images that directly incorporates the transform- domain quantization information by working with the compressed bit stream is proposed.
Abstract: There are a number of useful methods for creating high-quality video or still images from a lower quality video source. The best of these involve motion compensating a number of video frames to produce the desired video or still. These methods are formulated in the space domain and they require that the input be expressed in that format. More and more frequently, however, video sources are presented in a compressed format, such as MPEG, H.263, or DV. Ironically, there is important information in the compressed domain representation that is lost if the video is first decompressed and then used with a spatial-domain method. In particular, quantization information is lost once the video has been decompressed. Here, we propose a motion-compensated, transform-domain super-resolution procedure for creating high-quality video or still images that directly incorporates the transform-domain quantization information by working with the compressed bit stream. We apply this new formulation to MPEG-compressed video and demonstrate its effectiveness.
TL;DR: A general system that tracks the position and orientation of a camera observing a scene without visual markers and can employ any available feature tracking and pose estimation system for learning and tracking is described.
Abstract: Estimating the pose of a camera (virtual or real) in which some augmentation takes place is one of the most important parts of an augmented reality (AR) system. Availability of powerful processors and fast frame grabber shave made vision-based trackers commonly used due to their accuracy as well as flexibility and ease of use. Current vision-based trackers are based on tracking of markers. The use of markers increases robustness and reduces computational requirements. However, their use can be very complicated, as they require certain maintenance. Direct use of scene features for tracking, therefore, is desirable. To this end, we describe a general system that tracks the position and orientation of a camera observing a scene without any visual markers. Our method is base don a two-stage process. In the first stage, a set of features is learned with the help of an external tracking system while in action. The second stage uses these learned features for camera tracking when the system in thefirst stage decides that it is possible to do so. The system is very general so that it can employ any available feature tracking and pose estimation system for learning and tracking. We experimentally demonstrate the viability of the method in real-life examples.
TL;DR: In this article, the authors present a system and techniques for recording video in a mobile environment, in which camera means mounted at a first location in a vehicle generates a video signal based upon an observed scene.
Abstract: Provided are systems and techniques for recording video in a mobile environment, in which camera means mounted at a first location in a vehicle generates a video signal based upon an observed scene. Video recording means mounted at a second location in the vehicle inputs and records the video signal on a tangible medium. General-purpose computing means, mounted at a third location in the vehicle and running a general operating system and user-installed application programs, communicates with the video recording means, is loaded with software to provide a user interface to control recording and playback by the video recording means, and includes means for wireless communication with a central base station.
TL;DR: In this article, an apparatus and method for video object generation and selective encoding is provided, which includes a detection module for detecting a first object in at least one image frame of a series of image frames; a tracking module for tracking the first object and segmenting the object from a background, the background being a second object; and an encoder for encoding the first and second objects to be transmitted to a receiver, wherein the first objects are compressed at a high compression rate and the second object is compressed at low compression rate.
Abstract: An apparatus and method for video object generation and selective encoding is provided. The apparatus includes a detection module for detecting a first object in at least one image frame of a series of image frames; a tracking module for tracking the first object in successive image frames and segmenting the first object from a background, the background being a second object; and an encoder for encoding the first and second objects to be transmitted to a receiver, wherein the first object is compressed at a high compression rate and the second object is compressed at a low compression rate. The receiver merges the first and second object to form a composite image frame. The method provides for detecting, tracking and segmenting one or more objects, such as a face, from a background to be encoded at the same or different compression rates to conserve bandwidth.
TL;DR: This paper derives quantitative measures for the spatial uncertainty of the results provided by SSD-based feature trackers by scale the SSD correlation surface, fit a Gaussian distribution to this surface, and use this distribution to estimate values for a covariance matrix.
TL;DR: In this paper, a method for manipulating virtual objects displayed on a video conference broadcast by generating a computerized three dimensional image of an object to be superimposed on a first video broadcast signal from a local video camera for display on a remote video monitor, and then superimposing the same object on a second video broadcast message from a remote camera for displaying on a local monitor.
Abstract: This invention is a method for manipulating virtual objects displayed on a video conference broadcast by generating a computerized three dimensional image of an object to be superimposed on a first video broadcast signal from a local video camera for display on a remote video monitor, and superimposing the same object on a second video broadcast signal from a remote video camera for display on a local video monitor, grabbing a portion of the three dimensional image by placing a hand in close proximity to the portion of the image moving the hand while maintaining the hand in close proximity to the image and regenerating the three dimensional image to a new perspective view corresponding to the movement of the image with the hand to create the appearance that the hand is manipulating a virtual object displayed over the video broadcast signal.
TL;DR: Experimental results on the proposed online processing scheme combined with efficient VOS show the proposed integrated scheme generates desirable summarizations of surveillance videos.
Abstract: Key frames are the subset of still images which best represent the content of a video sequence in an abstracted manner. In other words, video abstraction transforms an entire video clip to a small number of representative images. We present a scheme for object-based video abstraction facilitated by an efficient video-object segmentation (VOS) system. In such a framework, the concept of a "key frame" is replaced by that of a "key video-object plane (VOP)." In order to achieve an online object-based framework such as an object-based video surveillance system, it becomes essential that semantically meaningful video objects are directly accessed from video sequences. Moreover, the extraction of key VOPs needs to be automated and context dependent so that they maintain the important contents of the video while removing all redundancies. Once a VOP is extracted, the shape of the VOP needs to be well described. To this end, both region-based and contour-based shape descriptors are investigated, and the region-based descriptor is selected for the proposed system. The key VOPs are extracted in a sequential manner by successive comparison with the previously declared key VOP. Experimental results on the proposed online processing scheme combined with efficient VOS show the proposed integrated scheme generates desirable summarizations of surveillance videos.
TL;DR: The 3D video recorder is presented, a system capable of recording, processing, and playing three-dimensional video from multiple points of view, and the player builds upon point-based rendering techniques and is thus capable of rendering high-quality images in real-time.
Abstract: We present the 3D video recorder, a system capable of recording, processing, and playing three-dimensional video from multiple points of view. We first record 2D video streams from several synchronized digital video cameras and store pre-processed images to disk. An off-line processing stage converts these images into a time-varying three-dimensional hierarchical point-based data structure and stores this 3D video to disk. We show how we can trade-off 3D video quality with processing performance and devise efficient compression and coding schemes for our novel 3D video representation. A typical sequence is encoded at less than 7 megabits per second at a frame rate of 8.5 frames per second. The 3D video player decodes and renders 3D videos from hard-disk in real-time, providing interaction features known from common video cassette recorders, like variable-speed forward and reverse, and slow motion. 3D video playback can be enhanced with novel 3D video effects such as freeze-and-rotate and arbitrary scaling. The player builds upon point-based rendering techniques and is thus capable of rendering high-quality images in real-time. Finally, we demonstrate the 3D video recorder on multiple real-life video sequences.
TL;DR: In this article, a method for real-time video communication is described, where a plurality of video streams of a local participant is captured from different viewpoints, and a new view synthesis technique is applied to generate a video image stream in real time.
Abstract: A method for real-time video communication. Specifically, one embodiment of the present invention discloses a method of video conferencing that captures a plurality of real-time video streams of a local participant from a plurality of sample viewpoints. From the plurality of video streams, a new view synthesis technique can be applied to generate a video image stream in real-time of the local participant rendered from a second location of a second participant with respect to a first location of the local participant in a coordinate space of a virtual environment. A change in either of the locations leads to the modifying of the video image stream, thereby enabling real-time video communication from the local participant to the second participant.
TL;DR: In this paper, a method and system provide an interactive video stream technique that allows pre-determination of interactive objects, on a frame-by-frame basis, within a video stream.
Abstract: A method and system provide an interactive video stream technique that allows pre-determination of interactive objects, on a frame-by-frame basis, within a video stream. The interactive technique allows designation of interactive objects as carried by key-frames, representing the end of a scene, within the video stream. Pre-determined information about the interactive object is provided to the user in response to user selection of the object. The interactive technique may include a video stream player software application that may receive a digital video stream and allow a user to designate the interactive objects within the video stream, and allow a user to select the interactive objects within the video stream during display and be provided with the pre-determined information about the object in response to the user selection.
TL;DR: A system is described for acquiring multi-view video of a person moving through the environment that adjusts the pan, tilt, zoom and focus parameters of multiple active cameras to keep the moving person centered in each view.
Abstract: A system is described for acquiring multi-view video of a person moving through the environment. A real-time tracking algorithm adjusts the pan, tilt, zoom and focus parameters of multiple active cameras to keep the moving person centered in each view. The output of the system is a set of synchronized, time-stamped video streams, showing the person simultaneously from several viewpoints.