TL;DR: In this paper, the Laplacian pyramid super-resolution network (LapSRN) is proposed to progressively reconstruct the sub-band residuals of high-resolution images.
Abstract: Convolutional neural networks have recently demonstrated high-quality reconstruction for single-image super-resolution. In this paper, we propose the Laplacian Pyramid Super-Resolution Network (LapSRN) to progressively reconstruct the sub-band residuals of high-resolution images. At each pyramid level, our model takes coarse-resolution feature maps as input, predicts the high-frequency residuals, and uses transposed convolutions for upsampling to the finer level. Our method does not require the bicubic interpolation as the pre-processing step and thus dramatically reduces the computational complexity. We train the proposed LapSRN with deep supervision using a robust Charbonnier loss function and achieve high-quality reconstruction. Furthermore, our network generates multi-scale predictions in one feed-forward pass through the progressive reconstruction, thereby facilitates resource-aware applications. Extensive quantitative and qualitative evaluations on benchmark datasets show that the proposed algorithm performs favorably against the state-of-the-art methods in terms of speed and accuracy.
TL;DR: This paper proposes the Laplacian Pyramid Super-Resolution Network (LapSRN) to progressively reconstruct the sub-band residuals of high-resolution images and generates multi-scale predictions in one feed-forward pass through the progressive reconstruction, thereby facilitates resource-aware applications.
Abstract: Convolutional neural networks have recently demonstrated high-quality reconstruction for single-image super-resolution. In this paper, we propose the Laplacian Pyramid Super-Resolution Network (LapSRN) to progressively reconstruct the sub-band residuals of high-resolution images. At each pyramid level, our model takes coarse-resolution feature maps as input, predicts the high-frequency residuals, and uses transposed convolutions for upsampling to the finer level. Our method does not require the bicubic interpolation as the pre-processing step and thus dramatically reduces the computational complexity. We train the proposed LapSRN with deep supervision using a robust Charbonnier loss function and achieve high-quality reconstruction. Furthermore, our network generates multi-scale predictions in one feed-forward pass through the progressive reconstruction, thereby facilitates resource-aware applications. Extensive quantitative and qualitative evaluations on benchmark datasets show that the proposed algorithm performs favorably against the state-of-the-art methods in terms of speed and accuracy.
TL;DR: The Spatial Pyramid Network (SPyNet) is much simpler and 96% smaller than FlowNet in terms of model parameters, which makes it more efficient and appropriate for embedded applications.
Abstract: We learn to compute optical flow by combining a classical spatial-pyramid formulation with deep learning. This estimates large motions in a coarse-to-fine approach by warping one image of a pair at each pyramid level by the current flow estimate and computing an update to the flow. Instead of the standard minimization of an objective function at each pyramid level, we train one deep network per level to compute the flow update. Unlike the recent FlowNet approach, the networks do not need to deal with large motions, these are dealt with by the pyramid. This has several advantages. First, our Spatial Pyramid Network (SPyNet) is much simpler and 96% smaller than FlowNet in terms of model parameters. This makes it more efficient and appropriate for embedded applications. Second, since the flow at each pyramid level is small (
TL;DR: In this article, a structured segment network (SSN) is proposed to model the temporal structure of each action instance via a structured temporal pyramid, and a decomposed discriminative model comprising two classifiers, respectively for classifying actions and determining completeness.
Abstract: Detecting actions in untrimmed videos is an important yet challenging task. In this paper, we present the structured segment network (SSN), a novel framework which models the temporal structure of each action instance via a structured temporal pyramid. On top of the pyramid, we further introduce a decomposed discriminative model comprising two classifiers, respectively for classifying actions and determining completeness. This allows the framework to effectively distinguish positive proposals from background or incomplete ones, thus leading to both accurate recognition and localization. These components are integrated into a unified network that can be efficiently trained in an end-to-end fashion. Additionally, a simple yet effective temporal action proposal scheme, dubbed temporal actionness grouping (TAG) is devised to generate high quality action proposals. On two challenging benchmarks, THUMOS14 and ActivityNet, our method remarkably outperforms previous state-of-the-art methods, demonstrating superior accuracy and strong adaptivity in handling actions with various temporal structures.
TL;DR: A novel method called Contextual Pyramid CNN (CP-CNN) for generating high-quality crowd density and count estimation by explicitly incorporating global and local contextual information of crowd images is presented.
Abstract: We present a novel method called Contextual Pyramid CNN (CP-CNN) for generating high-quality crowd density and count estimation by explicitly incorporating global and local contextual information of crowd images. The proposed CP-CNN consists of four modules: Global Context Estimator (GCE), Local Context Estimator (LCE), Density Map Estimator (DME) and a Fusion-CNN (F-CNN). GCE is a VGG-16 based CNN that encodes global context and it is trained to classify input images into different density classes, whereas LCE is another CNN that encodes local context information and it is trained to perform patch-wise classification of input images into different density classes. DME is a multi-column architecture-based CNN that aims to generate high-dimensional feature maps from the input image which are fused with the contextual information estimated by GCE and LCE using F-CNN. To generate high resolution and high-quality density maps, F-CNN uses a set of convolutional and fractionally-strided convolutional layers and it is trained along with the DME in an end-to-end fashion using a combination of adversarial loss and pixellevel Euclidean loss. Extensive experiments on highly challenging datasets show that the proposed method achieves significant improvements over the state-of-the-art methods.
TL;DR: This work designs a Pyramid Residual Module (PRMs) to enhance the invariance in scales of DCNNs and provides theoretic derivation to extend the current weight initialization scheme to multi-branch network structures.
Abstract: Articulated human pose estimation is a fundamental yet challenging task in computer vision. The difficulty is particularly pronounced in scale variations of human body parts when camera view changes or severe foreshortening happens. Although pyramid methods are widely used to handle scale changes at inference time, learning feature pyramids in deep convolutional neural networks (DCNNs) is still not well explored. In this work, we design a Pyramid Residual Module (PRMs) to enhance the invariance in scales of DCNNs. Given input features, the PRMs learn convolutional filters on various scales of input features, which are obtained with different subsampling ratios in a multibranch network. Moreover, we observe that it is inappropriate to adopt existing methods to initialize the weights of multi-branch networks, which achieve superior performance than plain networks in many tasks recently. Therefore, we provide theoretic derivation to extend the current weight initialization scheme to multi-branch network structures. We investigate our method on two standard benchmarks for human pose estimation. Our approach obtains state-of-the-art results on both benchmarks. Code is available at https://github.com/bearpaw/PyraNet.
TL;DR: This work proposes to augment feedforward neural networks with a novel pyramid pooling module and a multi-stage refinement mechanism for saliency detection and shows that the proposed method compares favorably against the state-of-the-art approaches.
Abstract: Deep convolutional neural networks (CNNs) have been successfully applied to a wide variety of problems in computer vision, including salient object detection. To detect and segment salient objects accurately, it is necessary to extract and combine high-level semantic features with low-levelfine details simultaneously. This happens to be a challenge for CNNs as repeated subsampling operations such as pooling and convolution lead to a significant decrease in the initial image resolution, which results in loss of spatial details and finer structures. To remedy this problem, here we propose to augment feedforward neural networks with a novel pyramid pooling module and a multi-stage refinement mechanism for saliency detection. First, our deep feedward net is used to generate a coarse prediction map with much detailed structures lost. Then, refinement nets are integrated with local context information to refine the preceding saliency maps generated in the master branch in a stagewise manner. Further, a pyramid pooling module is applied for different-region-based global context aggregation. Empirical evaluations over six benchmark datasets show that our proposed method compares favorably against the state-of-the-art approaches.
TL;DR: SSH as mentioned in this paper detects faces in a single stage directly from the early convolutional layers in a classification network, which achieves state-of-the-art results while removing the head of its underlying classification network.
Abstract: We introduce the Single Stage Headless (SSH) face detector. Unlike two stage proposal-classification detectors, SSH detects faces in a single stage directly from the early convolutional layers in a classification network. SSH is headless. That is, it is able to achieve state-of-the-art results while removing the “head” of its underlying classification network – i.e. all fully connected layers in the VGG-16 which contains a large number of parameters. Additionally, instead of relying on an image pyramid to detect faces with various scales, SSH is scale-invariant by design. We simultaneously detect faces with different scales in a single forward pass of the network, but from different layers. These properties make SSH fast and light-weight. Surprisingly, with a headless VGG-16, SSH beats the ResNet-101-based state-of-the-art on the WIDER dataset. Even though, unlike the current state-of-the-art, SSH does not use an image pyramid and is 5X faster. Moreover, if an image pyramid is deployed, our light-weight network achieves state-of-the-art on all subsets of the WIDER dataset, improving the AP by 2.5%. SSH also reaches state-of-the-art results on the FDDB and Pascal-Faces datasets while using a small input size, leading to a speed of 50 frames/second on a GPU.
TL;DR: In this article, the authors proposed the deep Laplacian pyramid super-resolution network (LAPS-Net), which progressively reconstructs the sub-band residuals of high-resolution images at multiple pyramid levels.
Abstract: Convolutional neural networks have recently demonstrated high-quality reconstruction for single image super-resolution. However, existing methods often require a large number of network parameters and entail heavy computational loads at runtime for generating high-accuracy super-resolution results. In this paper, we propose the deep Laplacian Pyramid Super-Resolution Network for fast and accurate image super-resolution. The proposed network progressively reconstructs the sub-band residuals of high-resolution images at multiple pyramid levels. In contrast to existing methods that involve the bicubic interpolation for pre-processing (which results in large feature maps), the proposed method directly extracts features from the low-resolution input space and thereby entails low computational loads. We train the proposed network with deep supervision using the robust Charbonnier loss functions and achieve high-quality image reconstruction. Furthermore, we utilize the recursive layers to share parameters across as well as within pyramid levels, and thus drastically reduce the number of parameters. Extensive quantitative and qualitative evaluations on benchmark datasets show that the proposed algorithm performs favorably against the state-of-the-art methods in terms of run-time and image quality.
TL;DR: In this paper, a Contextual Pyramid CNN (CP-CNN) is proposed for generating high-quality crowd density and count estimation by explicitly incorporating global and local contextual information of crowd images.
Abstract: We present a novel method called Contextual Pyramid CNN (CP-CNN) for generating high-quality crowd density and count estimation by explicitly incorporating global and local contextual information of crowd images. The proposed CP-CNN consists of four modules: Global Context Estimator (GCE), Local Context Estimator (LCE), Density Map Estimator (DME) and a Fusion-CNN (F-CNN). GCE is a VGG-16 based CNN that encodes global context and it is trained to classify input images into different density classes, whereas LCE is another CNN that encodes local context information and it is trained to perform patch-wise classification of input images into different density classes. DME is a multi-column architecture-based CNN that aims to generate high-dimensional feature maps from the input image which are fused with the contextual information estimated by GCE and LCE using F-CNN. To generate high resolution and high-quality density maps, F-CNN uses a set of convolutional and fractionally-strided convolutional layers and it is trained along with the DME in an end-to-end fashion using a combination of adversarial loss and pixel-level Euclidean loss. Extensive experiments on highly challenging datasets show that the proposed method achieves significant improvements over the state-of-the-art methods.
TL;DR: An improved pre-trained AlexNet architecture named pre- trained AlexNet-SPP-SS has been proposed, which incorporates the scale pooling—spatial pyramid pooling (SPP) and side supervision (SS) to improve the above two situations.
Abstract: The rapid development of high spatial resolution (HSR) remote sensing imagery techniques not only provide a considerable amount of datasets for scene classification tasks but also request an appropriate scene classification choice when facing with finite labeled samples. AlexNet, as a relatively simple convolutional neural network (CNN) architecture, has obtained great success in scene classification tasks and has been proven to be an excellent foundational hierarchical and automatic scene classification technique. However, current HSR remote sensing imagery scene classification datasets always have the characteristics of small quantities and simple categories, where the limited annotated labeling samples easily cause non-convergence. For HSR remote sensing imagery, multi-scale information of the same scenes can represent the scene semantics to a certain extent but lacks an efficient fusion expression manner. Meanwhile, the current pre-trained AlexNet architecture lacks a kind of appropriate supervision for enhancing the performance of this model, which easily causes overfitting. In this paper, an improved pre-trained AlexNet architecture named pre-trained AlexNet-SPP-SS has been proposed, which incorporates the scale pooling—spatial pyramid pooling (SPP) and side supervision (SS) to improve the above two situations. Extensive experimental results conducted on the UC Merced dataset and the Google Image dataset of SIRI-WHU have demonstrated that the proposed pre-trained AlexNet-SPP-SS model is superior to the original AlexNet architecture as well as the traditional scene classification methods.
TL;DR: An unsupervised representation learning method is proposed to investigate deconvolution networks for remote sensing scene classification and outperform most state of the arts results, which demonstrates the effectiveness of this method.
Abstract: With the rapid development of the satellite sensor technology, high spatial resolution remote sensing (HSR) data have attracted extensive attention in military and civilian applications In order to make full use of these data, remote sensing scene classification becomes an important and necessary precedent task In this paper, an unsupervised representation learning method is proposed to investigate deconvolution networks for remote sensing scene classification First, a shallow weighted deconvolution network is utilized to learn a set of feature maps and filters for each image by minimizing the reconstruction error between the input image and the convolution result The learned feature maps can capture the abundant edge and texture information of high spatial resolution images, which is definitely important for remote sensing images After that, the spatial pyramid model (SPM) is used to aggregate features at different scales to maintain the spatial layout of HSR image scene A discriminative representation for HSR image is obtained by combining the proposed weighted deconvolution model and SPM Finally, the representation vector is input into a support vector machine to finish classification We apply our method on two challenging HSR image data sets: the UCMerced data set with 21 scene categories and the Sydney data set with seven land-use categories All the experimental results achieved by the proposed method outperform most state of the arts, which demonstrates the effectiveness of the proposed method
TL;DR: This work proposes a novel spatiotemporal pyramid network to fuse the spatial and temporal features in a pyramid structure such that they can reinforce each other and achieves state-of-the-art results on standard video datasets.
Abstract: Two-stream convolutional networks have shown strong performance in video action recognition tasks. The key idea is to learn spatiotemporal features by fusing convolutional networks spatially and temporally. However, it remains unclear how to model the correlations between the spatial and temporal structures at multiple abstraction levels. First, the spatial stream tends to fail if two videos share similar backgrounds. Second, the temporal stream may be fooled if two actions resemble in short snippets, though appear to be distinct in the long term. We propose a novel spatiotemporal pyramid network to fuse the spatial and temporal features in a pyramid structure such that they can reinforce each other. From the architecture perspective, our network constitutes hierarchical fusion strategies which can be trained as a whole using a unified spatiotemporal loss. A series of ablation experiments support the importance of each fusion strategy. From the technical perspective, we introduce the spatiotemporal compact bilinear operator into video analysis tasks. This operator enables efficient training of bilinear fusion operations which can capture full interactions between the spatial and temporal features. Our final network achieves state-of-the-art results on standard video datasets.
TL;DR: In this paper, a convolutional neural network based approach for estimating the relative pose between two cameras is presented, which takes RGB images from both cameras as input and directly produces the relative rotation and translation as output.
Abstract: This paper presents a convolutional neural network based approach for estimating the relative pose between two cameras. The proposed network takes RGB images from both cameras as input and directly produces the relative rotation and translation as output. The system is trained in an end-to-end manner utilising transfer learning from a large scale classification dataset. The introduced approach is compared with widely used local feature based methods (SURF, ORB) and the results indicate a clear improvement over the baseline. In addition, a variant of the proposed architecture containing a spatial pyramid pooling (SPP) layer is evaluated and shown to further improve the performance.
TL;DR: A novel network structure, which allows an arbitrary number of frames as the network input, is proposed and can be learned on a small target data set because it can leverage the off-the-shelf image-level CNN for model parameter initialization.
Abstract: Encouraged by the success of convolutional neural networks (CNNs) in image classification, recently much effort is spent on applying the CNNs to the video-based action recognition problems. One challenge is that a video contains a varying number of frames, which is incompatible to the standard input format of the CNNs. Existing methods handle this issue either by directly sampling a fixed number of frames or bypassing this issue by introducing a 3D convolutional layer, which conducts convolution in spatial-temporal domain. In this paper, we propose a novel network structure, which allows an arbitrary number of frames as the network input. The key to our solution is to introduce a module consisting of an encoding layer and a temporal pyramid pooling layer. The encoding layer maps the activation from the previous layers to a feature vector suitable for pooling, whereas the temporal pyramid pooling layer converts multiple frame-level activations into a fixed-length video-level representation. In addition, we adopt a feature concatenation layer that combines the appearance and motion information. Compared with the frame sampling strategy, our method avoids the risk of missing any important frames. Compared with the 3D convolutional method, which requires a huge video data set for network training, our model can be learned on a small target data set because we can leverage the off-the-shelf image-level CNN for model parameter initialization. Experiments on three challenging data sets, Hollywood2, HMDB51, and UCF101 demonstrate the effectiveness of the proposed network.
TL;DR: A deep learning approach to remove motion blur from a single image captured in the wild, i.e., in an uncontrolled setting, is proposed and both a novel convolutional neural network architecture and a dataset for blurry images with ground truth are designed.
Abstract: We propose a deep learning approach to remove motion blur from a single image captured in the wild, i.e., in an uncontrolled setting. Thus, we consider motion blur degradations that are due to both camera and object motion, and by occlusion and coming into view of objects. In this scenario, a model-based approach would require a very large set of parameters, whose fitting is a challenge on its own. Hence, we take a data-driven approach and design both a novel convolutional neural network architecture and a dataset for blurry images with ground truth. The network produces directly the sharp image as output and is built into three pyramid stages, which allow to remove blur gradually from a small amount, at the lowest scale, to the full amount, at the scale of the input image. To obtain corresponding blurry and sharp image pairs, we use videos from a high frame-rate video camera. For each small video clip we select the central frame as the sharp image and use the frame average as the corresponding blurred image. Finally, to ensure that the averaging process is a sufficient approximation to real blurry images we estimate optical flow and select frames with pixel displacements smaller than a pixel. We demonstrate state of the art performance on datasets with both synthetic and real images.
TL;DR: It is shown that faces with different scales can be modeled through a specialized set of deep convolutional networks with different structures and these detectors can be seamlessly integrated into a single unified network that can be trained end-to-end.
Abstract: In this paper, we share our experience in designing a convolutional network-based face detector that could handle faces of an extremely wide range of scales. We show that faces with different scales can be modeled through a specialized set of deep convolutional networks with different structures. These detectors can be seamlessly integrated into a single unified network that can be trained end-to-end. In contrast to existing deep models that are designed for wide scale range, our network does not require an image pyramid input and the model is of modest complexity. Our network, dubbed ScaleFace, achieves promising performance on WIDER FACE and FDDB datasets with practical runtime speed. Specifically, our method achieves 76.4 average precision on the challenging WIDER FACE dataset and 96% recall rate on the FDDB dataset with 7 frames per second (fps) for 900 * 1300 input image.
TL;DR: A novel framework for recognizing human activities from video sequences captured by depth cameras is presented, including a general scheme of super normal vector (SNV) to aggregate the low-level polynormals into a discriminative representation, which can be viewed as a simplified version of the Fisher kernel representation.
Abstract: The advent of cost-effectiveness and easy-operation depth cameras has facilitated a variety of visual recognition tasks including human activity recognition. This paper presents a novel framework for recognizing human activities from video sequences captured by depth cameras. We extend the surface normal to polynormal by assembling local neighboring hypersurface normals from a depth sequence to jointly characterize local motion and shape information. We then propose a general scheme of super normal vector (SNV) to aggregate the low-level polynormals into a discriminative representation, which can be viewed as a simplified version of the Fisher kernel representation. In order to globally capture the spatial layout and temporal order, an adaptive spatio-temporal pyramid is introduced to subdivide a depth video into a set of space-time cells. In the extensive experiments, the proposed approach achieves superior performance to the state-of-the-art methods on the four public benchmark datasets, i.e., MSRAction3D, MSRDailyActivity3D, MSRGesture3D, and MSRActionPairs3D.
TL;DR: Benefiting from the edge-preserving property of the filter used in the algorithm, the details in the brightest/darkest regions are preserved well and no halo artifacts are produced in the fused image.
Abstract: Multi-scale exposure fusion is an efficient way to fuse differently exposed low dynamic range (LDR) images of a high dynamic range (HDR) scene into a high quality LDR image directly. It can produce images with higher quality than single-scale exposure fusion, but has a risk of producing halo artifacts and cannot preserve details in brightest or darkest regions well in the fused image. In this paper, an edge-preserving smoothing pyramid is introduced for the multi-scale exposure fusion. Benefiting from the edge-preserving property of the filter used in the algorithm, the details in the brightest/darkest regions are preserved well and no halo artifacts are produced in the fused image. The experimental results prove that the proposed algorithm produces better fused images than the state-of-the-art algorithms both qualitatively and quantitatively.
TL;DR: This work proposes a sketch framework, the Pyramid sketch, which can significantly improve accuracy as well as update and query speed, and verifies the effectiveness and efficiency of the framework.
Abstract: Sketch is a probabilistic data structure, and is used to store and query the frequency of any item in a given multiset. Due to its high memory efficiency, it has been applied to various fields in computer science, such as stream database, network traffic measurement, etc. The key metrics of sketches for data streams are accuracy, speed, and memory usage. Various sketches have been proposed, but they cannot achieve both high accuracy and high speed using limited memory, especially for skewed datasets. To address this issue, we propose a sketch framework, the Pyramid sketch, which can significantly improve accuracy as well as update and query speed. To verify the effectiveness and efficiency of our framework, we applied our framework to four typical sketches. Extensive experimental results show that the accuracy is improved up to 3.50 times, while the speed is improved up to 2.10 times. We have released our source codes at Github [1].
TL;DR: This work designs a scale-forecast network to globally predict potential scales in the image since there is no need to compute maps on all levels of the pyramid and proposes a landmark retracing network (LRN) to retrace back locations of the regressed landmarks and generate a confidence score for each landmark.
Abstract: Since convolutional neural network (CNN) lacks an inherent mechanism to handle large scale variations, we always need to compute feature maps multiple times for multiscale object detection, which has the bottleneck of computational cost in practice. To address this, we devise a recurrent scale approximation (RSA) to compute feature map once only, and only through this map can we approximate the rest maps on other levels. At the core of RSA is the recursive rolling out mechanism: given an initial map on a particular scale, it generates the prediction on a smaller scale that is half the size of input. To further increase efficiency and accuracy, we (a): design a scale-forecast network to globally predict potential scales in the image since there is no need to compute maps on all levels of the pyramid. (b): propose a landmark retracing network (LRN) to retrace back locations of the regressed landmarks and generate a confidence score for each landmark; LRN can effectively alleviate false positives due to the accumulated error in RSA. The whole system could be trained end-to-end in a unified CNN framework. Experiments demonstrate that our proposed algorithm is superior against state-of-the-arts on face detection benchmarks and achieves comparable results for generic proposal generation. The source code of our system is available.
TL;DR: This work introduces a new image dataset along with expert annotated diagnoses for evaluating image-based cervical disease classification algorithms and investigates the performance of convolutional neural network features for cervical disease Classification.
TL;DR: Together, the results suggest that continuous evolution of features on a multigrid pyramid is a more powerful alternative to existing CNN designs on a flat grid.
Abstract: We propose a multigrid extension of convolutional neural networks (CNNs) Rather than manipulating representations living on a single spatial grid, our network layers operate across scale space, on a pyramid of grids They consume multigrid inputs and produce multigrid outputs, convolutional filters themselves have both within-scale and cross-scale extent This aspect is distinct from simple multiscale designs, which only process the input at different scales Viewed in terms of information flow, a multigrid network passes messages across a spatial pyramid As a consequence, receptive field size grows exponentially with depth, facilitating rapid integration of context Most critically, multigrid structure enables networks to learn internal attention and dynamic routing mechanisms, and use them to accomplish tasks on which modern CNNs fail Experiments demonstrate wide-ranging performance advantages of multigrid On CIFAR and ImageNet classification tasks, flipping from a single grid to multigrid within the standard CNN paradigm improves accuracy, while being compute and parameter efficient Multigrid is independent of other architectural choices, we show synergy in combination with residual connections Multigrid yields dramatic improvement on a synthetic semantic segmentation dataset Most strikingly, relatively shallow multigrid networks can learn to directly perform spatial transformation tasks, where, in contrast, current CNNs fail Together, our results suggest that continuous evolution of features on a multigrid pyramid is a more powerful alternative to existing CNN designs on a flat grid
TL;DR: In this paper, the authors propose a novel CNN architecture which directly generates the sharp image from a high frame rate video camera and keeps one frame as the sharp images and frame average as the corresponding blurred images.
Abstract: The task of image deblurring is a very ill-posed problem as both the image and the blur are unknown. Moreover, when pictures are taken in the wild, this task becomes even more challenging due to the blur varying spatially and the occlusions between the object. Due to the complexity of the general image model we propose a novel convolutional network architecture which directly generates the sharp image.This network is built in three stages, and exploits the benefits of pyramid schemes often used in blind deconvolution. One of the main difficulties in training such a network is to design a suitable dataset. While useful data can be obtained by synthetically blurring a collection of images, more realistic data must be collected in the wild. To obtain such data we use a high frame rate video camera and keep one frame as the sharp image and frame average as the corresponding blurred image. We show that this realistic dataset is key in achieving state-of-the-art performance and dealing with occlusions.
TL;DR: A new end-to-end network based on ResNet and U-Net, which replaces the pooling layer with convolutional layer which can reduce information loss to some extent and introduces the LeakyReLU instead of ReLU along the downsampling path to increase the expressiveness of the model.
Abstract: Various deep convolutional neural networks (CNNs) have been applied in the task of medical image segmentation. A lot of CNNs have been proved to get better performance than the traditional algorithms. Deep residual network (ResNet) has drastically improved the performance by a trainable deep structure. In this paper, we proposed a new end-to-end network based on ResNet and U-Net. Our CNN effectively combine the features from shallow and deep layers through multi-path information confusion. In order to exploit global context features and enlarge receptive field in deep layer without losing resolution, We designed a new structure called pyramid dilated convolution. Different from traditional networks of CNNs, our network replaces the pooling layer with convolutional layer which can reduce information loss to some extent. We also introduce the LeakyReLU instead of ReLU along the downsampling path to increase the expressiveness of our model. Experiment shows that our proposed method can successfully extract features for medical image segmentation.
TL;DR: Simulation results verify that the proposed image compression–encryption hybrid algorithm could provide a considerable compression performance with a good security.
TL;DR: In this paper, a structured segment network (SSN) is proposed to model the temporal structure of each action instance via a structured temporal pyramid, and a decomposed discriminative model comprising two classifiers, respectively for classifying actions and determining completeness.
Abstract: Detecting actions in untrimmed videos is an important yet challenging task. In this paper, we present the structured segment network (SSN), a novel framework which models the temporal structure of each action instance via a structured temporal pyramid. On top of the pyramid, we further introduce a decomposed discriminative model comprising two classifiers, respectively for classifying actions and determining completeness. This allows the framework to effectively distinguish positive proposals from background or incomplete ones, thus leading to both accurate recognition and localization. These components are integrated into a unified network that can be efficiently trained in an end-to-end fashion. Additionally, a simple yet effective temporal action proposal scheme, dubbed temporal actionness grouping (TAG) is devised to generate high quality action proposals. On two challenging benchmarks, THUMOS14 and ActivityNet, our method remarkably outperforms previous state-of-the-art methods, demonstrating superior accuracy and strong adaptivity in handling actions with various temporal structures.
TL;DR: Experimental results show that the proposed detail enhanced exposure fusion algorithm can preserve details in saturated regions especially the brightest regions better than the state-of-the-art multiscale exposure fusion algorithms.
Abstract: Multiscale exposure fusion is a fast approach to fuse several differently exposed images captured at the same high dynamic range (HDR) scene into a high-quality low-dynamic range (LDR) image. The fused image is expected to include all details of the input images. However the details in the brightest and darkest regions are usually not well preserved. Adding details that are extracted from the input images to the fused image is an efficient approach to overcome the problem. In this paper a new gradient domain weighted least square based image smoothing algorithm is proposed to extract the details in the brightest and darkest regions of the HDR scene. The extracted details are then added to an image that is produced using an edge-preserving smoothing pyramid based multiscale exposure fusion algorithm. Experimental results show that the proposed detail enhanced exposure fusion algorithm can preserve details in saturated regions especially the brightest regions better than the state-of-the-art multiscale exposure fusion algorithms.
TL;DR: AnchorNet as discussed by the authors proposes a set of filters whose response is geometrically consistent across different object instances, even in the presence of strong intra-class, scale, or viewpoint variations.
Abstract: Despite significant progress of deep learning in recent years, state-of-the-art semantic matching methods still rely on legacy features such as SIFT or HoG. We argue that the strong invariance properties that are key to the success of recent deep architectures on the classification task make them unfit for dense correspondence tasks, unless a large amount of supervision is used. In this work, we propose a deep network, termed AnchorNet, that produces image representations that are well-suited for semantic matching. It relies on a set of filters whose response is geometrically consistent across different object instances, even in the presence of strong intra-class, scale, or viewpoint variations. Trained only with weak image-level labels, the final representation successfully captures information about the object structure and improves results of state-of-the-art semantic matching methods such as the Deformable Spatial Pyramid or the Proposal Flow methods. We show positive results on the cross-instance matching task where different instances of the same object category are matched as well as on a new cross-category semantic matching task aligning pairs of instances each from a different object class.
TL;DR: Experimental results indicate that the multi-scale SPP based DCNN can better adapt to input images of different sizes to learn of theMulti-scale characteristics of objects, thus further improving the detection effect.
Abstract: In recent years, vehicle detection from aerial images obtained using unmanned aerial vehicles (UAVs) has become a research focus in image processing as remote sensing platforms on UAVs are rapidly popularised. This study proposes a detection algorithm using a deep convolutional neural network (DCNN) based on multi-scale spatial pyramid pooling (SPP). By using multi-scale SPP models to sample characteristic patterns with different sizes, feature vectors with a fixed length are generated. This avoids the stretching- or cropping-induced deformation of input images of different sizes, thus improving the detection effect. In addition, an imaging pre-processing algorithm based on maximum normed gradient (NG) with multiple thresholds is proposed. By using this algorithm, this research restores the edges of objects disturbed by clutter in the environment. Meanwhile, the raised candidate object extraction algorithm based on the maximum binarized NG entails fewer computations as it generates fewer candidate windows. Experimental results indicate that the multi-scale SPP based DCNN can better adapt to input images of different sizes to learn of the multi-scale characteristics of objects, thus further improving the detection effect.