No Arabic abstract
Attributed to the ever-increasing large image datasets, Convolutional Neural Networks (CNNs) have become popular for vision-based tasks. It is generally admirable to have larger-sized datasets for higher network training accuracies. However, the impact of dataset quality has not to be involved. It is reasonable to assume the near-duplicate images exist in the datasets. For instance, the Street View House Numbers (SVHN) dataset having cropped house plate digits from 0 to 9 are likely to have repetitive digits from the same/similar house plates. Redundant images may take up a certain portion of the dataset without consciousness. While contributing little to no accuracy improvement for the CNNs training, these duplicated images unnecessarily pose extra resource and computation consumption. To this end, this paper proposes a framework to assess the impact of the near-duplicate images on CNN training performance, called CE-Dedup. Specifically, CE-Dedup associates a hashing-based image deduplication approach with downstream CNNs-based image classification tasks. CE-Dedup balances the tradeoff between a large deduplication ratio and a stable accuracy by adjusting the deduplication threshold. The effectiveness of CE-Dedup is validated through extensive experiments on well-known CNN benchmarks. On one hand, while maintaining the same validation accuracy, CE-Dedup can reduce the dataset size by 23%. On the other hand, when allowing a small validation accuracy drop (by 5%), CE-Dedup can trim the dataset size by 75%.
Deep Neural Networks (DNNs) have achieved im- pressive accuracy in many application domains including im- age classification. Training of DNNs is an extremely compute- intensive process and is solved using variants of the stochastic gradient descent (SGD) algorithm. A lot of recent research has focussed on improving the performance of DNN training. In this paper, we present optimization techniques to improve the performance of the data parallel synchronous SGD algorithm using the Torch framework: (i) we maintain data in-memory to avoid file I/O overheads, (ii) we present a multi-color based MPI Allreduce algorithm to minimize communication overheads, and (iii) we propose optimizations to the Torch data parallel table framework that handles multi-threading. We evaluate the performance of our optimizations on a Power 8 Minsky cluster with 32 nodes and 128 NVidia Pascal P100 GPUs. With our optimizations, we are able to train 90 epochs of the ResNet-50 model on the Imagenet-1k dataset using 256 GPUs in just 48 minutes. This significantly improves on the previously best known performance of training 90 epochs of the ResNet-50 model on the same dataset using 256 GPUs in 65 minutes. To the best of our knowledge, this is the best known training performance demonstrated for the Imagenet- 1k dataset.
In this paper, we propose to utilize Convolutional Neural Networks (CNNs) and the segmentation-based multi-scale analysis to locate tampered areas in digital images. First, to deal with color input sliding windows of different scales, a unified CNN architecture is designed. Then, we elaborately design the training procedures of CNNs on sampled training patches. With a set of robust multi-scale tampering detectors based on CNNs, complementary tampering possibility maps can be generated. Last but not least, a segmentation-based method is proposed to fuse the maps and generate the final decision map. By exploiting the benefits of both the small-scale and large-scale analyses, the segmentation-based multi-scale analysis can lead to a performance leap in forgery localization of CNNs. Numerous experiments are conducted to demonstrate the effectiveness and efficiency of our method.
In the problems of image retrieval and annotation, complete textual tag lists of images play critical roles. However, in real-world applications, the image tags are usually incomplete, thus it is important to learn the complete tags for images. In this paper, we study the problem of image tag complete and proposed a novel method for this problem based on a popular image representation method, convolutional neural network (CNN). The method estimates the complete tags from the convolutional filtering outputs of images based on a linear predictor. The CNN parameters, linear predictor, and the complete tags are learned jointly by our method. We build a minimization problem to encourage the consistency between the complete tags and the available incomplete tags, reduce the estimation error, and reduce the model complexity. An iterative algorithm is developed to solve the minimization problem. Experiments over benchmark image data sets show its effectiveness.
Convolutional Neural Networks (CNN) are state-of-the-art models for many image classification tasks. However, to recognize cancer subtypes automatically, training a CNN on gigapixel resolution Whole Slide Tissue Images (WSI) is currently computationally impossible. The differentiation of cancer subtypes is based on cellular-level visual features observed on image patch scale. Therefore, we argue that in this situation, training a patch-level classifier on image patches will perform better than or similar to an image-level classifier. The challenge becomes how to intelligently combine patch-level classification results and model the fact that not all patches will be discriminative. We propose to train a decision fusion model to aggregate patch-level predictions given by patch-level CNNs, which to the best of our knowledge has not been shown before. Furthermore, we formulate a novel Expectation-Maximization (EM) based method that automatically locates discriminative patches robustly by utilizing the spatial relationships of patches. We apply our method to the classification of glioma and non-small-cell lung carcinoma cases into subtypes. The classification accuracy of our method is similar to the inter-observer agreement between pathologists. Although it is impossible to train CNNs on WSIs, we experimentally demonstrate using a comparable non-cancer dataset of smaller images that a patch-based CNN can outperform an image-based CNN.
To train deep convolutional neural networks, the input data and the intermediate activations need to be kept in memory to calculate the gradient descent step. Given the limited memory available in the current generation accelerator cards, this limits the maximum dimensions of the input data. We demonstrate a method to train convolutional neural networks holding only parts of the image in memory while giving equivalent results. We quantitatively compare this new way of training convolutional neural networks with conventional training. In addition, as a proof of concept, we train a convolutional neural network with 64 megapixel images, which requires 97% less memory than the conventional approach.