Attention for Image Registration (AiR): an unsupervised Transformer approach

78 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Zihao Wang

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Zihao Wang - Herve Delingette

الرؤية الحاسوبية وتمييز الأنماط الذكاء الاصطناعي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Image registration as an important basis in signal processing task often encounter the problem of stability and efficiency. Non-learning registration approaches rely on the optimization of the similarity metrics between the fix and moving images. Yet, those approaches are usually costly in both time and space complexity. The problem can be worse when the size of the image is large or the deformations between the images are severe. Recently, deep learning, or precisely saying, the convolutional neural network (CNN) based image registration methods have been widely investigated in the research community and show promising effectiveness to overcome the weakness of non-learning based methods. To explore the advanced learning approaches in image registration problem for solving practical issues, we present in this paper a method of introducing attention mechanism in deformable image registration problem. The proposed approach is based on learning the deformation field with a Transformer framework (AiR) that does not rely on the CNN but can be efficiently trained on GPGPU devices also. In a more vivid interpretation: we treat the image registration problem as the same as a language translation task and introducing a Transformer to tackle the problem. Our method learns an unsupervised generated deformation map and is tested on two benchmark datasets. The source code of the AiR will be released at Gitlab.

قيم البحث

297 - Youssef A. Mejjati , Christian Richardt , James Tompkin 2018

Current unsupervised image-to-image translation techniques struggle to focus their attention on individual objects without altering the background or the way multiple objects interact within a scene. Motivated by the important role of attention in hu man perception, we tackle this limitation by introducing unsupervised attention mechanisms that are jointly adversarialy trained with the generators and discriminators. We demonstrate qualitatively and quantitatively that our approach is able to attend to relevant regions in the image without requiring supervision, and that by doing so it achieves more realistic mappings compared to recent approaches.

الرؤية الحاسوبية وتمييز الأنماط الذكاء الاصطناعي

Label-Attention Transformer with Geometrically Coherent Objects for Image Captioning

84 - Shikha Dubey , Farrukh Olimov , Muhammad Aasim Rafique 2021

Automatic transcription of scene understanding in images and videos is a step towards artificial general intelligence. Image captioning is a nomenclature for describing meaningful information in an image using computer vision techniques. Automated im age captioning techniques utilize encoder and decoder architecture, where the encoder extracts features from an image and the decoder generates a transcript. In this work, we investigate two unexplored ideas for image captioning using transformers: First, we demonstrate the enforcement of using objects relevance in the surrounding environment. Second, learning an explicit association between labels and language constructs. We propose label-attention Transformer with geometrically coherent objects (LATGeO). The proposed technique acquires a proposal of geometrically coherent objects using a deep neural network (DNN) and generates captions by investigating their relationships using a label-attention module. Object coherence is defined using the localized ratio of the geometrical properties of the proposals. The label-attention module associates the extracted objects classes to the available dictionary using self-attention layers. The experimentation results show that objects relevance in surroundings and binding of their visual feature with their geometrically localized ratios combined with its associated labels help in defining meaningful captions. The proposed framework is tested on the MSCOCO dataset, and a thorough evaluation resulting in overall better quantitative scores pronounces its superiority.

الرؤية الحاسوبية وتمييز الأنماط الذكاء الاصطناعي

An Unsupervised Sampling Approach for Image-Sentence Matching Using Document-Level Structural Information

66 - Zejun Li , Zhongyu Wei , Zhihao Fan 2021

In this paper, we focus on the problem of unsupervised image-sentence matching. Existing research explores to utilize document-level structural information to sample positive and negative instances for model training. Although the approach achieves p ositive results, it introduces a sampling bias and fails to distinguish instances with high semantic similarity. To alleviate the bias, we propose a new sampling strategy to select additional intra-document image-sentence pairs as positive or negative samples. Furthermore, to recognize the complex pattern in intra-document samples, we propose a Transformer based model to capture fine-grained features and implicitly construct a graph for each document, where concepts in a document are introduced to bridge the representation learning of images and sentences in the context of a document. Experimental results show the effectiveness of our approach to alleviate the bias and learn well-aligned multimodal representations.

الرؤية الحاسوبية وتمييز الأنماط الذكاء الاصطناعي الوسائط المتعددة

ViT-V-Net: Vision Transformer for Unsupervised Volumetric Medical Image Registration

126 - Junyu Chen , Yufan He , Eric C. Frey 2021

In the last decade, convolutional neural networks (ConvNets) have dominated and achieved state-of-the-art performances in a variety of medical imaging applications. However, the performances of ConvNets are still limited by lacking the understanding of long-range spatial relations in an image. The recently proposed Vision Transformer (ViT) for image classification uses a purely self-attention-based model that learns long-range spatial relations to focus on the relevant parts of an image. Nevertheless, ViT emphasizes the low-resolution features because of the consecutive downsamplings, result in a lack of detailed localization information, making it unsuitable for image registration. Recently, several ViT-based image segmentation methods have been combined with ConvNets to improve the recovery of detailed localization information. Inspired by them, we present ViT-V-Net, which bridges ViT and ConvNet to provide volumetric medical image registration. The experimental results presented here demonstrate that the proposed architecture achieves superior performance to several top-performing registration methods.

معالجة الصور والفيديو الرؤية الحاسوبية وتمييز الأنماط

Conv2Warp: An unsupervised deformable image registration with continuous convolution and warping

414 - Sharib Ali , Jens Rittscher 2019

Recent successes in deep learning based deformable image registration (DIR) methods have demonstrated that complex deformation can be learnt directly from data while reducing computation time when compared to traditional methods. However, the relianc e on fully linear convolutional layers imposes a uniform sampling of pixel/voxel locations which ultimately limits their performance. To address this problem, we propose a novel approach of learning a continuous warp of the source image. Here, the required deformation vector fields are obtained from a concatenated linear and non-linear convolution layers and a learnable bicubic Catmull-Rom spline resampler. This allows to compute smooth deformation field and more accurate alignment compared to using only linear convolutions and linear resampling. In addition, the continuous warping technique penalizes disagreements that are due to topological changes. Our experiments demonstrate that this approach manages to capture large non-linear deformations and minimizes the propagation of interpolation errors. While improving accuracy the method is computationally efficient. We present comparative results on a range of public 4D CT lung (POPI) and brain datasets (CUMC12, MGH10).

الرؤية الحاسوبية وتمييز الأنماط التعلم الآلي معالجة الصور والفيديو