أوراق بحثية, رسائل ماجستير ودكتوراه منشورة من قبل Bowen Zhang

Dynamic Neural Representational Decoders for High-Resolution Semantic Segmentation

136 - Bowen Zhang , Yifan Liu , Zhi Tian 2021

Semantic segmentation requires per-pixel prediction for a given image. Typically, the output resolution of a segmentation network is severely reduced due to the downsampling operations in the CNN backbone. Most previous methods employ upsampling deco ders to recover the spatial resolution. Various decoders were designed in the literature. Here, we propose a novel decoder, termed dynamic neural representational decoder (NRD), which is simple yet significantly more efficient. As each location on the encoders output corresponds to a local patch of the semantic labels, in this work, we represent these local patches of labels with compact neural networks. This neural representation enables our decoder to leverage the smoothness prior in the semantic label space, and thus makes our decoder more efficient. Furthermore, these neural representations are dynamically generated and conditioned on the outputs of the encoder networks. The desired semantic labels can be efficiently decoded from the neural representations, resulting in high-resolution semantic segmentation predictions. We empirically show that our proposed decoder can outperform the decoder in DeeplabV3+ with only 30% computational complexity, and achieve competitive performance with the methods using dilated encoders with only 15% computation. Experiments on the Cityscapes, ADE20K, and PASCAL Context datasets demonstrate the effectiveness and efficiency of our proposed method.

الرؤية الحاسوبية وتمييز الأنماط

Enhanced Hyperspectral Image Super-Resolution via RGB Fusion and TV-TV Minimization

76 - Marija Vella , Bowen Zhang , Wei Chen 2021

Hyperspectral (HS) images contain detailed spectral information that has proven crucial in applications like remote sensing, surveillance, and astronomy. However, because of hardware limitations of HS cameras, the captured images have low spatial res olution. To improve them, the low-resolution hyperspectral images are fused with conventional high-resolution RGB images via a technique known as fusion based HS image super-resolution. Currently, the best performance in this task is achieved by deep learning (DL) methods. Such methods, however, cannot guarantee that the input measurements are satisfied in the recovered image, since the learned parameters by the network are applied to every test image. Conversely, model-based algorithms can typically guarantee such measurement consistency. Inspired by these observations, we propose a framework that integrates learning and model based methods. Experimental results show that our method produces images of superior spatial and spectral resolution compared to the current leading methods, whether model- or DL-based.

معالجة الصور والفيديو التعلم الآلي معالجة الإشارات

CREATe: Clinical Report Extraction and Annotation Technology

90 - Yichao Zhou , Wei-Ting Chen , Bowen Zhang 2021

Clinical case reports are written descriptions of the unique aspects of a particular clinical case, playing an essential role in sharing clinical experiences about atypical disease phenotypes and new therapies. However, to our knowledge, there has be en no attempt to develop an end-to-end system to annotate, index, or otherwise curate these reports. In this paper, we propose a novel computational resource platform, CREATe, for extracting, indexing, and querying the contents of clinical case reports. CREATe fosters an environment of sustainable resource support and discovery, enabling researchers to overcome the challenges of information science. An online video of the demonstration can be viewed at https://youtu.be/Q8owBQYTjDc.

الحساب واللغة التعلم الآلي

Instance and Panoptic Segmentation Using Conditional Convolutions

76 - Zhi Tian , Bowen Zhang , Hao Chen 2021

We propose a simple yet effective framework for instance and panoptic segmentation, termed CondInst (conditional convolutions for instance and panoptic segmentation). In the literature, top-performing instance segmentation methods typically follow th e paradigm of Mask R-CNN and rely on ROI operations (typically ROIAlign) to attend to each instance. In contrast, we propose to attend to the instances with dynamic conditional convolutions. Instead of using instance-wise ROIs as inputs to the instance mask head of fixed weights, we design dynamic instance-aware mask heads, conditioned on the instances to be predicted. CondInst enjoys three advantages: 1.) Instance and panoptic segmentation are unified into a fully convolutional network, eliminating the need for ROI cropping and feature alignment. 2.) The elimination of the ROI cropping also significantly improves the output instance mask resolution. 3.) Due to the much improved capacity of dynamically-generated conditional convolutions, the mask head can be very compact (e.g., 3 conv. layers, each having only 8 channels), leading to significantly faster inference time per instance and making the overall inference time almost constant, irrelevant to the number of instances. We demonstrate a simpler method that can achieve improved accuracy and inference speed on both instance and panoptic segmentation tasks. On the COCO dataset, we outperform a few state-of-the-art methods. We hope that CondInst can be a strong baseline for instance and panoptic segmentation. Code is available at: https://git.io/AdelaiDet

الرؤية الحاسوبية وتمييز الأنماط معالجة الصور والفيديو

A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus

130 - Bowen Zhang , Hexiang Hu , Joonseok Lee 2020

Identifying a short segment in a long video that semantically matches a text query is a challenging task that has important application potentials in language-based video search, browsing, and navigation. Typical retrieval systems respond to a query with either a whole video or a pre-defined video segment, but it is challenging to localize undefined segments in untrimmed and unsegmented videos where exhaustively searching over all possible segments is intractable. The outstanding challenge is that the representation of a video must account for different levels of granularity in the temporal domain. To tackle this problem, we propose the HierArchical Multi-Modal EncodeR (HAMMER) that encodes a video at both the coarse-grained clip level and the fine-grained frame level to extract information at different scales based on multiple subtasks, namely, video retrieval, segment temporal localization, and masked language modeling. We conduct extensive experiments to evaluate our model on moment localization in video corpus on ActivityNet Captions and TVR datasets. Our approach outperforms the previous methods as well as strong baselines, establishing new state-of-the-art for this task.

الرؤية الحاسوبية وتمييز الأنماط الحساب واللغة

Solving Sparse Linear Inverse Problems in Communication Systems: A Deep Learning Approach With Adaptive Depth

82 - Wei Chen , Bowen Zhang , Shi Jin 2020

Sparse signal recovery problems from noisy linear measurements appear in many areas of wireless communications. In recent years, deep learning (DL) based approaches have attracted interests of researchers to solve the sparse linear inverse problem by unfolding iterative algorithms as neural networks. Typically, research concerning DL assume a fixed number of network layers. However, it ignores a key character in traditional iterative algorithms, where the number of iterations required for convergence changes with varying sparsity levels. By investigating on the projected gradient descent, we unveil the drawbacks of the existing DL methods with fixed depth. Then we propose an end-to-end trainable DL architecture, which involves an extra halting score at each layer. Therefore, the proposed method learns how many layers to execute to emit an output, and the network depth is dynamically adjusted for each task in the inference phase. We conduct experiments using both synthetic data and applications including random access in massive MTC and massive MIMO channel estimation, and the results demonstrate the improved efficiency for the proposed approach.

معالجة الإشارات الذكاء الاصطناعي

Online Action Detection in Streaming Videos with Time Buffers

99 - Bowen Zhang , Hao Chen , Meng Wang 2020

We formulate the problem of online temporal action detection in live streaming videos, acknowledging one important property of live streaming videos that there is normally a broadcast delay between the latest captured frame and the actual frame viewe d by the audience. The standard setting of the online action detection task requires immediate prediction after a new frame is captured. We illustrate that its lack of consideration of the delay is imposing unnecessary constraints on the models and thus not suitable for this problem. We propose to adopt the problem setting that allows models to make use of the small `buffer time incurred by the delay in live streaming videos. We design an action start and end detection framework for this online with buffer setting with two major components: flattened I3D and window-based suppression. Experiments on three standard temporal action detection benchmarks under the proposed setting demonstrate the effectiveness of the proposed framework. We show that by having a suitable problem setting for this problem with wide-applications, we can achieve much better detection accuracy than off-the-shelf online action detection models.

الرؤية الحاسوبية وتمييز الأنماط

Learning to Represent Image and Text with Denotation Graph

88 - Bowen Zhang , Hexiang Hu , Vihan Jain 2020

Learning to fuse vision and language information and representing them is an important research problem with many applications. Recent progresses have leveraged the ideas of pre-training (from language modeling) and attention layers in Transformers t o learn representation from datasets containing images aligned with linguistic expressions that describe the images. In this paper, we propose learning representations from a set of implied, visually grounded expressions between image and text, automatically mined from those datasets. In particular, we use denotation graphs to represent how specific concepts (such as sentences describing images) can be linked to abstract and generic concepts (such as short phrases) that are also visually grounded. This type of generic-to-specific relations can be discovered using linguistic analysis tools. We propose methods to incorporate such relations into learning representation. We show that state-of-the-art multimodal learning models can be further improved by leveraging automatically harvested structural relations. The representations lead to stronger empirical results on downstream tasks of cross-modal image retrieval, referring expression, and compositional attribute-object recognition. Both our codes and the extracted denotation graphs on the Flickr30K and the COCO datasets are publically available on https://sha-lab.github.io/DG.

الرؤية الحاسوبية وتمييز الأنماط الحساب واللغة التعلم الآلي

Challenging the adversarial robustness of DNNs based on error-correcting output codes

42 - Bowen Zhang , Benedetta Tondi , Xixiang Lv 2020

The existence of adversarial examples and the easiness with which they can be generated raise several security concerns with regard to deep learning systems, pushing researchers to develop suitable defense mechanisms. The use of networks adopting err or-correcting output codes (ECOC) has recently been proposed to counter the creation of adversarial examples in a white-box setting. In this paper, we carry out an in-depth investigation of the adversarial robustness achieved by the ECOC approach. We do so by proposing a new adversarial attack specifically designed for multi-label classification architectures, like the ECOC-based one, and by applying two existing attacks. In contrast to previous findings, our analysis reveals that ECOC-based networks can be attacked quite easily by introducing a small adversarial perturbation. Moreover, the adversarial examples can be generated in such a way to achieve high probabilities for the predicted target class, hence making it difficult to use the prediction confidence to detect them. Our findings are proven by means of experimental results obtained on MNIST, CIFAR-10 and GTSRB classification tasks.

التشفير والأمن التعلم الآلي

Characterizing Cryptocurrency Exchange Scams

90 - Pengcheng Xia , Bowen Zhang , Ru Ji 2020

As the indispensable trading platforms of the ecosystem, hundreds of cryptocurrency exchanges are emerging to facilitate the trading of digital assets. While, it also attracts the attentions of attackers. A number of scam attacks were reported target ing cryptocurrency exchanges, leading to a huge mount of financial loss. However, no previous work in our research community has systematically studied this problem. In this paper, we make the first effort to identify and characterize the cryptocurrency exchange scams. We first identify over 1,500 scam domains and over 300 fake apps, by collecting existing reports and using typosquatting generation techniques. Then we investigate the relationship between them, and identify 94 scam domain families and 30 fake app families. We further characterize the impacts of such scams, and reveal that these scams have incurred financial loss of 520k US dollars at least. We further observe that the fake apps have been sneaked to major app markets (including Google Play) to infect unsuspicious users. Our findings demonstrate the urgency to identify and prevent cryptocurrency exchange scams. To facilitate future research, we have publicly released all the identified scam domains and fake apps to the community.

التشفير والأمن هندسة البرمجيات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد