أوراق بحثية, رسائل ماجستير ودكتوراه منشورة من قبل Yi-Hsuan Tsai

Progressive Domain Adaptation for Object Detection

192 - Han-Kai Hsu , Chun-Han Yao , Yi-Hsuan Tsai 2019

Recent deep learning methods for object detection rely on a large amount of bounding box annotations. Collecting these annotations is laborious and costly, yet supervised models do not generalize well when testing on images from a different distribut ion. Domain adaptation provides a solution by adapting existing labels to the target testing data. However, a large gap between domains could make adaptation a challenging task, which leads to unstable training processes and sub-optimal results. In this paper, we propose to bridge the domain gap with an intermediate domain and progressively solve easier adaptation subtasks. This intermediate domain is constructed by translating the source images to mimic the ones in the target domain. To tackle the domain-shift problem, we adopt adversarial learning to align distributions at the feature level. In addition, a weighted task loss is applied to deal with unbalanced image quality in the intermediate domain. Experimental results show that our method performs favorably against the state-of-the-art method in terms of the performance on the target domain.

الرؤية الحاسوبية وتمييز الأنماط

Domain Adaptation for Structured Output via Discriminative Patch Representations

72 - Yi-Hsuan Tsai , Kihyuk Sohn , Samuel Schulter 2019

Predicting structured outputs such as semantic segmentation relies on expensive per-pixel annotations to learn supervised models like convolutional neural networks. However, models trained on one data domain may not generalize well to other domains w ithout annotations for model finetuning. To avoid the labor-intensive process of annotation, we develop a domain adaptation method to adapt the source data to the unlabeled target domain. We propose to learn discriminative feature representations of patches in the source domain by discovering multiple modes of patch-wise output distribution through the construction of a clustered space. With such representations as guidance, we use an adversarial learning scheme to push the feature representations of target patches in the clustered space closer to the distributions of source patches. In addition, we show that our framework is complementary to existing domain adaptation techniques and achieves consistent improvements on semantic segmentation. Extensive ablations and results are demonstrated on numerous benchmark datasets with various settings, such as synthetic-to-real and cross-city scenarios.

الرؤية الحاسوبية وتمييز الأنماط

Unseen Object Segmentation in Videos via Transferable Representations

73 - Yi-Wen Chen , Yi-Hsuan Tsai , Chu-Ya Yang 2019

In order to learn object segmentation models in videos, conventional methods require a large amount of pixel-wise ground truth annotations. However, collecting such supervised data is time-consuming and labor-intensive. In this paper, we exploit exis ting annotations in source images and transfer such visual information to segment videos with unseen object categories. Without using any annotations in the target video, we propose a method to jointly mine useful segments and learn feature representations that better adapt to the target frames. The entire process is decomposed into two tasks: 1) solving a submodular function for selecting object-like segments, and 2) learning a CNN model with a transferable module for adapting seen categories in the source domain to the unseen target video. We present an iterative update scheme between two tasks to self-learn the final solution for object segmentation. Experimental results on numerous benchmark datasets show that the proposed method performs favorably against the state-of-the-art algorithms.

الرؤية الحاسوبية وتمييز الأنماط

Fast and Accurate Online Video Object Segmentation via Tracking Parts

150 - Jingchun Cheng , Yi-Hsuan Tsai , Wei-Chih Hung 2018

Online video object segmentation is a challenging task as it entails to process the image sequence timely and accurately. To segment a target object through the video, numerous CNN-based methods have been developed by heavily finetuning on the object mask in the first frame, which is time-consuming for online applications. In this paper, we propose a fast and accurate video object segmentation algorithm that can immediately start the segmentation process once receiving the images. We first utilize a part-based tracking method to deal with challenging factors such as large deformation, occlusion, and cluttered background. Based on the tracked bounding boxes of parts, we construct a region-of-interest segmentation network to generate part masks. Finally, a similarity-based scoring function is adopted to refine these object parts by comparing them to the visual information in the first frame. Our method performs favorably against state-of-the-art algorithms in accuracy on the DAVIS benchmark dataset, while achieving much faster runtime performance.

الرؤية الحاسوبية وتمييز الأنماط

Learning to Adapt Structured Output Space for Semantic Segmentation

177 - Yi-Hsuan Tsai , Wei-Chih Hung , Samuel Schulter 2018

Convolutional neural network-based approaches for semantic segmentation rely on supervision with pixel-level ground truth, but may not generalize well to unseen image domains. As the labeling process is tedious and labor intensive, developing algorit hms that can adapt source ground truth labels to the target domain is of great interest. In this paper, we propose an adversarial learning method for domain adaptation in the context of semantic segmentation. Considering semantic segmentations as structured outputs that contain spatial similarities between the source and target domains, we adopt adversarial learning in the output space. To further enhance the adapted model, we construct a multi-level adversarial network to effectively perform output space domain adaptation at different feature levels. Extensive experiments and ablation study are conducted under various domain adaptation settings, including synthetic-to-real and cross-city scenarios. We show that the proposed method performs favorably against the state-of-the-art methods in terms of accuracy and visual quality.

الرؤية الحاسوبية وتمييز الأنماط

Learning Binary Residual Representations for Domain-specific Video Streaming

66 - Yi-Hsuan Tsai , Ming-Yu Liu , Deqing Sun 2017

We study domain-specific video streaming. Specifically, we target a streaming setting where the videos to be streamed from a server to a client are all in the same domain and they have to be compressed to a small size for low-latency transmission. Se veral popular video streaming services, such as the video game streaming services of GeForce Now and Twitch, fall in this category. While conventional video compression standards such as H.264 are commonly used for this task, we hypothesize that one can leverage the property that the videos are all in the same domain to achieve better video quality. Based on this hypothesis, we propose a novel video compression pipeline. Specifically, we first apply H.264 to compress domain-specific videos. We then train a novel binary autoencoder to encode the leftover domain-specific residual information frame-by-frame into binary representations. These binary representations are then compressed and sent to the client together with the H.264 stream. In our experiments, we show that our pipeline yields consistent gains over standard H.264 compression across several benchmark datasets while using the same channel bandwidth.

الرؤية الحاسوبية وتمييز الأنماط الذكاء الاصطناعي

Scene Parsing with Global Context Embedding

330 - Wei-Chih Hung , Yi-Hsuan Tsai , Xiaohui Shen 2017

We present a scene parsing method that utilizes global context information based on both the parametric and non- parametric models. Compared to previous methods that only exploit the local relationship between objects, we train a context network base d on scene similarities to generate feature representations for global contexts. In addition, these learned features are utilized to generate global and spatial priors for explicit classes inference. We then design modules to embed the feature representations and the priors into the segmentation network as additional global context cues. We show that the proposed method can eliminate false positives that are not compatible with the global context representations. Experiments on both the MIT ADE20K and PASCAL Context datasets show that the proposed method performs favorably against existing methods.

الرؤية الحاسوبية وتمييز الأنماط

SegFlow: Joint Learning for Video Object Segmentation and Optical Flow

91 - Jingchun Cheng , Yi-Hsuan Tsai , Shengjin Wang 2017

This paper proposes an end-to-end trainable network, SegFlow, for simultaneously predicting pixel-wise object segmentation and optical flow in videos. The proposed SegFlow has two branches where useful information of object segmentation and optical f low is propagated bidirectionally in a unified framework. The segmentation branch is based on a fully convolutional network, which has been proved effective in image segmentation task, and the optical flow branch takes advantage of the FlowNet model. The unified framework is trained iteratively offline to learn a generic notion, and fine-tuned online for specific objects. Extensive experiments on both the video object segmentation and optical flow datasets demonstrate that introducing optical flow improves the performance of segmentation and vice versa, against the state-of-the-art algorithms.

الرؤية الحاسوبية وتمييز الأنماط

Learning to Segment Instances in Videos with Spatial Propagation Network

113 - Jingchun Cheng , Sifei Liu , Yi-Hsuan Tsai 2017

We propose a deep learning-based framework for instance-level object segmentation. Our method mainly consists of three steps. First, We train a generic model based on ResNet-101 for foreground/background segmentations. Second, based on this generic m odel, we fine-tune it to learn instance-level models and segment individual objects by using augmented object annotations in first frames of test videos. To distinguish different instances in the same video, we compute a pixel-level score map for each object from these instance-level models. Each score map indicates the objectness likelihood and is only computed within the foreground mask obtained in the first step. To further refine this per frame score map, we learn a spatial propagation network. This network aims to learn how to propagate a coarse segmentation mask spatially based on the pairwise similarities in each frame. In addition, we apply a filter on the refined score map that aims to recognize the best connected region using spatial and temporal consistencies in the video. Finally, we decide the instance-level object segmentation in each video by comparing score maps of different instances.

الرؤية الحاسوبية وتمييز الأنماط

Deep Image Harmonization

83 - Yi-Hsuan Tsai , Xiaohui Shen , Zhe Lin 2017

Compositing is one of the most common operations in photo editing. To generate realistic composites, the appearances of foreground and background need to be adjusted to make them compatible. Previous approaches to harmonize composites have focused on learning statistical relationships between hand-crafted appearance features of the foreground and background, which is unreliable especially when the contents in the two layers are vastly different. In this work, we propose an end-to-end deep convolutional neural network for image harmonization, which can capture both the context and semantic information of the composite images during harmonization. We also introduce an efficient way to collect large-scale and high-quality training data that can facilitate the training process. Experiments on the synthesized dataset and real composite images show that the proposed network outperforms previous state-of-the-art methods.

الرؤية الحاسوبية وتمييز الأنماط

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد