ترغب بنشر مسار تعليمي؟ اضغط هنا

Existing popular unsupervised embedding learning methods focus on enhancing the instance-level local discrimination of the given unlabeled images by exploring various negative data. However, the existed sample outliers which exhibit large intra-class divergences or small inter-class variations severely limit their learning performance. We justify that the performance limitation is caused by the gradient vanishing on these sample outliers. Moreover, the shortage of positive data and disregard for global discrimination consideration also pose critical issues for unsupervised learning but are always ignored by existing methods. To handle these issues, we propose a novel solution to explicitly model and directly explore the uncertainty of the given unlabeled learning samples. Instead of learning a deterministic feature point for each sample in the embedding space, we propose to represent a sample by a stochastic Gaussian with the mean vector depicting its space localization and covariance vector representing the sample uncertainty. We leverage such uncertainty modeling as momentum to the learning which is helpful to tackle the outliers. Furthermore, abundant positive candidates can be readily drawn from the learned instance-specific distributions which are further adopted to mitigate the aforementioned issues. Thorough rationale analyses and extensive experiments are presented to verify our superiority.
Previous cycle-consistency correspondence learning methods usually leverage image patches for training. In this paper, we present a fully convolutional method, which is simpler and more coherent to the inference process. While directly applying fully convolutional training results in model collapse, we study the underline reason behind this collapse phenomenon, indicating that the absolute positions of pixels provide a shortcut to easily accomplish cycle-consistence, which hinders the learning of meaningful visual representations. To break this absolute position shortcut, we propose to apply different crops for forward and backward frames, and adopt feature warping to establish correspondence between two crops of a same frame. The former technique enforces the corresponding pixels at forward and back tracks to have different absolute positions, and the latter effectively blocks the shortcuts going between forward and back tracks. In three label propagation benchmarks for pose tracking, face landmark tracking and video object segmentation, our method largely improves the results of vanilla fully convolutional cycle-consistency method, achieving very competitive performance compared with the self-supervised state-of-the-art approaches.
Assessing action quality from videos has attracted growing attention in recent years. Most existing approaches usually tackle this problem based on regression algorithms, which ignore the intrinsic ambiguity in the score labels caused by multiple jud ges or their subjective appraisals. To address this issue, we propose an uncertainty-aware score distribution learning (USDL) approach for action quality assessment (AQA). Specifically, we regard an action as an instance associated with a score distribution, which describes the probability of different evaluated scores. Moreover, under the circumstance where fine-grained score labels are available (e.g., difficulty degree of an action or multiple scores from different judges), we further devise a multi-path uncertainty-aware score distributions learning (MUSDL) method to explore the disentangled components of a score. We conduct experiments on three AQA datasets containing various Olympic actions and surgical activities, where our approaches set new state-of-the-arts under the Spearmans Rank Correlation.
460 - Yansong Tang , Jiwen Lu , Jie Zhou 2020
Thanks to the substantial and explosively inscreased instructional videos on the Internet, novices are able to acquire knowledge for completing various tasks. Over the past decade, growing efforts have been devoted to investigating the problem on ins tructional video analysis. However, the most existing datasets in this area have limitations in diversity and scale, which makes them far from many real-world applications where more diverse activities occur. To address this, we present a large-scale dataset named as COIN for COmprehensive INstructional video analysis. Organized with a hierarchical structure, the COIN dataset contains 11,827 videos of 180 tasks in 12 domains (e.g., vehicles, gadgets, etc.) related to our daily life. With a new developed toolbox, all the videos are annotated efficiently with a series of step labels and the corresponding temporal boundaries. In order to provide a benchmark for instructional video analysis, we evaluate plenty of approaches on the COIN dataset under five different settings. Furthermore, we exploit two important characteristics (i.e., task-consistency and ordering-dependency) for localizing important steps in instructional videos. Accordingly, we propose two simple yet effective methods, which can be easily plugged into conventional proposal-based action detection models. We believe the introduction of the COIN dataset will promote the future in-depth research on instructional video analysis for the community. Our dataset, annotation toolbox and source code are available at http://coin-dataset.github.io.
There are substantial instructional videos on the Internet, which enables us to acquire knowledge for completing various tasks. However, most existing datasets for instructional video analysis have the limitations in diversity and scale,which makes t hem far from many real-world applications where more diverse activities occur. Moreover, it still remains a great challenge to organize and harness such data. To address these problems, we introduce a large-scale dataset called COIN for COmprehensive INstructional video analysis. Organized with a hierarchical structure, the COIN dataset contains 11,827 videos of 180 tasks in 12 domains (e.g., vehicles, gadgets, etc.) related to our daily life. With a new developed toolbox, all the videos are annotated effectively with a series of step descriptions and the corresponding temporal boundaries. Furthermore, we propose a simple yet effective method to capture the dependencies among different steps, which can be easily plugged into conventional proposal-based action detection methods for localizing important steps in instructional videos. In order to provide a benchmark for instructional video analysis, we evaluate plenty of approaches on the COIN dataset under different evaluation criteria. We expect the introduction of the COIN dataset will promote the future in-depth research on instructional video analysis for the community.
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا