أوراق بحثية, رسائل ماجستير ودكتوراه منشورة من قبل Yali Wang

Learning Dynamical Human-Joint Affinity for 3D Pose Estimation in Videos

121 - Junhao Zhang , Yali Wang , Zhipeng Zhou 2021

Graph Convolution Network (GCN) has been successfully used for 3D human pose estimation in videos. However, it is often built on the fixed human-joint affinity, according to human skeleton. This may reduce adaptation capacity of GCN to tackle complex spatio-temporal pose variations in videos. To alleviate this problem, we propose a novel Dynamical Graph Network (DG-Net), which can dynamically identify human-joint affinity, and estimate 3D pose by adaptively learning spatial/temporal joint relations from videos. Different from traditional graph convolution, we introduce Dynamical Spatial/Temporal Graph convolution (DSG/DTG) to discover spatial/temporal human-joint affinity for each video exemplar, depending on spatial distance/temporal movement similarity between human joints in this video. Hence, they can effectively understand which joints are spatially closer and/or have consistent motion, for reducing depth ambiguity and/or motion uncertainty when lifting 2D pose to 3D pose. We conduct extensive experiments on three popular benchmarks, e.g., Human3.6M, HumanEva-I, and MPI-INF-3DHP, where DG-Net outperforms a number of recent SOTA approaches with fewer input frames and model size.

الرؤية الحاسوبية وتمييز الأنماط

Digging into Uncertainty in Self-supervised Multi-view Stereo

141 - Hongbin Xu , Zhipeng Zhou , Yali Wang 2021

Self-supervised Multi-view stereo (MVS) with a pretext task of image reconstruction has achieved significant progress recently. However, previous methods are built upon intuitions, lacking comprehensive explanations about the effectiveness of the pre text task in self-supervised MVS. To this end, we propose to estimate epistemic uncertainty in self-supervised MVS, accounting for what the model ignores. Specially, the limitations can be categorized into two types: ambiguious supervision in foreground and invalid supervision in background. To address these issues, we propose a novel Uncertainty reduction Multi-view Stereo (UMVS) framework for self-supervised learning. To alleviate ambiguous supervision in foreground, we involve extra correspondence prior with a flow-depth consistency loss. The dense 2D correspondence of optical flows is used to regularize the 3D stereo correspondence in MVS. To handle the invalid supervision in background, we use Monte-Carlo Dropout to acquire the uncertainty map and further filter the unreliable supervision signals on invalid regions. Extensive experiments on DTU and Tank&Temples benchmark show that our U-MVS framework achieves the best performance among unsupervised MVS methods, with competitive performance with its supervised opponents.

الرؤية الحاسوبية وتمييز الأنماط

CT-Net: Channel Tensorization Network for Video Classification

83 - Kunchang Li , Xianhang Li , Yali Wang 2021

3D convolution is powerful for video classification but often computationally expensive, recent studies mainly focus on decomposing it on spatial-temporal and/or channel dimensions. Unfortunately, most approaches fail to achieve a preferable balance between convolutional efficiency and feature-interaction sufficiency. For this reason, we propose a concise and novel Channel Tensorization Network (CT-Net), by treating the channel dimension of input feature as a multiplication of K sub-dimensions. On one hand, it naturally factorizes convolution in a multiple dimension way, leading to a light computation burden. On the other hand, it can effectively enhance feature interaction from different channels, and progressively enlarge the 3D receptive field of such interaction to boost classification accuracy. Furthermore, we equip our CT-Module with a Tensor Excitation (TE) mechanism. It can learn to exploit spatial, temporal and channel attention in a high-dimensional manner, to improve the cooperative power of all the feature dimensions in our CT-Module. Finally, we flexibly adapt ResNet as our CT-Net. Extensive experiments are conducted on several challenging video benchmarks, e.g., Kinetics-400, Something-Something V1 and V2. Our CT-Net outperforms a number of recent SOTA approaches, in terms of accuracy and/or efficiency. The codes and models will be available on https://github.com/Andy1621/CT-Net.

الرؤية الحاسوبية وتمييز الأنماط

PC-HMR: Pose Calibration for 3D Human Mesh Recovery from 2D Images/Videos

96 - Tianyu Luan , Yali Wang , Junhao Zhang 2021

The end-to-end Human Mesh Recovery (HMR) approach has been successfully used for 3D body reconstruction. However, most HMR-based frameworks reconstruct human body by directly learning mesh parameters from images or videos, while lacking explicit guid ance of 3D human pose in visual data. As a result, the generated mesh often exhibits incorrect pose for complex activities. To tackle this problem, we propose to exploit 3D pose to calibrate human mesh. Specifically, we develop two novel Pose Calibration frameworks, i.e., Serial PC-HMR and Parallel PC-HMR. By coupling advanced 3D pose estimators and HMR in a serial or parallel manner, these two frameworks can effectively correct human mesh with guidance of a concise pose calibration module. Furthermore, since the calibration module is designed via non-rigid pose transformation, our PC-HMR frameworks can flexibly tackle bone length variations to alleviate misplacement in the calibrated mesh. Finally, our frameworks are based on generic and complementary integration of data-driven learning and geometrical modeling. Via plug-and-play modules, they can be efficiently adapted for both image/video-based human mesh recovery. Additionally, they have no requirement of extra 3D pose annotations in the testing phase, which releases inference difficulties in practice. We perform extensive experiments on the popular bench-marks, i.e., Human3.6M, 3DPW and SURREAL, where our PC-HMR frameworks achieve the SOTA results.

الرؤية الحاسوبية وتمييز الأنماط

Automatic Preference Based Multi-objective Evolutionary Algorithm on Vehicle Fleet Maintenance Scheduling Optimization

97 - Yali Wang , Steffen Limmer , Markus Olhofer 2021

A preference based multi-objective evolutionary algorithm is proposed for generating solutions in an automatically detected knee point region. It is named Automatic Preference based DI-MOEA (AP-DI-MOEA) where DI-MOEA stands for Diversity-Indicator ba sed Multi-Objective Evolutionary Algorithm). AP-DI-MOEA has two main characteristics: firstly, it generates the preference region automatically during the optimization; secondly, it concentrates the solution set in this preference region. Moreover, the real-world vehicle fleet maintenance scheduling optimization (VFMSO) problem is formulated, and a customized multi-objective evolutionary algorithm (MOEA) is proposed to optimize maintenance schedules of vehicle fleets based on the predicted failure distribution of the components of cars. Furthermore, the customized MOEA for VFMSO is combined with AP-DI-MOEA to find maintenance schedules in the automatically generated preference region. Experimental results on multi-objective benchmark problems and our three-objective real-world application problems show that the newly proposed algorithm can generate the preference region accurately and that it can obtain better solutions in the preference region. Especially, in many cases, under the same budget, the Pareto optimal solutions obtained by AP-DI-MOEA dominate solutions obtained by MOEAs that pursue the entire Pareto front.

الحوسبة العصبية والتطورية

SmallBigNet: Integrating Core and Contextual Views for Video Classification

105 - Xianhang Li , Yali Wang , Zhipeng Zhou 2020

Temporal convolution has been widely used for video classification. However, it is performed on spatio-temporal contexts in a limited view, which often weakens its capacity of learning video representation. To alleviate this problem, we propose a con cise and novel SmallBig network, with the cooperation of small and big views. For the current time step, the small view branch is used to learn the core semantics, while the big view branch is used to capture the contextual semantics. Unlike traditional temporal convolution, the big view branch can provide the small view branch with the most activated video features from a broader 3D receptive field. Via aggregating such big-view contexts, the small view branch can learn more robust and discriminative spatio-temporal representations for video classification. Furthermore, we propose to share convolution in the small and big view branch, which improves model compactness as well as alleviates overfitting. As a result, our SmallBigNet achieves a comparable model size like 2D CNNs, while boosting accuracy like 3D CNNs. We conduct extensive experiments on the large-scale video benchmarks, e.g., Kinetics400, Something-Something V1 and V2. Our SmallBig network outperforms a number of recent state-of-the-art approaches, in terms of accuracy and/or efficiency. The codes and models will be available on https://github.com/xhl-video/SmallBigNet.

الرؤية الحاسوبية وتمييز الأنماط

Angular Trapping of Spherical Janus Particles

110 - Xiaoqing Gao , Yali Wang , Xuehao He 2020

Developing angular trapping methods, which will enable optical tweezers to rotate a micronized bead, is of great importance for the studies of biomacromolecules during a wide range of torque-generation processes. Here we report a novel controlled ang ular trapping method based on composite Janus particles. We used a chemically synthesized Janus particle, which consists of two hemispheres made of polystyrene (PS) and poly(methyl methacrylate) (PMMA) respectively, as a model system to demonstrate this method. Through computational and experimental studies, we demonstrated the feasibility to control the rotation of a Janus particle in a linearly polarized laser trap. Our results showed that the Janus particle aligned its two hemispheres interface parallel to the laser propagation direction as well as the laser polarization direction. In our experiments, the rotational state of the particle can be easily and directly visualized by using a CMOS camera, and does not require complex optical detection system. The rotation of the Janus particle in the laser trap can be fully controlled in real time by controlling the laser polarization direction. Our newly developed angular trapping technique has the great advantage of easy implementation and real time controllability. Considering the easy chemical synthesis of Janus particles and implementation of the angular trapping, this novel method has the potential of becoming a general angular trapping method. We anticipate that this new method will significantly broaden the availability of angular trapping in the biophysics community, and expand the scope of the research that can be enabled by the angular trapping approach.

بصريات مادة مكثفة ناعمة

Context-Transformer: Tackling Object Confusion for Few-Shot Detection

370 - Ze Yang 2020

Few-shot object detection is a challenging but realistic scenario, where only a few annotated training images are available for training detectors. A popular approach to handle this problem is transfer learning, i.e., fine-tuning a detector pretraine d on a source-domain benchmark. However, such transferred detector often fails to recognize new objects in the target domain, due to low data diversity of training samples. To tackle this problem, we propose a novel Context-Transformer within a concise deep transfer framework. Specifically, Context-Transformer can effectively leverage source-domain object knowledge as guidance, and automatically exploit contexts from only a few training images in the target domain. Subsequently, it can adaptively integrate these relational clues to enhance the discriminative power of detector, in order to reduce object confusion in few-shot scenarios. Moreover, Context-Transformer is flexibly embedded in the popular SSD-style detectors, which makes it a plug-and-play module for end-to-end few-shot learning. Finally, we evaluate Context-Transformer on the challenging settings of few-shot detection and incremental few-shot detection. The experimental results show that, our framework outperforms the recent state-of-the-art approaches.

الرؤية الحاسوبية وتمييز الأنماط التعلم الآلي

Learning Attentive Pairwise Interaction for Fine-Grained Classification

135 - Peiqin Zhuang , Yali Wang , Yu Qiao 2020

Fine-grained classification is a challenging problem, due to subtle differences among highly-confused categories. Most approaches address this difficulty by learning discriminative representation of individual input image. On the other hand, humans c an effectively identify contrastive clues by comparing image pairs. Inspired by this fact, this paper proposes a simple but effective Attentive Pairwise Interaction Network (API-Net), which can progressively recognize a pair of fine-grained images by interaction. Specifically, API-Net first learns a mutual feature vector to capture semantic differences in the input pair. It then compares this mutual vector with individual vectors to generate gates for each input image. These distinct gate vectors inherit mutual context on semantic differences, which allow API-Net to attentively capture contrastive clues by pairwise interaction between two images. Additionally, we train API-Net in an end-to-end manner with a score ranking regularization, which can further generalize API-Net by taking feature priorities into account. We conduct extensive experiments on five popular benchmarks in fine-grained classification. API-Net outperforms the recent SOTA methods, i.e., CUB-200-2011 (90.0%), Aircraft(93.9%), Stanford Cars (95.3%), Stanford Dogs (90.3%), and NABirds (88.1%).

الرؤية الحاسوبية وتمييز الأنماط

Progressive Object Transfer Detection

149 - Hao Chen , Yali Wang , Guoyou Wang 2020

Recent development of object detection mainly depends on deep learning with large-scale benchmarks. However, collecting such fully-annotated data is often difficult or expensive for real-world applications, which restricts the power of deep neural ne tworks in practice. Alternatively, humans can detect new objects with little annotation burden, since humans often use the prior knowledge to identify new objects with few elaborately-annotated examples, and subsequently generalize this capacity by exploiting objects from wild images. Inspired by this procedure of learning to detect, we propose a novel Progressive Object Transfer Detection (POTD) framework. Specifically, we make three main contributions in this paper. First, POTD can leverage various object supervision of different domains effectively into a progressive detection procedure. Via such human-like learning, one can boost a target detection task with few annotations. Second, POTD consists of two delicate transfer stages, i.e., Low-Shot Transfer Detection (LSTD), and Weakly-Supervised Transfer Detection (WSTD). In LSTD, we distill the implicit object knowledge of source detector to enhance target detector with few annotations. It can effectively warm up WSTD later on. In WSTD, we design a recurrent object labelling mechanism for learning to annotate weakly-labeled images. More importantly, we exploit the reliable object supervision from LSTD, which can further enhance the robustness of target detector in the WSTD stage. Finally, we perform extensive experiments on a number of challenging detection benchmarks with different settings. The results demonstrate that, our POTD outperforms the recent state-of-the-art approaches.

الرؤية الحاسوبية وتمييز الأنماط

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد