ترغب بنشر مسار تعليمي؟ اضغط هنا

Visual question answering (VQA) is challenging not only because the model has to handle multi-modal information, but also because it is just so hard to collect sufficient training examples -- there are too many questions one can ask about an image. A s a result, a VQA model trained solely on human-annotated examples could easily over-fit specific question styles or image contents that are being asked, leaving the model largely ignorant about the sheer diversity of questions. Existing methods address this issue primarily by introducing an auxiliary task such as visual grounding, cycle consistency, or debiasing. In this paper, we take a drastically different approach. We found that many of the unknowns to the learned VQA model are indeed known in the dataset implicitly. For instance, questions asking about the same object in different images are likely paraphrases; the number of detected or annotated objects in an image already provides the answer to the how many question, even if the question has not been annotated for that image. Building upon these insights, we present a simple data augmentation pipeline SimpleAug to turn this known knowledge into training examples for VQA. We show that these augmented examples can notably improve the learned VQA models performance, not only on the VQA-CP dataset with language prior shifts but also on the VQA v2 dataset without such shifts. Our method further opens up the door to leverage weakly-labeled or unlabeled images in a principled way to enhance VQA models. Our code and data are publicly available at https://github.com/heendung/simpleAUG.
We consider an optomechanical system comprising a single cavity mode and a dense spectrum of acoustic modes and solve for the quantum dynamics of initial cavity mode Fock (i.e., photon number) superposition states and thermal acoustic states. The opt omechanical interaction results in dephasing without damping and bears some analogy to gravitational decoherence. For a cavity mode locally coupled to a one-dimensional (1D) elastic string-like environment or two-dimensional (2D) elastic membrane-like environment, we find that the dephasing dynamics depends respectively on the string length and membrane area--a consequence of an infrared divergence in the limit of an infinite-sized string or membrane. On the other hand, for a cavity mode locally coupled to a three-dimensional (3D) bulk elastic solid, the dephasing dynamics is independent of the solid volume (i.e., is infrared finite), but dependent on the local geometry of the coupled cavity--a consequence of an ultraviolet divergence in the limit of a pointlike coupled cavity. We consider as possible respective realizations for the cavity-coupled-1D and 2D acoustic environments, an LC oscillator capacitively coupled to a partially metallized strip and a cavity light mode interacting via light pressure with a membrane.
The reionization process is expected to be prolonged by the small-scale absorbers (SSAs) of ionizing photons, which have been seen as Lyman-limit systems in quasar absorption line observations. We use a set of semi-numerical simulations to investigat e the effects of absorption systems on the reionization process, especially their impacts on the neutral islands during the late epoch of reionization (EoR). Three model are studied, i.e. the extreme case of no-SSA model with a high level of ionizing background, the moderate-SSA model with a relatively high level of ionizing background, and the dense-SSA model with a low level of ionizing background. We find that while the characteristic scale of neutral regions decreases during the early and middle stages of reionization, it stays nearly unchanged at about 10 comoving Mpc during the late stage for the no-SSA and moderate-SSA models. However, in the case of weak ionizing background in the dense-SSA model, the characteristic island scale shows obvious evolution, as large islands break into many small ones that are slowly ionized. The evolutionary behavior of neutral islands during the late EoR thus provides a novel way to constrain the abundance of SSAs. We discuss the 21-cm observation with the upcoming Square Kilometre Array (SKA). The different models can be distinguished by the 21-cm power spectrum measurement, and it is also possible to extract the characteristic island scale from the imaging observation with a proper choice of the 21-cm brightness threshold.
323 - Qingjia zhou , Lei Gao , Yadong Xu 2021
Phase gradient metagratings/metasurfaces (PGMs) have provided a new paradigm for light manipulations. In this work, we will show the existence of gauge invariance in PGMs, i.e., the diffraction law of PGMs is independent of the choice of initial valu e of abrupt phase shift that induces the phase gradient. This gauge invariance ensures the well-studied ordinary metallic grating that can be regarded as a PGM, with its diffraction properties that can fully predicted by generalized diffraction law with phase gradient. The generalized diffraction law presents a new insight for the famous effect of Woods Anomalies and Rayleigh conjecture.
375 - Minghan Yang , Dong Xu , Qiwen Cui 2021
In this paper, a novel second-order method called NG+ is proposed. By following the rule ``the shape of the gradient equals the shape of the parameter, we define a generalized fisher information matrix (GFIM) using the products of gradients in the ma trix form rather than the traditional vectorization. Then, our generalized natural gradient direction is simply the inverse of the GFIM multiplies the gradient in the matrix form. Moreover, the GFIM and its inverse keeps the same for multiple steps so that the computational cost can be controlled and is comparable with the first-order methods. A global convergence is established under some mild conditions and a regret bound is also given for the online learning setting. Numerical results on image classification with ResNet50, quantum chemistry modeling with Schnet, neural machine translation with Transformer and recommendation system with DLRM illustrate that GN+ is competitive with the state-of-the-art methods.
88 - Jinyang Guo , Dong Xu , Guo Lu 2021
In this paper, we propose a new deep image compression framework called Complexity and Bitrate Adaptive Network (CBANet), which aims to learn one single network to support variable bitrate coding under different computational complexity constraints. In contrast to the existing state-of-the-art learning based image compression frameworks that only consider the rate-distortion trade-off without introducing any constraint related to the computational complexity, our CBANet considers the trade-off between the rate and distortion under dynamic computational complexity constraints. Specifically, to decode the images with one single decoder under various computational complexity constraints, we propose a new multi-branch complexity adaptive module, in which each branch only takes a small portion of the computational budget of the decoder. The reconstructed images with different visual qualities can be readily generated by using different numbers of branches. Furthermore, to achieve variable bitrate decoding with one single decoder, we propose a bitrate adaptive module to project the representation from a base bitrate to the expected representation at a target bitrate for transmission. Then it will project the transmitted representation at the target bitrate back to that at the base bitrate for the decoding process. The proposed bit adaptive module can significantly reduce the storage requirement for deployment platforms. As a result, our CBANet enables one single codec to support multiple bitrate decoding under various computational complexity constraints. Comprehensive experiments on two benchmark datasets demonstrate the effectiveness of our CBANet for deep image compression.
179 - Zhihao Hu , Guo Lu , Dong Xu 2021
Learning based video compression attracts increasing attention in the past few years. The previous hybrid coding approaches rely on pixel space operations to reduce spatial and temporal redundancy, which may suffer from inaccurate motion estimation o r less effective motion compensation. In this work, we propose a feature-space video coding network (FVC) by performing all major operations (i.e., motion estimation, motion compression, motion compensation and residual compression) in the feature space. Specifically, in the proposed deformable compensation module, we first apply motion estimation in the feature space to produce motion information (i.e., the offset maps), which will be compressed by using the auto-encoder style network. Then we perform motion compensation by using deformable convolution and generate the predicted feature. After that, we compress the residual feature between the feature from the current frame and the predicted feature from our deformable compensation module. For better frame reconstruction, the reference features from multiple previous reconstructed frames are also fused by using the non-local attention mechanism in the multi-frame feature fusion module. Comprehensive experimental results demonstrate that the proposed framework achieves the state-of-the-art performance on four benchmark datasets including HEVC, UVG, VTL and MCL-JCV.
In recommendation systems, the existence of the missing-not-at-random (MNAR) problem results in the selection bias issue, degrading the recommendation performance ultimately. A common practice to address MNAR is to treat missing entries from the so-c alled exposure perspective, i.e., modeling how an item is exposed (provided) to a user. Most of the existing approaches use heuristic models or re-weighting strategy on observed ratings to mimic the missing-at-random setting. However, little research has been done to reveal how the ratings are missing from a causal perspective. To bridge the gap, we propose an unbiased and robust method called DENC (De-bias Network Confounding in Recommendation) inspired by confounder analysis in causal inference. In general, DENC provides a causal analysis on MNAR from both the inherent factors (e.g., latent user or item factors) and auxiliary networks perspective. Particularly, the proposed exposure model in DENC can control the social network confounder meanwhile preserves the observed exposure information. We also develop a deconfounding model through the balanced representation learning to retain the primary user and item features, which enables DENC generalize well on the rating prediction. Extensive experiments on three datasets validate that our proposed model outperforms the state-of-the-art baselines.
239 - Zizheng Que , Guo Lu , Dong Xu 2021
In this paper, we propose a two-stage deep learning framework called VoxelContext-Net for both static and dynamic point cloud compression. Taking advantages of both octree based methods and voxel based schemes, our approach employs the voxel context to compress the octree structured data. Specifically, we first extract the local voxel representation that encodes the spatial neighbouring context information for each node in the constructed octree. Then, in the entropy coding stage, we propose a voxel context based deep entropy model to compress the symbols of non-leaf nodes in a lossless way. Furthermore, for dynamic point cloud compression, we additionally introduce the local voxel representations from the temporal neighbouring point clouds to exploit temporal dependency. More importantly, to alleviate the distortion from the octree construction procedure, we propose a voxel context based 3D coordinate refinement method to produce more accurate reconstructed point cloud at the decoder side, which is applicable to both static and dynamic point cloud compression. The comprehensive experiments on both static and dynamic point cloud benchmark datasets(e.g., ScanNet and Semantic KITTI) clearly demonstrate the effectiveness of our newly proposed method VoxelContext-Net for 3D point cloud geometry compression.
Object handover is a common human collaboration behavior that attracts attention from researchers in Robotics and Cognitive Science. Though visual perception plays an important role in the object handover task, the whole handover process has been spe cifically explored. In this work, we propose a novel rich-annotated dataset, H2O, for visual analysis of human-human object handovers. The H2O, which contains 18K video clips involving 15 people who hand over 30 objects to each other, is a multi-purpose benchmark. It can support several vision-based tasks, from which, we specifically provide a baseline method, RGPNet, for a less-explored task named Receiver Grasp Prediction. Extensive experiments show that the RGPNet can produce plausible grasps based on the givers hand-object states in the pre-handover phase. Besides, we also report the hand and object pose errors with existing baselines and show that the dataset can serve as the video demonstrations for robot imitation learning on the handover task. Dataset, model and code will be made public.
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا