No Arabic abstract
Identifying common patterns among events is a key ability in human and machine perception, as it underlies intelligent decision making. We propose an approach for learning semantic relational set abstractions on videos, inspired by human learning. We combine visual features with natural language supervision to generate high-level representations of similarities across a set of videos. This allows our model to perform cognitive tasks such as set abstraction (which general concept is in common among a set of videos?), set completion (which new video goes well with the set?), and odd one out detection (which video does not belong to the set?). Experiments on two video benchmarks, Kinetics and Multi-Moments in Time, show that robust and versatile representations emerge when learning to recognize commonalities among sets. We compare our model to several baseline algorithms and show that significant improvements result from explicitly learning relational abstractions with semantic supervision.
Temporal relational reasoning, the ability to link meaningful transformations of objects or entities over time, is a fundamental property of intelligent species. In this paper, we introduce an effective and interpretable network module, the Temporal Relation Network (TRN), designed to learn and reason about temporal dependencies between video frames at multiple time scales. We evaluate TRN-equipped networks on activity recognition tasks using three recent video datasets - Something-Something, Jester, and Charades - which fundamentally depend on temporal relational reasoning. Our results demonstrate that the proposed TRN gives convolutional neural networks a remarkable capacity to discover temporal relations in videos. Through only sparsely sampled video frames, TRN-equipped networks can accurately predict human-object interactions in the Something-Something dataset and identify various human gestures on the Jester dataset with very competitive performance. TRN-equipped networks also outperform two-stream networks and 3D convolution networks in recognizing daily activities in the Charades dataset. Further analyses show that the models learn intuitive and interpretable visual common sense knowledge in videos.
We analyze the additional effect on planetary atmospheres of recently detected gamma-ray burst afterglow photons in the range up to 1 TeV. For an Earth-like atmosphere we find that there is a small additional depletion in ozone versus that modeled for only prompt emission. We also find a small enhancement of muon flux at the planet surface. Overall, we conclude that the additional afterglow emission, even with TeV photons, does not result in a significantly larger impact over that found in past studies.
We report the XMM-Newton detection of a moderately bright X-ray source superimposed on the outer arms of the inactive spiral galaxy MCG-03-34-63 (z=0.0213). It is clearly offset from the nucleus (by about 19) but well within the D25 ellipse of the galaxy, just along its bar axis. The field has also been observed with the HST enabling us to compute a lower limit of > 94 on the X-ray to optical flux ratio which, together with the X-ray spectrum of the source, argues against a background AGN. On the other hand, the detection of excess X-ray absorption and the lack of a bright optical counterpart argue against foreground contamination. Short-timescale variability is observed, ruling out the hypothesis of a particularly powerful supernova. If it is associated with the apparent host galaxy, the source is the most powerful ULX detected so far with a peak luminosity of 1.35x10^41 erg/s in the 0.5-7 keV band. If confirmed by future multi-wavelength observations, the inferred bolometric luminosity (about 3x10^41 erg/s) requires a rather extreme beaming factor (larger than 115) to accommodate accretion onto a stellar-mass black hole of 20 solar masses and the source could represent instead one of the best intermediate-mass black hole candidate so far. If beaming is excluded, the Eddington limit implies a mass of >2300 solar masses for the accreting compact object.
Many optimizers have been proposed for training deep neural networks, and they often have multiple hyperparameters, which make it tricky to benchmark their performance. In this work, we propose a new benchmarking protocol to evaluate both end-to-end efficiency (training a model from scratch without knowing the best hyperparameter) and data-addition training efficiency (the previously selected hyperparameters are used for periodically re-training the model with newly collected data). For end-to-end efficiency, unlike previous work that assumes random hyperparameter tuning, which over-emphasizes the tuning time, we propose to evaluate with a bandit hyperparameter tuning strategy. A human study is conducted to show that our evaluation protocol matches human tuning behavior better than the random search. For data-addition training, we propose a new protocol for assessing the hyperparameter sensitivity to data shift. We then apply the proposed benchmarking framework to 7 optimizers and various tasks, including computer vision, natural language processing, reinforcement learning, and graph mining. Our results show that there is no clear winner across all the tasks.
Semantic segmentation is a crucial task for robot navigation and safety. However, it requires huge amounts of pixelwise annotations to yield accurate results. While recent progress in computer vision algorithms has been heavily boosted by large ground-level datasets, the labeling time has hampered progress in low altitude UAV applications, mostly due to the difficulty imposed by large object scales and pose variations. Motivated by the lack of a large video aerial dataset, we introduce a new one, with high resolution (4K) images and manually-annotated dense labels every 50 frames. To help the video labeling process, we make an important step towards automatic annotation and propose SegProp, an iterative flow-based method with geometric constrains to propagate the semantic labels to frames that lack human annotations. This results in a dataset with more than 50k annotated frames - the largest of its kind, to the best of our knowledge. Our experiments show that SegProp surpasses current state-of-the-art label propagation methods by a significant margin. Furthermore, when training a semantic segmentation deep neural net using the automatically annotated frames, we obtain a compelling overall performance boost at test time of 16.8% mean F-measure over a baseline trained only with manually-labeled frames. Our Ruralscapes dataset, the label propagation code and a fast segmentation tool are available at our website: https://sites.google.com/site/aerialimageunderstanding/