ترغب بنشر مسار تعليمي؟ اضغط هنا

A Multigrid Method for Efficiently Training Video Models

149   0   0.0 ( 0 )
 نشر من قبل Chao-Yuan Wu
 تاريخ النشر 2019
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

Training competitive deep video models is an order of magnitude slower than training their counterpart image models. Slow training causes long research cycles, which hinders progress in video understanding research. Following standard practice for training image models, video model training assumes a fixed mini-batch shape: a specific number of clips, frames, and spatial size. However, what is the optimal shape? High resolution models perform well, but train slowly. Low resolution models train faster, but they are inaccurate. Inspired by multigrid methods in numerical optimization, we propose to use variable mini-batch shapes with different spatial-temporal resolutions that are varied according to a schedule. The different shapes arise from resampling the training data on multiple sampling grids. Training is accelerated by scaling up the mini-batch size and learning rate when shrinking the other dimensions. We empirically demonstrate a general and robust grid schedule that yields a significant out-of-the-box training speedup without a loss in accuracy for different models (I3D, non-local, SlowFast), datasets (Kinetics, Something-Something, Charades), and training settings (with and without pre-training, 128 GPUs or 1 GPU). As an illustrative example, the proposed multigrid method trains a ResNet-50 SlowFast network 4.5x faster (wall-clock time, same hardware) while also improving accuracy (+0.8% absolute) on Kinetics-400 compared to the baseline training method. Code is available online.



قيم البحث

اقرأ أيضاً

Algorithmic decision making based on computer vision and machine learning technologies continue to permeate our lives. But issues related to biases of these models and the extent to which they treat certain segments of the population unfairly, have l ed to concern in the general public. It is now accepted that because of biases in the datasets we present to the models, a fairness-oblivious training will lead to unfair models. An interesting topic is the study of mechanisms via which the de novo design or training of the model can be informed by fairness measures. Here, we study mechanisms that impose fairness concurrently while training the model. While existing fairness based approaches in vision have largely relied on training adversarial modules together with the primary classification/regression task, in an effort to remove the influence of the protected attribute or variable, we show how ideas based on well-known optimization concepts can provide a simpler alternative. In our proposed scheme, imposing fairness just requires specifying the protected attribute and utilizing our optimization routine. We provide a detailed technical analysis and present experiments demonstrating that various fairness measures from the literature can be reliably imposed on a number of training tasks in vision in a manner that is interpretable.
An emulator is a fast-to-evaluate statistical approximation of a detailed mathematical model (simulator). When used in lieu of simulators, emulators can expedite tasks that require many repeated evaluations, such as sensitivity analyses, policy optim ization, model calibration, and value-of-information analyses. Emulators are developed using the output of simulators at specific input values (design points). Developing an emulator that closely approximates the simulator can require many design points, which becomes computationally expensive. We describe a self-terminating active learning algorithm to efficiently develop emulators tailored to a specific emulation task, and compare it with algorithms that optimize geometric criteria (random latin hypercube sampling and maximum projection designs) and other active learning algorithms (treed Gaussian Processes that optimize typical active learning criteria). We compared the algorithms root mean square error (RMSE) and maximum absolute deviation from the simulator (MAX) for seven benchmark functions and in a prostate cancer screening model. In the empirical analyses, in simulators with greatly-varying smoothness over the input domain, active learning algorithms resulted in emulators with smaller RMSE and MAX for the same number of design points. In all other cases, all algorithms performed comparably. The proposed algorithm attained satisfactory performance in all analyses, had smaller variability than the treed Gaussian Processes (it is deterministic), and, on average, had similar or better performance as the treed Gaussian Processes in 6 out of 7 benchmark functions and in the prostate cancer model.
Abnormal event detection in video is a complex computer vision problem that has attracted significant attention in recent years. The complexity of the task arises from the commonly-adopted definition of an abnormal event, that is, a rarely occurring event that typically depends on the surrounding context. Following the standard formulation of abnormal event detection as outlier detection, we propose a background-agnostic framework that learns from training videos containing only normal events. Our framework is composed of an object detector, a set of appearance and motion auto-encoders, and a set of classifiers. Since our framework only looks at object detections, it can be applied to different scenes, provided that normal events are defined identically across scenes and that the single main factor of variation is the background. To overcome the lack of abnormal data during training, we propose an adversarial learning strategy for the auto-encoders. We create a scene-agnostic set of out-of-domain pseudo-abnormal examples, which are correctly reconstructed by the auto-encoders before applying gradient ascent on the pseudo-abnormal examples. We further utilize the pseudo-abnormal examples to serve as abnormal examples when training appearance-based and motion-based binary classifiers to discriminate between normal and abnormal latent features and reconstructions. We compare our framework with the state-of-the-art methods on four benchmark data sets, using various evaluation metrics. Compared to existing methods, the empirical results indicate that our approach achieves favorable performance on all data sets. In addition, we provide region-based and track-based annotations for two large-scale abnormal event detection data sets from the literature, namely ShanghaiTech and Subway.
Weakly supervised video anomaly detection (WS-VAD) is to distinguish anomalies from normal events based on discriminative representations. Most existing works are limited in insufficient video representations. In this work, we develop a multiple inst ance self-training framework (MIST)to efficiently refine task-specific discriminative representations with only video-level annotations. In particular, MIST is composed of 1) a multiple instance pseudo label generator, which adapts a sparse continuous sampling strategy to produce more reliable clip-level pseudo labels, and 2) a self-guided attention boosted feature encoder that aims to automatically focus on anomalous regions in frames while extracting task-specific representations. Moreover, we adopt a self-training scheme to optimize both components and finally obtain a task-specific feature encoder. Extensive experiments on two public datasets demonstrate the efficacy of our method, and our method performs comparably to or even better than existing supervised and weakly supervised methods, specifically obtaining a frame-level AUC 94.83% on ShanghaiTech.
We present a first step towards a multigrid method for solving the min-cost flow problem. Specifically, we present a strategy that takes advantage of existing black-box fast iterative linear solvers, i.e. algebraic multigrid methods. We show with sta ndard benchmarks that, while less competitive than combinatorial techniques on small problems that run on a single core, our approach scales well with problem size, complexity, and number of processors, allowing for tackling large-scale problems on modern parallel architectures. Our approach is based on combining interior-point with multigrid methods for solving the nonlinear KKT equations via Newtons method. However, the Jacobian matrix arising in the Newton iteration is indefinite and its condition number cannot be expected to be bounded. In fact, the eigenvalues of the Jacobian can both vanish and blow up near the solution, leading to a significant slow-down of the convergence speed of iterative solvers - or to the loss of convergence at all. In order to allow for the application of multigrid methods, which have been originally designed for elliptic problems, we furthermore show that the occurring Jacobian can be interpreted as the stiffness matrix of a mixed formulation of the weighted graph Laplacian of the network, whose metric depends on the slack variables and the multipliers of the inequality constraints. Together with our regularization, this allows for the application of a black-box algebraic multigrid method on the Schur-complement of the system.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا