No Arabic abstract
We present a clock-driven Spiking Neural Network simulator which is up to 3x faster than the state of the art while, at the same time, being more general and requiring less programming effort on both the users and maintainers side. This is made possible by designing our pipeline around work queues which act as interfaces between stages and greatly reduce implementation complexity. We evaluate our work using three well-established SNN models on a series of benchmarks.
We present two novel optimizations that accelerate clock-based spiking neural network (SNN) simulators. The first one targets spike timing dependent plasticity (STDP). It combines lazy- with event-driven plasticity and efficiently facilitates the computation of pre- and post-synaptic spikes using bitfields and integer intrinsics. It offers higher bandwidth than event-driven plasticity alone and achieves a 1.5x-2x speedup over our closest competitor. The second optimization targets spike delivery. We partition our graph representation in a way that bounds the number of neurons that need be updated at any given time which allows us to perform said update in shared memory instead of global memory. This is 2x-2.5x faster than our closest competitor. Both optimizations represent the final evolutionary stages of years of iteration on STDP and spike delivery inside Spice (/spaIk/), our state of the art SNN simulator. The proposed optimizations are not exclusive to our graph representation or pipeline but are applicable to a multitude of simulator designs. We evaluate our performance on three well-established models and compare ourselves against three other state of the art simulators.
We present a SNN simulator which scales to millions of neurons, billions of synapses, and 8 GPUs. This is made possible by 1) a novel, cache-aware spike transmission algorithm 2) a model parallel multi-GPU distribution scheme and 3) a static, yet very effective load balancing strategy. The simulator further features an easy to use API and the ability to create custom models. We compare the proposed simulator against two state of the art ones on a series of benchmarks using three well-established models. We find that our simulator is faster, consumes less memory, and scales linearly with the number of GPUs.
Single object tracking (SOT) is currently one of the most important tasks in computer vision. With the development of the deep network and the release for a series of large scale datasets for single object tracking, siamese networks have been proposed and perform better than most of the traditional methods. However, recent siamese networks get deeper and slower to obtain better performance. Most of these methods could only meet the needs of real-time object tracking in ideal environments. In order to achieve a better balance between efficiency and accuracy, we propose a simpler siamese network for single object tracking, which runs fast in poor hardware configurations while remaining an excellent accuracy. We use a more efficient regression method to compute the location of the tracked object in a shorter time without losing much precision. For improving the accuracy and speeding up the training progress, we introduce the Squeeze-and-excitation (SE) network into the feature extractor. In this paper, we compare the proposed method with some state-of-the-art trackers and analysis their performances. Using our method, a siamese network could be trained with shorter time and less data. The fast processing speed enables combining object tracking with object detection or other tasks in real time.
In this work, we first show that on the widely used LibriSpeech benchmark, our transformer-based context-dependent connectionist temporal classification (CTC) system produces state-of-the-art results. We then show that using wordpieces as modeling units combined with CTC training, we can greatly simplify the engineering pipeline compared to conventional frame-based cross-entropy training by excluding all the GMM bootstrapping, decision tree building and force alignment steps, while still achieving very competitive word-error-rate. Additionally, using wordpieces as modeling units can significantly improve runtime efficiency since we can use larger stride without losing accuracy. We further confirm these findings on two internal VideoASR datasets: German, which is similar to English as a fusional language, and Turkish, which is an agglutinative language.
In the decremental $(1+epsilon)$-approximate Single-Source Shortest Path (SSSP) problem, we are given a graph $G=(V,E)$ with $n = |V|, m = |E|$, undergoing edge deletions, and a distinguished source $s in V$, and we are asked to process edge deletions efficiently and answer queries for distance estimates $widetilde{mathbf{dist}}_G(s,v)$ for each $v in V$, at any stage, such that $mathbf{dist}_G(s,v) leq widetilde{mathbf{dist}}_G(s,v) leq (1+ epsilon)mathbf{dist}_G(s,v)$. In the decremental $(1+epsilon)$-approximate All-Pairs Shortest Path (APSP) problem, we are asked to answer queries for distance estimates $widetilde{mathbf{dist}}_G(u,v)$ for every $u,v in V$. In this article, we consider the problems for undirected, unweighted graphs. We present a new emph{deterministic} algorithm for the decremental $(1+epsilon)$-approximate SSSP problem that takes total update time $O(mn^{0.5 + o(1)})$. Our algorithm improves on the currently best algorithm for dense graphs by Chechik and Bernstein [STOC 2016] with total update time $tilde{O}(n^2)$ and the best existing algorithm for sparse graphs with running time $tilde{O}(n^{1.25}sqrt{m})$ [SODA 2017] whenever $m = O(n^{1.5 - o(1)})$. In order to obtain this new algorithm, we develop several new techniques including improved decremental cover data structures for graphs, a more efficient notion of the heavy/light decomposition framework introduced by Chechik and Bernstein and the first clustering technique to maintain a dynamic emph{sparse} emulator in the deterministic setting. As a by-product, we also obtain a new simple deterministic algorithm for the decremental $(1+epsilon)$-approximate APSP problem with near-optimal total running time $tilde{O}(mn /epsilon)$ matching the time complexity of the sophisticated but rather involved algorithm by Henzinger, Forster and Nanongkai [FOCS 2013].