No Arabic abstract
We consider the problem of predicting the next observation given a sequence of past observations, and consider the extent to which accurate prediction requires complex algorithms that explicitly leverage long-range dependencies. Perhaps surprisingly, our positive results show that for a broad class of sequences, there is an algorithm that predicts well on average, and bases its predictions only on the most recent few observation together with a set of simple summary statistics of the past observations. Specifically, we show that for any distribution over observations, if the mutual information between past observations and future observations is upper bounded by $I$, then a simple Markov model over the most recent $I/epsilon$ observations obtains expected KL error $epsilon$---and hence $ell_1$ error $sqrt{epsilon}$---with respect to the optimal predictor that has access to the entire past and knows the data generating distribution. For a Hidden Markov Model with $n$ hidden states, $I$ is bounded by $log n$, a quantity that does not depend on the mixing time, and we show that the trivial prediction algorithm based on the empirical frequencies of length $O(log n/epsilon)$ windows of observations achieves this error, provided the length of the sequence is $d^{Omega(log n/epsilon)}$, where $d$ is the size of the observation alphabet. We also establish that this result cannot be improved upon, even for the class of HMMs, in the following two senses: First, for HMMs with $n$ hidden states, a window length of $log n/epsilon$ is information-theoretically necessary to achieve expected $ell_1$ error $sqrt{epsilon}$. Second, the $d^{Theta(log n/epsilon)}$ samples required to estimate the Markov model for an observation alphabet of size $d$ is necessary for any computationally tractable learning algorithm, assuming the hardness of strongly refuting a certain class of CSPs.
Tropical cyclones can be of varied intensity and cause a huge loss of lives and property if the intensity is high enough. Therefore, the prediction of the intensity of tropical cyclones advance in time is of utmost importance. We propose a novel stacked bidirectional long short-term memory network (BiLSTM) based model architecture to predict the intensity of a tropical cyclone in terms of Maximum surface sustained wind speed (MSWS). The proposed model can predict MSWS well advance in time (up to 72 h) with very high accuracy. We have applied the model on tropical cyclones in the North Indian Ocean from 1982 to 2018 and checked its performance on two recent tropical cyclones, namely, Fani and Vayu. The model predicts MSWS (in knots) for the next 3, 12, 24, 36, 48, 60, and 72 hours with a mean absolute error of 1.52, 3.66, 5.88, 7.42, 8.96, 10.15, and 11.92, respectively.
It is well known that recurrent neural networks (RNNs) faced limitations in learning long-term dependencies that have been addressed by memory structures in long short-term memory (LSTM) networks. Matrix neural networks feature matrix representation which inherently preserves the spatial structure of data and has the potential to provide better memory structures when compared to canonical neural networks that use vector representation. Neural Turing machines (NTMs) are novel RNNs that implement notion of programmable computers with neural network controllers to feature algorithms that have copying, sorting, and associative recall tasks. In this paper, we study the augmentation of memory capacity with a matrix representation of RNNs and NTMs (MatNTMs). We investigate if matrix representation has a better memory capacity than the vector representations in conventional neural networks. We use a probabilistic model of the memory capacity using Fisher information and investigate how the memory capacity for matrix representation networks are limited under various constraints, and in general, without any constraints. In the case of memory capacity without any constraints, we found that the upper bound on memory capacity to be $N^2$ for an $Ntimes N$ state matrix. The results from our experiments using synthetic algorithmic tasks show that MatNTMs have a better learning capacity when compared to its counterparts.
Advanced travel information and warning, if provided accurately, can help road users avoid traffic congestion through dynamic route planning and behavior change. It also enables traffic control centres mitigate the impact of congestion by activating Intelligent Transport System (ITS) proactively. Deep learning has become increasingly popular in recent years, following a surge of innovative GPU technology, high-resolution, big datasets and thriving machine learning algorithms. However, there are few examples exploiting this emerging technology to develop applications for traffic prediction. This is largely due to the difficulty in capturing random, seasonal, non-linear, and spatio-temporal correlated nature of traffic data. In this paper, we propose a data-driven modelling approach with a novel hierarchical D-CLSTM-t deep learning model for short-term traffic speed prediction, a framework combined with convolutional neural network (CNN) and long short-term memory (LSTM) models. A deep CNN model is employed to learn the spatio-temporal traffic patterns of the input graphs, which are then fed into a deep LSTM model for sequence learning. To capture traffic seasonal variations, time of the day and day of the week indicators are fused with trained features. The model is trained end-to-end to predict travel speed in 15 to 90 minutes in the future. We compare the model performance against other baseline models including CNN, LGBM, LSTM, and traditional speed-flow curves. Experiment results show that the D-CLSTM-t outperforms other models considerably. Model tests show that speed upstream also responds sensibly to a sudden accident occurring downstream. Our D-CLSTM-t model framework is also highly scalable for future extension such as for network-wide traffic prediction, which can also be improved by including additional features such as weather, long term seasonality and accident information.
Model compression is significant for the wide adoption of Recurrent Neural Networks (RNNs) in both user devices possessing limited resources and business clusters requiring quick responses to large-scale service requests. This work aims to learn structurally-sparse Long Short-Term Memory (LSTM) by reducing the sizes of basic structures within LSTM units, including input updates, gates, hidden states, cell states and outputs. Independently reducing the sizes of basic structures can result in inconsistent dimensions among them, and consequently, end up with invalid LSTM units. To overcome the problem, we propose Intrinsic Sparse Structures (ISS) in LSTMs. Removing a component of ISS will simultaneously decrease the sizes of all basic structures by one and thereby always maintain the dimension consistency. By learning ISS within LSTM units, the obtained LSTMs remain regular while having much smaller basic structures. Based on group Lasso regularization, our method achieves 10.59x speedup without losing any perplexity of a language modeling of Penn TreeBank dataset. It is also successfully evaluated through a compact model with only 2.69M weights for machine Question Answering of SQuAD dataset. Our approach is successfully extended to non- LSTM RNNs, like Recurrent Highway Networks (RHNs). Our source code is publicly available at https://github.com/wenwei202/iss-rnns
The problem of devising learning strategies for discrete losses (e.g., multilabeling, ranking) is currently addressed with methods and theoretical analyses ad-hoc for each loss. In this paper we study a least-squares framework to systematically design learning algorithms for discrete losses, with quantitative characterizations in terms of statistical and computational complexity. In particular we improve existing results by providing explicit dependence on the number of labels for a wide class of losses and faster learning rates in conditions of low-noise. Theoretical results are complemented with experiments on real datasets, showing the effectiveness of the proposed general approach.