Do you want to publish a course? Click here

Efficient recursive least squares solver for rank-deficient matrices

102   0   0.0 ( 0 )
 Added by Stephan Steinmann
 Publication date 2021
and research's language is English
 Authors Ruben Staub




Ask ChatGPT about the research

Updating a linear least squares solution can be critical for near real-time signalprocessing applications. The Greville algorithm proposes a simple formula for updating the pseudoinverse of a matrix A $in$ R nxm with rank r. In this paper, we explicitly derive a similar formula by maintaining a general rank factorization, which we call rank-Greville. Based on this formula, we implemented a recursive least squares algorithm exploiting the rank-deficiency of A, achieving the update of the minimum-norm least-squares solution in O(mr) operations and, therefore, solving the linear least-squares problem from scratch in O(nmr) operations. We empirically confirmed that this algorithm displays a better asymptotic time complexity than LAPACK solvers for rank-deficient matrices. The numerical stability of rank-Greville was found to be comparable to Cholesky-based solvers. Nonetheless, our implementation supports exact numerical representations of rationals, due to its remarkable algebraic simplicity.



rate research

Read More

Tensor decompositions, such as CANDECOMP/PARAFAC (CP), are widely used in a variety of applications, such as chemometrics, signal processing, and machine learning. A broadly used method for computing such decompositions relies on the Alternating Least Squares (ALS) algorithm. When the number of components is small, regardless of its implementation, ALS exhibits low arithmetic intensity, which severely hinders its performance and makes GPU offloading ineffective. We observe that, in practice, experts often have to compute multiple decompositions of the same tensor, each with a small number of components (typically fewer than 20), to ultimately find the best ones to use for the application at hand. In this paper, we illustrate how multiple decompositions of the same tensor can be fused together at the algorithmic level to increase the arithmetic intensity. Therefore, it becomes possible to make efficient use of GPUs for further speedups; at the same time the technique is compatible with many enhancements typically used in ALS, such as line search, extrapolation, and non-negativity constraints. We introduce the Concurrent ALS algorithm and library, which offers an interface to Matlab, and a mechanism to effectively deal with the issue that decompositions complete at different times. Experimental results on artificial and real datasets demonstrate a shorter time to completion due to increased arithmetic intensity.
Recursive least squares (RLS) algorithms were once widely used for training small-scale neural networks, due to their fast convergence. However, previous RLS algorithms are unsuitable for training deep neural networks (DNNs), since they have high computational complexity and too many preconditions. In this paper, to overcome these drawbacks, we propose three novel RLS optimization algorithms for training feedforward neural networks, convolutional neural networks and recurrent neural networks (including long short-term memory networks), by using the error backpropagation and our average-approximation RLS method, together with the equivalent gradients of the linear least squares loss function with respect to the linear outputs of hidden layers. Compared with previous RLS optimization algorithms, our algorithms are simple and elegant. They can be viewed as an improved stochastic gradient descent (SGD) algorithm, which uses the inverse autocorrelation matrix of each layer as the adaptive learning rate. Their time and space complexities are only several times those of SGD. They only require the loss function to be the mean squared error and the activation function of the output layer to be invertible. In fact, our algorithms can be also used in combination with other first-order optimization algorithms without requiring these two preconditions. In addition, we present two improved methods for our algorithms. Finally, we demonstrate their effectiveness compared to the Adam algorithm on MNIST, CIFAR-10 and IMDB datasets, and investigate the influences of their hyperparameters experimentally.
Trajectory prediction plays a pivotal role in the field of intelligent vehicles. It currently suffers from several challenges,e.g., accumulative error in rollout process and weak adaptability in various scenarios. This paper proposes a parametric-learning recursive least squares (RLS) estimation based on deep neural network for trajectory prediction. We design a flexible plug-in module which can be readily implanted into rollout approaches. Goal points are proposed to capture the long-term prediction stability from the global perspective. We carried experiments out on the NGSIM dataset. The promising results indicate that our method could improve rollout trajectory prediction methods effectively.
In this work we describe an efficient implementation of a hierarchy of algorithms for the decomposition of dense matrices over the field with two elements (GF(2)). Matrix decomposition is an essential building block for solving dense systems of linear and non-linear equations and thus much research has been devoted to improve the asymptotic complexity of such algorithms. In this work we discuss an implementation of both well-known and improved algorithms in the M4RI library. The focus of our discussion is on a new variant of the M4RI algorithm - denoted MMPF in this work -- which allows for considerable performance gains in practice when compared to the previously fastest implementation. We provide performance figures on x86_64 CPUs to demonstrate the viability of our approach.
We describe an efficient implementation of a hierarchy of algorithms for multiplication of dense matrices over the field with two elements (GF(2)). In particular we present our implementation -- in the M4RI library -- of Strassen-Winograd matrix multiplication and the Method of the Four Russians multiplication (M4RM) and compare it against other available implementations. Good performance is demonstrated on on AMDs Opteron and particulary good performance on Intels Core 2 Duo. The open-source M4RI library is available stand-alone as well as part of the Sage mathematics software. In machine terms, addition in GF(2) is logical-XOR, and multiplication is logical-AND, thus a machine word of 64-bits allows one to operate on 64 elements of GF(2) in parallel: at most one CPU cycle for 64 parallel additions or multiplications. As such, element-wise operations over GF(2) are relatively cheap. In fact, in this paper, we conclude that the actual bottlenecks are memory reads and writes and issues of data locality. We present our empirical findings in relation to minimizing these and give an analysis thereof.
comments
Fetching comments Fetching comments
Sign in to be able to follow your search criteria
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا