No Arabic abstract
Matrix Completion is the problem of recovering an unknown real-valued low-rank matrix from a subsample of its entries. Important recent results show that the problem can be solved efficiently under the assumption that the unknown matrix is incoherent and the subsample is drawn uniformly at random. Are these assumptions necessary? It is well known that Matrix Completion in its full generality is NP-hard. However, little is known if make additional assumptions such as incoherence and permit the algorithm to output a matrix of slightly higher rank. In this paper we prove that Matrix Completion remains computationally intractable even if the unknown matrix has rank $4$ but we are allowed to output any constant rank matrix, and even if additionally we assume that the unknown matrix is incoherent and are shown $90%$ of the entries. This result relies on the conjectured hardness of the $4$-Coloring problem. We also consider the positive semidefinite Matrix Completion problem. Here we show a similar hardness result under the standard assumption that $mathrm{P} e mathrm{NP}.$ Our results greatly narrow the gap between existing feasibility results and computational lower bounds. In particular, we believe that our results give the first complexity-theoretic justification for why distributional assumptions are needed beyond the incoherence assumption in order to obtain positive results. On the technical side, we contribute several new ideas on how to encode hard combinatorial problems in low-rank optimization problems. We hope that these techniques will be helpful in further understanding the computational limits of Matrix Completion and related problems.
Matrix completion aims to reconstruct a data matrix based on observations of a small number of its entries. Usually in matrix completion a single matrix is considered, which can be, for example, a rating matrix in recommendation system. However, in practical situations, data is often obtained from multiple sources which results in a collection of matrices rather than a single one. In this work, we consider the problem of collective matrix completion with multiple and heterogeneous matrices, which can be count, binary, continuous, etc. We first investigate the setting where, for each source, the matrix entries are sampled from an exponential family distribution. Then, we relax the assumption of exponential family distribution for the noise and we investigate the distribution-free case. In this setting, we do not assume any specific model for the observations. The estimation procedures are based on minimizing the sum of a goodness-of-fit term and the nuclear norm penalization of the whole collective matrix. We prove that the proposed estimators achieve fast rates of convergence under the two considered settings and we corroborate our results with numerical experiments.
We propose a novel algorithm for sequential matrix completion in a recommender system setting, where the $(i,j)$th entry of the matrix corresponds to a user $i$s rating of product $j$. The objective of the algorithm is to provide a sequential policy for user-product pair recommendation which will yield the highest possible ratings after a finite time horizon. The algorithm uses a Gamma process factor model with two posterior-focused bandit policies, Thompson Sampling and Information-Directed Sampling. While Thompson Sampling shows competitive performance in simulations, state-of-the-art performance is obtained from Information-Directed Sampling, which makes its recommendations based off a ratio between the expected reward and a measure of information gain. To our knowledge, this is the first implementation of Information Directed Sampling on large real datasets. This approach contributes to a recent line of research on bandit approaches to collaborative filtering including Kawale et al. (2015), Li et al. (2010), Bresler et al. (2014), Li et al. (2016), Deshpande & Montanari (2012), and Zhao et al. (2013). The setting of this paper, as has been noted in Kawale et al. (2015) and Zhao et al. (2013), presents significant challenges to bounding regret after finite horizons. We discuss these challenges in relation to simpler models for bandits with side information, such as linear or gaussian process bandits, and hope the experiments presented here motivate further research toward theoretical guarantees.
The recently proposed SPARse Factor Analysis (SPARFA) framework for personalized learning performs factor analysis on ordinal or binary-valued (e.g., correct/incorrect) graded learner responses to questions. The underlying factors are termed concepts (or knowledge components) and are used for learning analytics (LA), the estimation of learner concept-knowledge profiles, and for content analytics (CA), the estimation of question-concept associations and question difficulties. While SPARFA is a powerful tool for LA and CA, it requires a number of algorithm parameters (including the number of concepts), which are difficult to determine in practice. In this paper, we propose SPARFA-Lite, a convex optimization-based method for LA that builds on matrix completion, which only requires a single algorithm parameter and enables us to automatically identify the required number of concepts. Using a variety of educational datasets, we demonstrate that SPARFALite (i) achieves comparable performance in predicting unobserved learner responses to existing methods, including item response theory (IRT) and SPARFA, and (ii) is computationally more efficient.
In the low-rank matrix completion (LRMC) problem, the low-rank assumption means that the columns (or rows) of the matrix to be completed are points on a low-dimensional linear algebraic variety. This paper extends this thinking to cases where the columns are points on a low-dimensional nonlinear algebraic variety, a problem we call Low Algebraic Dimension Matrix Completion (LADMC). Matrices whose columns belong to a union of subspaces are an important special case. We propose a LADMC algorithm that leverages existing LRMC methods on a tensorized representation of the data. For example, a second-order tensorized representation is formed by taking the Kronecker product of each column with itself, and we consider higher order tensorizations as well. This approach will succeed in many cases where traditional LRMC is guaranteed to fail because the data are low-rank in the tensorized representation but not in the original representation. We provide a formal mathematical justification for the success of our method. In particular, we give bounds of the rank of these data in the tensorized representation, and we prove sampling requirements to guarantee uniqueness of the solution. We also provide experimental results showing that the new approach outperforms existing state-of-the-art methods for matrix completion under a union of subspaces model.
This paper addresses the problem of low-rank distance matrix completion. This problem amounts to recover the missing entries of a distance matrix when the dimension of the data embedding space is possibly unknown but small compared to the number of considered data points. The focus is on high-dimensional problems. We recast the considered problem into an optimization problem over the set of low-rank positive semidefinite matrices and propose two efficient algorithms for low-rank distance matrix completion. In addition, we propose a strategy to determine the dimension of the embedding space. The resulting algorithms scale to high-dimensional problems and monotonically converge to a global solution of the problem. Finally, numerical experiments illustrate the good performance of the proposed algorithms on benchmarks.