No Arabic abstract
Given a sufficiently large amount of labeled data, the non-convex low-rank matrix recovery problem contains no spurious local minima, so a local optimization algorithm is guaranteed to converge to a global minimum starting from any initial guess. However, the actual amount of data needed by this theoretical guarantee is very pessimistic, as it must prevent spurious local minima from existing anywhere, including at adversarial locations. In contrast, prior work based on good initial guesses have more realistic data requirements, because they allow spurious local minima to exist outside of a neighborhood of the solution. In this paper, we quantify the relationship between the quality of the initial guess and the corresponding reduction in data requirements. Using the restricted isometry constant as a surrogate for sample complexity, we compute a sharp threshold number of samples needed to prevent each specific point on the optimization landscape from becoming a spurious local minimum. Optimizing the threshold over regions of the landscape, we see that for initial points around the ground truth, a linear improvement in the quality of the initial guess amounts to a constant factor improvement in the sample complexity.
This paper develops a new class of nonconvex regularizers for low-rank matrix recovery. Many regularizers are motivated as convex relaxations of the matrix rank function. Our new factor group-sparse regularizers are motivated as a relaxation of the number of nonzero columns in a factorization of the matrix. These nonconvex regularizers are sharper than the nuclear norm; indeed, we show they are related to Schatten-$p$ norms with arbitrarily small $0 < p leq 1$. Moreover, these factor group-sparse regularizers can be written in a factored form that enables efficient and effective nonconvex optimization; notably, the method does not use singular value decomposition. We provide generalization error bounds for low-rank matrix completion which show improved upper bounds for Schatten-$p$ norm reglarization as $p$ decreases. Compared to the max norm and the factored formulation of the nuclear norm, factor group-sparse regularizers are more efficient, accurate, and robust to the initial guess of rank. Experiments show promising performance of factor group-sparse regularization for low-rank matrix completion and robust principal component analysis.
When the linear measurements of an instance of low-rank matrix recovery satisfy a restricted isometry property (RIP)---i.e. they are approximately norm-preserving---the problem is known to contain no spurious local minima, so exact recovery is guaranteed. In this paper, we show that moderate RIP is not enough to eliminate spurious local minima, so existing results can only hold for near-perfect RIP. In fact, counterexamples are ubiquitous: we prove that every x is the spurious local minimum of a rank-1 instance of matrix recovery that satisfies RIP. One specific counterexample has RIP constant $delta=1/2$, but causes randomly initialized stochastic gradient descent (SGD) to fail 12% of the time. SGD is frequently able to avoid and escape spurious local minima, but this empirical result shows that it can occasionally be defeated by their existence. Hence, while exact recovery guarantees will likely require a proof of no spurious local minima, arguments based solely on norm preservation will only be applicable to a narrow set of nearly-isotropic instances.
The discriminator from generative adversarial nets (GAN) has been used by researchers as a feature extractor in transfer learning and appeared worked well. However, there are also studies that believe this is the wrong research direction because intuitively the task of the discriminator focuses on separating the real samples from the generated ones, making features extracted in this way useless for most of the downstream tasks. To avoid this dilemma, we first conducted a thorough theoretical analysis of the relationship between the discriminator task and the features extracted. We found that the connection between the task of the discriminator and the feature is not as strong as was thought, for that the main factor restricting the feature learned by the discriminator is not the task, but is the need to prevent the entire GAN model from mode collapse during the training. From this perspective and combined with further analyses, we found that to avoid mode collapse, the features extracted by the discriminator are not guided to be different for the real samples, but divergence without noise is indeed allowed and occupies a large proportion of the feature space. This makes the features more robust and helps answer the question as to why the discriminator can succeed as a feature extractor in related research. Consequently, to expose the essence of the discriminator extractor as different from other extractors, we analyze the counterpart of the discriminator extractor, the classifier extractor that assigns the target samples to different categories. We found the performance of the discriminator extractor may be inferior to the classifier based extractor when the source classification task is similar to the target task, which is the common case, but the ability to avoid noise prevents the discriminator from being replaced by the classifier.
Since the early 1980s, the research community has developed ever more sophisticated algorithms for the problem of energy disaggregation, but despite decades of research, there is still a dearth of applications with demonstrated value. In this work, we explore a question that is highly pertinent to this research community: how good does energy disaggregation need to be in order to infer characteristics of a household? We present novel techniques that use unsupervised energy disaggregation to predict both household occupancy and static properties of the household such as size of the home and number of occupants. Results show that basic disaggregation approaches performs up to 30% better at occupancy estimation than using aggregate power data alone, and are up to 10% better at estimating static household characteristics. These results show that even rudimentary energy disaggregation techniques are sufficient for improved inference of household characteristics. To conclude, we re-evaluate the bar set by the community for energy disaggregation accuracy and try to answer the question how good is good enough?
Many contemporary machine learning models require extensive tuning of hyperparameters to perform well. A variety of methods, such as Bayesian optimization, have been developed to automate and expedite this process. However, tuning remains extremely costly as it typically requires repeatedly fully training models. We propose to accelerate the Bayesian optimization approach to hyperparameter tuning for neural networks by taking into account the relative amount of information contributed by each training example. To do so, we leverage importance sampling (IS); this significantly increases the quality of the black-box function evaluations, but also their runtime, and so must be done carefully. Casting hyperparameter search as a multi-task Bayesian optimization problem over both hyperparameters and importance sampling design achieves the best of both worlds: by learning a parameterization of IS that trades-off evaluation complexity and quality, we improve upon Bayesian optimization state-of-the-art runtime and final validation error across a variety of datasets and complex neural architectures.