No Arabic abstract
We analyze alternating minimization for column space recovery of a partially observed, approximately low rank matrix with a growing number of columns and a fixed budget of observations per column. In this work, we prove that if the budget is greater than the rank of the matrix, column space recovery succeeds -- as the number of columns grows, the estimate from alternating minimization converges to the true column space with probability tending to one. From our proof techniques, we naturally formulate an active sampling strategy for choosing entries of a column that is theoretically and empirically (on synthetic and real data) better than the commonly studied uniformly random sampling strategy.
Recent studies identified that sequential Recommendation is improved by the attention mechanism. By following this development, we propose Relation-Aware Kernelized Self-Attention (RKSA) adopting a self-attention mechanism of the Transformer with augmentation of a probabilistic model. The original self-attention of Transformer is a deterministic measure without relation-awareness. Therefore, we introduce a latent space to the self-attention, and the latent space models the recommendation context from relation as a multivariate skew-normal distribution with a kernelized covariance matrix from co-occurrences, item characteristics, and user information. This work merges the self-attention of the Transformer and the sequential recommendation by adding a probabilistic model of the recommendation task specifics. We experimented RKSA over the benchmark datasets, and RKSA shows significant improvements compared to the recent baseline models. Also, RKSA were able to produce a latent space model that answers the reasons for recommendation.
In randomized experiments, interactions between units might generate a treatment diffusion process. This is common when the treatment of interest is an actual object or product that can be shared among peers (e.g., flyers, booklets, videos). For instance, if the intervention of interest is an information campaign realized through the distribution of a video to targeted individuals, some of these treated individuals might share the video they received with their friends. Such a phenomenon is usually unobserved, causing a misallocation of individuals in the two treatment arms: some of the initially untreated units might have actually received the treatment by diffusion. Treatment misclassification can, in turn, introduce a bias in the estimation of the causal effect. Inspired by a recent field experiment on the effect of different types of school incentives aimed at encouraging students to attend cultural events, we present a novel approach to deal with a hidden diffusion process on observed or partially observed networks.Specifically, we develop a simulation-based sensitivity analysis that assesses the robustness of the estimates against the possible presence of a treatment diffusion. We simulate several diffusion scenarios within a plausible range of sensitivity parameters and we compare the treatment effect which is estimated in each scenario with the one that is obtained while ignoring the diffusion process. Results suggest that even a treatment diffusion parameter of small size may lead to a significant bias in the estimation of the treatment effect.
Knowledge distillation (KD) is a well-known method to reduce inference latency by compressing a cumbersome teacher model to a small student model. Despite the success of KD in the classification task, applying KD to recommender models is challenging due to the sparsity of positive feedback, the ambiguity of missing feedback, and the ranking problem associated with the top-N recommendation. To address the issues, we propose a new KD model for the collaborative filtering approach, namely collaborative distillation (CD). Specifically, (1) we reformulate a loss function to deal with the ambiguity of missing feedback. (2) We exploit probabilistic rank-aware sampling for the top-N recommendation. (3) To train the proposed model effectively, we develop two training strategies for the student model, called the teacher- and the student-guided training methods, selecting the most useful feedback from the teacher model. Via experimental results, we demonstrate that the proposed model outperforms the state-of-the-art method by 2.7-33.2% and 2.7-29.1% in hit rate (HR) and normalized discounted cumulative gain (NDCG), respectively. Moreover, the proposed model achieves the performance comparable to the teacher model.
Random features provide a practical framework for large-scale kernel approximation and supervised learning. It has been shown that data-dependent sampling of random features using leverage scores can significantly reduce the number of features required to achieve optimal learning bounds. Leverage scores introduce an optimized distribution for features based on an infinite-dimensional integral operator (depending on input distribution), which is impractical to sample from. Focusing on empirical leverage scores in this paper, we establish an out-of-sample performance bound, revealing an interesting trade-off between the approximated kernel and the eigenvalue decay of another kernel in the domain of random features defined based on data distribution. Our experiments verify that the empirical algorithm consistently outperforms vanilla Monte Carlo sampling, and with a minor modification the method is even competitive to supervised data-dependent kernel learning, without using the output (label) information.
Random projection is often used to project higher-dimensional vectors onto a lower-dimensional space, while approximately preserving their pairwise distances. It has emerged as a powerful tool in various data processing tasks and has attracted considerable research interest. Partly motivated by the recent discoveries in neuroscience, in this paper we study the problem of random projection using binary matrices with controllable sparsity patterns. Specifically, we proposed two sparse binary projection models that work on general data vectors. Compared with the conventional random projection models with dense projection matrices, our proposed models enjoy significant computational advantages due to their sparsity structure, as well as improved accuracies in empirical evaluations.