ترغب بنشر مسار تعليمي؟ اضغط هنا

Dimensionality Reduction using Similarity-induced Embeddings

74   0   0.0 ( 0 )
 نشر من قبل Nikolaos Passalis
 تاريخ النشر 2017
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

The vast majority of Dimensionality Reduction (DR) techniques rely on second-order statistics to define their optimization objective. Even though this provides adequate results in most cases, it comes with several shortcomings. The methods require carefully designed regularizers and they are usually prone to outliers. In this work, a new DR framework, that can directly model the target distribution using the notion of similarity instead of distance, is introduced. The proposed framework, called Similarity Embedding Framework, can overcome the aforementioned limitations and provides a conceptually simpler way to express optimization targets similar to existing DR techniques. Deriving a new DR technique using the Similarity Embedding Framework becomes simply a matter of choosing an appropriate target similarity matrix. A variety of classical tasks, such as performing supervised dimensionality reduction and providing out-of-of-sample extensions, as well as, new novel techniques, such as providing fast linear embeddings for complex techniques, are demonstrated in this paper using the proposed framework. Six datasets from a diverse range of domains are used to evaluate the proposed method and it is demonstrated that it can outperform many existing DR techniques.



قيم البحث

اقرأ أيضاً

85 - Vikas Raunak 2017
Word embeddings have become the basic building blocks for several natural language processing and information retrieval tasks. Pre-trained word embeddings are used in several downstream applications as well as for constructing representations for sen tences, paragraphs and documents. Recently, there has been an emphasis on further improving the pre-trained word vectors through post-processing algorithms. One such area of improvement is the dimensionality reduction of the word embeddings. Reducing the size of word embeddings through dimensionality reduction can improve their utility in memory constrained devices, benefiting several real-world applications. In this work, we present a novel algorithm that effectively combines PCA based dimensionality reduction with a recently proposed post-processing algorithm, to construct word embeddings of lower dimensions. Empirical evaluations on 12 standard word similarity benchmarks show that our algorithm reduces the embedding dimensionality by 50%, while achieving similar or (more often) better performance than the higher dimension embeddings.
VARCLUST algorithm is proposed for clustering variables under the assumption that variables in a given cluster are linear combinations of a small number of hidden latent variables, corrupted by the random noise. The entire clustering task is viewed a s the problem of selection of the statistical model, which is defined by the number of clusters, the partition of variables into these clusters and the cluster dimensions, i.e. the vector of dimensions of linear subspaces spanning each of the clusters. The optimal model is selected using the approximate Bayesian criterion based on the Laplace approximations and using a non-informative uniform prior on the number of clusters. To solve the problem of the search over a huge space of possible models we propose an extension of the ClustOfVar algorithm which was dedicated to subspaces of dimension only 1, and which is similar in structure to the $K$-centroid algorithm. We provide a complete methodology with theoretical guarantees, extensive numerical experimentations, complete data analyses and implementation. Our algorithm assigns variables to appropriate clusterse based on the consistent Bayesian Information Criterion (BIC), and estimates the dimensionality of each cluster by the PEnalized SEmi-integrated Likelihood Criterion (PESEL), whose consistency we prove. Additionally, we prove that each iteration of our algorithm leads to an increase of the Laplace approximation to the model posterior probability and provide the criterion for the estimation of the number of clusters. Numerical comparisons with other algorithms show that VARCLUST may outperform some popular machine learning tools for sparse subspace clustering. We also report the results of real data analysis including TCGA breast cancer data and meteorological data. The proposed method is implemented in the publicly available R package varclust.
Feature selection is a pattern recognition approach to choose important variables according to some criteria to distinguish or explain certain phenomena. There are many genomic and proteomic applications which rely on feature selection to answer ques tions such as: selecting signature genes which are informative about some biological state, e.g. normal tissues and several types of cancer; or defining a network of prediction or inference among elements such as genes, proteins, external stimuli and other elements of interest. In these applications, a recurrent problem is the lack of samples to perform an adequate estimate of the joint probabilities between element states. A myriad of feature selection algorithms and criterion functions are proposed, although it is difficult to point the best solution in general. The intent of this work is to provide an open-source multiplataform graphical environment to apply, test and compare many feature selection approaches suitable to be used in bioinformatics problems.
Existing dimensionality reduction methods are adept at revealing hidden underlying manifolds arising from high-dimensional data and thereby producing a low-dimensional representation. However, the smoothness of the manifolds produced by classic techn iques over sparse and noisy data is not guaranteed. In fact, the embedding generated using such data may distort the geometry of the manifold and thereby produce an unfaithful embedding. Herein, we propose a framework for nonlinear dimensionality reduction that generates a manifold in terms of smooth geodesics that is designed to treat problems in which manifold measurements are either sparse or corrupted by noise. Our method generates a network structure for given high-dimensional data using a nearest neighbors search and then produces piecewise linear shortest paths that are defined as geodesics. Then, we fit points in each geodesic by a smoothing spline to emphasize the smoothness. The robustness of this approach for sparse and noisy datasets is demonstrated by the implementation of the method on synthetic and real-world datasets.
138 - Hendrik Heuer 2016
This paper describes a technique to compare large text sources using word vector representations (word2vec) and dimensionality reduction (t-SNE) and how it can be implemented using Python. The technique provides a birds-eye view of text sources, e.g. text summaries and their source material, and enables users to explore text sources like a geographical map. Word vector representations capture many linguistic properties such as gender, tense, plurality and even semantic concepts like capital city of. Using dimensionality reduction, a 2D map can be computed where semantically similar words are close to each other. The technique uses the word2vec model from the gensim Python library and t-SNE from scikit-learn.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا