ﻻ يوجد ملخص باللغة العربية
With the rapid advance of the Internet, search engines (e.g., Google, Bing, Yahoo!) are used by billions of users for each day. The main function of a search engine is to locate the most relevant webpages corresponding to what the user requests. This report focuses on the core problem of information retrieval: how to learn the relevance between a document (very often webpage) and a query given by user. Our analysis consists of two parts: 1) we use standard statistical methods to select important features among 137 candidates given by information retrieval researchers from Microsoft. We find that not all the features are useful, and give interpretations on the top-selected features; 2) we give baselines on prediction over the real-world dataset MSLR-WEB by using various learning algorithms. We find that models of boosting trees, random forest in general achieve the best performance of prediction. This agrees with the mainstream opinion in information retrieval community that tree-based algorithms outperform the other candidates for this problem.
There exist many high-dimensional data in real-world applications such as biology, computer vision, and social networks. Feature selection approaches are devised to confront with high-dimensional data challenges with the aim of efficient learning tec
The instability in the selection of models is a major concern with data sets containing a large number of covariates. This paper deals with variable selection methodology in the case of high-dimensional problems where the response variable can be rig
We compare two major approaches to variable selection in clustering: model selection and regularization. Based on previous results, we select the method of Maugis et al. (2009b), which modified the method of Raftery and Dean (2006), as a current stat
Model selection is a fundamental part of the applied Bayesian statistical methodology. Metrics such as the Akaike Information Criterion are commonly used in practice to select models but do not incorporate the uncertainty of the models parameters and
Much work has been done on feature selection. Existing methods are based on document frequency, such as Chi-Square Statistic, Information Gain etc. However, these methods have two shortcomings: one is that they are not reliable for low-frequency term