The 2013 Defi de Fouille de Textes (DEFT) campaign is interested in two types of language analysis tasks, the document classification and the information extraction in the specialized domain of cuisine recipes. We present the systems that the LIA has used in DEFT 2013. Our systems show interesting results, even though the complexity of the proposed tasks.
In the geometric situation of some simple unitary Shimura varieties studied by Harris and Taylor, I have built two filtrations of the perverse sheaf of vanishing cycles. The graduate of the first are the $p$-intermediate extension of some local Harris-Taylors local systems, while for the second, obtained by duality, they are the $p+$-intermediate extensions. In this work, we describe the difference between these $p$ and $p+$ intermediate extension. Precisely, we show, in the case where the local system is associated to an irreducible cuspidal representation whose reduction modulo $l$ is supercuspidal, that the two intermediate extensions are the same. Otherwise, if the reduction modulo $l$ is just cuspidal, we describe the $l$-torsion of their difference.
The goal of this Habilitation `a diriger des recherches is to present two different applications, namely computations of certain partition functions in probability and applications to integrable systems, of the topological recursion developed by B. Eynard and N. Orantin in 2007. Since its creation, the range of applications of the topological recursion has been growing and many results in different fields have been obtained. The first aspect that I will develop deals with the historical domain of the topological recursion: random matrix integrals. I will review the formalism of the topological recursion as well as how it can be used to obtain asymptotic $frac{1}{N}$ series expansion of various matrix integrals. In particular, a key feature of the topological recursion is that it can recover from the leading order of the asymptotic all sub-leading orders with elementary computations. This method is particularly well known and fruitful in the case of hermitian matrix integrals, but I will also show that the general method can be used to cover integrals with hard edges, integrals over unitary matrices and much more. In the end, I will also briefly mention the generalization to $beta$-ensembles. In a second chapter, I will review the connection between the topological recursion and the study of integrable systems having a Lax pair representation. Most of the results presented there will be illustrated by the case of the famous six Painleve equations. Though the formalism used in this chapter may look completely disconnected from the previous one, it is well known that the local statistics of eigenvalues in random matrix theory exhibit a universality phenomenon and that the encountered universal systems are precisely driven by some solutions of the Painlev{e} equations. As I will show, the connection can be made very explicit with the topological recursion formalism.
We investigate the performance of two machine learning algorithms in the context of anti-spam filtering. The increasing volume of unsolicited bulk e-mail (spam) has generated a need for reliable anti-spam filters. Filters of this type have so far been based mostly on keyword patterns that are constructed by hand and perform poorly. The Naive Bayesian classifier has recently been suggested as an effective method to construct automatically anti-spam filters with superior performance. We investigate thoroughly the performance of the Naive Bayesian filter on a publicly available corpus, contributing towards standard benchmarks. At the same time, we compare the performance of the Naive Bayesian filter to an alternative memory-based learning approach, after introducing suitable cost-sensitive evaluation measures. Both methods achieve very accurate spam filtering, outperforming clearly the keyword-based filter of a widely used e-mail reader.
Estimating the semantic similarity between text data is one of the challenging and open research problems in the field of Natural Language Processing (NLP). The versatility of natural language makes it difficult to define rule-based methods for determining semantic similarity measures. In order to address this issue, various semantic similarity methods have been proposed over the years. This survey article traces the evolution of such methods, categorizing them based on their underlying principles as knowledge-based, corpus-based, deep neural network-based methods, and hybrid methods. Discussing the strengths and weaknesses of each method, this survey provides a comprehensive view of existing systems in place, for new researchers to experiment and develop innovative ideas to address the issue of semantic similarity.
Probabilistic topic models are generative models that describe the content of documents by discovering the latent topics underlying them. However, the structure of the textual input, and for instance the grouping of words in coherent text spans such as sentences, contains much information which is generally lost with these models. In this paper, we propose sentenceLDA, an extension of LDA whose goal is to overcome this limitation by incorporating the structure of the text in the generative and inference processes. We illustrate the advantages of sentenceLDA by comparing it with LDA using both intrinsic (perplexity) and extrinsic (text classification) evaluation tasks on different text collections.