Preliminary Exploration of Formula Embedding for Mathematical Information Retrieval: can mathematical formulae be embedded like a natural language?

96 0 0.0 ( 0 )

Download Cite

Added by Zhuoren Jiang

Publication date 2017

fields Informatics Engineering

and research's language is English

Authors Liangcai Gao - Zhuoren Jiang - Yue Yin

Information Retrieval

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

While neural network approaches are achieving breakthrough performance in the natural language related fields, there have been few similar attempts at mathematical language related tasks. In this study, we explore the potential of applying neural representation techniques to Mathematical Information Retrieval (MIR) tasks. In more detail, we first briefly analyze the characteristic differences between natural language and mathematical language. Then we design a symbol2vec method to learn the vector representations of formula symbols (numbers, variables, operators, functions, etc.) Finally, we propose a formula2vec based MIR approach and evaluate its performance. Preliminary experiment results show that there is a promising potential for applying formula embedding models to mathematical language representation and MIR tasks.

rate research

NaturalProofs: Mathematical Theorem Proving in Natural Language

64 - Sean Welleck , Jiacheng Liu , Ronan Le Bras 2021

Understanding and creating mathematics using natural mathematical language - the mixture of symbolic and natural language used by humans - is a challenging and important problem for driving progress in machine learning. As a step in this direction, we develop NaturalProofs, a multi-domain corpus of mathematical statements and their proofs, written in natural mathematical language. NaturalProofs unifies broad coverage, deep coverage, and low-resource mathematical sources, allowing for evaluating both in-distribution and zero-shot generalization. Using NaturalProofs, we benchmark strong neural methods on mathematical reference retrieval and generation tasks which test a systems ability to determine key results that appear in a proof. Large-scale sequence models show promise compared to classical information retrieval methods, yet their performance and out-of-domain generalization leave substantial room for improvement. NaturalProofs opens many avenues for research on challenging mathematical tasks.

Information Retrieval Machine Learning

MIaS: Math-Aware Retrieval in Digital Mathematical Libraries

92 - Petr Sojka Faculty of Informatics 2018

Digital mathematical libraries (DMLs) such as arXiv, Numdam, and EuDML contain mainly documents from STEM fields, where mathematical formulae are often more important than text for understanding. Conventional information retrieval (IR) systems are unable to represent formulae and they are therefore ill-suited for math information retrieval (MIR). To fill the gap, we have developed, and open-sourced the MIaS MIR system. MIaS is based on the full-text search engine Apache Lucene. On top of text retrieval, MIaS also incorporates a set of tools for preprocessing mathematical formulae. We describe the design of the system and present speed, and quality evaluation results. We show that MIaS is both efficient, and effective, as evidenced by our victory in the NTCIR-11 Math-2 task.

Information Retrieval

Prob2Vec: Mathematical Semantic Embedding for Problem Retrieval in Adaptive Tutoring

108 - Du Su , Ali Yekkehkhany , Yi Lu 2020

We propose a new application of embedding techniques for problem retrieval in adaptive tutoring. The objective is to retrieve problems whose mathematical concepts are similar. There are two challenges: First, like sentences, problems helpful to tutoring are never exactly the same in terms of the underlying concepts. Instead, good problems mix concepts in innovative ways, while still displaying continuity in their relationships. Second, it is difficult for humans to determine a similarity score that is consistent across a large enough training set. We propose a hierarchical problem embedding algorithm, called Prob2Vec, that consists of abstraction and embedding steps. Prob2Vec achieves 96.88% accuracy on a problem similarity test, in contrast to 75% from directly applying state-of-the-art sentence embedding methods. It is interesting that Prob2Vec is able to distinguish very fine-grained differences among problems, an ability humans need time and effort to acquire. In addition, the sub-problem of concept labeling with imbalanced training data set is interesting in its own right. It is a multi-label problem suffering from dimensionality explosion, which we propose ways to ameliorate. We propose the novel negative pre-training algorithm that dramatically reduces false negative and positive ratios for classification, using an imbalanced training data set.

Computers and Society Machine Learning Machine Learning

Audio Retrieval with Natural Language Queries

411 - Andreea-Maria Oncescu , A. Sophia Koepke , Jo~ao F. Henriques 2021

We consider the task of retrieving audio using free-form natural language queries. To study this problem, which has received limited attention in the existing literature, we introduce challenging new benchmarks for text-based audio retrieval using text annotations sourced from the Audiocaps and Clotho datasets. We then employ these benchmarks to establish baselines for cross-modal audio retrieval, where we demonstrate the benefits of pre-training on diverse audio tasks. We hope that our benchmarks will inspire further research into cross-modal text-based audio retrieval with free-form text queries.

Information Retrieval Sound Audio and Speech Processing

A Mathematical Exploration of Why Language Models Help Solve Downstream Tasks

325 - Nikunj Saunshi , Sadhika Malladi , Sanjeev Arora 2020

Autoregressive language models, pretrained using large text corpora to do well on next word prediction, have been successful at solving many downstream tasks, even with zero-shot usage. However, there is little theoretical understanding of this success. This paper initiates a mathematical study of this phenomenon for the downstream task of text classification by considering the following questions: (1) What is the intuitive connection between the pretraining task of next word prediction and text classification? (2) How can we mathematically formalize this connection and quantify the benefit of language modeling? For (1), we hypothesize, and verify empirically, that classification tasks of interest can be reformulated as sentence completion tasks, thus making language modeling a meaningful pretraining task. With a mathematical formalization of this hypothesis, we make progress towards (2) and show that language models that are $epsilon$-optimal in cross-entropy (log-perplexity) learn features that can linearly solve such classification tasks with $mathcal{O}(sqrt{epsilon})$ error, thus demonstrating that doing well on language modeling can be beneficial for downstream tasks. We experimentally verify various assumptions and theoretical findings, and also use insights from the analysis to design a new objective function that performs well on some classification tasks.

Computation and Language Artificial Intelligence Machine Learning

comments

Fetching comments

Tishreen University

Additional details More universities

Preliminary Exploration of Formula Embedding for Mathematical Information Retrieval: can mathematical formulae be embedded like a natural language?

Ask ChatGPT about the research

No Arabic abstract

Read More