ArGoT: A Glossary of Terms extracted from the arXiv

92 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل EPTCS

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Luis Berlioz

المكتبات الرقمية التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We introduce ArGoT, a data set of mathematical terms extracted from the articles hosted on the arXiv website. A term is any mathematical concept defined in an article. Using labels in the articles source code and examples from other popular math websites, we mine all the terms in the arXiv data and compile a comprehensive vocabulary of mathematical terms. Each term can be then organized in a dependency graph by using the terms definitions and the arXivs metadata. Using both hyperbolic and standard word embeddings, we demonstrate how this structure is reflected in the texts vector representation and how they capture relations of entailment in mathematical concepts. This data set is part of an ongoing effort to align natural mathematical text with existing Interactive Theorem Prover Libraries (ITPs) of formally verified statements.

قيم البحث

52 - Alberto Pepe , Matteo Cantiello , Josh Nicholson 2017

The arXiv is the most popular preprint repository in the world. Since its inception in 1991, the arXiv has allowed researchers to freely share publication-ready articles prior to formal peer review. The growth and the popularity of the arXiv emerged as a result of new technologies that made document creation and dissemination easy, and cultural practices where collaboration and data sharing were dominant. The arXiv represents a unique place in the history of research communication and the Web itself, however it has arguably changed very little since its creation. Here we look at the strengths and weaknesses of arXiv in an effort to identify what possible improvements can be made based on new technologies not previously available. Based on this, we argue that a modern arXiv might in fact not look at all like the arXiv of today.

المكتبات الرقمية الفيزياء الفلكية الشمسية والنجوم

Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia

59 - Harshdeep Singh , Robert West , Giovanni Colavizza 2020

Wikipedias contents are based on reliable and published sources. To this date, relatively little is known about what sources Wikipedia relies on, in part because extracting citations and identifying cited sources is challenging. To close this gap, we release Wikipedia Citations, a comprehensive dataset of citations extracted from Wikipedia. A total of 29.3M citations were extracted from 6.1M English Wikipedia articles as of May 2020, and classified as being to books, journal articles or Web contents. We were thus able to extract 4.0M citations to scholarly publications with known identifiers -- including DOI, PMC, PMID, and ISBN -- and further equip an extra 261K citations with DOIs from Crossref. As a result, we find that 6.7% of Wikipedia articles cite at least one journal article with an associated DOI, and that Wikipedia cites just 2% of all articles with a DOI currently indexed in the Web of Science. We release our code to allow the community to extend upon our work and update the dataset in the future.

المكتبات الرقمية

The Diskionary: A Glossary of Terms Commonly Used for Disks and Related Objects, First Edition

492 - Neal Evans , Nuria Calvet , Lucas Cieza 2009

Based on a panel discussion at the meeting New Light on Young Stars: Spitzers View of Circumstellar Disks, we provide some definitions of common usage of terms describing disks and related objects.

الفيزياء الفلكية الشمسية والنجوم

On the Use of ArXiv as a Dataset

160 - Colin B. Clement , Matthew Bierbaum , Kevin P. OKeeffe 2019

The arXiv has collected 1.5 million pre-print articles over 28 years, hosting literature from scientific fields including Physics, Mathematics, and Computer Science. Each pre-print features text, figures, authors, citations, categories, and other met adata. These rich, multi-modal features, combined with the natural graph structure---created by citation, affiliation, and co-authorship---makes the arXiv an exciting candidate for benchmarking next-generation models. Here we take the first necessary steps toward this goal, by providing a pipeline which standardizes and simplifies access to the arXivs publicly available data. We use this pipeline to extract and analyze a 6.7 million edge citation graph, with an 11 billion word corpus of full-text research articles. We present some baseline classification results, and motivate application of more exciting generative graph models.

استرجاع المعلومات التعلم الآلي الشبكات الاجتماعية والمعلومات

Embedding technique and network analysis of scientific innovations emergence in an arXiv-based concept network

167 - Serhii Brodiuk , Vasyl Palchykov , 2020

Novelty is an inherent part of innovations and discoveries. Such processes may be considered as an appearance of new ideas or as an emergence of atypical connections between the existing ones. The importance of such connections hints for investigatio n of innovations through network or graph representation in the space of ideas. In such representation, a graph node corresponds to the relevant concept (idea), whereas an edge between two nodes means that the corresponding concepts have been used in a common context. In this study we address the question about a possibility to identify the edges between existing concepts where the innovations may emerge. To this end, we use a well-documented scientific knowledge landscape of 1.2M arXiv.org manuscripts dated starting from April 2007 and until September 2019. We extract relevant concepts for them using the ScienceWISE.info platform. Combining approaches developed in complex networks science and graph embedding, we discuss the predictability of edges (links) on the scientific knowledge landscape where the innovations may appear.

المكتبات الرقمية الفيزياء والمجتمع