أوراق بحثية, رسائل ماجستير ودكتوراه منشورة من قبل Daniel Campos

IMG2SMI: Translating Molecular Structure Images to Simplified Molecular-input Line-entry System

142 - Daniel Campos , Heng Ji 2021

Like many scientific fields, new chemistry literature has grown at a staggering pace, with thousands of papers released every month. A large portion of chemistry literature focuses on new molecules and reactions between molecules. Most vital informat ion is conveyed through 2-D images of molecules, representing the underlying molecules or reactions described. In order to ensure reproducible and machine-readable molecule representations, text-based molecule descriptors like SMILES and SELFIES were created. These text-based molecule representations provide molecule generation but are unfortunately rarely present in published literature. In the absence of molecule descriptors, the generation of molecule descriptors from the 2-D images present in the literature is necessary to understand chemistry literature at scale. Successful methods such as Optical Structure Recognition Application (OSRA), and ChemSchematicResolver are able to extract the locations of molecules structures in chemistry papers and infer molecular descriptions and reactions. While effective, existing systems expect chemists to correct outputs, making them unsuitable for unsupervised large-scale data mining. Leveraging the task formulation of image captioning introduced by DECIMER, we introduce IMG2SMI, a model which leverages Deep Residual Networks for image feature extraction and an encoder-decoder Transformer layers for molecule description generation. Unlike previous Neural Network-based systems, IMG2SMI builds around the task of molecule description generation, which enables IMG2SMI to outperform OSRA-based systems by 163% in molecule similarity prediction as measured by the molecular MACCS Fingerprint Tanimoto Similarity. Additionally, to facilitate further research on this task, we release a new molecule prediction dataset. including 81 million molecules for molecule description generation

الأساليب الكمية الرؤية الحاسوبية وتمييز الأنماط التعلم الآلي

Curriculum learning for language modeling

79 - Daniel Campos 2021

Language Models like ELMo and BERT have provided robust representations of natural language, which serve as the language understanding component for a diverse range of downstream tasks.Curriculum learning is a method that employs a structured trainin g regime instead, which has been leveraged in computer vision and machine translation to improve model training speed and model performance. While language models have proven transformational for the natural language processing community, these models have proven expensive, energy-intensive, and challenging to train. In this work, we explore the effect of curriculum learning on language model pretraining using various linguistically motivated curricula and evaluate transfer performance on the GLUE Benchmark. Despite a broad variety of training methodologies and experiments we do not find compelling evidence that curriculum learning methods improve language model training.

الحساب واللغة الذكاء الاصطناعي

Significant Improvements over the State of the Art? A Case Study of the MS MARCO Document Ranking Leaderboard

96 - Jimmy Lin , Daniel Campos , Nick Craswell 2021

Leaderboards are a ubiquitous part of modern research in applied machine learning. By design, they sort entries into some linear order, where the top-scoring entry is recognized as the state of the art (SOTA). Due to the rapid progress being made in information retrieval today, particularly with neural models, the top entry in a leaderboard is replaced with some regularity. These are touted as improvements in the state of the art. Such pronouncements, however, are almost never qualified with significance testing. In the context of the MS MARCO document ranking leaderboard, we pose a specific question: How do we know if a run is significantly better than the current SOTA? We ask this question against the backdrop of recent IR debates on scale types: in particular, whether commonly used significance tests are even mathematically permissible. Recognizing these potential pitfalls in evaluation methodology, our study proposes an evaluation framework that explicitly treats certain outcomes as distinct and avoids aggregating them into a single-point metric. Empirical analysis of SOTA runs from the MS MARCO document ranking leaderboard reveals insights about how one run can be significantly better than another that are obscured by the current official evaluation metric (MRR@100).

استرجاع المعلومات

Informational entropy thresholds as a physical mechanism to explain power-law time distributions in sequential decision-making

114 - Javier Cristin , Vicenc{c} Mendez , Daniel Campos 2021

While frameworks based on physical grounds (like the Drift-Diffusion Model) have been exhaustively used in psychology and neuroscience to describe perceptual decision-making in humans, analogous approaches for more complex situations like sequential (tree-like) decision making are still absent. For such scenarios, which involve a reflective prospection of future options to reach a decision, we offer a plausible mechanism based on the internal computation of the Shannons entropy for the different options available to the subjects. When a threshold in the entropy is reached this will trigger the decision, which means that the amount of information that has been gathered through sensory evidence is enough to assess the options accurately. Experimental evidence in favour of this mechanism is provided by exploring human performances during navigation through a maze on the computer screen monitored with the help of eye-trackers. In particular, our analysis allows us to prove that: (i) prospection is effectively being used by humans during such navigation tasks, and a quantification of the level of prospection used is attainable, (ii) the distribution of decision times during the task exhibits power-law tails, a feature that our entropy-based mechanism is able to explain, in contrast to classical decision-making frameworks.

الفيزياء والمجتمع الأنظمة المضطربة والشبكات العصبية أنظمة التكيف والتنظيم الذاتي

ORCAS: 18 Million Clicked Query-Document Pairs for Analyzing Search

62 - Nick Craswell , Daniel Campos , Bhaskar Mitra 2020

Users of Web search engines reveal their information needs through queries and clicks, making click logs a useful asset for information retrieval. However, click logs have not been publicly released for academic use, because they can be too revealing of personally or commercially sensitive information. This paper describes a click data release related to the TREC Deep Learning Track document corpus. After aggregation and filtering, including a k-anonymity requirement, we find 1.4 million of the TREC DL URLs have 18 million connections to 10 million distinct queries. Our dataset of these queries and connections to TREC documents is of similar size to proprietary datasets used in previous papers on query mining and ranking. We perform some preliminary experiments using the click data to augment the TREC DL training data, offering by comparison: 28x more queries, with 49x more connections to 4.4x more URLs in the corpus. We present a description of the datasets generation process, characteristics, use in ranking and suggest other potential uses.

استرجاع المعلومات التعلم الآلي

Optimal management of impaired self-avoiding random walks for minimizing spatial coverage

78 - Daniel Campos , Javier Cristin , Vicenc{c} Mendez 2019

Self-avoidance is a common mechanism to improve the efficiency of a random walker for covering a spatial domain. However, how this efficiency decreases when self-avoidance is impaired or limited by other processes has remained largely unexplored. Her e we use simulations to study the case when the self-avoiding signal left by a walker both (i) saturates after successive revisits to a site, and (ii) evaporates, or dissappears, after some characteristic time. We surprisingly reveal that the mean cover time becomes minimum for intermediate values of the evaporation time, leading to the existence of a nontrivial optimum management of the self-avoiding signal. We argue that this is a consequence of complex blocking effects caused by the interplay with the signal saturation and, remarkably, we show that the optimum becomes more and more significant as the domain size increases.

الميكانيكا الإحصائية

Transport properties of random walks under stochastic non-instantaneous resetting

138 - Axel Maso-Puigdellosas , Daniel Campos , Vicenc{c} Mendez 2019

Random walks with stochastic resetting provides a treatable framework to study interesting features about central-place motion. In this work, we introduce non-instantaneous resetting as a two-state model being a combination of an exploring state wher e the walker moves randomly according to a propagator and a returning state where the walker performs a ballistic motion with constant velocity towards the origin. We study the emerging transport properties for two types of reset time probability density functions (PDFs): exponential and Pareto. In the first case, we find the stationary distribution and a general expression for the stationary mean square displacement (MSD) in terms of the propagator. We find that the stationary MSD may increase, decrease or remain constant with the returning velocity. This depends on the moments of the propagator. Regarding the Pareto resetting PDF we also study the stationary distribution and the asymptotic scaling of the MSD for diffusive motion. In this case, we see that the resetting modifies the transport regime, making the overall transport sub-diffusive and even reaching a stationary MSD., i.e., a stochastic localization. This phenomena is also observed in diffusion under instantaneous Pareto resetting. We check the main results with stochastic simulations of the process.

الميكانيكا الإحصائية

Anomalous diffusion in random-walks with memory-induced relocations

479 - Axel Maso-Puigdellosas , Daniel Campos , Vicenc{c} Mendez 2019

In this minireview we present the main results regarding the transport properties of stochastic movement with relocations to known positions. To do so, we formulate the problem in a general manner to see several cases extensively studied during the l ast years as particular situations within a framework of random walks with memory. We focus on (i) stochastic motion with resets to its initial position followed by a waiting period, and (ii) diffusive motion with memory-driven relocations to previously visited positions. For both of them we show how the overall transport regime may be actively modified by the details of the relocation mechanism.

الميكانيكا الإحصائية

Experiments in Inferring Social Networks of Diffusion

79 - Daniel Campos , Zoe Konrad 2019

Information diffusion is a fundamental process that takes place over networks. While it is rarely realistic to observe the individual transmissions of the information diffusion process, it is typically possible to observe when individuals first publi sh the information. We look specifically at previously published algorithm NETINF that probabilistically identifies the optimal network that best explains the observed infection times. We explore how the algorithm could perform on a range of intrinsically different social and information network topologies, from news blogs and websites to Twitter to Reddit.

الشبكات الاجتماعية والمعلومات نظرية المعلومات نظرية المعلومات

Reconstruction of the magnetic field for a Schrodinger operator in a cylindrical setting

96 - Daniel Campos 2019

In this thesis we consider a magnetic Schrodinger inverse problem over a compact domain contained in an infinite cylindrical manifold. We show that, under certain conditions on the electromagnetic potentials, we can recover the magnetic field from bo undary measurements in a constructive way. A fundamental tool for this procedure is a global Carleman estimate for the magnetic Schrodinger operator. We prove this by conjugating the magnetic operator essentially into the Laplacian, and using the Carleman estimates for it proven by Kenig-Salo-Uhlmann in the anisotropic setting, see [KSU11a]. The conjugation is achieved through pseudodifferential operators over the cylinder, for which we develop the necessary results. The main motivations to attempt this question are the following results concerning the magnetic Schrodinger operator: first, the solution to the uniqueness problem in the cylindrical setting in [DSFKSU09], and, second, the reconstruction algorithm in the Euclidean setting from [Sal06]. We will also borrow ideas from the reconstruction of the electric potential in the cylindrical setting from [KSU11b]. These two new results answer partially the Carleman estimate problem (Question 4.3.) proposed in [Sal13] and the reconstruction for the magnetic Schrodinger operator mentioned in the introduction of [KSU11b]. To our knowledge, these are the first global Carleman estimates and reconstruction procedure for the magnetic Schrodinger operator available in the cylindrical setting.

تحليل PDES

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد