MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics

55 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Kunhao Zheng

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Kunhao Zheng - Jesse Michael Han - Stanislas Polu

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We present miniF2F, a dataset of formal Olympiad-level mathematics problems statements intended to provide a unified cross-system benchmark for neural theorem proving. The miniF2F benchmark currently targets Metamath, Lean, and Isabelle and consists of 488 problem statements drawn from the AIME, AMC, and the International Mathematical Olympiad (IMO), as well as material from high-school and undergraduate mathematics courses. We report baseline results using GPT-f, a neural theorem prover based on GPT-3 and provide an analysis of its performance. We intend for miniF2F to be a community-driven effort and hope that our benchmark will help spur advances in neural theorem proving.

قيم البحث

304 - Weili Nie , Zhiding Yu , Lei Mao 2020

Humans have an inherent ability to learn novel concepts from only a few samples and generalize these concepts to different situations. Even though todays machine learning models excel with a plethora of training data on standard recognition tasks, a considerable gap exists between machine-level pattern recognition and human-level concept learning. To narrow this gap, the Bongard problems (BPs) were introduced as an inspirational challenge for visual cognition in intelligent systems. Despite new advances in representation learning and learning to learn, BPs remain a daunting challenge for modern AI. Inspired by the original one hundred BPs, we propose a new benchmark Bongard-LOGO for human-level concept learning and reasoning. We develop a program-guided generation technique to produce a large set of human-interpretable visual cognition problems in action-oriented LOGO language. Our benchmark captures three core properties of human cognition: 1) context-dependent perception, in which the same object may have disparate interpretations given different contexts; 2) analogy-making perception, in which some meaningful concepts are traded off for other meaningful concepts; and 3) perception with a few samples but infinite vocabulary. In experiments, we show that the state-of-the-art deep learning methods perform substantially worse than human subjects, implying that they fail to capture core human cognition properties. Finally, we discuss research directions towards a general architecture for visual reasoning to tackle this benchmark.

الذكاء الاصطناعي الرؤية الحاسوبية وتمييز الأنماط التعلم الآلي

HyFlex: A Benchmark Framework for Cross-domain Heuristic Search

268 - Edmund Burke , Tim Curtois , Matthew Hyde 2011

Automating the design of heuristic search methods is an active research field within computer science, artificial intelligence and operational research. In order to make these methods more generally applicable, it is important to eliminate or reduce the role of the human expert in the process of designing an effective methodology to solve a given computational search problem. Researchers developing such methodologies are often constrained on the number of problem domains on which to test their adaptive, self-configuring algorithms; which can be explained by the inherent difficulty of implementing their corresponding domain specific software components. This paper presents HyFlex, a software framework for the development of cross-domain search methodologies. The framework features a common software interface for dealing with different combinatorial optimisation problems, and provides the algorithm components that are problem specific. In this way, the algorithm designer does not require a detailed knowledge the problem domains, and thus can concentrate his/her efforts in designing adaptive general-purpose heuristic search algorithms. Four hard combinatorial problems are fully implemented (maximum satisfiability, one dimensional bin packing, permutation flow shop and personnel scheduling), each containing a varied set of instance data (including real-world industrial applications) and an extensive set of problem specific heuristics and search operators. The framework forms the basis for the first International Cross-domain Heuristic Search Challenge (CHeSC), and it is currently in use by the international research community. In summary, HyFlex represents a valuable new benchmark of heuristic search generality, with which adaptive cross-domain algorithms are being easily developed, and reliably compared.

الذكاء الاصطناعي

AGENT: A Benchmark for Core Psychological Reasoning

240 - Tianmin Shu , Abhishek Bhandwaldar , Chuang Gan 2021

For machine agents to successfully interact with humans in real-world settings, they will need to develop an understanding of human mental life. Intuitive psychology, the ability to reason about hidden mental variables that drive observable actions, comes naturally to people: even pre-verbal infants can tell agents from objects, expecting agents to act efficiently to achieve goals given constraints. Despite recent interest in machine agents that reason about other agents, it is not clear if such agents learn or hold the core psychology principles that drive human reasoning. Inspired by cognitive development studies on intuitive psychology, we present a benchmark consisting of a large dataset of procedurally generated 3D animations, AGENT (Action, Goal, Efficiency, coNstraint, uTility), structured around four scenarios (goal preferences, action efficiency, unobserved constraints, and cost-reward trade-offs) that probe key concepts of core intuitive psychology. We validate AGENT with human-ratings, propose an evaluation protocol emphasizing generalization, and compare two strong baselines built on Bayesian inverse planning and a Theory of Mind neural network. Our results suggest that to pass the designed tests of core intuitive psychology at human levels, a model must acquire or have built-in representations of how agents plan, combining utility computations and core knowledge of objects and physics.

الذكاء الاصطناعي الرؤية الحاسوبية وتمييز الأنماط التعلم الآلي

A Unifying Framework for Formal Theories of Novelty:Framework, Examples and Discussion

66 - T. E. Boult , P. A. Grabowicz , D. S. Prijatelj 2020

Managing inputs that are novel, unknown, or out-of-distribution is critical as an agent moves from the lab to the open world. Novelty-related problems include being tolerant to novel perturbations of the normal input, detecting when the input include s novel items, and adapting to novel inputs. While significant research has been undertaken in these areas, a noticeable gap exists in the lack of a formalized definition of novelty that transcends problem domains. As a team of researchers spanning multiple research groups and different domains, we have seen, first hand, the difficulties that arise from ill-specified novelty problems, as well as inconsistent definitions and terminology. Therefore, we present the first unified framework for formal theories of novelty and use the framework to formally define a family of novelty types. Our framework can be applied across a wide range of domains, from symbolic AI to reinforcement learning, and beyond to open world image recognition. Thus, it can be used to help kick-start new research efforts and accelerate ongoing work on these important novelty-related problems. This extended version of our AAAI 2021 paper included more details and examples in multiple domains.

الذكاء الاصطناعي الرؤية الحاسوبية وتمييز الأنماط التعلم الآلي

Use of L-system mathematics for making new subfamily members of olfactory receptor full length genes, OR1D2, OR1D4 and OR1D5

319 - Sk. Sarif Hassan , Pabitra Pal Choudhury , Amita Pal 2009

Ligands for only two human olfactory receptors are known. One of them, OR1D2, binds to Bourgeonal [Malnic B, Godfrey P-A, Buck L-B (2004) The human olfactory receptor gene family. Proc. Natl. Acad. Sci U. S. A. 101: 2584-2589 and Erratum in: Proc Nat l Acad Sci U. S. A. (2004) 101: 7205]. OR1D2, OR1D4 and OR1D5 are three full length olfactory receptors present in an olfactory locus in human genome. These receptors are more than 80% identical in DNA sequences and have 108 base pair mismatches among them. We have used L-system mathematics and have been able to show a closely related subfamily of OR1D2, OR1D4 and OR1D5.

علوم الكمبيوتر اللغات الرسمية ونظرية الأتومات