ﻻ يوجد ملخص باللغة العربية
While programming is one of the most broadly applicable skills in modern society, modern machine learning models still cannot code solutions to basic problems. Despite its importance, there has been surprisingly little work on evaluating code generation, and it can be difficult to accurately assess code generation performance rigorously. To meet this challenge, we introduce APPS, a benchmark for code generation. Unlike prior work in more restricted settings, our benchmark measures the ability of models to take an arbitrary natural language specification and generate satisfactory Python code. Similar to how companies assess candidate software developers, we then evaluate models by checking their generated code on test cases. Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges. We fine-tune large language models on both GitHub and our training set, and we find that the prevalence of syntax errors is decreasing exponentially as models improve. Recent models such as GPT-Neo can pass approximately 20% of the test cases of introductory problems, so we find that machine learning models are now beginning to learn how to code. As the social significance of automatic code generation increases over the coming years, our benchmark can provide an important measure for tracking advancements.
Automated unit test case generation tools facilitate test-driven development and support developers by suggesting tests intended to identify flaws in their code. Existing approaches are usually guided by the test coverage criteria, generating synthet
Statistical language models are powerful tools which have been used for many tasks within natural language processing. Recently, they have been used for other sequential data such as source code.(Ray et al., 2015) showed that it is possible train an
The process of developing a mobile application typically starts with the ideation and conceptualization of its user interface. This concept is then translated into a set of mock-ups to help determine how well the user interface embodies the intended
Mobile apps are now ubiquitous. Before developing a new app, the development team usually endeavors painstaking efforts to review many existing apps with similar purposes. The review process is crucial in the sense that it reduces market risks and pr
Validation of Android apps via testing is difficult owing to the presence of flaky tests. Due to non-deterministic execution environments, a sequence of events (a test) may lead to success or failure in unpredictable ways. In this work, we present an