No Arabic abstract
GitHub has become an important platform for code sharing and scientific exchange. With the massive number of repositories available, there is a pressing need for topic-based search. Even though the topic label functionality has been introduced, the majority of GitHub repositories do not have any labels, impeding the utility of search and topic-based analysis. This work targets the automatic repository classification problem as textit{keyword-driven hierarchical classification}. Specifically, users only need to provide a label hierarchy with keywords to supply as supervision. This setting is flexible, adaptive to the users needs, accounts for the different granularity of topic labels and requires minimal human effort. We identify three key challenges of this problem, namely (1) the presence of multi-modal signals; (2) supervision scarcity and bias; (3) supervision format mismatch. In recognition of these challenges, we propose the textsc{HiGitClass} framework, comprising of three modules: heterogeneous information network embedding; keyword enrichment; topic modeling and pseudo document generation. Experimental results on two GitHub repository collections confirm that textsc{HiGitClass} is superior to existing weakly-supervised and dataless hierarchical classification methods, especially in its ability to integrate both structured and unstructured data for repository classification.
Software development is becoming increasingly open and collaborative with the advent of platforms such as GitHub. Given its crucial role, there is a need to better understand and model the dynamics of GitHub as a social platform. Previous work has mostly considered the dynamics of traditional social networking sites like Twitter and Facebook. We propose GitEvolve, a system to predict the evolution of GitHub repositories and the different ways by which users interact with them. To this end, we develop an end-to-end multi-task sequential deep neural network that given some seed events, simultaneously predicts which user-group is next going to interact with a given repository, what the type of the interaction is, and when it happens. To facilitate learning, we use graph based representation learning to encode relationship between repositories. We map users to groups by modelling common interests to better predict popularity and to generalize to unseen users during inference. We introduce an artificial event type to better model varying levels of activity of repositories in the dataset. The proposed multi-task architecture is generic and can be extended to model information diffusion in other social networks. In a series of experiments, we demonstrate the effectiveness of the proposed model, using multiple metrics and baselines. Qualitative analysis of the models ability to predict popularity and forecast trends proves its applicability.
The Institutional Repositories (IR) have been consolidated into the institutions in scientific and academic areas, as shown by the directories existing open access repositories and the deposits daily of articles made by different ways, such as by self-archiving of registered users and the cataloging by librarians. IR systems are based on various conceptual models, so in this paper a bibliographic survey Model-Driven Development (MDD) in systems and applications for RI in order to expose the benefits of applying MDD in IR. The MDD is a paradigm for building software that assigns a central role models and active under which derive models ranging from the most abstract to the concrete, this is done through successive transformations. This paradigm provides a framework that allows interested parties to share their views and directly manipulate representations of the entities of this domain. Therefore, the benefits are grouped by actors that are present, namely, developers, business owners and domain experts. In conclusion, these benefits help make more formal software implementations, resulting in a consolidation of such systems, where the main beneficiaries are the end users through the services are offered
This report is a high-level summary analysis of the 2017 GitHub Open Source Survey dataset, presenting frequency counts, proportions, and frequency or proportion bar plots for every question asked in the survey.
When a group of people strives to understand new information, struggle ensues as various ideas compete for attention. Steep learning curves are surmounted as teams learn together. To understand how these team dynamics play out in software development, we explore Git logs, which provide a complete change history of software repositories. In these repositories, we observe code additions, which represent successfully implemented ideas, and code deletions, which represent ideas that have failed or been superseded. By examining the patterns between these commit types, we can begin to understand how teams adopt new information. We specifically study what happens after a software library is adopted by a project, i.e., when a library is used for the first time in the project. We find that a variety of factors, including team size, library popularity, and prevalence on Stack Overflow are associated with how quickly teams learn and successfully adopt new software libraries.
It has been proved that deep neural networks are facing a new threat called backdoor attacks, where the adversary can inject backdoors into the neural network model through poisoning the training dataset. When the input containing some special pattern called the backdoor trigger, the model with backdoor will carry out malicious task such as misclassification specified by adversaries. In text classification systems, backdoors inserted in the models can cause spam or malicious speech to escape detection. Previous work mainly focused on the defense of backdoor attacks in computer vision, little attention has been paid to defense method for RNN backdoor attacks regarding text classification. In this paper, through analyzing the changes in inner LSTM neurons, we proposed a defense method called Backdoor Keyword Identification (BKI) to mitigate backdoor attacks which the adversary performs against LSTM-based text classification by data poisoning. This method can identify and exclude poisoning samples crafted to insert backdoor into the model from training data without a verified and trusted dataset. We evaluate our method on four different text classification datset: IMDB, DBpedia ontology, 20 newsgroups and Reuters-21578 dataset. It all achieves good performance regardless of the trigger sentences.