No Arabic abstract
Community detection techniques are widely used to infer hidden structures within interconnected systems. Despite demonstrating high accuracy on benchmarks, they reproduce the external classification for many real-world systems with a significant level of discrepancy. A widely accepted reason behind such outcome is the unavoidable loss of non-topological information (such as node attributes) encountered when the original complex system is represented as a network. In this article we emphasize that the observed discrepancies may also be caused by a different reason: the external classification itself. For this end we use scientific publication data which i) exhibit a well defined modular structure and ii) hold an expert-made classification of research articles. Having represented the articles and the extracted scientific concepts both as a bipartite network and as its unipartite projection, we applied modularity optimization to uncover the inner thematic structure. The resulting clusters are shown to partly reflect the author-made classification, although some significant discrepancies are observed. A detailed analysis of these discrepancies shows that they carry essential information about the system, mainly related to the use of similar techniques and methods across different (sub)disciplines, that is otherwise omitted when only the external classification is considered.
Novelty is an inherent part of innovations and discoveries. Such processes may be considered as an appearance of new ideas or as an emergence of atypical connections between the existing ones. The importance of such connections hints for investigation of innovations through network or graph representation in the space of ideas. In such representation, a graph node corresponds to the relevant concept (idea), whereas an edge between two nodes means that the corresponding concepts have been used in a common context. In this study we address the question about a possibility to identify the edges between existing concepts where the innovations may emerge. To this end, we use a well-documented scientific knowledge landscape of 1.2M arXiv.org manuscripts dated starting from April 2007 and until September 2019. We extract relevant concepts for them using the ScienceWISE.info platform. Combining approaches developed in complex networks science and graph embedding, we discuss the predictability of edges (links) on the scientific knowledge landscape where the innovations may appear.
This article presents a study that compares detected structural communities in a coauthorship network to the socioacademic characteristics of the scholars that compose the network. The coauthorship network was created from the bibliographic record of a multi-institution, interdisciplinary research group focused on the study of sensor networks and wireless communication. Four different community detection algorithms were employed to assign a structural community to each scholar in the network: leading eigenvector, walktrap, edge betweenness and spinglass. Socioacademic characteristics were gathered from the scholars and include such information as their academic department, academic affiliation, country of origin, and academic position. A Pearsons $chi^2$ test, with a simulated Monte Carlo, revealed that structural communities best represent groupings of individuals working in the same academic department and at the same institution. A generalization of this result suggests that, even in interdisciplinary, multi-institutional research groups, coauthorship is primarily driven by departmental and institutional affiliation.
Home detection, assigning a phone device to its home antenna, is a ubiquitous part of most studies in the literature on mobile phone data. Despite its widespread use, home detection relies on a few assumptions that are difficult to check without ground truth, i.e., where the individual that owns the device resides. In this paper, we provide an unprecedented evaluation of the accuracy of home detection algorithms on a group of sixty-five participants for whom we know their exact home address and the antennas that might serve them. Besides, we analyze not only Call Detail Records (CDRs) but also two other mobile phone streams: eXtended Detail Records (XDRs, the ``data channel) and Control Plane Records (CPRs, the network stream). These data streams vary not only in their temporal granularity but also they differ in the data generation mechanism, e.g., CDRs are purely human-triggered while CPR is purely machine-triggered events. Finally, we quantify the amount of data that is needed for each stream to carry out successful home detection for each stream. We find that the choice of stream and the algorithm heavily influences home detection, with an hour-of-day algorithm for the XDRs performing the best, and with CPRs performing best for the amount of data needed to perform home detection. Our work is useful for researchers and practitioners in order to minimize data requests and to maximize the accuracy of home antenna location.
The characteristics of the $h$-index in the field of condensed matter physics are studied using high-quality data from ResearcherID. The results are examined in terms of theoretical descriptions of the $h$-index overall dependence on a researchers total number of published papers, and total number of citations. In particular, the models by Hirsch, Egghe and Rousseau, as well as by Glanzel and Schubert are examined. Special emphasis is placed on the deviations from such statistical descriptions, and it is argued that the deviation of a particular researchers $h$ value from the Egghe-Rouseau models prediction can be used as a supplementary measure of impact. A corresponding analysis with similar results is performed using the multi-author $h_m$-index.
NLP Interpretability aims to increase trust in model predictions. This makes evaluating interpretability approaches a pressing issue. There are multiple datasets for evaluating NLP Interpretability, but their dependence on human provided ground truths raises questions about their unbiasedness. In this work, we take a different approach and formulate a specific classification task by diverting question-answering datasets. For this custom classification task, the interpretability ground-truth arises directly from the definition of the classification problem. We use this method to propose a benchmark and lay the groundwork for future research in NLP interpretability by evaluating a wide range of current state of the art methods.