No Arabic abstract
We introduce GraSPy, a Python library devoted to statistical inference, machine learning, and visualization of random graphs and graph populations. This package provides flexible and easy-to-use algorithms for analyzing and understanding graphs with a scikit-learn compliant API. GraSPy can be downloaded from Python Package Index (PyPi), and is released under the Apache 2.0 open-source license. The documentation and all releases are available at https://neurodata.io/graspy.
We present TurbuStat (v1.0): a Python package for computing turbulence statistics in spectral-line data cubes. TurbuStat includes implementations of fourteen methods for recovering turbulent properties from observational data. Additional features of the software include: distance metrics for comparing two data sets; a segmented linear model for fitting lines with a break-point; a two-dimensional elliptical power-law model; multi-core fast-fourier-transform support; a suite for producing simulated observations of fractional Brownian Motion fields, including two-dimensional images and optically-thin HI data cubes; and functions for creating realistic world coordinate system information for synthetic observations. This paper summarizes the TurbuStat package and provides representative examples using several different methods. TurbuStat is an open-source package and we welcome community feedback and contributions.
OpenML is an online platform for open science collaboration in machine learning, used to share datasets and results of machine learning experiments. In this paper we introduce OpenML-Python, a client API for Python, opening up the OpenML platform for a wide range of Python-based tools. It provides easy access to all datasets, tasks and experiments on OpenML from within Python. It also provides functionality to conduct machine learning experiments, upload the results to OpenML, and reproduce results which are stored on OpenML. Furthermore, it comes with a scikit-learn plugin and a plugin mechanism to easily integrate other machine learning libraries written in Python into the OpenML ecosystem. Source code and documentation is available at https://github.com/openml/openml-python/.
Supervised machine learning methods usually require a large set of labeled examples for model training. However, in many real applications, there are plentiful unlabeled data but limited labeled data; and the acquisition of labels is costly. Active learning (AL) reduces the labeling cost by iteratively selecting the most valuable data to query their labels from the annotator. This article introduces a Python toobox ALiPy for active learning. ALiPy provides a module based implementation of active learning framework, which allows users to conveniently evaluate, compare and analyze the performance of active learning methods. In the toolbox, multiple options are available for each component of the learning framework, including data process, active selection, label query, results visualization, etc. In addition to the implementations of more than 20 state-of-the-art active learning algorithms, ALiPy also supports users to easily configure and implement their own approaches under different active learning settings, such as AL for multi-label data, AL with noisy annotators, AL with different costs and so on. The toolbox is well-documented and open-source on Github, and can be easily installed through PyPI.
Collecting statistic from graph-based data is an increasingly studied topic in the data mining community. We argue that these statistics have great value as well in dynamic IoT contexts: they can support complex computational activities involving distributed coordination and provision of situation recognition. We show that the HyperANF algorithm for calculating the neighbourhood function of vertices of a graph naturally allows for a fully distributed and asynchronous implementation, thanks to a mapping to the field calculus, a distribution model proposed for collective adaptive systems. This mapping gives evidence that the field calculus framework is well-suited to accommodate massively parallel computations over graphs. Furthermore, it provides a new self-stabilising building block which can be used in aggregate computing in several contexts, there including improved leader election or network vulnerabilities detection.
Many have argued that statistics students need additional facility to express statistical computations. By introducing students to commonplace tools for data management, visualization, and reproducible analysis in data science and applying these to real-world scenarios, we prepare them to think statistically. In an era of increasingly big data, it is imperative that students develop data-related capacities, beginning with the introductory course. We believe that the integration of these precursors to data science into our curricula-early and often-will help statisticians be part of the dialogue regarding Big Data and Big Questions.