No Arabic abstract
In this study we focus on the prediction of basketball games in the Euroleague competition using machine learning modelling. The prediction is a binary classification problem, predicting whether a match finishes 1 (home win) or 2 (away win). Data is collected from the Euroleagues official website for the seasons 2016-2017, 2017-2018 and 2018-2019, i.e. in the new format era. Features are extracted from matches data and off-the-shelf supervised machine learning techniques are applied. We calibrate and validate our models. We find that simple machine learning models give accuracy not greater than 67% on the test set, worse than some sophisticated benchmark models. Additionally, the importance of this study lies in the wisdom of the basketball crowd and we demonstrate how the predicting power of a collective group of basketball enthusiasts can outperform machine learning models discussed in this study. We argue why the accuracy level of this group of experts should be set as the benchmark for future studies in the prediction of (European) basketball games using machine learning.
Learning kinetic systems from data is one of the core challenges in many fields. Identifying stable models is essential for the generalization capabilities of data-driven inference. We introduce a computationally efficient framework, called CausalKinetiX, that identifies structure from discrete time, noisy observations, generated from heterogeneous experiments. The algorithm assumes the existence of an underlying, invariant kinetic model, a key criterion for reproducible research. Results on both simulated and real-world examples suggest that learning the structure of kinetic systems benefits from a causal perspective. The identified variables and models allow for a concise description of the dynamics across multiple experimental settings and can be used for prediction in unseen experiments. We observe significant improvements compared to well established approaches focusing solely on predictive performance, especially for out-of-sample generalization.
One of the emerging trends for sports analytics is the growing use of player and ball tracking data. A parallel development is deep learning predictive approaches that use vast quantities of data with less reliance on feature engineering. This paper applies recurrent neural networks in the form of sequence modeling to predict whether a three-point shot is successful. The models are capable of learning the trajectory of a basketball without any knowledge of physics. For comparison, a baseline static machine learning model with a full set of features, such as angle and velocity, in addition to the positional data is also tested. Using a dataset of over 20,000 three pointers from NBA SportVu data, the models based simply on sequential positional data outperform a static feature rich machine learning model in predicting whether a three-point shot is successful. This suggests deep learning models may offer an improvement to traditional feature based machine learning methods for tracking data.
This article deals with the enumeration of directed lattice walks on the integers with any finite set of steps, starting at a given altitude $j$ and ending at a given altitude $k$, with additional constraints such as, for example, to never attain altitude $0$ in-between. We first discuss the case of walks on the integers with steps $-h, dots, -1, +1, dots, +h$. The case $h=1$ is equivalent to the classical Dyck paths, for which many ways of getting explicit formulas involving Catalan-like numbers are known. The case $h=2$ corresponds to basketball walks, which we treat in full detail. Then we move on to the more general case of walks with any finite set of steps, also allowing some weights/probabilities associated with each step. We show how a method of wide applicability, the so-called kernel method, leads to explicit formulas for the number of walks of length $n$, for any $h$, in terms of nested sums of binomials. We finally relate some special cases to other combinatorial problems, or to problems arising in queuing theory.
Machine learning models $-$ now commonly developed to screen, diagnose, or predict health conditions $-$ are evaluated with a variety of performance metrics. An important first step in assessing the practical utility of a model is to evaluate its average performance over an entire population of interest. In many settings, it is also critical that the model makes good predictions within predefined subpopulations. For instance, showing that a model is fair or equitable requires evaluating the models performance in different demographic subgroups. However, subpopulation performance metrics are typically computed using only data from that subgroup, resulting in higher variance estimates for smaller groups. We devise a procedure to measure subpopulation performance that can be more sample-efficient than the typical subsample estimates. We propose using an evaluation model $-$ a model that describes the conditional distribution of the predictive model score $-$ to form model-based metric (MBM) estimates. Our procedure incorporates model checking and validation, and we propose a computationally efficient approximation of the traditional nonparametric bootstrap to form confidence intervals. We evaluate MBMs on two main tasks: a semi-synthetic setting where ground truth metrics are available and a real-world hospital readmission prediction task. We find that MBMs consistently produce more accurate and lower variance estimates of model performance for small subpopulations.
Human groups can perform extraordinary accurate estimations compared to individuals by simply using the mean, median or geometric mean of the individual estimations [Galton 1907, Surowiecki 2005, Page 2008]. However, this is true only for some tasks and in general these collective estimations show strong biases. The method fails also when allowing for social interactions, which makes the collective estimation worse as individuals tend to converge to the biased result [Lorenz et al. 2011]. Here we show that there is a bright side of this apparently negative impact of social interactions into collective intelligence. We found that some individuals resist the social influence and, when using the median of this subgroup, we can eliminate the bias of the wisdom of the full crowd. To find this subgroup of individuals more confident in their private estimations than in the social influence, we model individuals as estimators that combine private and social information with different relative weights [Perez-Escudero & de Polavieja 2011, Arganda et al. 2012]. We then computed the geometric mean for increasingly smaller groups by eliminating those using in their estimations higher values of the social influence weight. The trend obtained in this procedure gives unbiased results, in contrast to the simpler method of computing the median of the complete group. Our results show that, while a simple operation like the mean, median or geometric mean of a group may not allow groups to make good estimations, a more complex operation taking into account individuality in the social dynamics can lead to a better collective intelligence.