No Arabic abstract
Development of sustainable insurance for cyber risks, with associated benefits, inter alia requires reduction of ambiguity of the risk. Considering cyber risk, and data breaches in particular, as a man-made catastrophe clarifies the actuarial need for multiple levels of analysis - going beyond claims-driven loss statistics alone to include exposure, hazard, breach size, and so on - and necessitating specific advances in scope, quality, and standards of both data and models. The prominent human element, as well as dynamic, networked, and multi-type nature, of cyber risk makes it perhaps uniquely challenging. Complementary top-down statistical, and bottom-up analytical approaches are discussed. Focusing on data breach severity, measured in private information items (ids) extracted, we exploit relatively mature open data for U.S. data breaches. We show that this extremely heavy-tailed risk is worsening for external attacker (hack) events - both in frequency and severity. Writing in Q2-2018, the median predicted number of ids breached in the U.S. due to hacking, for the last 6 months of 2018, is 0.5 billion. But with a 5% chance that the figure exceeds 7 billion - doubling the historical total. Fortunately the total breach in that period turned out to be near the median.
After the peace agreement of 2016 with FARC, the killings of social leaders have emerged as an important post-conflict challenge for Colombia. We present a data analysis based on official records obtained from the Colombian General Attorneys Office spanning the time period from 2012 to 2017. The results of the analysis show a drastic increase in the officially recorded number of killings of democratically elected leaders of community organizations, in particular those belonging to Juntas de Accion Comunal [Community Action Boards]. These are important entities that have been part of the Colombian democratic apparatus since 1958, and enable communities to advocate for their needs. We also describe how the data analysis guided a journalistic investigation that was motivated by the Colombian governments denial of the systematic nature of social leaders killings.
Anonymized smartphone-based mobility data has been widely adopted in devising and evaluating COVID-19 response strategies such as the targeting of public health resources. Yet little attention has been paid to measurement validity and demographic bias, due in part to the lack of documentation about which users are represented as well as the challenge of obtaining ground truth data on unique visits and demographics. We illustrate how linking large-scale administrative data can enable auditing mobility data for bias in the absence of demographic information and ground truth labels. More precisely, we show that linking voter roll data -- containing individual-level voter turnout for specific voting locations along with race and age -- can facilitate the construction of rigorous bias and reliability tests. These tests illuminate a sampling bias that is particularly noteworthy in the pandemic context: older and non-white voters are less likely to be captured by mobility data. We show that allocating public health resources based on such mobility data could disproportionately harm high-risk elderly and minority groups.
The need to forecast COVID-19 related variables continues to be pressing as the epidemic unfolds. Different efforts have been made, with compartmental models in epidemiology and statistical models such as AutoRegressive Integrated Moving Average (ARIMA), Exponential Smoothing (ETS) or computing intelligence models. These efforts have proved useful in some instances by allowing decision makers to distinguish different scenarios during the emergency, but their accuracy has been disappointing, forecasts ignore uncertainties and less attention is given to local areas. In this study, we propose a simple Multiple Linear Regression model, optimised to use call data to forecast the number of daily confirmed cases. Moreover, we produce a probabilistic forecast that allows decision makers to better deal with risk. Our proposed approach outperforms ARIMA, ETS and a regression model without call data, evaluated by three point forecast error metrics, one prediction interval and two probabilistic forecast accuracy measures. The simplicity, interpretability and reliability of the model, obtained in a careful forecasting exercise, is a meaningful contribution to decision makers at local level who acutely need to organise resources in already strained health services. We hope that this model would serve as a building block of other forecasting efforts that on the one hand would help front-line personal and decision makers at local level, and on the other would facilitate the communication with other modelling efforts being made at the national level to improve the way we tackle this pandemic and other similar future challenges.
Predicting pregnancy has been a fundamental problem in womens health for more than 50 years. Previous datasets have been collected via carefully curated medical studies, but the recent growth of womens health tracking mobile apps offers potential for reaching a much broader population. However, the feasibility of predicting pregnancy from mobile health tracking data is unclear. Here we develop four models -- a logistic regression model, and 3 LSTM models -- to predict a womans probability of becoming pregnant using data from a womens health tracking app, Clue by BioWink GmbH. Evaluating our models on a dataset of 79 million logs from 65,276 women with ground truth pregnancy test data, we show that our predicted pregnancy probabilities meaningfully stratify women: women in the top 10% of predicted probabilities have a 89% chance of becoming pregnant over 6 menstrual cycles, as compared to a 27% chance for women in the bottom 10%. We develop a technique for extracting interpretable time trends from our deep learning models, and show these trends are consistent with previous fertility research. Our findings illustrate the potential that womens health tracking data offers for predicting pregnancy on a broader population; we conclude by discussing the steps needed to fulfill this potential.
Because word semantics can substantially change across communities and contexts, capturing domain-specific word semantics is an important challenge. Here, we propose SEMAXIS, a simple yet powerful framework to characterize word semantics using many semantic axes in word- vector spaces beyond sentiment. We demonstrate that SEMAXIS can capture nuanced semantic representations in multiple online communities. We also show that, when the sentiment axis is examined, SEMAXIS outperforms the state-of-the-art approaches in building domain-specific sentiment lexicons.