No Arabic abstract
In a world where traditional notions of privacy are increasingly challenged by the myriad companies that collect and analyze our data, it is important that decision-making entities are held accountable for unfair treatments arising from irresponsible data usage. Unfortunately, a lack of appropriate methodologies and tools means that even identifying unfair or discriminatory effects can be a challenge in practice. We introduce the unwarranted associations (UA) framework, a principled methodology for the discovery of unfair, discriminatory, or offensive user treatment in data-driven applications. The UA framework unifies and rationalizes a number of prior attempts at formalizing algorithmic fairness. It uniquely combines multiple investigative primitives and fairness metrics with broad applicability, granular exploration of unfair treatment in user subgroups, and incorporation of natural notions of utility that may account for observed disparities. We instantiate the UA framework in FairTest, the first comprehensive tool that helps developers check data-driven applications for unfair user treatment. It enables scalable and statistically rigorous investigation of associations between application outcomes (such as prices or premiums) and sensitive user attributes (such as race or gender). Furthermore, FairTest provides debugging capabilities that let programmers rule out potential confounders for observed unfair effects. We report on use of FairTest to investigate and in some cases address disparate impact, offensive labeling, and uneven rates of algorithmic error in four data-driven applications. As examples, our results reveal subtle biases against older populations in the distribution of error in a predictive health application and offensive racial labeling in an image tagger.
We investigate how Simpsons paradox affects analysis of trends in social data. According to the paradox, the trends observed in data that has been aggregated over an entire population may be different from, and even opposite to, those of the underlying subgroups. Failure to take this effect into account can lead analysis to wrong conclusions. We present a statistical method to automatically identify Simpsons paradox in data by comparing statistical trends in the aggregate data to those in the disaggregated subgroups. We apply the approach to data from Stack Exchange, a popular question-answering platform, to analyze factors affecting answerer performance, specifically, the likelihood that an answer written by a user will be accepted by the asker as the best answer to his or her question. Our analysis confirms a known Simpsons paradox and identifies several new instances. These paradoxes provide novel insights into user behavior on Stack Exchange.
With the increasing availability of mobility-related data, such as GPS-traces, Web queries and climate conditions, there is a growing demand to utilize this data to better understand and support urban mobility needs. However, data available from the individual actors, such as providers of information, navigation and transportation systems, is mostly restricted to isolated mobility modes, whereas holistic data analytics over integrated data sources is not sufficiently supported. In this paper we present our ongoing research in the context of holistic data analytics to support urban mobility applications in the Data4UrbanMobility (D4UM) project. First, we discuss challenges in urban mobility analytics and present the D4UM platform we are currently developing to facilitate holistic urban data analytics over integrated heterogeneous data sources along with the available data sources. Second, we present the MiC app - a tool we developed to complement available datasets with intermodal mobility data (i.e. data about journeys that involve more than one mode of mobility) using a citizen science approach. Finally, we present selected use cases and discuss our future work.
The increasing generation and collection of personal data has created a complex ecosystem, often collaborative but sometimes combative, around companies and individuals engaging in the use of these data. We propose that the interactions between these agents warrants a new topic of study: Human-Data Interaction (HDI). In this paper we discuss how HDI sits at the intersection of various disciplines, including computer science, statistics, sociology, psychology and behavioural economics. We expose the challenges that HDI raises, organised into three core themes of legibility, agency and negotiability, and we present the HDI agenda to open up a dialogue amongst interested parties in the personal and big data ecosystems.
In recent years, the amount of information collected about human beings has increased dramatically. This development has been partially driven by individuals posting and storing data about themselves and friends using online social networks or collecting their data for self-tracking purposes (quantified-self movement). Across the sciences, researchers conduct studies collecting data with an unprecedented resolution and scale. Using computational power combined with mathematical models, such rich datasets can be mined to infer underlying patterns, thereby providing insights into human nature. Much of the data collected is sensitive. It is private in the sense that most individuals would feel uncomfortable sharing their collected personal data publicly. For this reason, the need for solutions to ensure the privacy of the individuals generating data has grown alongside the data collection efforts. Out of all the massive data collection efforts, this paper focuses on efforts directly instrumenting human behavior, and notes that -- in many cases -- the privacy of participants is not sufficiently addressed. For example, study purposes are often not explicit, informed consent is ill-defined, and security and sharing protocols are only partially disclosed. This paper provides a survey of the work related to addressing privacy issues in research studies that collect detailed sensor data on human behavior. Reflections on the key problems and recommendations for future work are included. We hope the overview of the privacy-related practices in massive data collection studies can be used as a frame of reference for practitioners in the field. Although focused on data collection in an academic context, we believe that many of the challenges and solutions we identify are also relevant and useful for other domains where massive data collection takes place, including businesses and governments.
Effective patient queue management to minimize patient wait delays and patient overcrowding is one of the major challenges faced by hospitals. Unnecessary and annoying waits for long periods result in substantial human resource and time wastage and increase the frustration endured by patients. For each patient in the queue, the total treatment time of all patients before him is the time that he must wait. It would be convenient and preferable if the patients could receive the most efficient treatment plan and know the predicted waiting time through a mobile application that updates in real-time. Therefore, we propose a Patient Treatment Time Prediction (PTTP) algorithm to predict the waiting time for each treatment task for a patient. We use realistic patient data from various hospitals to obtain a patient treatment time model for each task. Based on this large-scale, realistic dataset, the treatment time for each patient in the current queue of each task is predicted. Based on the predicted waiting time, a Hospital Queuing-Recommendation (HQR) system is developed. HQR calculates and predicts an efficient and convenient treatment plan recommended for the patient. Because of the large-scale, realistic dataset and the requirement for real-time response, the PTTP algorithm and HQR system mandate efficiency and low-latency response. We use an Apache Spark-based cloud implementation at the National Supercomputing Center in Changsha (NSCC) to achieve the aforementioned goals. Extensive experimentation and simulation results demonstrate the effectiveness and applicability of our proposed model to recommend an effective treatment plan for patients to minimize their wait times in hospitals.