Data-Driven Investigative Journalism For Connectas Dataset

55 0 0.0 ( 0 )

Download Cite

Added by Aniket Jain

Publication date 2018

fields Informatics Engineering

and research's language is English

Authors Aniket Jain - Bhavya Sharma - Paridhi Choudhary

Computation and Language

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

The following paper explores the possibility of using Machine Learning algorithms to detect the cases of corruption and malpractice by governments. The dataset used by the authors contains information about several government contracts in Colombia from year 2007 to 2012. The authors begin with exploring and cleaning the data, followed by which they perform feature engineering before finally implementing Machine Learning models to detect anomalies in the given dataset.

rate research

Killings of social leaders in the Colombian post-conflict: Data analysis for investigative journalism

95 - Maria De-Arteaga , Benedikt Boecking 2019

After the peace agreement of 2016 with FARC, the killings of social leaders have emerged as an important post-conflict challenge for Colombia. We present a data analysis based on official records obtained from the Colombian General Attorneys Office spanning the time period from 2012 to 2017. The results of the analysis show a drastic increase in the officially recorded number of killings of democratically elected leaders of community organizations, in particular those belonging to Juntas de Accion Comunal [Community Action Boards]. These are important entities that have been part of the Colombian democratic apparatus since 1958, and enable communities to advocate for their needs. We also describe how the data analysis guided a journalistic investigation that was motivated by the Colombian governments denial of the systematic nature of social leaders killings.

Applications Computers and Society

Utilizing a Transparency-driven Environment toward Trusted Automatic Genre Classification: A Case Study in Journalism History

177 - Aysenur Bilgin 2018

With the growing abundance of unlabeled data in real-world tasks, researchers have to rely on the predictions given by black-boxed computational models. However, it is an often neglected fact that these models may be scoring high on accuracy for the wrong reasons. In this paper, we present a practical impact analysis of enabling model transparency by various presentation forms. For this purpose, we developed an environment that empowers non-computer scientists to become practicing data scientists in their own research field. We demonstrate the gradually increasing understanding of journalism historians through a real-world use case study on automatic genre classification of newspaper articles. This study is a first step towards trusted usage of machine learning pipelines in a responsible way.

Computation and Language Machine Learning

Graph integration of structured, semistructured and unstructured data for data journalism

138 - Angelos-Christos Anadiotis , Oana Balalau , Catarina Conceicao 2020

Digital data is a gold mine for modern journalism. However, datasets which interest journalists are extremely heterogeneous, ranging from highly structured (relational databases), semi-structured (JSON, XML, HTML), graphs (e.g., RDF), and text. Journalists (and other classes of users lacking advanced IT expertise, such as most non-governmental-organizations, or small public administrations) need to be able to make sense of such heterogeneous corpora, even if they lack the ability to define and deploy custom extract-transform-load workflows, especially for dynamically varying sets of data sources. We describe a complete approach for integrating dynamic sets of heterogeneous datasets along the lines described above: the challenges we faced to make such graphs useful, allow their integration to scale, and the solutions we proposed for these problems. Our approach is implemented within the ConnectionLens system; we validate it through a set of experiments.

Databases Artificial Intelligence

Graph integration of structured, semistructured and unstructured data for data journalism

115 - Oana Balalau 2020

Nowadays, journalism is facilitated by the existence of large amounts of digital data sources, including many Open Data ones. Such data sources are extremely heterogeneous, ranging from highly struc-tured (relational databases), semi-structured (JSON, XML, HTML), graphs (e.g., RDF), and text. Journalists (and other classes of users lacking advanced IT expertise, such as most non-governmental-organizations, or small public administrations) need to be able to make sense of such heterogeneous corpora, even if they lack the ability to de ne and deploy custom extract-transform-load work ows. These are di cult to set up not only for arbitrary heterogeneous inputs , but also given that users may want to add (or remove) datasets to (from) the corpus. We describe a complete approach for integrating dynamic sets of heterogeneous data sources along the lines described above: the challenges we faced to make such graphs useful, allow their integration to scale, and the solutions we proposed for these problems. Our approach is implemented within the ConnectionLens system; we validate it through a set of experiments.

Databases Artificial Intelligence Computers and Society

WeatherBench: A benchmark dataset for data-driven weather forecasting

82 - Stephan Rasp , Peter D. Dueben , Sebastian Scher 2020

Data-driven approaches, most prominently deep learning, have become powerful tools for prediction in many domains. A natural question to ask is whether data-driven methods could also be used to predict global weather patterns days in advance. First studies show promise but the lack of a common dataset and evaluation metrics make inter-comparison between studies difficult. Here we present a benchmark dataset for data-driven medium-range weather forecasting, a topic of high scientific interest for atmospheric and computer scientists alike. We provide data derived from the ERA5 archive that has been processed to facilitate the use in machine learning models. We propose simple and clear evaluation metrics which will enable a direct comparison between different methods. Further, we provide baseline scores from simple linear regression techniques, deep learning models, as well as purely physical forecasting models. The dataset is publicly available at https://github.com/pangeo-data/WeatherBench and the companion code is reproducible with tutorials for getting started. We hope that this dataset will accelerate research in data-driven weather forecasting.

Atmospheric and Oceanic Physics Machine Learning