No Arabic abstract
The study of causality or causal inference - how much a given treatment causally affects a given outcome in a population - goes way beyond correlation or association analysis of variables, and is critical in making sound data driven decisions and policies in a multitude of applications. The gold standard in causal inference is performing controlled experiments, which often is not possible due to logistical or ethical reasons. As an alternative, inferring causality on observational data based on the Neyman-Rubin potential outcome model has been extensively used in statistics, economics, and social sciences over several decades. In this paper, we present a formal framework for sound causal analysis on observational datasets that are given as multiple relations and where the population under study is obtained by joining these base relations. We study a crucial condition for inferring causality from observational data, called the strong ignorability assumption (the treatment and outcome variables should be independent in the joined relation given the observed covariates), using known conditional independences that hold in the base relations. We also discuss how the structure of the conditional independences in base relations given as graphical models help infer new conditional independences in the joined relation. The proposed framework combines concepts from databases, statistics, and graphical models, and aims to initiate new research directions spanning these fields to facilitate powerful data-driven decisions in todays big data world.
Convergent Cross-Mapping (CCM) has shown high potential to perform causal inference in the absence of models. We assess the strengths and weaknesses of the method by varying coupling strength and noise levels in coupled logistic maps. We find that CCM fails to infer accurate coupling strength and even causality direction in synchronized time-series and in the presence of intermediate coupling. We find that the presence of noise deterministically reduces the level of cross-mapping fidelity, while the convergence rate exhibits higher levels of robustness. Finally, we propose that controlled noise injections in intermediate-to-strongly coupled systems could enable more accurate causal inferences. Given the inherent noisy nature of real-world systems, our findings enable a more accurate evaluation of CCM applicability and advance suggestions on how to overcome its weaknesses.
Inference in current domains of application are often complex and require us to integrate the expertise of a variety of disparate panels of experts and models coherently. In this paper we develop a formal statistical methodology to guide the networking together of a diverse collection of probabilistic models. In particular, we derive sufficient conditions that ensure inference remains coherent across the composite before and after accommodating relevant evidence.
Acting on time-critical events by processing ever growing social media or news streams is a major technical challenge. Many of these data sources can be modeled as multi-relational graphs. Continuous queries or techniques to search for rare events that typically arise in monitoring applications have been studied extensively for relational databases. This work is dedicated to answer the question that emerges naturally: how can we efficiently execute a continuous query on a dynamic graph? This paper presents an exact subgraph search algorithm that exploits the temporal characteristics of representative queries for online news or social media monitoring. The algorithm is based on a novel data structure called the Subgraph Join Tree (SJ-Tree) that leverages the structural and semantic characteristics of the underlying multi-relational graph. The paper concludes with extensive experimentation on several real-world datasets that demonstrates the validity of this approach.
Tagging based methods are one of the mainstream methods in relational triple extraction. However, most of them suffer from the class imbalance issue greatly. Here we propose a novel tagging based model that addresses this issue from following two aspects. First, at the model level, we propose a three-step extraction framework that can reduce the total number of samples greatly, which implicitly decreases the severity of the mentioned issue. Second, at the intra-model level, we propose a confidence threshold based cross entropy loss that can directly neglect some samples in the major classes. We evaluate the proposed model on NYT and WebNLG. Extensive experiments show that it can address the mentioned issue effectively and achieves state-of-the-art results on both datasets. The source code of our model is available at: https://github.com/neukg/ConCasRTE.
Understanding the functioning of a neural system in terms of its underlying circuitry is an important problem in neuroscience. Recent developments in electrophysiology and imaging allow one to simultaneously record activities of hundreds of neurons. Inferring the underlying neuronal connectivity patterns from such multi-neuronal spike train data streams is a challenging statistical and computational problem. This task involves finding significant temporal patterns from vast amounts of symbolic time series data. In this paper we show that the frequent episode mining methods from the field of temporal data mining can be very useful in this context. In the frequent episode discovery framework, the data is viewed as a sequence of events, each of which is characterized by an event type and its time of occurrence and episodes are certain types of temporal patterns in such data. Here we show that, using the set of discovered frequent episodes from multi-neuronal data, one can infer different types of connectivity patterns in the neural system that generated it. For this purpose, we introduce the notion of mining for frequent episodes under certain temporal constraints; the structure of these temporal constraints is motivated by the application. We present algorithms for discovering serial and parallel episodes under these temporal constraints. Through extensive simulation studies we demonstrate that these methods are useful for unearthing patterns of neuronal network connectivity.