No Arabic abstract
Structure learning algorithms that learn the graph of a Bayesian network from observational data often do so by assuming the data correctly reflect the true distribution of the variables. However, this assumption does not hold in the presence of measurement error, which can lead to spurious edges. This is one of the reasons why the synthetic performance of these algorithms often overestimates real-world performance. This paper describes an algorithm that can be added as an additional learning phase at the end of any structure learning algorithm, and serves as a correction learning phase that removes potential false positive edges. The results show that the proposed correction algorithm successfully improves the graphical score of four well-established structure learning algorithms spanning different classes of learning in the presence of measurement error.
We present a new approach for learning the structure of a treewidth-bounded Bayesian Network (BN). The key to our approach is applying an exact method (based on MaxSAT) locally, to improve the score of a heuristically computed BN. This approach allows us to scale the power of exact methods -- so far only applicable to BNs with several dozens of random variables -- to large BNs with several thousands of random variables. Our experiments show that our method improves the score of BNs provided by state-of-the-art heuristic methods, often significantly.
Bayesian Networks (BNs) have become a powerful technology for reasoning under uncertainty, particularly in areas that require causal assumptions that enable us to simulate the effect of intervention. The graphical structure of these models can be determined by causal knowledge, learnt from data, or a combination of both. While it seems plausible that the best approach in constructing a causal graph involves combining knowledge with machine learning, this approach remains underused in practice. We implement and evaluate 10 knowledge approaches with application to different case studies and BN structure learning algorithms available in the open-source Bayesys structure learning system. The approaches enable us to specify pre-existing knowledge that can be obtained from heterogeneous sources, to constrain or guide structure learning. Each approach is assessed in terms of structure learning effectiveness and efficiency, including graphical accuracy, model fitting, complexity, and runtime; making this the first paper that provides a comparative evaluation of a wide range of knowledge approaches for BN structure learning. Because the value of knowledge depends on what data are available, we illustrate the results both with limited and big data. While the overall results show that knowledge becomes less important with big data due to higher learning accuracy rendering knowledge less important, some of the knowledge approaches are actually found to be more important with big data. Amongst the main conclusions is the observation that reduced search space obtained from knowledge does not always imply reduced computational complexity, perhaps because the relationships implied by the data and knowledge are in tension.
Latent variables may lead to spurious relationships that can be misinterpreted as causal relationships. In Bayesian Networks (BNs), this challenge is known as learning under causal insufficiency. Structure learning algorithms that assume causal insufficiency tend to reconstruct the ancestral graph of a BN, where bi-directed edges represent confounding and directed edges represent direct or ancestral relationships. This paper describes a hybrid structure learning algorithm, called CCHM, which combines the constraint-based part of cFCI with hill-climbing score-based learning. The score-based process incorporates Pearl s do-calculus to measure causal effects and orientate edges that would otherwise remain undirected, under the assumption the BN is a linear Structure Equation Model where data follow a multivariate Gaussian distribution. Experiments based on both randomised and well-known networks show that CCHM improves the state-of-the-art in terms of reconstructing the true ancestral graph.
We present a novel method for variable selection in regression models when covariates are measured with error. The iterative algorithm we propose, MEBoost, follows a path defined by estimating equations that correct for covariate measurement error. Via simulation, we evaluated our method and compare its performance to the recently-proposed Convex Conditioned Lasso (CoCoLasso) and to the naive Lasso which does not correct for measurement error. Increasing the degree of measurement error increased prediction error and decreased the probability of accurate covariate selection, but this loss of accuracy was least pronounced when using MEBoost. We illustrate the use of MEBoost in practice by analyzing data from the Box Lunch Study, a clinical trial in nutrition where several variables are based on self-report and hence measured with error.
Measurement error in the observed values of the variables can greatly change the output of various causal discovery methods. This problem has received much attention in multiple fields, but it is not clear to what extent the causal model for the measurement-error-free variables can be identified in the presence of measurement error with unknown variance. In this paper, we study precise sufficient identifiability conditions for the measurement-error-free causal model and show what information of the causal model can be recovered from observed data. In particular, we present two different sets of identifiability conditions, based on the second-order statistics and higher-order statistics of the data, respectively. The former was inspired by the relationship between the generating model of the measurement-error-contaminated data and the factor analysis model, and the latter makes use of the identifiability result of the over-complete independent component analysis problem.