No Arabic abstract
In this paper, we extend graph-based identification methods by allowing background knowledge in the form of non-zero parameter values. Such information could be obtained, for example, from a previously conducted randomized experiment, from substantive understanding of the domain, or even an identification technique. To incorporate such information systematically, we propose the addition of auxiliary variables to the model, which are constructed so that certain paths will be conveniently cancelled. This cancellation allows the auxiliary variables to help conventional methods of identification (e.g., single-door criterion, instrumental variables, half-trek criterion), as well as model testing (e.g., d-separation, over-identification). Moreover, by iteratively alternating steps of identification and adding auxiliary variables, we can improve the power of existing identification methods via a bootstrapping approach that does not require external knowledge. We operationalize this method for simple instrumental sets (a generalization of instrumental variables) and show that the resulting method is able to identify at least as many models as the most general identification method for linear systems known to date. We further discuss the application of auxiliary variables to the tasks of model testing and z-identification.
We developed a novel approach to identification and model testing in linear structural equation models (SEMs) based on auxiliary variables (AVs), which generalizes a widely-used family of methods known as instrumental variables. The identification problem is concerned with the conditions under which causal parameters can be uniquely estimated from an observational, non-causal covariance matrix. In this paper, we provide an algorithm for the identification of causal parameters in linear structural models that subsumes previous state-of-the-art methods. In other words, our algorithm identifies strictly more coefficients and models than methods previously known in the literature. Our algorithm builds on a graph-theoretic characterization of conditional independence relations between auxiliary and model variables, which is developed in this paper. Further, we leverage this new characterization for allowing identification when limited experimental data or new substantive knowledge about the domain is available. Lastly, we develop a new procedure for model testing using AVs.
In causal inference, principal stratification is a framework for dealing with a posttreatment intermediate variable between a treatment and an outcome, in which the principal strata are defined by the joint potential values of the intermediate variable. Because the principal strata are not fully observable, the causal effects within them, also known as the principal causal effects, are not identifiable without additional assumptions. Several previous empirical studies leveraged auxiliary variables to improve the inference of principal causal effects. We establish a general theory for identification and estimation of the principal causal effects with auxiliary variables, which provides a solid foundation for statistical inference and more insights for model building in empirical research. In particular, we consider two commonly-used strategies for principal stratification problems: principal ignorability, and the conditional independence between the auxiliary variable and the outcome given principal strata and covariates. For these two strategies, we give non-parametric and semi-parametric identification results without modeling assumptions on the outcome. When the assumptions for neither strategies are plausible, we propose a large class of flexible parametric and semi-parametric models for identifying principal causal effects. Our theory not only establishes formal identification results of several models that have been used in previous empirical studies but also generalizes them to allow for different types of outcomes and intermediate variables.
A rich literature has explored the modeling of homophily and other forms of nonuniform mixing associated with individual-level covariates within the exponential family random graph (ERGM) framework. Such differential mixing does not fully explain phenomena such as stigma, however, which involve the active maintenance of social boundaries by ostracism of persons with out-group ties. Here, we introduce a new statistic that allows for such effects to be captured, making it possible to probe for the potential presence of boundary maintenance above and beyond simple differences in nomination rates. We demonstrate this statistic in the context of gender segregation in a school classroom.
Many AutoML problems involve optimizing discrete objects under a black-box reward. Neural-guided search provides a flexible means of searching these combinatorial spaces using an autoregressive recurrent neural network. A major benefit of this approach is that builds up objects sequentially--this provides an opportunity to incorporate domain knowledge into the search by directly modifying the logits emitted during sampling. In this work, we formalize a framework for incorporating such in situ priors and constraints into neural-guided search, and provide sufficient conditions for enforcing constraints. We integrate several priors and constraints from existing works into this framework, propose several new ones, and demonstrate their efficacy in informing the task of symbolic regression.
The trouble with data is that often it provides only an imperfect representation of the phenomenon of interest. When reading and interpreting data, personal knowledge about the data plays an important role. Data visualization, however, has neither a concept defining personal knowledge about datasets, nor the methods or tools to robustly integrate them into an analysis process, thus hampering analysts ability to express their personal knowledge about datasets, and others to learn from such knowledge. In this work, we define such personal knowledge about datasets as data hunches and elevate this knowledge to another form of data that can be externalized, visualized, and used for collaboration. We establish the implications of data hunches and provide a design space for externalizing and communicating data hunches through visualization techniques. We envision such a design space will empower users to externalize their personal knowledge and support the ability to learn from others data hunches.