As causality enjoys increasing attention in various areas of machine learning, this workshop turns the spotlight on the assumptions behind the successful application of causal inference techniques. It is well known that answering causal queries from observational data requires strong and sometimes untestable assumptions. On the theoretical side, a whole host of settings as been established in which causal effects are identifiable and consistently estimable under a set of by now considered "standard" assumptions. While these can be reasonable in specific scenarios, they were often at least partially motivated by rendering estimation theoretically feasible. Such assumptions tell us what we would need to assert about the data generating process in order to be able to answer causal queries. Unfortunately, in applications we often find them taken for granted as properties that can safely be assumed to hold without further scrutiny. This starts with fundamentally untestable assumptions such as the stable unit treatment value assumption or ignorability and continues to no interference, faithfulness, positivity or overlap, no unobserved confounding and even reaches blanket onesizefits all assumptions on the linearity of structural equations or the additivity of noise. This situation may lead practitioners to either believe that well founded causal inference is unattainable altogether, or that established offtheshelf methods can be trusted to deliver reliable causal estimates in virtually any situation. Similarly, as ideas from causality are increasingly picked up by researchers in deep, reinforcement, or metalearning, there is a risk that the role of assumptions for causal inference gets lost in translation. One of the main goals of this workshop is to help the research community and practitioners understand the concrete challenges of trustworthy assumptions for effective causal inference.
Fri 7:50 a.m.  8:00 a.m.

Welcome and Introduction
(intro)
SlidesLive Video » 

Fri 8:00 a.m.  8:27 a.m.

Invited Talk 1: Noam Barda and Noa Dagan
(invited talk)
SlidesLive Video » 
Noam Barda, Noa Dagan 
Fri 8:27 a.m.  8:53 a.m.

Invited Talk 2: Lina Montoya
(invited talk)
»
SlidesLive Video » Optimal Dynamic Treatment Rule Estimation and Evaluation with Application to Criminal Justice Interventions in the United States The optimal dynamic treatment rule (ODTR) framework offers an approach for understanding which kinds of individuals respond best to specific interventions. Recently, there has been a proliferation of methods for estimating the ODTR. One such method is an extension of the SuperLearner algorithm – an ensemble method to optimally combine candidate algorithms extensively used in prediction problems – to ODTRs. Following the "Causal Roadmap," in this talk we causally and statistically define the ODTR, and different parameters to evaluate it. We show how to estimate the ODTR with SuperLearner and evaluate it using crossvalidated targeted maximum likelihood estimation. We apply the ODTR SuperLearner to the "Interventions" study, a randomized trial that is currently underway aimed at reducing recidivism among justiceinvolved adults with mental illness in the United States. Specifically, we show preliminary results for the ODTR SuperLearner applied to this data, which aims to learn for whom Cognitive Behavioral Therapy (CBT) treatment works best to reduce recidivism, instead of Treatment As Usual (TAU; psychiatric services). This is joint work with Drs. Maya Petersen, Mark van der Laan, and Jennifer Skeem. 
Lina Montoya 
Fri 8:53 a.m.  9:00 a.m.

Q&A for invited talks 1 and 2
(live Q&A)
SlidesLive Video » 

Fri 9:00 a.m.  10:00 a.m.

Poster Session 1 (poster session) link »  
Fri 10:00 a.m.  10:20 a.m.

Coffee Break 1
(break)


Fri 10:20 a.m.  10:45 a.m.

Invited Talk 3: Lihua Lei and Avi Feller
(invited talk)
»
SlidesLive Video » DistributionFree Assessment of Population Overlap in Observational Studies Overlap in baseline covariates between treated and control groups, also known as positivity or common support, is a common assumption in observational causal inference. Assessing this assumption is often ad hoc, however, and can give misleading results. For example, the common practice of examining the empirical distribution of estimated propensity scores is heavily dependent on model specification and has poor uncertainty quantification. In this paper, we propose a formal statistical framework for assessing the extrema of the population propensity score; e.g., the propensity score lies in [0.1, 0.9] almost surely. We develop a family of upper confidence bounds, which we term Ovalues, for this quantity. We show these bounds are valid in finite samples so long as the observations are independent and identically distributed, without requiring any further modeling assumptions on the data generating process. We also use extensive simulations to show that these bounds are reasonably tight in practice. Finally, we demonstrate this approach using several benchmark observational studies, showing how to build our proposed method into the observational causal inference workflow. 
Lihua Lei, Avi Feller 
Fri 10:45 a.m.  11:10 a.m.

Invited Talk 4: Frederick Eberhardt
(invited talk)
SlidesLive Video » 
Frederick Eberhardt 
Fri 11:10 a.m.  11:20 a.m.

Q&A for invited talks 3 and 4
(live Q&A)
SlidesLive Video » 

Fri 11:20 a.m.  11:25 a.m.

Theo Saarinen  Nonparametric identification is not enough, but randomized controlled trials are
(contributed talk)
»
SlidesLive Video » We argue that randomized controlled trials (RCTs) are special even among settings where average treatment effects are identified by a nonparametric unconfoundedness assumption. We argue that this claim follows from two results of Robins and Ritov (1997): (1) with at least one continuous covariate control, no estimator of the average treatment effect exists which is uniformly consistent without further assumptions, (2) knowledge of the propensity score yields a consistent estimator and confidence intervals at parametric rates, regardless of how complicated the propensity score function is. We emphasize the latter point, and note that successfullyconducted RCTs provide knowledge of the propensity score to the researcher. We discuss modern developments in covariate adjustment for RCTs, noting that statistical models and machine learning methods can be used to improve efficiency while preserving finite sample unbiasedness. We conclude that statistical inference may be fundamentally more difficult in observational settings than it is in RCTs, even when all confounders are measured. 
Theo Saarinen 
Fri 11:25 a.m.  11:32 a.m.

Alexander Franks  Bayesian Inference for Partial Identification of Multiple Treatment Effects
(contributed talk )
»
SlidesLive Video » In Bayesian causal inference for partially identified parameters, there is a delicate balance between parameterizing models in terms of the fully identified and unidentified parameters directly versus modeling the parameters of primary scientific interest. We explore the challenges of Bayesian inference for partially identified models in the context of multitreatment causal inference with unobserved confounding in the linear model, where the treatment effects are partially identified. We demonstrate how carefully chosen priors can be used to incorporate additional scientific assumptions which further constrain the set of causal conclusions, and describe how our approach can be used assess robustness and sensitivity of the outcomes. We illustrate our approach to multitreatment causal inference in an example quantifying the effect of gene expression levels on mouse obesity. 
Alexander Franks 
Fri 11:32 a.m.  11:40 a.m.

Smitha Milli  Causal Inference Struggles with Agency on Online Platforms
(contributed talk)
»
SlidesLive Video » Online platforms regularly conduct randomized experiments to understand how changes to the platform causally affect various outcomes of interest. However, experimentation on online platforms has been criticized for having, among other issues, a lack of meaningful oversight and user consent. As platforms give users greater agency, it becomes possible to conduct observational studies in which users selfselect into the treatment of interest as an alternative to experiments in which the platform controls whether the user receives treatment or not. In this paper, we conduct four largescale withinstudy comparisons on Twitter aimed at assessing the effectiveness of observational studies derived from user selfselection on online platforms. In a withinstudy comparison, treatment effects from an observational study are assessed based on how effectively they replicate results from a randomized experiment with the same target population. We test the naive difference in group means estimator, exact matching, regression adjustment, and propensity score weighting while controlling for plausible confounding variables. In all cases, all observational estimates perform poorly at recovering the groundtruth estimate from the analogous randomized experiments. Our results suggest that observational studies derived from user selfselection are a poor alternative to randomized experimentation on online platforms. In discussing our results, we present a “Catch22” that undermines the use of causal inference in these settings: we give users control because we postulate that there is no adequate model for predicting user behavior, but performing observational causal inference successfully requires exactly that. 
Smitha Milli 
Fri 11:40 a.m.  11:45 a.m.

Philippe Brouillard  Typing assumptions improve identification in causal discovery
(contributed talk)
»
SlidesLive Video » Causal discovery from observational data is a challenging task to which an exact solution cannot always be identified. Under assumptions about the datagenerative process, the causal graph can often be identified up to an equivalence class. Proposing new realistic assumptions to circumscribe such equivalence classes is an active field of research. In this work, we propose a new set of assumptions that constrain possible causal relationships based on the nature of the variables. We thus introduce typed directed acyclic graphs, in which variable types are used to determine the validity of causal relationships. We demonstrate, both theoretically and empirically, that the proposed assumptions can result in significant gains in the identification of the causal graph. 
Philippe Brouillard 
Fri 11:45 a.m.  11:50 a.m.

Q&A for contributed talks 1, 2, 3, 4
(live Q&A)
SlidesLive Video » 

Fri 11:50 a.m.  12:20 p.m.

Lunch Break
(break)


Fri 12:20 p.m.  12:42 p.m.

Invited Talk 5: Margarita Moreno Betancur
(invited talk)
SlidesLive Video » 
Margarita MorenoBetancur 
Fri 12:42 p.m.  1:12 p.m.

Invited Talk 6: Daniel Malinsky
(invited talk)
SlidesLive Video » 
Daniel Malinsky 
Fri 1:12 p.m.  1:20 p.m.

Q&A for invited talks 5 and 6
(live Q&A)
SlidesLive Video » 

Fri 1:20 p.m.  2:15 p.m.

Panel Discussion
(live panel discussion)
SlidesLive Video » 

Fri 2:15 p.m.  2:30 p.m.

Wrap up
(wrap up)
SlidesLive Video » 



Hölder Bounds for Sensitivity Analysis in Causal Reasoning
(Poster)
»
[ Visit Poster at Spot B3 in Virtual World ]
We examine interval estimation of the effect of a treatment T on an outcome Y given the existence of an unobserved confounder U. Using Hölder's inequality, we derive a set of bounds on the confounding bias E[YT=t]E[Ydo(T=t)] based on the degree of unmeasured confounding (i.e., the strength of the connection U>T, and the strength of U>Y). These bounds are tight either when U⊥T or U⊥Y  T (when there is no unobserved confounding). We focus on a special case of this bound depending on the total variation distance between the distributions p(U) and p(UT=t), as well as the maximum (over all possible values of U) deviation of the conditional expected outcome E[YU=u,T=t] from the average expected outcome E[YT=t]. We discuss possible calibration strategies for this bound to get interval estimates for treatment effects, and experimentally validate the bound using synthetic and semisynthetic datasets. 
Serge Assaad, Shuxi Zeng, Henry Pfister, Fan Li, Lawrence Carin 


Causal Gradient Boosting: Boosted Instrumental Variables Regression
(Poster)
»
[ Visit Poster at Spot A2 in Virtual World ]
Recent advances in the literature have demonstrated that standard supervised learning algorithms are illsuited for problems with endogenous explanatory variables. To correct for the endogeneity bias, many variants of nonparameteric instrumental variable regression methods have been developed. In this paper, we propose an alternative algorithm called boostIV that builds on the traditional gradient boosting algorithm and corrects for the endogeneity bias. The algorithm is very intuitive and resembles an iterative version of the standard 2SLS estimator. We demonstrate that our estimator is consistent under mild conditions and demonstrates an outstanding finite sample performance. 
Edvard Bakhitov, Aman Singh 


Strategic Instrumental Variable Regression: Recovering Causal Relationships From Strategic Responses
(Poster)
»
[ Visit Poster at Spot C3 in Virtual World ]
In social domains, Machine Learning algorithms often prompt individuals to strategically modify their observable attributes to receive more favorable predictions. As a result, the distribution the predictive model is trained on may differ from the one it operates on in deployment. While such distribution shifts, in general, hinder accurate predictions, our work identifies a unique opportunity associated with shifts due to strategic responses: We show that we can use strategic responses effectively to recover causal relationships between the observable features and outcomes we wish to predict. More specifically, we study a gametheoretic model in which a principal deploys a sequence of models to predict an outcome of interest (e.g., college GPA) for a sequence of strategic agents (e.g., college applicants). In response, strategic agents invest efforts and modify their features for better predictions. In such settings, unobserved confounding variables (e.g., family educational background) can influence both an agent's observable features (e.g., high school records) and outcomes (e.g., college GPA). Therefore, standard regression methods (such as OLS) generally produce biased estimators. In order to address this issue, our work establishes a novel connection between strategic responses to machine learning models and instrumental variable (IV) regression, by observing that the sequence of deployed models can be viewed as an instrument that affects agents' observable features but does not directly influence their outcomes. Therefore, twostage least squares (2SLS) regression can recover the causal relationships between observable features and outcomes. 
Keegan Harris, Daniel Ngo, Logan Stapleton, Hoda Heidari, Steven Wu 


Deep Causal Inequalities: Demand Estimation in Differentiated Products Markets
(Poster)
»
[ Visit Poster at Spot A6 in Virtual World ]
Supervised machine learning algorithms fail to perform well in the presence of endogeneity in the explanatory variables. In this paper, we borrow from the literature on partial identification to propose deep causal inequalities that overcome this issue. Instead of relying on observed labels, the DeepCI estimator uses inferred inequalities from the observed behavior of agents in the data. This by construction can allow us to circumvent the issue of endogeneous explanatory variables in many cases. We provide theoretical guarantees for our estimator and demonstrate it is consistent under very mild conditions. We demonstrate through extensive simulations that our estimator outperforms standard supervised machine learning algorithms and existing partial identification methods. 
Edvard Bakhitov, Aman Singh, Jiding Zhang 


Causal Inference Struggles with Agency on Online Platforms
(Poster)
»
[ Visit Poster at Spot A3 in Virtual World ]
Online platforms regularly conduct randomized experiments to understand how changes to the platform causally affect various outcomes of interest. However, experimentation on online platforms has been criticized for having, among other issues, a lack of meaningful oversight and user consent. As platforms give users greater agency, it becomes possible to conduct observational studies in which users selfselect into the treatment of interest as an alternative to experiments in which the platform controls whether the user receives treatment or not. In this paper, we conduct four largescale withinstudy comparisons on Twitter aimed at assessing the effectiveness of observational studies derived from user selfselection on online platforms. In a withinstudy comparison, treatment effects from an observational study are assessed based on how effectively they replicate results from a randomized experiment with the same target population. We test the naive difference in group means estimator, exact matching, regression adjustment, and propensity score weighting while controlling for plausible confounding variables. In all cases, all observational estimates perform poorly at recovering the groundtruth estimate from the analogous randomized experiments. Our results suggest that observational studies derived from user selfselection are a poor alternative to randomized experimentation on online platforms. In discussing our results, we present a “Catch22” that undermines the use of causal inference in these settings: we give users control because we postulate that there is no adequate model for predicting user behavior, but performing observational causal inference successfully requires exactly that. 
Smitha Milli, Luca Belli, Moritz Hardt 


Randomization does not imply unconfoundedness
(Poster)
»
[ Visit Poster at Spot C1 in Virtual World ]
A common assumption in causal inference is that random treatment assignment ensures that potential outcomes are independent of treatment, or in one word, unconfoundedness. This paper highlights that randomization and unconfoundedness are separate properties, and neither implies the other. A study with random treatment assignment does not have to be unconfounded, and a study with deterministic assignment can still be unconfounded. A corollary is that a propensity score is not the same thing as a treatment assignment probability. These facts should not be taken as arguments against randomization. The moral of this paper is that randomization is useful only when investigators know or can reconstruct the assignment process. 
Fredrik Savje 


Models, identifiability, and estimability in causal inference
(Poster)
»
[ Visit Poster at Spot B5 in Virtual World ]
Here we discuss two common but, in our view, misguided assumptions in causal inference. The first assumption is that one requires potential outcomes, directed acyclic graphs (DAGs), or structural causal models (SCMs) for thinking about causal inference in statistics. The second is that identifiability of a quantity implies estimability of that quantity. These views are not universal, but we believe they are sufficiently common to warrant comment. 
Oliver Maclaren 


Variational Causal Networks: Approximate Bayesian Inference over Causal Structures
(Poster)
»
[ Visit Poster at Spot D1 in Virtual World ]
Learning the causal structure that underlies data is a crucial step towards robust realworld decision making. The majority of existing work in causal inference focuses on determining a single directed acyclic graph (DAG) or a Markov equivalence class thereof. However, a crucial aspect to acting intelligently upon the knowledge about causal structure which has been inferred from finite data demands reasoning about its uncertainty. For instance, planning interventions to find out more about the causal mechanisms that govern our data requires quantifying epistemic uncertainty over DAGs. While Bayesian causal inference allows to do so, the posterior over DAGs becomes intractable even for a small number of variables. Aiming to overcome this issue, we propose a form of variational inference over the graphs of Structural Causal Models (SCMs). To this end, we introduce a parametric variational family modelled by an autoregressive distribution over the space of discrete DAGs. Its number of parameters does not grow exponentially with the number of variables and can be tractably learned by maximising an Evidence Lower Bound (ELBO). In our experiments, we demonstrate that the proposed variational posterior is able to provide a good approximation of the true posterior. 
Yashas Annadani, Jonas Rothfuss, Alexandre Lacoste, Nino Scherrer, Anirudh Goyal, Yoshua Bengio, Stefan Bauer 


Optimal transport for causal discovery
(Poster)
»
[ Visit Poster at Spot C0 in Virtual World ]
Recently, approaches based on Functional Causal Models (FCMs) have been proposed to determine causal direction between two variables, by restricting model classes; however, their performance is sensitive to the model assumptions. In this paper, we provide a dynamicalsystem view of FCMs and propose a new framework for identifying causal direction in the bivariate case. We first interpret FCMs in the bivariate case as an optimal transport problem under proper structural constraints. By exploiting the dynamical interpretation of optimal transport, we then derive the underlying time evolution of static causeeffect pair data under the least action principle. It provides a new dimension for describing static causal discovery tasks, while enjoying more freedom for modeling the quantitative causal influences. In particular, we show that additive noise models correspond to volumepreserving pressureless flows. Consequently, based on their velocity field divergence, we derive a criterion to determine causal direction. With this criterion, we propose a novel optimal transport based algorithm which is robust to the choice of models. Our method demonstrated promising results on both synthetic and real causeeffect pair datasets. 
Ruibo Tu, Kun Zhang, Hedvig Kjellström, Cheng Zhang 


Doing Great at Estimating CATE? On the Neglected Assumptions in Benchmark Comparisons of Treatment Effect Estimators
(Poster)
»
[ Visit Poster at Spot B1 in Virtual World ]
The machine learning toolbox for estimation of heterogeneous treatment effects from observational data is expanding rapidly, yet many of its algorithms have been evaluated only on a very limited set of semisynthetic benchmark datasets. In this paper, we show that even in arguably the simplest setting  estimation under ignorability assumptions  the results of such empirical evaluations can be misleading if (i) the assumptions underlying the datagenerating mechanisms in benchmark datasets and (ii) their interplay with baseline algorithms are inadequately discussed. We consider two popular machine learning benchmark datasets for evaluation of heterogeneous treatment effect estimators  the IHDP and ACIC2016 datasets  in detail. We identify problems with their current use and highlight that the inherent characteristics of the benchmark datasets favor some algorithms over others  a fact that is rarely acknowledged but of immense relevance for interpretation of empirical results. We close by discussing implications and possible next steps. 
Alicia Curth, Mihaela van der Schaar 


Underexploring in Bandits with Confounded Data
(Poster)
»
[ Visit Poster at Spot C6 in Virtual World ]
We study the problem of MultiArmed Bandits with mean bounds where each arm is associated with an interval in which its mean reward lies. We develop the GLobal UnderExplore (GLUE) algorithm which, for each arm, uses these intervals to infer ``pseudovariances'' that instruct the rate of exploration. We provide regret guarantees for GLUE and show that it is never worse than the standard Upper Confidence Bound Algorithm. Further, we show regimes in which GLUE improves upon existing regret guarantees for structured bandit problems. Finally, we present the practical setting of learning adaptive interventions using prior confounded data in which unrecorded variables affect rewards. We show that mean bounds for each intervention can be extracted from such logs and can thus be used to improve the learning process. We also provide semisynthetic experiments on realworld data sets to validate our findings. 
Nihal Sharma, Soumya Basu, Karthikeyan Shanmugam, Sanjay Shakkottai 


The Adaptive Doubly Robust Estimator for Policy Evaluation in Adaptive Experiments
(Poster)
»
[ Visit Poster at Spot C4 in Virtual World ]
We consider policy evaluation with dependent samples gathered from adaptive experiments. To deal with the dependency, existing studies, such as van der Laan (2008), proposed estimators including an inverse probability weight, whose score function has a martingale property. However, these estimators require the true logging policy (the probability of choosing an action) for using the martingale property. To mitigate this neglected assumption, we propose the doubly robust (DR) estimator, which consists of two nuisance estimators of the conditional mean outcome and the logging policy, for the dependent samples. To obtain an asymptotically normal semiparametric estimator from dependent samples without Donsker nuisance estimators and martingale property, we propose adaptivefitting as a variant of samplesplitting proposed by Chernozhukov et al. (2018) for independent and identically distributed samples. We confirm the empirical performance through simulation studies and report that the DR estimator also has a stabilization effect. 
Masahiro Kato, Shota Yasui, Kenichiro McAlinn 


CausalBALD: Deep Bayesian Active Learning of Outcomes to Infer TreatmentEffects from Observational Data
(Poster)
»
[ Visit Poster at Spot A4 in Virtual World ]
Estimating personalized treatment effects from highdimensional observational data is essential in situations where experimental designs are infeasible, unethical or expensive. Existing approaches rely on fitting deep models on outcomes observed for treated and control populations, but when measuring the outcome for an individual is costly (e.g. biopsy) a sample efficient strategy for acquiring outcomes is required. Deep Bayesian active learning provides a framework for efficient data acquisition by selecting points with high uncertainty. However, naive application of existing methods selects training data that is biased toward regions where the treatment effect cannot be identified because there is nonoverlapping support between the treated and control populations. To maximize sample efficiency for learning personalized treatment effects, we introduce new acquisition functions grounded in information theory that bias data acquisition towards regions where overlap is satisfied, by combining insights from deep Bayesian active learning and causal inference. We demonstrate the performance of the proposed acquisition strategies on synthetic and semisynthetic datasets IHDP and CMNIST and their extensions which aim to simulate common dataset biases and pathologies. 
Andrew Jesson, Panagiotis Tigas, Joost van Amersfoort, Andreas Kirsch, Uri Shalit, Yarin Gal 


VICAUSE: Simultaneous missing value imputation and causal discovery
(Poster)
»
[ Visit Poster at Spot D2 in Virtual World ]
Missing values constitute an important challenge in realworld machine learning for both prediction and causal discovery tasks. However, only few methods in causal discovery can handle missing data in an efficient way, while existing imputation methods are agnostic to causality. In this work we propose VICAUSE, a novel approach to simultaneously tackle missing value imputation and causal discovery efficiently with deep learning. Particularly, we propose a generative model with a structured latent space and a graph neural networkbased architecture, scaling to large number of variables. Moreover, our method can discover relationship between groups of variables which is useful in many realworld applications. VICAUSE shows improved performance compared to popular and recent approaches in both missing value imputation and causal discovery. 
Pablo MoralesAlvarez, Angus Lamb, Simon Woodhead, Simon Pyton Jones, Miltiadis Allamanis, Cheng Zhang 


Leveraging molecular negative controls for effect estimation in nonrandomized human health and disease studies: a demonstrative simulation study
(Poster)
»
[ Visit Poster at Spot B4 in Virtual World ]
Background Exploratory nullhypothesis significance testing (e.g. GWAS, EWAS) form the backbone of molecular mechanism discovery, however methods to identify true causal signals are underdeveloped. We evaluate two negative control approaches to quantitatively control for shared unmeasured confounding and recover unbiased effects using epigenomic data and biologicallyinformed structural assumptions. Methods We consider the application of the control outcome calibration approach (COCA) and proximal gcomputation (PGC) to case studies in reproductive genomics. COCA may be employed when maternal epigenome has no direct effects on phenotype and proxy shared unmeasured confounders and PG further with suitable genetic instruments (e.g. mQTLs). Baseline covariates were extracted from 777 motherchild pairs in a birth cohort with maternal blood and fetal cord DNA methylation array data. Treatment, negative control, and outcome values were simulated in 2000 bootstraps under a plasmode simulation framework. Bootstrapped, ordinary (COCA) and 2stage (PGC) least squares were fitted to estimate treatment effects and standard errors under various settings of missing confounders (e.g. paternal data). Regression adjustment and a naive application of doublyrobust, ensemble learning efficient estimators were compared. Results COCA and PGC performed well in simplistic data generating processes. However, in realworld cohort simulations, COCA performed acceptably only in settings with strong proxy confounders, but otherwise poorly (median bias 610%; coverage 29%). PGC performed slightly better. Alternatively, simple covariate adjustments generally outperformed all others in bias and confidence interval coverage across scenarios (median bias 22%; 71% coverage). Discussion Molecular epidemiology provides key opportunity to leverage biological knowledge against unmeasured confounding, but these identification strategies are underutilized and understudied in this context. Negative control calibration or adjustments may help under limited scenarios where assumptions are fulfilled, but should be tested with simulations closer to realworld conditions. 
Jon Huang 


On the Distinction Between ``Conditional Average Treatment Effects'' (CATE) and ``Individual Treatment Effects'' (ITE) Under Ignorability Assumptions
(Poster)
»
[ Visit Poster at Spot B6 in Virtual World ]
Recent years have seen a swell in methods that focus on estimating 
Brian Vegetabile 


Signal Manipulation and the Causal Status of Race
(Poster)
»
See Attached 
Naftali Weinberger 


Typing assumptions improve identification in causal discovery
(Poster)
»
[ Visit Poster at Spot C5 in Virtual World ]
Causal discovery from observational data is a challenging task to which an exact solution cannot always be identified. Under assumptions about the datagenerative process, the causal graph can often be identified up to an equivalence class. Proposing new realistic assumptions to circumscribe such equivalence classes is an active field of research. In this work, we propose a new set of assumptions that constrain possible causal relationships based on the nature of the variables. We thus introduce typed directed acyclic graphs, in which variable types are used to determine the validity of causal relationships. We demonstrate, both theoretically and empirically, that the proposed assumptions can result in significant gains in the identification of the causal graph. 
Philippe Brouillard, Perouz Taslakian, Alexandre Lacoste, Sebastien Lachapelle, Alexandre Drouin 


Understanding the Role of Prognostic Factors and Effect Modifiers in Heterogeneity of Treatment Effect using a WithinSubjects Analysis of Variance
(Poster)
»
[ Visit Poster at Spot D0 in Virtual World ]
Personalized EvidenceBased Medicine (EBM) aims to estimate patient specific causal effects using covariate information. In order to adequately estimate these Individual Treatment Effects (ITEs), a thorough understanding of the role of covariates in heterogeneous datasets is necessary. In this preliminary work, we distinguish prognostic factors that influence the outcome variable, from effect modifiers, which influence the treatment effect. By means of a small synthetic data experiment where we temporarily disregard the fundamental problem of causal inference, we evaluate withinsubjects variance for three possible distributions of ITEs, while keeping the Average Treatment Effect (ATE) fixed. The hypothetical nature of the experiment allows us to further understand the role of prognostic factors and effect modifiers in estimating ATEs and ITEs. 
Rianne Schouten, Mykola Pechenizkiy 


Discovering Latent Causal Variables via Mechanism Sparsity: A New Principle for Nonlinear ICA
(Poster)
»
[ Visit Poster at Spot B0 in Virtual World ]
It can be argued that finding an interpretable lowdimensional representation of a potentially highdimensional phenomenon is central to the scientific enterprise. Independent component analysis (ICA) refers to an ensemble of methods which formalize this goal and provide estimation procedure for practical application. This work proposes mechanism sparsity regularization as a new principle to achieve nonlinear ICA when latent factors depend sparsely on observed auxiliary variables and/or past latent factors. We show that the latent variables can be recovered up to a permutation if one regularizes the latent mechanisms to be sparse and if some graphical criterion is satisfied by the data generating process. As a special case, our framework shows how one can leverage unknowntarget interventions on the latent factors to disentangle them, thus drawing further connections between ICA and causality. We validate our theoretical results with toy experiments. 
Sebastien Lachapelle, Pau Rodriguez, Remi Le Priol, Alexandre Lacoste 


Bayesian Inference for Partial Identification of Multiple Treatment Effects
(Poster)
»
In Bayesian causal inference for partially identified parameters, there is a delicate balance between parameterizing models in terms of the fully identified and unidentified parameters directly versus modeling the parameters of primary scientific interest. We explore the challenges of Bayesian inference for partially identified models in the context of multitreatment causal inference with unobserved confounding in the linear model, where the treatment effects are partially identified. We demonstrate how carefully chosen priors can be used to incorporate additional scientific assumptions which further constrain the set of causal conclusions, and describe how our approach can be used assess robustness and sensitivity of the outcomes. We illustrate our approach to multitreatment causal inference in an example quantifying the effect of gene expression levels on mouse obesity. 
Alexander Franks, Jiajing Zheng 


DRTCI: Learning Disentangled Representations for Temporal Causal Inference
(Poster)
»
[ Visit Poster at Spot D5 in Virtual World ]
Medical professionals evaluating alternative treatment plans for a patient often encounter time varying confounders, or covariates that affect both the future treatment assignment and the patient outcome. The recently proposed Counterfactual Recurrent Network (CRN) accounts for time varying confounders by using adversarial training to balance recurrent historical representations of patient data. However, this work assumes that all time varying covariates are confounding and thus attempts to balance the full state representation. Given that the actual subset of covariates that may in fact be confounding is in general unknown, recent work on counterfactual evaluation in the static, nontemporal setting has suggested that disentangling the covariate representation into separate factors, where each either influence treatment selection, patient outcome or both can help isolate selection bias and restrict balancing efforts to factors that influence outcome, allowing the remaining factors which predict treatment without needlessly being balanced. We hypothesize that such disentanglement should be possible in the temporal setting as well, and would be beneficial when dealing with time varying confounders. We propose DRTCI, a model for temporal causal inference which uses a recurrent neural network to learn hidden representation of the patient's evolving covariates that disentangles into three factors that each causally determine either treatment, outcome or both treatment and outcome. The model is evaluated on the same simulated model of tumour growth used to evaluate the CRN, with varying degrees of timedependent confounding. The resulting outcome predictions from DRTCI significantly outperform the predictions from existing baselines especially for cases with high confounding and minimal historical data (early prediction). Ablation experiments are additionally performed to identify the key contributing factors to the performance of DRTCI. 
Garima Gupta, Lovekesh Vig, gmshroff Shroff 


DataDriven Exclusion Criteria for Instrumental Variables
(Poster)
»
[ Visit Poster at Spot A5 in Virtual World ]
When conducting instrumental variable studies, practitioners may exclude units in data processing prior to estimation. This exclusion step is critical for the study design but is often neglected. Here we view this problem as a welldefined tradeoff between statistical power and external validity, which can be navigated with a data driven strategy. Our method estimates the probability of units being compliant and increases statistical power by excluding units with low compliance probability. This datadriven exclusion criterion can help navigate the tradeoff between power and external validity for many quasiexperimental settings. 
Tony Liu, Lyle Ungar, Konrad Kording 


Nonparametric identification is not enough, but randomized controlled trials are
(Poster)
»
We argue that randomized controlled trials (RCTs) are special even among settings where average treatment effects are identified by a nonparametric unconfoundedness assumption. We argue that this claim follows from two results of Robins and Ritov (1997): (1) with at least one continuous covariate control, no estimator of the average treatment effect exists which is uniformly consistent without further assumptions, (2) knowledge of the propensity score yields a consistent estimator and confidence intervals at parametric rates, regardless of how complicated the propensity score function is. We emphasize the latter point, and note that successfullyconducted RCTs provide knowledge of the propensity score to the researcher. We discuss modern developments in covariate adjustment for RCTs, noting that statistical models and machine learning methods can be used to improve efficiency while preserving finite sample unbiasedness. We conclude that statistical inference may be fundamentally more difficult in observational settings than it is in RCTs, even when all confounders are measured. 
P Aronow 


When Can We Achieve Small Error in Observational Causal Inference?
(Poster)
»
[ Visit Poster at Spot D3 in Virtual World ]
We explore the conditions necessary to guarantee sharp upper bounds on the mean squared error when estimating mean counterfactual outcomes from observational data. In particular, we analyze the large family of designedbased weighting estimators which include balancing weights and matching. Beginning from the biasvariance decomposition, we argue that assumptions have to be made about the outcome function in order to choose a high performance estimator. For a theoretical framework, we use integral probability metrics and $\phi$divergences to analyze the biasvariance tradeoff. Finally, we consider conditions under which our mean squared error bounds are robust to failure of our assumptions.

David BrunsSmith 


A Topological Perspective on Causal Inference
(Poster)
»
[ Visit Poster at Spot A1 in Virtual World ]
As an approach to the workshop theme of causal assumptions, we offer a topological learningtheoretic perspective on causal inference by introducing a series of topologies defined on general spaces of structural causal models (SCMs). To illustrate the power of the framework we prove a topological causal hierarchy theorem, showing that substantive assumptionfree causal inference is possible only in a meager set of SCMs. Thanks to a correspondence between open sets in the weak topology and statistically verifiable hypotheses, our results show that inductive assumptions sufficient to license valid causal inferences are statistically unverifiable in principle. Similar to nofreelunch theorems for statistical inference, the present results clarify the inevitability of substantial assumptions for causal inference. We furthermore suggest that the framework may be helpful for the positive project of exploring and assessing alternative causalinductive assumptions. 
Duligur Ibeling, Thomas Icard 


Scalable Algorithms for Nonlinear Causal Inference
(Poster)
»
[ Visit Poster at Spot C2 in Virtual World ]
We derive and implement nonlinear extensions of the classical instrumental variable regression (IVR) technique. Our key insight is that even in the nonlinear setting, finding a causally consistent estimate of a structural equation is equivalent to satisfying constraints on conditional outcome moments. This insight allows us to leverage standard constrained optimization techniques to reframe the work of Dikkala et al. as optimizing a regularized Lagrangian and reveal underlying smoothness assumptions. We then propose a new algorithm, CausAL, that instead optimizes an augmented Lagrangian, requiring a different definition of smoothness and no adversarial training. We then extend our method to handle matching outcome distributions instead of just expected values, propose an efficient noregret procedure, and implement a practical realization via a modification of an Integral Probability Metric (IPM) GAN which we call ACADIMI. 
Gokul Swamy, Sanjiban Choudhury, Drew Bagnell, Steven Wu 


DoWhy: Addressing Challenges in Expressing and Validating Causal Assumptions
(Poster)
»
[ Visit Poster at Spot B2 in Virtual World ]
Estimation of causal effects involves crucial assumptions about the datagenerating process, such as directionality of effect, presence of instrumental variables or mediators, and whether all relevant confounders are observed. Violation of any of these assumptions leads to significant error in the effect estimate. However, unlike crossvalidation for predictive models, there is no global validator method for a causal estimate. As a result, expressing different causal assumptions formally and validating them (to the extent possible) becomes critical for any analysis. We present DoWhy, a framework that allows explicit declaration of assumptions through a causal graph and provides multiple validation tests to check a subset of these assumptions. Our experience with DoWhy highlights a number of open questions for future research: developing new ways beyond causal graphs to express assumptions, the role of causal discovery in learning relevant parts of the graph, and developing validation tests that can better detect errors, both for average and conditional treatment effects. DoWhy is available at https://github.com/microsoft/dowhy. 
Amit Sharma, Vasilis Syrgkanis, cheng zhang, Emre Kiciman 


On formalizing causal offpolicy sequential decisionmaking
(Poster)
»
[ Visit Poster at Spot D6 in Virtual World ]
Assessing the effects of deploying a policy based on retrospective data collected from a different policy is a common problem across several highstake decision making domains. A number of offpolicy evaluation (OPE) techniques have been proposed for this purpose with different biasvariance tradeoffs. However, these methods largely formulate OPE as a problem disassociated from the process used to generate the data. Posing OPE instead as a causal estimand has strong implications ranging from our fundamental understanding of the complexity of the OPE problem to which methods we apply in practice, and can help highlight gaps in existing literature in terms of the overall objective of OPE. Many formalisms of OPE additionally overlook the role of uncertainty entirely in the estimation process, which can significantly bias the estimation of counterfactuals and produce large errors in OPE as a result. Finally, depending on how we formalise OPE, human expertise can be particularly helpful in assessing the validity of OPE estimates or improving estimation from a finite number of samples to achieve certain efficiency guarantees. In this position paper, we discuss each of these issues in terms of the role they play on OPE. Importantly, each of these aspects may be viewed as a means of assessing the validity of various other common assumptions made in causal inference. 
Sonali Parbhoo, Shalmali Joshi, Finale DoshiVelez 


Statistical Decidability in Confounded, Linear NonGaussian Models
(Poster)
»
Since Spirtes et al. (2000), it is well known that if causal relationships are linear and noise terms are independent and Gaussian, causal orientation is not identified from observational data — even if causal faithfulness is satisfied. Shimizu et al. (2006) showed that linear, nonGaussian (LiNGAM) causal models are identified from observational data, so long as no latent confounders are present. That holds even when faithfulness fails. Genin and MayoWilson (2020) refine that identifiability result: not only are causal relationships identified, but causal orientation is statistically decidable. That means that for every α > 0, there is a method that converges in probability to the correct orientation and, at every sample size, outputs an incorrect orientation with probability less than α.These results naturally raise questions about what happens in the presence of latent confounders. Hoyer et al. (2008) and Salehkaleybar et al. (2020) show that, although the causal model is not uniquely identified, causal orientation among observed variables is identified in the presence of latent confounders, so long as faithfulness is satisfied. This paper refines these results. When we allow for the presence of latent confounders, causal orientation is no longer statistically decidable. Although it is possible to converge in probability to the correct orientation, it is not possible to do so with finitesample bounds on the probability of orientation errors. That is true even if causal faithfulness is satisfied. 
Konstantin Genin 


Lie interventions in complex systems with cycles
(Poster)
»
[ Visit Poster at Spot D4 in Virtual World ]
Complex systems often contain feedback loops, that can be described as cyclic causal models. Intervening in such systems may lead to counterintuitive effects, which cannot be inferred directly from the graph structure. After establishing a framework for differentiable interventions based on Lie groups, we take advantage of modern automatic differentiation techniques and their application to implicit functions in order to optimize interventions in cyclic causal models. We illustrate the use of this framework by investigating scenarios of transition to sustainable economies. 
Michel Besserve, Bernhard Schölkopf 


A Survey on Deep Learning of Potential Outcomes From a Social Science Perspective
(Poster)
»
[ Visit Poster at Spot A0 in Virtual World ]
This abstract describes a survey on deep causal estimators for a social science audience. While the machine learning community has moved quickly to leverage causal reasoning to improve predictive models, adoption of deep learning has been slower in areas of science that prioritize interpretability and robust evidence of causality for inference (e.g., epidemiology, social science, social statistics). Here we summarize deep learning models that adjust for confounding in creative ways (e.g., representation learning and generative modeling) to estimate/predict unbiased treatment effects, and/or extend causal inference beyond tabular data to text and networks. We discuss the strengths and weaknesses of these models from an applied social science perspective, and how the machine learning community might better frame/support their contributions to increase adoption by social and data scientists. 
Bernard Koch, Niki Kilbertus 