Consider a setting where there are N heterogeneous units (e.g.,
individuals, sub-populations) and D interventions (e.g., socio-economic
policies). Our goal is to learn the potential outcome associated with
every intervention on every unit (i.e., N × D causal parameters).
Towards this, we present a causal framework, synthetic interventions
(SI), to infer these N × D causal parameters while only observing each
of the N units under at most two interventions, independent of D. This
can be significant as the number of interventions, i.e, level of
personalization, grows. Importantly, our estimator also allows for
latent confounders that determine how interventions are assigned.
Theoretically, under a novel tensor factor model across units,
measurements, and interventions, we formally establish an identification
result for each of these N × D causal parameters, and establish
finite-sample consistency and asymptotic normality of our estimator. The
estimator is furnished with a data-driven test to verify its
suitability. Empirically, we validate our framework through both
experimental and observational case studies; namely, a large-scale A/B
test performed on an e-commerce platform as well as personalized
clinical trial for an Alzheimers' drug, and an evaluation of mobility
restriction on morbidity outcomes due to COVID-19. We believe this has
important implications for program evaluation and the design of
data-efficient RCTs with heterogeneous units and multiple interventions.
Time
permitting, I will discuss how this opens up avenues for developing
"panel-style" causal estimation for the setting of regression
discontinuity design, incorporating time varying aspects as well as the
ability to estimate higher-order distributional properties of the
potential outcomes.
Synthetic Interventions is based on a joint work
with Anish Agarwal (MIT) and Dennis Shen (UC Berkeley). The latter part
of the talk discussing future avenues is based on the ongoing
conversations with Alberto Abadie (MIT), Anish Agarwal (MIT) and Dennis
Shen (UC Berkeley).
Paper: https://arxiv.org/pdf/2006.07691.pdf
SGD without replacement: optimal rate analysis and more
Stochastic gradient descent (SGD) is omnipresent in machine learning. Two fundamental versions of SGD exist: (i) one that picks stochastic gradients with replacement, and (ii) one that picks gradients without replacement. Ironically, version (ii) is what is used in practice, while version (i) is what most theoretical works analyze. This mismatch is well-known. It arises because without replacement sampling leads to non-independent stochastic gradients, which makes analysis hard.
In this talk I will present recent progress on analyzing without replacement SGD, focusing in particular on two key variants: RandomShuffle and SingleShuffle. I will summarize the best known convergence rates, as well as important refinements that result under additional assumptions on the loss functions. The results presented remove drawbacks common to most previous works on this topic.
Based on work with: Chulhee Yun and Kwangjun Ahn
We establish sharp minimax bounds on Imitation Learning (IL) in episodic Markov Decision Processes (MDPs), where the learner is provided a dataset of demonstrations from an expert. It is known that Behavior Cloning (BC) achieves suboptimality growing quadratically in horizon, which is termed as error compounding in the literature. We show that when the MDP transition function is unknown, all algorithms have to suffer a suboptimality that grows quadratically with the horizon, even if the algorithm can interactively query the expert such as in the setting of DAGGER. We then propose the setting of known transitions and show that one can provably break the quadratic dependence and improve the exponent to 3/2, which is shown to be tight. Our upper bound is established using a computationally efficient algorithm which we name as Mimic-MD, and the lower bound is established by proving a two-way reduction between IL and the value estimation problem of the unknown expert policy under any given reward function, as well as linear functional estimation with subsampled observations. We further show that under the additional assumption that the expert is optimal for the true reward function, there exists an efficient algorithm, which we term as Mimic-Mixture, that provably achieves suboptimality independent of the horizon for arbitrary 3-state MDPs with rewards only at the terminal layer. In contrast, no algorithm can achieve suboptimality growing slower than the square root of the horizon with high probability if the expert is not constrained to be optimal. We formally establish the benefit of expert optimal assumption in the known transition setting and show that this additional assumption does not help when the transition functions are unknown.
Based on joint work with Nived Rajaraman, Yanjun Han, Lin F. Yang, and Kannan Ramchandran.
Equilibrium Computation and the Foundations of Deep Learning
Deep Learning has recently yielded important advances in single-agent learning challenges, much of that progress being fueled by the empirical success of gradient descent and its variants in computing local optima of non-convex optimization problems. In multi-agent learning applications, the role of single-objective optimization is played by equilibrium computation, yet our understanding of its complexity in settings that are relevant for Deep Learning remains sparse. In this talk we focus on min-max optimization of nonconvex-nonconcave objectives, which has found applications in GANs, and other adversarial learning problems. Here, not only are there no known gradient-descent based methods converging to even local and approximate min-max equilibria, but the computational complexity of identifying them remains poorly understood. We show that finding approximate local min-max equilibria of Lipschitz and smooth objectives requires a number of queries to the function and its gradient that is exponential in the relevant parameters, in sharp contrast to the polynomial number of queries required to find approximate local minima of non-convex objectives. Our oracle lower bound is a byproduct of a complexity-theoretic result showing that finding approximate local min-max equilibria is computationally equivalent to finding Brouwer fixed points, and Nash equilibria in non zero-sum games, and thus PPAD-complete.
Minimal complexity theory knowledge will be assumed in the talk. Joint work with Stratis Skoulakis and Manolis Zampetakis.
Multiply robust estimation of causal effects under principal ignorability
Causal inference concerns not only the average effect of the treatment on the outcome but also the underlying mechanism through an intermediate variable of interest. Principal stratification characterizes such mechanism by targeting subgroup causal effects within principal strata, which are defined by the joint potential values of an intermediate variable. Due to the fundamental problem of causal inference, principal strata are inherently latent, rendering it challenging to identify and estimate subgroup effects within them. A line of research leverages the principal ignorability assumption that the latent principal strata are mean independent of the potential outcomes conditioning on the observed covariates. Under principal ignorability, we derive various nonparametric identification formulas for causal effects within principal strata in observational studies, which motivate estimators relying on the correct specifications of different parts of the observed-data distribution. Appropriately combining these estimators further yields new triply robust estimators for the causal effects within principal strata. These new estimators are consistent if two of the treatment, intermediate variable, and outcome models are correctly specified, and they are locally efficient if all three models are correctly specified. We show that these estimators arise naturally from either the efficient influence functions in the semiparametric theory or the model-assisted estimators in the survey sampling theory. We evaluate different estimators based on their finite-sample performance through simulation, apply them to two observational studies, and implement them in an open-source software package. Joint work with Zhichao Jiang and Shu Yang.
Much of the recent focus in machine learning has been on the pattern-recognition side of the field. The decision-making side of the field is in need of a similar level of attention, given that it is in the decisions supported by, or made by, learning systems that many consequential, real-world problems arise. Many of the fundamental challenges in decision-making involve learning systems that must cope with scarcity, competition, and allocations among many decision makers. Progress on such problems requires new blends of machine learning and microeconomics. I'll exemplify this blending with a discussion of exploration-exploitation tradeoffs for bandits that compete over a scarce resource, focusing on decentralized versions of this problem.