Jean Pouget-Abadie

Back to home.

In the 1920s, Fisher and Neyman laid out the framework of causal inference theory, the study of randomized experiments. Later continued by Rubin, this theory is at the root of today's clinical trials, public policy experiments, and internet giants' A/B tests. Below is a short introduction to their work and what can be done when a fundamental independence assumption, known as SUTVA, does not hold.

How is "treatment effect" defined?

Randomized experiments -- A/B tests -- aim to determine the treatment effect of a particular policy, feature, or drug. But what is meant by "treatment effect"? Intuitively, the effect of a particular treatment is the difference between receiving that treatment and not receiving it. In this sense, the Neyman-Rubin potential outcomes framework is a useful formalism to properly define treatment effect. It supposes that every unit $i$ in a population is associated with a potential outcomes function, which maps treatment assignments $Z$ to outcomes: $Y_i: Z_i \mapsto Y_i(Z_i)$. If each unit can be placed in either treatment or control, then the potential outcomes table is the following:

Units	Treatment	Control
$1$	$Y_1(1)$	$Y_1(0)$
$2$	$Y_2(1)$	$Y_2(0)$
$\vdots$	$\vdots$	$\vdots$
$N$	$Y_N(1)$	$Y_N(0)$

The quantity of interest is the effect of getting assigned to treatment over control, which, with the notation above, can be written as the average difference between potential outcomes, or average treatment effect (ATE): $$ATE = \frac{1}{N} \sum_{i=1}^N Y_i(1) - Y_i(0)$$

In other words, the "treatment effect" is the direct comparison of a universe where every unit receives treatment and a parallel universe where no unit receives treatment. The potential outcomes are considered fixed in each universe, but for each unit, only one potential outcome per unit is ever observable. If unit $i$ is placed in treatment, we observe $Y_i(1)$; if it is placed in control, we observe $Y_i(0)$. As a result, the ATE is never computable in practice since it requires knowledge of both $Y_i(1)$ and $Y_i(0)$. However, it can be shown that running a randomized experiment provides the right answer in expectation.

How and why do A/B tests work?

The main idea behind an A/B test is to randomly observe either $Y_i(1)$ or $Y_i(0)$. In expectation over our assignment, we can deduce the treatment effect. More formally, suppose a random set of $N_t$ units are assigned to treatment and $N_c = N - N_t$ units are assigned to control. Let vector $Z \in \{0, 1\}^N$ be the assignment vector such that $Z_i = 1$ if unit $i$ is in treatment and $Z_i = 0$ if unit $i$ is in control.

The difference-in-means estimator $\hat \mu$ averages the outcomes of the units in treatment $Y_t$ and the outcomes of the units in control $Y_c$: $$\hat \mu = \overline Y_t - \overline Y_c$$ If we assign $N_t$ randomly selected units to treatment and $N_c$ to control, known as a Completely Randomized assignment, then a fundamental result of causal inference states that, $$\mathbb{E}_Z \left[ \hat \mu \right] = ATE$$ In other words, an A/B test which randomly places a set portion of the total units in treatment and every other unit in control provides on average the true difference between a universe where every unit receives treatment and a universe where none do. This confirms the intuition that A/B testing is a good way to estimate treatment effect.

What is interference?

The previous result does make the assumption that a unit's outcome does not depend on the treatment assignment of another unit in the population. This assumption is known as SUTVA for Stable Unit Treatment Value Assumption and allows us to write the potential outcomes as a function of just $Z_i$ instead of the entire vector $Z$. When SUTVA does not hold, we talk about interference, and sometimes network interference if we imagine the interference structure to be a network. In this case, the notation becomes $Y_i: Z \mapsto Y_i(Z)$ instead of $Y_i: Z_i \mapsto Y_i(Z_i)$. The definition of treatment effect must also be redefined. A popular estimand in this case is the Total Treatment Effect defined as $$TTE = \frac{1}{N} \sum_{i = 1}^N Y_i(\vec 1) - Y_i(\vec 0)$$ which still captures the comparison between a universe where every unit receives treatment and a universe where every unit receives control. Alternative estimands might compare a universe where only one unit receives treatment to a universe where no unit receives treatment: $$\frac{1}{N} \sum_{i=1}^N Y_i(Z_i = 1, Z_{-i} = \vec 0) - Y_i(\vec 0)$$

Vaccinations are an excellent example of randomized experiments where SUTVA does not hold. If Alice's family is vaccinated, the probability that Alice becomes sick is lower than if her family were not vaccinated. In other words, $Y_{Alice}$ depends not only on $Z_{Alice}$ but also on $Z_{family}$. Another example is that of a feature that requires communication between units: the more of Bob's friends also have the new feature, the more Bob will tend to use it and the higher his outcome. Thus, $Y_{Bob}$ depends not only on $Z_{Bob}$ but also on $Z_{friends}$.

Why does interference matter?

If interference is present, then our A/B tests may no longer be valid: $E_{Z}[\hat \mu] \neq TTE$. Consider the following viral marketing example where units adopt a product if they see the ad and more than half their friends have also seen the ad. In fact, the total treatment effect of the ad is 100%. In the universe where everyone sees the ad, the adoption rate is 100% and it is 0% in the parallel universe where no one sees the ad. However, if I assign 10% of people at random to see the ad, it may very well be that only very few users will also have more than half their friends who see the ad. As a result, a smaller fraction of treated users will end up adopting the product and the estimated adoption rate is much lower than 100%.

Interference issues can also be more subtle. For example, a small change in a Facebook feed ranking algorithm can affect Alice's interaction with her feed and impact what her friends see on their own feed (e.g. "Alice liked this video"), thereby affecting their outcomes. To be more specific, suppose that Facebook wishes to test the effect of increasing the ranking of text posts made by users. Alice, seeing more text posts in her feed, begins to post more text posts of her own. Bob, friend's with Alice, is not treated but notices more text posts from Alice, and begins to post more text posts as well. An A/B test would compare Alice and Bob who may have both increased their text posts in a similar way, and thus might underestimate the true treatment effect.

How can we know if there is interference?

When the presence of interference is not immediately obvious, it is an ongoing research question to know whether interference is present in a significant way. Some great work by Athey et al. [1] (and prior to that Aronow [2]) looks at the regression of a unit's outcome on their neighbors' treatment assignment, rejecting SUTVA when the parameter is significant. Some of the more difficult questions that face this kind of work are:

What if we don't know who is friends with whom? Is there a procedure that does not incorrectly reject SUTVA even if our knowledge of the graph is incomplete and possibly wrong?
There are multiple ways to combine the regression coefficients into a test statistic. The authors suggest one, which produces different results depending on the core set chosen, making the reliance on p-values problematic.

An alternative approach is to design a particular experiment which tests for interference. How might one go about designing such an experiment? Let's first explore popular alternative experimental designs. One such design assigns entire clusters of units to treatment or control, instead of individual units. This is known as a Cluster-based randomized design (CBR), and is often used when graph information is available in the hope of mitigating interference. For example, in the vaccination example, we might wish to vaccinate entire households rather than individual units in order to reduce interference between units of the same household.

Cluster-based randomized designs are popular for their simplicity and the intuitive way in which they might reduce interference. Furthermore, if no interference exists, $\mathbb{E}_Z[\hat \mu'] = ATE$ for a slightly modified difference-in-means estimator. In other words, when no interference is present, running a completely randomized design or a cluster-based design produces the same result in expectation. This property can be exploited to design an interference test.

When there is interference, it is easy to see why running one design over another might produce different results. In the vaccination example, the vaccine might seem more effective under a cluster-based assignment than under a completely randomized one. In fact, if the vaccinated units cohabit with other vaccinated units, they are less likely to fall sick than if they cohabit with a mix of vaccinated and non-vaccinated units. In other words, not only the treatment but also the way treatment is assigned randomly to the population affects outcomes, and thus the final estimate of treatment effect. When there is NO interference however, the result of running one design or another does not change the final estimate (in expectation).

Significant assignment strategy effect implies interference.

The idea of applying both experimental designs on the whole graph and comparing the estimates is of course not feasible, much like we cannot place units in both treatment and control. A solution is therefore to A/B test the A/B test strategies, much like randomizing over treatment and control solved the missing potential outcome problem.

How can we best randomize over two assignment strategies?

The best way to randomize over both assignment strategies is to randomly select a part of the graph to receive treatment (or control) according to a cluster-based randomized assignment and the rest to receive treatment (or control) according to a completely-randomized assignment. We can talk about arms (CR vs. CBR) and buckets (treatment vs. control). The following design finds a good compromise between maintaining sufficient graph structure in each arm, whilst ensuring that no bias is introduced in expectation:

Cluster the graph using balanced clusters and assign units to either arm in a cluster-based randomized way.
Apply a completely-randomized assignment in the CR arm and compute $\hat \mu_{cr}$.
Apply a cluster-based randomized assignment in the CBR arm using the initial clustering and compute $\hat \mu_{cbr}$.

How can I conclude?

It is relatively easy to prove that if $E_{W, Z}[\hat \mu_{cr}] \neq E_{W,Z}[\hat \mu_{cbr}]$, then there is interference. The main idea is therefore to determine if the difference $\hat \mu_{cr} - \hat \mu_{cbr}$ is significantly different from 0. For that, we need an estimator for the variance of $\hat \mu_{cr} - \hat \mu_{cbr}$. As it turns out, it is almost never the case in causal inference that we can know the variance exactly, and it is common to look for empirical upper-bounds instead. It can be shown that summing $$\frac{\hat S_{cr, t}}{n_{cr,t}} + \frac{\hat S_{cr, c}}{n_{cr, c}}$$ and $$\frac{m_{cbr}}{n_{cbr}} \left( \frac{\hat S_{cbr, t}}{m_{cr, t}} + \frac{\hat S_{cbr, c}}{m_{cbr,c}} \right)$$ provides an upper-bound in expectation over $W$ and $Z$ of $var_{W, Z}[\hat \mu_{cr} - \hat \mu_{cbr}]$, where

$\hat S_{cr, t}$ is the variance of the outcomes in the treatment bucket of the completely randomized arm and $n_{cr,t}$ is the number of units in that bucket.
$\hat S_{cr, c}$ is the variance of the outcomes in the control bucket of the completely randomized arm and $n_{cr,c}$ is the number of units in that bucket.
$\hat S_{cbr, t}$ is the variance of the aggregated outcomes in the treatment bucket of the cluster-based randomized arm and $m_{cbr,t}$ is the number of clusters in that bucket.
$\hat S_{cbr, c}$ is the variance of the aggregated outcomes in the control bucket of the cluster-based randomized arm and $m_{cbr,c}$ is the number of clusters in that bucket.
$m_{cbr}$ (resp. $n_{cbr}$) is the total number of clusters (resp. units) in the cluster-based randomized arm

By aggregated outcomes, we mean that outcomes are summed over every cluster, which become the new unit size. To conclude, we consider $$T = \frac{\hat \mu_{cr} - \hat \mu_{cbr}}{\frac{\hat S_{cr, t}}{n_{cr,t}} + \frac{\hat S_{cr, c}}{n_{cr, c}} + \frac{m_{cbr}}{n_{cbr}} \left( \frac{\hat S_{cbr, t}}{m_{cr, t}} + \frac{\hat S_{cbr, c}}{m_{cbr,c}} \right)}$$ If $T$ is significantly different from 0, we conclude with Chebychev's inequality or using a normal approximation on whether there is interference. For a more technical introduction to the methodology, see [3] which introduces the design above. In fact, the design was implemented at LinkedIn during two experiments. See [4] for the details on the implementation.

Acknowledgements. Martin Saveski did the bulk of the data exploration, implementation of the design, and data analysis whilst interning at LinkedIn. See [4].

[1] Athey, Susan, Dean Eckles, and Guido W. Imbens. "Exact P-values for Network Interference*." Journal of the American Statistical Association (2016). [arxiv]
[2] Aronow, Peter M. "A general method for detecting interference between units in randomized experiments." Sociological Methods & Research 41.1 (2012): 3-16.
[3] Pouget-Abadie, Jean, Martin Saveski, Guillaume Saint-Jacques, Weitao Duan, Ya Xu, Souvik Ghosh, Edoardo M. Airoldi. "Testing for network interference on experimentation platforms" (2017) [arXiv]
[4] Saveski, Martin, Jean Pouget-Abadie, Guillaume Saint-Jacques, Weitao Duan, Ya Xu, Souvik Ghosh, Edoardo M. Airoldi. "Detecting Network effects: randomizing over randomized experiments" (2017)

Back to home.

A quick introduction to non-independent A/B tests

by Jean Pouget-Abadie on February 27, 2017