# An Introduction to Expectation Maximization


We perform a simple experiment $n = 12$ times. In the $i$’th experiment, a coin is drawn and $Z_i$ is used to denote its color ($c_1$ or $c_2$). Next, the coin is flipped $k = 10$ times and we let $X_i$ count the number of heads. If in step i we have drawn a coin of type $z \in \{c_1, c_2\}$, then $X_i$ can be modelled by a binomial distribution:

where we abbreviate $\pi_z := \pi_j$, if $z = c_j$. We can model the probability of drawing a coin of a certain color $z$ using a categorical distribution:

which is to say that $p(Z_i = c_1 \mid \theta) = w_1$ and $p(Z_i = c_2 \mid \theta) = w_2$. Furthermore, we assume that the mixture components are independent; that the coins are independently chosen for all experiments. This means that the current model factorizes in components:

and we will later see why this is such a helpful assumption.

To get an idea of what outcomes one should expect in this experiment, we look at the distribution $p(X_i \mid \theta)$. We don’t know what that looks like yet, since only $p(X_i \mid Z_i)$ is given, but marginalizing over the coin types $Z_i$ gives the desired distribution:

This is just a weighted sum of two binomial distributions. If one of the weights would increase, the probability mass would shift to that component. And if one bias would increase, the probability mass of that component would move to the right. The outcomes in the experiment thus crucially depend on the weights $w_j$ and the biases $\pi_j$, both of which are unknown. All this is illustrated in the two figures below (which will hopefully be interactive soon):

The marginal distribution (black) of the data $p(x \mid \params)$ for parameters $\params = (w_1 = 0.4,w_2 = 0.6,\pi_1 = 0.25,\pi_2 = 0.6)$. The conditional binomial distributions $p(x \mid c_1, \params)$ and $p(x \mid c_2, \params)$ corresponding to the two coins are shown in red and blue respectively.

Marginal distributions $p(x \mid \params)$ for different parameters. On the left side, the biases are kept constant to $\pi_1 = 0.3$ and $\pi_2 = 0.8$, while the weights vary. On the right side, the weights and the second bias are fixed to $w_1 = w_2 = 0.5$ and $\pi_2 = 0.1$, while the first bias $\pi_1$ varies.

## Parameter estimation on complete data

After performing the experiments, the following data were reported.

 $x_i:$ 8 6 3 4 3 4 4 4 4 5 5 6 $z_i:$ $c_2$ $c_2$ $c_1$ $c_2$ $c_1$ $c_1$ $c_2$ $c_2$ $c_1$ $c_2$ $c_2$ $c_2$

Before we compute the maximum likelihood estimates for $w_j$ and $\pi_j$ from these data, we introduce a notational device that will come in handy later. Suppose we have two lists $\mathbf{a} = (a_1,...a_n)$ and $\mathbf{b} = (b_1,...,b_n)$ of equal length. We often need elements $a_i$ of $\mathbf{a}$ for which the corresponding elements $b_i$ have a certain value $b^*$. Therefore define the restriction of $\mathbf{a}$ to $\mathbf{b} = b^*$ as

If $\bb$ is clear from the context, we abbreviate $\restr{\aa}{b^*} := \restr{\aa}{\bb=b^*}$. Furthermore, we write $\#(\aa)$ for the length of $\aa$ list, $\Sigma(\aa)$ of the sum of its items and things like $\overline{\aa}$ for its average, i.e. $\overline{\aa} = \Sigma(\aa) \; / \; \#(a)$. Getting back to the observed data, the sum of all $x_i$ that stem from tosses with coin $c_1$ is $\Sigma(\restr{\xx}{\zz = c_1}) = 14$ which can be abbreviated to $\Sigma(\restr{\xx}{c_1})$. Or take the number of occurrences of $c_2$, which can be denoted as $\#(\restr{\zz}{\zz=c_2}) = 8$ or simply $\#(\restr{\zz}{c_2})$.

Maximum likelihood estimates are easily computed after the sufficient statistics have first been calculated. For the categorical distribution, these are the number of occurrences per category,1

The ML-estimates of the weights of a categorical distribution are the relative frequencies of the categories, hence2

The sufficient statistic for binomially distributed $X_1, \dots, X_n$ is the sum $\sum_i X_i$ and the number of datapoints $n$.3 As the experiment consists of two binomial distributions of which the parameters are to be estimated, there are two sets of sufficient statistics: the sums $\Sigma(\restr{\xx}{c_1})$ and $\Sigma(\restr{\xx}{c_2})$ of all $X_i$ generated from $c_1$ and $c_2$ respectively, plus the number of observations $\#(\restr{\xx}{c_1})$ and $\#(\restr{\xx}{c_2})$ for both types of coins. The latter were already computed in \eqref{eq:suff-stats-cat} and the sums are

Since the ML-parameter of a binomial distribution is $\hat\pi = \frac{1}{k} \cdot \overline{\xx}$4,

Note that these ML-estimates make sense: $\hat{w}_1$ is smaller than $\hat{w}_2$, which one would expect if there are two times as many observations from $c_2$. Similarly, $% $ is consistent with the fact that the outcomes in $c_1$ seem to be lower than those in $c_2$.

In conclusion, when both the outcomes $x_i$ and the cointype $z_i$ are observed, parameter estimation is straightforward. Just compute the sufficient statistics and then the maximum likelihood estimates follow directly. But what if we do not know which cointype generated which outcome?

## Parameter estimation for ‘nearly incomplete’ data.

In the previous section we dealt with complete data: both $X_i$ and $Z_i$ were observed. Can we extend this to the case when only the $X_i$ are observed, and we are completely ignorant about the mixture components $Z_i$ that generated $X_i$? Can we do ML-estimation for incomplete data? Before we get there, we consider an intermediate step one might call, for lack of a better name, the case of nearly incomplete data. Consider this:

 $x_i$ 8 6 3 4 3 4 4 4 4 5 5 6 $p(c_1 \mid x_i)$ 0.1 0.3 0.8 0.5 0.8 0.5 0.5 0.5 0.5 0.4 0.4 0.3 $p(c_2 \mid x_i)$ 0.9 0.7 0.2 0.5 0.2 0.5 0.5 0.5 0.5 0.6 0.6 0.7

In contrast to the previous dataset, this ‘dataset’ allows for uncertainty about which component generated which datapoint. Every outcome $x_i$ comes with a probability $p(c_j \mid x_i)$ that component $c_j$ generated $x_i$. These probabilities are made up, but we will later be able to calculate them. Do note that they are conditional probabilities and that e.g. $p(c_1 \mid x_i) + p(c_2 \mid x_i) = 1$. Also note that the table contains a lot of redundancy as all columns with the same $x_i$ are identical. A neater representation would group the $x_i$:

 $x_i$ $\#(\restr{\xx}{x_i})$ $p(c_1 \mid x_1)$ $p(c_2 \mid x_i)$ 3 2 0.8 0.2 4 5 0.5 0.5 5 2 0.4 0.6 6 2 0.3 0.7 8 1 0.1 0.9

It is worth pondering these probabilities $p(c_j \mid x_i)$ a bit more. We observed two 5’s in our data, each of which was generated by a coin of type $c_1$ with probability $p(c_1 \mid 5) = 0.4$ and by a coin of type $c_2$ with probability $p(c_2 \mid 5) = 0.6$. Another way of saying the same thing is that component $c_1$ is responsible for 40% of the 5’s, while component $c_2$ is responsible for the remaining 60%. For that reason the probabilities $p(c_j \mid x_i)$ are sometimes called responsibilities. Spelling that out further, we would expect $0.4 \times 2 = 0.8$ observed 5’s to be coming from component $c_1$ and $0.6 \times 2 = 1.2$ from component $c_2$. Coin type $c_1$ is responsible for 0.8 observed 5’s, while $c_2$ is responsible for the remaining 1.2 — or at least on average.

Why would this be of the slightest import? Well…

When computing the ML-parameters for complete data, we really only needed two things: the counts $\#(\restr{\xx}{c_j})$ and sums $\Sigma(\restr{\xx}{c_j})$ for each of the categories $c_j$. These we can no longer compute, since we no longer know which components $z_i$ generated which datapoints. But we do have expectations about the $z_i$ and we can use those.

Forget about the actual value of $\#(\restr{\zz}{c_1})$ for a moment. What would we expect it to be? Based on the ‘responsibilities’ $p(c_j \mid x_i)$, we expect $c_1$ to have generated 0.8 of the two 3’s, 0.5 of the five 4’s, 0.4 of the two 5’s, 0.3 of the two 6’s and 0.1 of the one 8. So, writing $\mathcal X = \{3,4,5,6,8\}$, we expect

of the observed datapoints $x_i$ to be generated by $z_i = c_1$. Accordingly $c_2$ is expected to be responsible for the remaining $12 − 7.1 = 6.4$ observations.

Indeed, it is a strange conclusion that 5.6 datapoints come from $c_1$ and 6.4 from $c_2$ — counts should be integers! True as that may be, expected counts need not. On average, 6.4 of the datapoints in this particular dataset might very well come from $c_2$. Assuming these expected counts are correct, the maximum likelihood estimates for the mixture weights are easily computed:

Compare this to the earlier estimates in \eqref{eq:ml-estimate-w1-complete} and \eqref{eq:ml-estimate-w2-complete}. The formulae look very similar, but the numbers are slightly different. And they should be, as we currently know much less about the unobserved variables $z_i$ than we did before.

In a similar spirit, we can try to estimate $\hat{\pi}_1$ and $\hat{\pi}_2$, for which we need $\Sigma(\restr{\xx}{c_j})$, or what we expect that to be. In words, what would we expect the sum of the observations generated from component $c_1$ to be? Well, we expect $0.8 \times 2$ threes, which contribute $0.8 \times 2 \times 3$ to the sum; we expect $0.5 \times 5$ fours, contributing $0.5 \times 5 \times 4$ to the sum, etc. In short, we expect the sum to be

If this is the expected sum of the $x_i$ generated by $c_1$, then the sum of the $x_i$ from $c_2$ is also know, using that the sum of all observed datapoints 56:

The ML-estimates follow directly from this:

The take home message is this: even if you only have expectations about which component generated which datapoint, you can still estimate the parameters. Only the responsibilities $p(z_i \mid x_i)$ that component $z_i$ generated $x_i$ are needed. So how could one get those responsibilities? In fact, they are just the posterior probabilities

Notice that these responsibilities can only be computed once some estimates of $w_1$, $w_2$, $\pi_1$, $\pi_2$ are known. But we need the responsbilities to estimate the parameters! Chickens and eggs! Where do we start?

## Parameter estimation on incomplete data

Anywhere, really.5 We just have to pick an arbitrary starting point, some initial parameter settings, use those to find the responsibilities, then compute better parameter estimates, new responsibilities, better parameter estimates, yet new responsibilities, and so on, and so on. The two alternating steps have names: the E(xpectation) step and the M(aximization) step. In the E-step you compute the expected component assignments, i.e., the responsibilities. Use those in the M-step to find better parameters. If you repeatedly follow this procedure, you get the Expectation-Maximization (EM) algorithm.

We will go through the first step now and start by choosing

where we use the superscript $\tau$ to indicate the timestep in the algorithm. To calculate the posteriors \eqref{eq:posteriors}, it is often convenient to use a table. After computing the joint probabilities $p(x, c_j)$, the marginal probability $p(x) = \sum_j p(x, c_j)$ follows immediately, from which you can compute the posteriors $p(c_j \mid x)$.

If that was too quick, here is the same thing again, slowly. Suppose you are in step τ = 0 of the algorithm. Start by calculating the joint from (1). For this you need both the current estimates of the weights and the current biases,

Then we can compute the likelihood of the data, writing $m = 2$ for the number of mixture components,

Finally, the responsibilities, the posteriors, follow from this:

All this, remember, we can actually compute. The table below collects these computations for the first step $\tau = 0$; the posteriors are in column $6$ and $7$.

Calculating the responsibilities from the initial parameter guesses $w_1^{(0)} = w_2^{(0)} = 0.5$ and $\pi_1^{(0)}=0.2, \pi_2^{(0)} = 0.7$.

In fact, the table also already contains all computations needed to get the expected sufficient statistics: just compare the sums of column 8 and 9 with equation \eqref{eq:exp-counts} and column 10 and 11 with \eqref{eq:exp-sums}. The numbers at the bottom of those columns are the sums of values in the columns and these are precisely the expected counts and expected sums from \eqref{eq:exp-counts} and \eqref{eq:exp-sums}. Indeed, they add up to the total number of datapoints (12) and the sum of all observed datapoints (56). In short,

And using the same calculations as in (17) and (18)

Note that the superscripts indicate that these are the new parameter es- timates. The same computations as in (21) and (3) give the new biases:

The first step of the EM algorithm is now completed. Using the new parameters $\params^{(1)}$, one could now compute new posterior probabilities, and estimate even better parameters $\params^{(2)}$, and so on. Hopefully, the model parameters $\params^{(\tau)}$ will eventually converge to the maximum likelihood estimates of the parameters as $\tau\to \infty$ and explain the data better and better. The crucial question is thus: does this happen? Is there a guarantee that the parameters will not get worse? There is, but we leave that for later.

1. Note that when counting we can use restrictions of $\xx$ and $\zz$ interchangeably, as $\#(\restr{\xx}{\zz = c_1}) = \#(\restr{\zz}{\zz = c_1})$.

2. I always use hats, as in $\hat{w}$, to indicate maximum likelihood estimates

3. The number of observations $n$ is typically not counted as a sufficient statistic, but implicitly assumed. As we will see, we cannot implicitly assume its value for the binomial distribution and therefore included it. For the categorical distribution, $n$ is always fixed and we therefore implicitly assume its value.

4. Here $\overline{\xx}$ again denotes the sample mean $\frac{1}{n} \sum_{i=1}^n x_i$.

5. In theory, at least. In practice, finding a good starting point can be very difficult.

Written on January 8, 2017