An Introduction to Expectation Maximization
Consider an urn with coins in two colors: red ones () and blue ones (). All red coins have bias , while all blue coins have bias . The number of red and blue coins could be different so we let denote the fraction of red coins and the fraction of blue coins. The urn, in other words, contains a mixture of red and blue coins, with what one calls mixture weights and . We know that there are two types of coins, two mixture components, but importantly, we do not know what the mixture weights are, nor what the biases are. That’s the problem: how can one infer these parameters from the data?
We perform a simple experiment times. In the ’th experiment, a coin is drawn and is used to denote its color ( or ). Next, the coin is flipped times and we let count the number of heads. If in step i we have drawn a coin of type , then can be modelled by a binomial distribution:
where we abbreviate , if . We can model the probability of drawing a coin of a certain color using a categorical distribution:
which is to say that and . Furthermore, we assume that the mixture components are independent; that the coins are independently chosen for all experiments. This means that the current model factorizes in components:
and we will later see why this is such a helpful assumption.
To get an idea of what outcomes one should expect in this experiment, we look at the distribution . We don’t know what that looks like yet, since only is given, but marginalizing over the coin types gives the desired distribution:
This is just a weighted sum of two binomial distributions. If one of the weights would increase, the probability mass would shift to that component. And if one bias would increase, the probability mass of that component would move to the right. The outcomes in the experiment thus crucially depend on the weights and the biases , both of which are unknown. All this is illustrated in the two figures below (which will hopefully be interactive soon):
The marginal distribution (black) of the data for parameters . The conditional binomial distributions and corresponding to the two coins are shown in red and blue respectively.
Marginal distributions for different parameters. On the left side, the biases are kept constant to and , while the weights vary. On the right side, the weights and the second bias are fixed to and , while the first bias varies.
Parameter estimation on complete data
After performing the experiments, the following data were reported.
8  6  3  4  3  4  4  4  4  5  5  6  
Before we compute the maximum likelihood estimates for and from these data, we introduce a notational device that will come in handy later. Suppose we have two lists and of equal length. We often need elements of for which the corresponding elements have a certain value . Therefore define the restriction of to as
If is clear from the context, we abbreviate . Furthermore, we write for the length of list, of the sum of its items and things like for its average, i.e. . Getting back to the observed data, the sum of all that stem from tosses with coin is which can be abbreviated to . Or take the number of occurrences of , which can be denoted as or simply .
Maximum likelihood estimates are easily computed after the sufficient statistics have first been calculated. For the categorical distribution, these are the number of occurrences per category,^{1}
The MLestimates of the weights of a categorical distribution are the relative frequencies of the categories, hence^{2}
The sufficient statistic for binomially distributed is the sum and the number of datapoints .^{3} As the experiment consists of two binomial distributions of which the parameters are to be estimated, there are two sets of sufficient statistics: the sums and of all generated from and respectively, plus the number of observations and for both types of coins. The latter were already computed in \eqref{eq:suffstatscat} and the sums are
Since the MLparameter of a binomial distribution is ^{4},
Note that these MLestimates make sense: is smaller than , which one would expect if there are two times as many observations from . Similarly, is consistent with the fact that the outcomes in seem to be lower than those in .
In conclusion, when both the outcomes and the cointype are observed, parameter estimation is straightforward. Just compute the sufficient statistics and then the maximum likelihood estimates follow directly. But what if we do not know which cointype generated which outcome?
Parameter estimation for ‘nearly incomplete’ data.
In the previous section we dealt with complete data: both and were observed. Can we extend this to the case when only the are observed, and we are completely ignorant about the mixture components that generated ? Can we do MLestimation for incomplete data? Before we get there, we consider an intermediate step one might call, for lack of a better name, the case of nearly incomplete data. Consider this:
8  6  3  4  3  4  4  4  4  5  5  6  
0.1  0.3  0.8  0.5  0.8  0.5  0.5  0.5  0.5  0.4  0.4  0.3  
0.9  0.7  0.2  0.5  0.2  0.5  0.5  0.5  0.5  0.6  0.6  0.7 
In contrast to the previous dataset, this ‘dataset’ allows for uncertainty about which component generated which datapoint. Every outcome comes with a probability that component generated . These probabilities are made up, but we will later be able to calculate them. Do note that they are conditional probabilities and that e.g. . Also note that the table contains a lot of redundancy as all columns with the same are identical. A neater representation would group the :
3  2  0.8  0.2 
4  5  0.5  0.5 
5  2  0.4  0.6 
6  2  0.3  0.7 
8  1  0.1  0.9 
It is worth pondering these probabilities a bit more. We observed two 5’s in our data, each of which was generated by a coin of type with probability and by a coin of type with probability . Another way of saying the same thing is that component is responsible for 40% of the 5’s, while component is responsible for the remaining 60%. For that reason the probabilities are sometimes called responsibilities. Spelling that out further, we would expect observed 5’s to be coming from component and from component . Coin type is responsible for 0.8 observed 5’s, while is responsible for the remaining 1.2 — or at least on average.
Why would this be of the slightest import? Well…
When computing the MLparameters for complete data, we really only needed two things: the counts and sums for each of the categories . These we can no longer compute, since we no longer know which components generated which datapoints. But we do have expectations about the and we can use those.
Forget about the actual value of for a moment. What would we expect it to be? Based on the ‘responsibilities’ , we expect to have generated 0.8 of the two 3’s, 0.5 of the five 4’s, 0.4 of the two 5’s, 0.3 of the two 6’s and 0.1 of the one 8. So, writing , we expect
of the observed datapoints to be generated by . Accordingly is expected to be responsible for the remaining observations.
Indeed, it is a strange conclusion that 5.6 datapoints come from and 6.4 from — counts should be integers! True as that may be, expected counts need not. On average, 6.4 of the datapoints in this particular dataset might very well come from . Assuming these expected counts are correct, the maximum likelihood estimates for the mixture weights are easily computed:
Compare this to the earlier estimates in \eqref{eq:mlestimatew1complete} and \eqref{eq:mlestimatew2complete}. The formulae look very similar, but the numbers are slightly different. And they should be, as we currently know much less about the unobserved variables than we did before.
In a similar spirit, we can try to estimate and , for which we need , or what we expect that to be. In words, what would we expect the sum of the observations generated from component to be? Well, we expect threes, which contribute to the sum; we expect fours, contributing to the sum, etc. In short, we expect the sum to be
If this is the expected sum of the generated by , then the sum of the from is also know, using that the sum of all observed datapoints 56:
The MLestimates follow directly from this:
The take home message is this: even if you only have expectations about which component generated which datapoint, you can still estimate the parameters. Only the responsibilities that component generated are needed. So how could one get those responsibilities? In fact, they are just the posterior probabilities
Notice that these responsibilities can only be computed once some estimates of , , , are known. But we need the responsbilities to estimate the parameters! Chickens and eggs! Where do we start?
Parameter estimation on incomplete data
Anywhere, really.^{5} We just have to pick an arbitrary starting point, some initial parameter settings, use those to find the responsibilities, then compute better parameter estimates, new responsibilities, better parameter estimates, yet new responsibilities, and so on, and so on. The two alternating steps have names: the E(xpectation) step and the M(aximization) step. In the Estep you compute the expected component assignments, i.e., the responsibilities. Use those in the Mstep to find better parameters. If you repeatedly follow this procedure, you get the ExpectationMaximization (EM) algorithm.
We will go through the first step now and start by choosing
where we use the superscript to indicate the timestep in the algorithm. To calculate the posteriors \eqref{eq:posteriors}, it is often convenient to use a table. After computing the joint probabilities , the marginal probability follows immediately, from which you can compute the posteriors .
If that was too quick, here is the same thing again, slowly. Suppose you are in step τ = 0 of the algorithm. Start by calculating the joint from (1). For this you need both the current estimates of the weights and the current biases,
Then we can compute the likelihood of the data, writing for the number of mixture components,
Finally, the responsibilities, the posteriors, follow from this:
All this, remember, we can actually compute. The table below collects these computations for the first step ; the posteriors are in column and .
Calculating the responsibilities from the initial parameter guesses and .
In fact, the table also already contains all computations needed to get the expected sufficient statistics: just compare the sums of column 8 and 9 with equation \eqref{eq:expcounts} and column 10 and 11 with \eqref{eq:expsums}. The numbers at the bottom of those columns are the sums of values in the columns and these are precisely the expected counts and expected sums from \eqref{eq:expcounts} and \eqref{eq:expsums}. Indeed, they add up to the total number of datapoints (12) and the sum of all observed datapoints (56). In short,
And using the same calculations as in (17) and (18)
Note that the superscripts indicate that these are the new parameter es timates. The same computations as in (21) and (3) give the new biases:
The first step of the EM algorithm is now completed. Using the new parameters , one could now compute new posterior probabilities, and estimate even better parameters , and so on. Hopefully, the model parameters will eventually converge to the maximum likelihood estimates of the parameters as and explain the data better and better. The crucial question is thus: does this happen? Is there a guarantee that the parameters will not get worse? There is, but we leave that for later.

Note that when counting we can use restrictions of and interchangeably, as . ↩

I always use hats, as in , to indicate maximum likelihood estimates ↩

The number of observations is typically not counted as a sufficient statistic, but implicitly assumed. As we will see, we cannot implicitly assume its value for the binomial distribution and therefore included it. For the categorical distribution, is always fixed and we therefore implicitly assume its value. ↩

Here again denotes the sample mean . ↩

In theory, at least. In practice, finding a good starting point can be very difficult. ↩