# Bayesian models of iterated learning

Iterated learning is a kind of transmission chain. Every learner in the chain has to produce output based on what it learned from the previous learner’s output. The first learner is for example shown some objects and (artificial) words for those. It learns the word-object pairings — the language, really — and then (re)produces words for the objects. The second learner is shown the objects with the words the first learner produced for them and is asked to produce new words for the objects. You would not expect the new words to differ a lot from the original words if the learners are really learning something, but errors might nonetheless occur.

The idea underlying iterated learning is that these errors can exhibit some kind of structure, leading to the emergence of a more systematic language over time. If all learners for example share an initial bias about which kind of words go with which objects, then every learner will be biased to produce certain object-word pairs, to produce a certain language. Over time the languages used by the agents can be expected to shift towards this bias and indeed. It is of crucial importance that all learners share the same prior here. A convergence to prior expectations is not always found empirically, showing that this assumption is unrealistic in some cases. But is the argument even valid — does a chain of biased learners over time settle on the language they were initially biased towards?

@@Griffiths2005 showed that there is indeed a convergence to the prior if all agents are identical perfectly Bayesian reasoners. Instead of humans, they put Bayesian agents in a transmission chain. Abstracting away from the previous example, the task 1 of every agent is simple: it is presented with data $\mathbf{d}_0 = (d_1, \ldots, d_b)$ from which it infers a hypothesis $h_1$ that explains how that data was produced. Using the same hypothesis, the agent can produce new data $\mathbf{d}_1$. That is presented to the next learner, who again infers a hypothesis $h_1$ and produces new data $\mathbf{d}_2$, which is presented to the third learner, and so on, and so on. Every agent faces two problems: learning a new hypothesis based on the data and producing new data using the hypothesis. It thus needs a learning algorithm (LA) and a production algorithm (PA), both of which can be modelled using probability distributions:

The production algorithm simply samples data from $p_{\text{PA}}(\mathbf{d} \mid h)$. Similarly, the learning algorithm can sample from $p_{\text{LA}}(h \mid \mathbf{d})$. But there are other possibilities. The learning algorithm could also choose the hypothesis that is most likely under $p_{\text{LA}}(h \mid \mathbf{d})$. Agents with such a strategy are called maximizers, whereas the former are called samplers. In this post, we are only concerned with samplers.

Since our agents are Bayesian reasoners, the whole model can be further simplified. If every agent uses the same production algorithm, Bayes rule implies that the distribution over hypotheses is the posterior distribution

where $p(h)$ is the prior we talked about earlier. This means we no longer need the subscripts ‘PA’ an ‘LA’ since we are working with one big joint probabilistic model $p(\mathbf{d}, h)$ over both data and hypotheses. That model completely characterizes each of our agents.

Now back to the chain. Recall that agents sucessively form hypotheses and produce new data in the following way:

The arrows indicate conditional dependence. Importantly, we assume that $h_n$ does not depend on previous hypotheses $h_m$ ($m \le n$) once $\mathbf{d}_{n-1}$ has been given. This is a Markov assumption that reveals several Markov chains in the transmission chain. The one of interest here is the chain on hypotheses

The distributions corresponding to this chain can be calculated from the joint model $p(\mathbf{d}, h)$ by marginalizing out the data $\mathbf d$:

This equation specifies a Markov chain, whose long-term behaviour can be analysed using standard techniques. Briefly, if the chain satisfies some additional assumptions, it will converge to a stationary distribution $p(h^\star)$ for which

The imporant observation of Kalish and Griffiths was this: the prior $p(h)$ is the stationary distribution of the Markov chain over hypotheses. So Bayesian agents in a transmission chain will over time start to sample hypotheses according to their prior probability — convergence to the prior.

This finding sparked a lot of debate. An obvious issue with the model is the rather artificial structure of the population in one infinitely long, linear chain @@(Smith2009). @@Ferdinand2008b showed that for chains with larger generations, the convergence disappeared, although one might object that their agents were not Bayesian reasoners. @@Burkett2010 made that objection and presented a model of Bayesian learners who assume they are learning from a mixture of different languages, in which case the convergence again holds. And indeed, the mathematical proof leaves no doubt: if the Markov model applies, convergence is guaranteed. Another line of attack argued that the model might not be very informative of actual language evolution. As this model only exhibits verly limited dynamics, bifurcations can never arise, although they arguably occur in actual evolution @@(Niyogi2009). Finally, one might wonder whether the assumption that all learners are essentially idential is warranted. But despite all these criticisms, something like a convergence to the prior can to be observed in certain experiments. Very recently, @@Jacoby2017 for example used a transmission chain design to elucidate priors in rhythm perception.

1. This discussion largely follows @Griffiths2007a and @Griffiths2005.