Bayesian models of iterated learning Part 2

Iterated Learning with Bayesian Programs

Bayesian learners in a transmission chain will over time show a convergence to the prior, as we have seen in the previous section. If all learners try to infer the most likely hypothesis given the observed data and produce data according to that hypothesis, at least. The result was introduced with general transmission chains, but we are interested in the more concrete case of language learning. We therefore consider a simulation of iterated learning using the toy language from @@Griffiths2005. This can be easily implemented if we represent every agent as a probabilistic program in the probabilistic programming language WebPPL.

A toy language

In the toy language from @@Griffiths2005, objects (meanings) and symbols “vary along two binary dimensions.” Say objects can be either red (0) or blue (1) and round (0) or square (1). The set of meanings can then be summarized as . Similarly, we can use the words rood (a) or blauw (b) and rond (a) or vierkant (b), such that are all possible symbols. A language is a mapping from meanings to symbols such as

Note that this language is compositional, since it is built up compositionally from the simpler mappings . Here’s an example of a language that is not compositional in this sense:

In total, there are mappings from meanings to symbols . Only four of these are compositional:

(And two of these are degenerate in the sense that they map all meanings to or , so there really are only two nontrivial compositional languages.) We add the 256 languages and the 4 compositional languages together to get a total of 260 languages (which indeed contains 4 duplicates, but that doesn’t rellly matter).

In every step of the transmission chain, the transmitted data will be pair of meanings and symbols . The length of these is the transmission bottleneck. The hypotheses that agents form are exactly the languages we just introduced: mappings from meanings to symbols . Not all languages (hypotheses) are equally likely. We use a hierarchical prior that puts a fraction of the probability mass on the compositional languages:

Once a hypothesis (a language) has been fixed, the agent can be presented with new meaning for which it can then produce a symbol by sampling from the predictive distribution

This means the agent will pick the symbol corresponding to under language most of the time, but has a small probability of making an error. Then it uniformly picks one of the other symbols. In the next section, we write a simulation of this process. Hopefully things will become clearer then.

Simulating iterated language learning

We start by defining the meanings, symbols and languages

var symbols   =  ['aa', 'ab', 'ba', 'bb'];
var meanings  =  ['00', '01', '10', '11'];
var compLangs = [['aa', 'aa', 'aa', 'aa'],  // 0 -> a, 1 -> a
                 ['aa', 'ab', 'ba', 'bb'],  // 0 -> a, 1 -> b
                 ['bb', 'ba', 'ab', 'aa'],  // 0 -> b, 1 -> a
                 ['bb', 'bb', 'bb', 'bb']]; // 0 -> b, 1 -> b

As you can see, we represent a language by a 4-tuple of symbols corresponding to the meanings 00, 01, 10 and 11 respectively. The next thing we need is a function Prior that returns the prior distribution over hypotheses (using an optional parameter ).

///fold:
var symbols   =  ['aa', 'ab', 'ba', 'bb'];
var meanings  =  ['00', '01', '10', '11'];
var compLangs = [['aa', 'aa', 'aa', 'aa'],  // 0 -> a, 1 -> a
                 ['aa', 'ab', 'ba', 'bb'],  // 0 -> a, 1 -> b
                 ['bb', 'ba', 'ab', 'aa'],  // 0 -> b, 1 -> a
                 ['bb', 'bb', 'bb', 'bb']]; // 0 -> b, 1 -> b
///
// A global value for alpha
var alpha = .3;

var Prior = function(_alpha) {
  
  // Default to the global value of alpha
  var _alpha = _alpha || alpha;
  
  // Infer the prior distribution over languages.
  // It needs a function that returns hypotheses
  // in the right proportion and then automatically
  // infers the distribution over them.
  return Infer({method: 'enumerate'}, function() {
    
    // Flip a coin to 
    if( flip(_alpha) ) {
      // Randomply pick a compositional language with prob alpha
      return uniformDraw(compLangs);
      
    } else {
      // Randomly pick a (non-compositional) language with prob 1-alpha
      return repeat(4, function() { uniformDraw(symbols) });  
    }
  });
};

// Sample from a prior using the global alpha
var myPrior = Prior();
print('alpha = 0.3:  ' + sample(myPrior))

// A prior strongly favouring non-compositional languages
var myOtherPrior = Prior(.01);
print('alpha = 0.01: ' + sample(myOtherPrior))

Next, we need a function Predictive that returns a distribution over symbols given a meaning and a hypothesis (language) . The function definition is very similar to Prior:

///fold:
var symbols   =  ['aa', 'ab', 'ba', 'bb'];
var meanings  =  ['00', '01', '10', '11'];
var compLangs = [['aa', 'aa', 'aa', 'aa'],  // 0 -> a, 1 -> a
                 ['aa', 'ab', 'ba', 'bb'],  // 0 -> a, 1 -> b
                 ['bb', 'ba', 'ab', 'aa'],  // 0 -> b, 1 -> a
                 ['bb', 'bb', 'bb', 'bb']]; // 0 -> b, 1 -> b
///
var eps = 0.1;

var Predictive = function(x, hyp, _eps) {
  var _eps = _eps || eps; // Prob. of error, default to global value
  var y = hyp[meanings.indexOf(x)]; // 'Correct' symbol for x
  return Infer({ method: 'enumerate'}, function() {
    if(flip(_eps)) {
      // With prob eps return another symbol
      return uniformDraw(remove(y, symbols));
    } else {
      return y;
    }
  })
}

// Plot the predictive distribution
var hyp = ['ab', 'aa', 'ba', 'bb']
var x = '10';
viz(Predictive(x, hyp))

The agent can produce new symbols for meanings by repeatedly sampling . Two helper functions will be convenient:

///fold:
var symbols   =  ['aa', 'ab', 'ba', 'bb'];
var meanings  =  ['00', '01', '10', '11'];
var compLangs = [['aa', 'aa', 'aa', 'aa'],  // 0 -> a, 1 -> a
                 ['aa', 'ab', 'ba', 'bb'],  // 0 -> a, 1 -> b
                 ['bb', 'ba', 'ab', 'aa'],  // 0 -> b, 1 -> a
                 ['bb', 'bb', 'bb', 'bb']]; // 0 -> b, 1 -> b
// A global value for alpha

var Prior = function(_alpha) {
  
  // Default to the global value of alpha
  var _alpha = _alpha || alpha;
  
  // Infer the prior distribution over languages.
  // It needs a function that returns hypotheses
  // in the right proportion and then automatically
  // infers the distribution over them.
  return Infer({method: 'enumerate'}, function() {
    
    // Flip a coin to 
    if( flip(_alpha) ) {
      // Randomply pick a compositional language with prob alpha
      return uniformDraw(compLangs);
      
    } else {
      // Randomly pick a (non-compositional) language with prob 1-alpha
      return repeat(4, function() { uniformDraw(symbols) });  
    }
  });
};

var Predictive = function(x, hyp, _eps) {
  var _eps = _eps || eps; // Prob. of error, default to global value
  var y = hyp[meanings.indexOf(x)]; // 'Correct' symbol for x
  return Infer({ method: 'enumerate'}, function() {
    if(flip(_eps)) {
      // With prob eps return another symbol
      return uniformDraw(remove(y, symbols));
    } else {
      return y;
    }
  })
}
///
var alpha = .3;
var eps = 0.1;
var b = 2;

// Sample input x_1, ..., x_b
var sampleInput = function(_b) {
  var _b = _b || b; // Bottleneck
  repeat(_b, function() { uniformDraw(meanings) });
}

// Sample corresponding outputs y_1, ..., y_b
var sampleOutput = function(xs, hyp, _eps) {
  var _eps = _eps || eps;
  return map(function(x){ sample(Predictive(x, hyp, _eps)) }, xs);
}

// Sample a hypothesis, some meanings and symbols
var hyp = sample(Prior());
var xs  = sampleInput(8)
var ys  = sampleOutput(xs, hyp)
var _   = mapIndexed(function(i, x_y){ print(i+1+': '+x_y[0]+' --> '+x_y[1]) }, zip(xs, ys));

We are nearly there. The last and most important hing we need is the posterior distribution. When an agent sees data , what languages are most likely? The function Posterior returns a distribution telling you that. It takes the data and the prior distribution (not a sample) as inputs. Also, there is an optional argument toString that reduces hypotheses such as ['aa', 'ab', 'ba', 'bb'] to strings 'aa ab ba bb' to enable easier visualization.

///fold:
var symbols   =  ['aa', 'ab', 'ba', 'bb'];
var meanings  =  ['00', '01', '10', '11'];
var compLangs = [['aa', 'aa', 'aa', 'aa'],  // 0 -> a, 1 -> a
                 ['aa', 'ab', 'ba', 'bb'],  // 0 -> a, 1 -> b
                 ['bb', 'ba', 'ab', 'aa'],  // 0 -> b, 1 -> a
                 ['bb', 'bb', 'bb', 'bb']]; // 0 -> b, 1 -> b
// A global value for alpha

var Prior = function(_alpha) {
  
  // Default to the global value of alpha
  var _alpha = _alpha || alpha;
  
  // Infer the prior distribution over languages.
  // It needs a function that returns hypotheses
  // in the right proportion and then automatically
  // infers the distribution over them.
  return Infer({method: 'enumerate'}, function() {
    
    // Flip a coin to 
    if( flip(_alpha) ) {
      // Randomply pick a compositional language with prob alpha
      return uniformDraw(compLangs);
      
    } else {
      // Randomly pick a (non-compositional) language with prob 1-alpha
      return repeat(4, function() { uniformDraw(symbols) });  
    }
  });
};

var Predictive = function(x, hyp, _eps) {
  var _eps = _eps || eps; // Prob. of error, default to global value
  var y = hyp[meanings.indexOf(x)]; // 'Correct' symbol for x
  return Infer({ method: 'enumerate'}, function() {
    if(flip(_eps)) {
      // With prob eps return another symbol
      return uniformDraw(remove(y, symbols));
    } else {
      return y;
    }
  })
}

// Sample input x_1, ..., x_b
var sampleInput = function(_b) {
  var _b = _b || b; // Bottleneck
  repeat(_b, function() { uniformDraw(meanings) });
}

// Sample corresponding outputs y_1, ..., y_b
var sampleOutput = function(xs, hyp, _eps) {
  var _eps = _eps || eps;
  return map(function(x){ sample(Predictive(x, hyp, _eps)) }, xs);
}
///
var alpha = .3;
var eps = 0.1;
var b = 2;
var samples = 300;

var Posterior = function(xs, ys, prior, _toString, _eps, _samples) {
  var _eps = _eps || eps;
  var _samples = _samples || samples;
  var _toString = _toString || false;
  return Infer({method:'MCMC', samples: _samples}, function() {
    var hyp = sample(prior);
    // Predictive distribution for every x in xs
    var predictives = map(function(x){ Predictive(x, hyp, _eps) }, xs)
    // Log likelihood: sum of p_pred(y | x, h) over all (x,y)
    var likelihood = sum(map(function(pred_y) { pred_y[0].score(pred_y[1]) }, zip(predictives, ys) ))
    // Condition on the data
    factor(likelihood)
    if(_toString) return join(hyp, ' ');
    return hyp
  })
}

// Let's look at a posterior
var xs = ['00', '01', '10', '11', '11'];
var ys = ['aa', 'aa', 'aa', 'ab', 'bb'];
var prior = Prior();
var posterior = Posterior(xs, ys, prior);
sample(posterior)

To get a better feeling of the posterior, we can also plot it:

///fold:
var symbols   =  ['aa', 'ab', 'ba', 'bb'];
var meanings  =  ['00', '01', '10', '11'];
var compLangs = [['aa', 'aa', 'aa', 'aa'],  // 0 -> a, 1 -> a
                 ['aa', 'ab', 'ba', 'bb'],  // 0 -> a, 1 -> b
                 ['bb', 'ba', 'ab', 'aa'],  // 0 -> b, 1 -> a
                 ['bb', 'bb', 'bb', 'bb']]; // 0 -> b, 1 -> b
// A global value for alpha

var Prior = function(_alpha) {
  
  // Default to the global value of alpha
  var _alpha = _alpha || alpha;
  
  // Infer the prior distribution over languages.
  // It needs a function that returns hypotheses
  // in the right proportion and then automatically
  // infers the distribution over them.
  return Infer({method: 'enumerate'}, function() {
    
    // Flip a coin to 
    if( flip(_alpha) ) {
      // Randomply pick a compositional language with prob alpha
      return uniformDraw(compLangs);
      
    } else {
      // Randomly pick a (non-compositional) language with prob 1-alpha
      return repeat(4, function() { uniformDraw(symbols) });  
    }
  });
};

var Predictive = function(x, hyp, _eps) {
  var _eps = _eps || eps; // Prob. of error, default to global value
  var y = hyp[meanings.indexOf(x)]; // 'Correct' symbol for x
  return Infer({ method: 'enumerate'}, function() {
    if(flip(_eps)) {
      // With prob eps return another symbol
      return uniformDraw(remove(y, symbols));
    } else {
      return y;
    }
  })
}

// Sample input x_1, ..., x_b
var sampleInput = function(_b) {
  var _b = _b || b; // Bottleneck
  repeat(_b, function() { uniformDraw(meanings) });
}

// Sample corresponding outputs y_1, ..., y_b
var sampleOutput = function(xs, hyp, _eps) {
  var _eps = _eps || eps;
  return map(function(x){ sample(Predictive(x, hyp, _eps)) }, xs);
}


var Posterior = function(xs, ys, prior, _toString, _eps, _samples) {
  var _eps = _eps || eps;
  var _samples = _samples || samples;
  var _toString = _toString || false;
  return Infer({method:'MCMC', samples: _samples}, function() {
    var hyp = sample(prior);
    // Predictive distribution for every x in xs
    var predictives = map(function(x){ Predictive(x, hyp, _eps) }, xs)
    // Log likelihood: sum of p_pred(y | x, h) over all (x,y)
    var likelihood = sum(map(function(pred_y) { pred_y[0].score(pred_y[1]) }, zip(predictives, ys) ))
    // Condition on the data
    factor(likelihood)
    if(_toString) return join(hyp, ' ');
    return hyp
  })
}
// Concatenate an array of strings with a 
// separator in between (defaults to space)
var join = function(strings, sep) { 
  var sep = sep || ' ';
  reduce(function(total,part){ 
    if(part == '') return total;
    return append(total, sep+part) 
  }, '', strings)
}
///
var alpha = .3, eps = 0.1, b = 2, samples = 300;
var xs = ['00', '01', '10', '11', '11'];
var ys = ['aa', 'aa', 'aa', 'ab', 'bb'];
var prior = Prior();
viz(Posterior(xs, ys, prior, true));

You will notice that the support doesn’t include all of the 260 languages. In fact, it can’t, since the MCMC sampler uses only 250 samples. Most of the languages will then get zero probability. Try increasing samples and see what happens.

We are now ready to put everything together and simulate an iterated learning chain. A function simulate will recursively (yes, functional programming) pass through the chain and store the hypothesis each of the learners formulated.

///fold:
var symbols   =  ['aa', 'ab', 'ba', 'bb'];
var meanings  =  ['00', '01', '10', '11'];
var compLangs = [['aa', 'aa', 'aa', 'aa'],  // 0 -> a, 1 -> a
                 ['aa', 'ab', 'ba', 'bb'],  // 0 -> a, 1 -> b
                 ['bb', 'ba', 'ab', 'aa'],  // 0 -> b, 1 -> a
                 ['bb', 'bb', 'bb', 'bb']]; // 0 -> b, 1 -> b
// A global value for alpha

var Prior = function(_alpha) {
  
  // Default to the global value of alpha
  var _alpha = _alpha || alpha;
  
  // Infer the prior distribution over languages.
  // It needs a function that returns hypotheses
  // in the right proportion and then automatically
  // infers the distribution over them.
  return Infer({method: 'enumerate'}, function() {
    
    // Flip a coin to 
    if( flip(_alpha) ) {
      // Randomply pick a compositional language with prob alpha
      return uniformDraw(compLangs);
      
    } else {
      // Randomly pick a (non-compositional) language with prob 1-alpha
      return repeat(4, function() { uniformDraw(symbols) });  
    }
  });
};

var Predictive = function(x, hyp, _eps) {
  var _eps = _eps || eps; // Prob. of error, default to global value
  var y = hyp[meanings.indexOf(x)]; // 'Correct' symbol for x
  return Infer({ method: 'enumerate'}, function() {
    if(flip(_eps)) {
      // With prob eps return another symbol
      return uniformDraw(remove(y, symbols));
    } else {
      return y;
    }
  })
}

// Sample input x_1, ..., x_b
var sampleInput = function(_b) {
  var _b = _b || b; // Bottleneck
  repeat(_b, function() { uniformDraw(meanings) });
}

// Sample corresponding outputs y_1, ..., y_b
var sampleOutput = function(xs, hyp, _eps) {
  var _eps = _eps || eps;
  return map(function(x){ sample(Predictive(x, hyp, _eps)) }, xs);
}


var Posterior = function(xs, ys, prior, _toString, _eps, _samples) {
  var _eps = _eps || eps;
  var _samples = _samples || samples;
  var _toString = _toString || false;
  return Infer({method:'MCMC', samples: _samples}, function() {
    var hyp = sample(prior);
    // Predictive distribution for every x in xs
    var predictives = map(function(x){ Predictive(x, hyp, _eps) }, xs)
    // Log likelihood: sum of p_pred(y | x, h) over all (x,y)
    var likelihood = sum(map(function(pred_y) { pred_y[0].score(pred_y[1]) }, zip(predictives, ys) ))
    // Condition on the data
    factor(likelihood)
    if(_toString) return join(hyp, ' ');
    return hyp
  })
}
// Concatenate an array of strings with a 
// separator in between (defaults to space)
var join = function(strings, sep) { 
  var sep = sep || ' ';
  reduce(function(total,part){ 
    if(part == '') return total;
    return append(total, sep+part) 
  }, '', strings)
}
///
var alpha = .3, eps = 0.1, b = 2, samples = 300;
var prior = Prior();
var samples = 500;
var b = 2;

var simulate = function(xs, ys, n, hypotheses) {
  // Infer a hypothesis about the data
  var hyp = sample(Posterior(xs, ys, prior))

  // Store it and return if we've reached the end of the chain
  var hypotheses = append(hypotheses || [], join(hyp, ' '));
  if(n == 1) return hypotheses;
  
  // Generate new data
  var newXs = sampleInput();
  var newYs = sampleOutput(newXs, hyp);
  
  // Pass data to next agent
  return simulate(newXs, newYs, n-1, hypotheses);
}

// Initial unbiased data
var xs = ['00', '00', '00', '00'];
var ys = ['aa', 'ab', 'ba', 'bb'];
var results = simulate(xs, ys, 50);
// Store for later use in other code blocks
editor.put('results', results); 
viz(results)

Indeed! You are looking at the relative frequencies of the hypotheses used throughout the chain. The distribution is flat with four peaks, each corresponding to one of the compositional languages. That this is indeed very close to the prior becomes clear when we aggregate the non-compositional languages:

///fold:
var symbols   =  ['aa', 'ab', 'ba', 'bb'];
var meanings  =  ['00', '01', '10', '11'];
var compLangs = [['aa', 'aa', 'aa', 'aa'],  // 0 -> a, 1 -> a
                 ['aa', 'ab', 'ba', 'bb'],  // 0 -> a, 1 -> b
                 ['bb', 'ba', 'ab', 'aa'],  // 0 -> b, 1 -> a
                 ['bb', 'bb', 'bb', 'bb']]; // 0 -> b, 1 -> b
var join = function(strings, sep) { 
  var sep = sep || ' ';
  reduce(function(total,part){ 
    if(part == '') return total;
    return append(total, sep+part) 
  }, '', strings)
}
///
var results = editor.get('results');
var compLangsStr = map(join, compLangs)
viz(map(function(r){ 
  return (compLangsStr.indexOf(r) == -1) ? 'other' : r;
}, results))

Compare this to the actual prior (recall that means that only 0.3 of the probability mass is given to the compositional languages).

///fold:
var symbols   =  ['aa', 'ab', 'ba', 'bb'];
var meanings  =  ['00', '01', '10', '11'];
var compLangs = [['aa', 'aa', 'aa', 'aa'],  // 0 -> a, 1 -> a
                 ['aa', 'ab', 'ba', 'bb'],  // 0 -> a, 1 -> b
                 ['bb', 'ba', 'ab', 'aa'],  // 0 -> b, 1 -> a
                 ['bb', 'bb', 'bb', 'bb']]; // 0 -> b, 1 -> b
var join = function(strings, sep) { 
  var sep = sep || ' ';
  reduce(function(total,part){ 
    if(part == '') return total;
    return append(total, sep+part) 
  }, '', strings)
}
///
var alpha = .3
viz(Infer({method:'enumerate'}, function() {
  return flip(alpha) ? join(uniformDraw(compLangs)) : 'other';
}));

The simulation illustrates the theoretical result that Bayesian learners in a transmission chain will over time choose hypotheses according to the prior — that their choice of languages converges to their prior expectations thereof.

Concluding remarks

A transmission chain of identical Bayesian reasoners converges to the prior. What does that mean? The formal meaning is crystal clear: if , . But the implications for real transmission chains are less clear. Two concerns immediately cast some shadow over this result. First, transmission chains in practice are rarely completely linear. Besides vertical transmission, they often also allow for horizontal transmission between agents in the same generation. The result does not immediately generalize to groups of agents, however. Second, it is unrealistic to assume that all agents share the same prior and again, it is not clear what happens if you drop that assumption.1 Both of these concerns will be addressed in more depth in future post; I want to conclude with another point.

The first paper in which @@Griffiths2005 announced the convergence to the prior contains a section titled “An example: evolving compositionality”. This post was about precisely that example — so where was the evolving compositionality? Nowhere, really. No property of any language evolved in this example, since we never changed any of the languages. Only our expectations of these languages, the probabilities we assigned to them, changed. The compositional languages came to the surface only because we put more weight on those initially. We could have picked any other set of languages. What you put in, is what you get out.

This is still interesting as it shows how certain (compositional) languages can gain importance by accumulating individual preferences. But it also suggests that individual preferences primarily shape a language and not, say, the interactions between agents. This is not an unproblematic assumption. Consider two agents discussing the four objects in our example: red and blue circles and squares. If they use a compositional language, they can refer to “all blue things” which seems more efficient than what a non-compositional language can do. Communicative efficiency, for example, might also drive the emergence of compositionality. And that is not a property of the individuals, but of their interation.

In short, the point is that Kalish and Griffiths’ account of iterated learning reduces the emergence of compositionality to individual priors and does not explain how those preferences came to be in the first place. Moreover, if compositionality also emerges as a result of interactions between agents (and not only individual priors), the model can not account for it.

  1. In any case, the proof of the original convergence result breaks down since the Markov chain is no longer (time)-homogeneous. That means that the transition probabilities do not change over time: \begin{equation} p(h_n = i \mid h_{n-1} = j) = p(h_{m} = i \mid h_{m-1}=j), \qquad m\neq n. \end{equation} The original proof crucially relies on homogeneity.