This project explores how a simple neural network can learn to represent and generate Gregorian chant for which we will focus on a good-old-fashioined long short-term memory (LSTM) network, rather than, say, a transformer.
The model architecture and its implementation are inspired by Gulordava et al (2018). We train a 2-layer LSTM on next character prediction using a cross-entropy loss. We split the training data into mini-batches consisting of 32 sequences of 64 characters. The learning rate is dynamically adjusted using Adam with default parameters. For each of the three representations, we first broadly tuned the embedding size, hidden size, sequence length, learning rate, batch size, initialization range, dropout, and gradient clip with HyperOpt using ASHA scheduling and then more finely tuned the learning rate and sequence length using a population-based training. We then fixed the batch size to 32, the initialization range to (-0.15, 0.15), the dropout to 0.15, and the clipped gradients to 0.5. All models are implemented in PyTorch, and tuning is done using Ray Tune. We train two classes of models, small ones with an embedding size of 8 and a hidden size of 64, and ‘large’ ones with an embedding size of 32 and a hidden size of 256. All models were trained to predict the next character, but using three different chant representations: plain volpiano, intervals and contour.
You can read more about this project in chapter 5 of my PhD dissertation.