THE SOUND OF MAIA

AI for Music Creation

Join maia on her musical journey of self-discovery and development. Experience the maia's creativity and diversity as she continues to matures into a professional composer. Browse her latest works and discover the practical and theoretical unpinning that powers maia.

OUR STORY

We started out with the intention of creating an AI that could complete Mozart’s unfinished composition Lacrimosa — the eighth sequence of the Requiem — which was written only till the eighth bar at the time of his passing.

We ended up creating a deep neural network called maia — play on the term ‘music ai’ — that can generate original piano solo compositions by learning patterns of harmony, rhythm, and style from a corpus of classical music by composers like Mozart, Chopin, and Bach.

We approached this problem by framing music generation as a language modeling problem. The idea is to encode midi files into a vocabulary of tokens and have the neural network predict the next token in a sequence from thousands of midi-files.

Home: Image

ENCODING

We used MIT’s music21 library (http://web.mit.edu/music21/) — a toolkit for computer-aided musicology — to deconstruct the midi files into their fundamental elements: duration, pitch, and dynamics.

Next, we adopted the ‘Notewise’ method [2] proposed by Christine Payne’s — a technical staff at OpenAI’ — to encode each composition’s duration, pitch, and dynamics into a text sequence — resulting in a vocabulary size of 150 words. Each midi file is sampled 12 times per quarter note to encode triplets and 16th notes.

Additionally, to augment our dataset, we used modulations to duplicate every piece by twelve times — each, one note lower than the next.

Home: Text

TOKENIZING & SEQUENCING

Home: Image

STACKED LSTM

The first version of maia was built using a recurrent neural net architecture. In particular, we used LSTM because its additional forget gate and cell state was able to carry information about longer-term structures in music compared to RNN and GRUs — allowing us to predict longer sequences of up to 1 minute that still sounded coherent. Our baseline model is a 2-layers stacked LSTM with 512 hidden units in each LSTM cell. Instead of one-hot encoding the input, we used an embedding layer to transform each token into a vector — the embedding dimension we used is 150^(0.25) ≈ 4. The loss function for our LSTM model is cross-entropy.

Home: Text

We explored added special tokens to each piece so that the sequence would contain information about the composer, when the music starts and when it ends. This will be important for the GPT model that we use later.

We also explored NGram Tokenization, which treats a string of n consecutive ‘words’ as a single token. The motivation was to see if we can better capture the semantics of compound ‘words’ that represent common chords or melody patterns. Eventually, we still stuck with unigram tokens for our final model, as any higher order of ngram increases the vocabulary size substantially compared to our dataset.

The encoded text data were batched into sequences of 512 tokens for training. Instead of just chopping them up into mutually exclusive sequences, we overlapped the sequences i.e. every subsequent sequence share 50% overlap with the previous sequence. This way, we wouldn’t lose any information of continuity at the points which split the sequences.

Home: Image

Home: Text

To ensure that our generated sequences are diverse, instead of always selected the most likely next token in the prediction, the model would randomly sample from the top k most likely next tokens based on their corresponding probability, where k is between 1 and 5.

Home: Image

Home: Music

Home: Image

TUNING

There were hyperparameters that we had to tune, such as deciding the sequence length per sample. Too short and you will not learn enough to produce a string of music that sounds coherent to human. Too long a sequence and training will take too long without learning more information.

Optimizing our batch size will also allow us to trade off learning iterations vs leveraging on GPU’s concurrent computation. We used a grid search to find the best combination of sequence length and batch size.

Home: Image

L2 REGULARIZATION

We also set the weight decay attribute in the Torch LSTM module to non-zero so that our model penalizes large weights and ultimately reduces overfitting. This is the equivalent of L2 regularization for LSTM.

We observed that the ability of the LSTM model in generating coherent sequences started to fail after 512 tokens. Additionally, for sequences longer than 512 tokens that were generated by the LSTM model, there was not any discernible pattern of musical form or structure. Our empirical results seem to agree with theoretical postulations that even with LSTM, recurrent neural nets face difficulties learning dependencies of longer paths.

For this reason, we decide to use the Transformer model which leverages self-attentional networks to better model long term dependencies [3].

Home: Text

TRANSFORMER - GPT

We used the Generative Pre-Training (GPT) variant of the Transformer model that was proposed by OpenAI [4]. The effectiveness of GPT was first demonstrated in generative language modeling through training on a diverse corpus of unlabeled text data. The GPT model is essential the vanilla Transformer model with its encoder block and cross-attention mechanism stripped away — so that it can perform more efficiently on unsupervised tasks.

Home: Text

Home: Image

We used a 6-layers GPT model that included one embedding layer, 6 decoder blocks and 1 linear softmax layer that will return us the logits for the next predicted token. Each decoder block has 8 self-attention heads with 256-dimensional states and 1024 dimensional inner states for the feed-forward layers.

Of the 500k sequences that we obtained from preparing the data, 95% was set aside as training data while 5% was used for validation. None of the datasets we had were allocated as the test set since we are working with a generative model.

Home: Text

Home: Music

Home: Image

Using data from all composers, the training loss converged after about 5000 iterations. However, we decided to push the training to 200k iterations (30 epoch) — which took approximately 60 hours — to see if we could get higher quality generated samples.

Our initial results were not particularly coherent so we constrained the training set to a single composer.

Overall, our attempt to model longer sequences using GPT were hampered by the memory of the GPU.

It can only support up to 512 tokens in each sequence.

Home: Text

Home: Image

SPARSE
ATTENTION

In vanilla self-attention, each token attends to every other token which creates a quadratic complexity over the sequence length O(n²). As a result, our GPU memory can only support up to 512 tokens in each sequence.

However, in sparse attention, each token only attends to a subset of the other tokens. This reduces the complexity to O(n*√n), making computation tractable even for long sequences.

Home: About

Home: Quote

"Music is a language. Our approach involves encoding music into tokens and framing music generation as a language modeling problem. Original music is then generated from models that were trained in an unsupervised fashion."

Team maia

TEAM

SUPPORTED BY:

Ikhlaq Sidhu & Alex Fred-Ojala | UC Berkeley Data-X Lab

Christine Payne | OpenAI

Contact

Home: Contact

CONTACT

2451 Ridge Rd, Berkeley, CA 94709

edwardtky@berkeley.edu

510-944-9938

THE SOUND OF MAIA

OUR STORY

ENCODING

TOKENIZING & SEQUENCING

STACKED LSTM

We explored added special tokens to each piece so that the sequence would contain information about the composer, when the music starts and when it ends. This will be important for the GPT model that we use later.

To ensure that our generated sequences are diverse, instead of always selected the most likely next token in the prediction, the model would randomly sample from the top k most likely next tokens based on their corresponding probability, where k is between 1 and 5.

TUNING

L2 REGULARIZATION

​

For this reason, we decide to use the Transformer model which leverages self-attentional networks to better model long term dependencies [3].

TRANSFORMER - GPT

​

Of the 500k sequences that we obtained from preparing the data, 95% was set aside as training data while 5% was used for validation. None of the datasets we had were allocated as the test set since we are working with a generative model.

Using data from all composers, the training loss converged after about 5000 iterations. However, we decided to push the training to 200k iterations (30 epoch) — which took approximately 60 hours — to see if we could get higher quality generated samples.

​

Our initial results were not particularly coherent so we constrained the training set to a single composer.

Overall, our attempt to model longer sequences using GPT were hampered by the memory of the GPU.

It can only support up to 512 tokens in each sequence.

SPARSEATTENTION

Team maia

TEAM

CONTACT

SPARSE
ATTENTION