Sing Bot Sing
Context
In the fall of 2021, I was lucky enough to take a course that introduced partial differential equations (PDEs) and Fourier Series. Dr. Bhat taught the section I took, and I am forever grateful I took this class with him. For the final project for this class, some peers and I decided to model the oscillations of guitar strings using PDEs, and to attempt to conclude which specific PDE modeled the guitar string most accurately.
Through the class, and that project, Dr. Bhat illustrated the mathematical basis behind notes and some simple chords, and the beautiful patterns in the mathematics of music. So that brings us to summer of 2022. One of my friends from high school, Caleb Stevens, and I were curious about whether a computer could be taught these patterns and symmetries that are present throughout music.
Approach to Music Modeling with LSTM
Driving Question
Can a computer make it’s own music? Is it possible to translate music into a mathematical format to be modeled accurately by a computer?
Background
Music
I won’t go too much into the mathematics behind music, but generally music can be thought of as a specific ordered sequence of notes played together. Chords are just multiple notes played at the same time, rests are when no notes are played together. While in musical convention these notes are often expressed as letters, they can also be expressed as frequencies of the wave that corresponds to the solution of a PDE.
String PDE with Stiffness
While all humans are generally restricted to the same range of audible noise (approximately 20 Hz to 20,000 Hz), different cultures have unique ways of dividing this range up into different notes, and have different musical traditions when it comes to musical patterns. However, regardless of the culture, these patterns exist and those patterns ensure various melodies or combinations of notes are pleasing to the human ear if they abide by certain patterns. Some patterns are mathematically expressed by ratios between frequencies of notes: combinations of notes that fall into certain ratios are pleasing to hear.
The soundwaves that represent these notes can be represented as an amplitude versus time plot.
Music Files
Now we have some idea of what music is and can be represented by, the way music is stored makes more sense. The first strategy for storing music in a digital file is storing the audio as a sampled wave (sampled a given number of times per second) and storing that amplitude. This strategy is the general idea behind WAV files or if the sampling rate is reduced, MP3 files.
Sampling Rates
There’s a bit of an issue, however. These files tend to be very large, and so storing large quantities of music is a bit tricky. Also, editing music would be quite difficult with these files (imagine trying to alter amplitudes one by one to change music).
The next strategy is more focused on music editing and storage. MID or MIDI files store the audio as data with what notes are being played, when they’re played, and how long they’re played, and then by running them through a software, music can be reproduced. The data is encoded in hexadecimal encoding, and so the file size is incredibly small. A one minute song might require 10 kB of storage as a MIDI file, but 1.44 MB as an MP3 file.
Ableton Interface
For this project, the data we originally used were in MIDI or MID format and the output files were in MIDI format, in order to retrieve information such as current note, duration, etc.
Data Source
The goal was to use machine learning to attempt to predict music, and to determine whether a computer could effectively model music, so we need a large dataset. Luckily enough, on kaggle there were a few large datasets of pop music as MIDI files.
Now why pop music? Pop music is much more repetitive than something like classical music or jazz, so we thought it was likely a good place to start for a machine.
The dataset we ended up using was the POP909 by Wang, Ziyu et. al. We could’ve likely spent the time to collect music files in midi form, then upload those files to kaggle directly, but best to keep some steps simple if possible.
Neural Network Architecture
Neural Networks
To-do: I’ll record a video to explain what neural networks are
RNNs and LSTMs
The best way to understand the differences or the appeal of Recurrent Neural Networks (RNNs) is through the concept of memory. When humans are confronted with something new, we don’t usually completely change the way we think about something, we don’t start over completely.
We compare that event with our past memories and past lessons that we’ve learned from similar circumstances. Base neural networks lack this ability to remember, and RNNs are an attempt to address this downside.
If I was to make a wish list for an ideal RNN, I’d want it to be able to arbitrarily “remember” or store past information, to be resistant to stochastic fluctuations in data, and for the model to be trainable. RNNs accomplish this by feeding forward data stored in memory cells.
RNN Structure
Now traditional RNNs have experienced issues when dealing with time lags greater than 5-10 time steps, where the gradient either explodes, or vanishes entirely, and since neural networks rely on this gradient to train, this presents quite an issue.
Long Short-Term Memory (networks) (LSTMs) are a type of RNN that addresses some of these downsides.
“LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O.“
-Hochreiter and Schmidhuber, Long Short-Term Memory, 1997.
An LSTM differs from other RNNs because of the composition of each memory unit or cell. Within the cell there are four neural network layers, three with a sigmoid activation function and one with a tanh activation function. In the above diagram, the important part to focus on is the highest horizontal line, which represents the “cell state.”
The LSTM has the ability to change the information flowing through, but the ability to do so is blocked by gates (the circles with plus signs or multiplication signs). In the case of multiplication, the sigmoid layer means only values of one or zero will be returned, one being let all information through, and zero being let none through.
The tanh layer (which scales values to between negative one and one) creates the updates that need to be made to the cell state (expressed in the form of a vector) which are then added to the cell state (the plus gate).
Lastly, a value needs to be output, and so the tanh function maps the cell state to values between negative one and one, and the sigmoid function allows only certain parts of that mapped data to be output.
Literature Review
Music generation through machine learning is a relatively young technique, but not completely new. Many papers go into more depth concerning overall musical structure and motifs, and modeling these larger scale patterns in novel pieces.
For this first foray, our goal was much more simple: an approach to music using LSTMs to generate predictions.
That goal narrows down the field of research considerably (there are some projects that go into using more complex deep learning such as a GPT architecture). Two papers that were crucial for understanding the approach we should use were LSTM Based Music Generation System by Sanidhya Mangal, Rahul Modak, and Poorva Joshi, and Generation of music pieces using machine learning: long short-term memory neural networks approach, by Nabil Hewahi, Salman Al Saigal, and Sulaiman Al Janahi.
Both papers provide the architecture (layers, dropout levels, number of cells, best optimizer, etc.) they used and which performed optimally with the task given, as well as the data they input to the LSTM model. The latter paper went further into a mechanism for bringing songs into the same key to avoid confusing the model, but for this project, we filtered out songs that were not of the same key (since the note patterns would vary between keys).
The first paper used a combination of pitch, velocity (how loud the note is) and time interval were the three pieces of information stored in a given timestep. The second paper used pitch, start time, and end time as the three pieces of information stored in a given timestep.
For this project, Caleb and I decided to test two different approaches. The first was a frequency based approach, with the frequency of a note, followed by the duration, and finally the offset (where in the song the note takes place) are stored in a given timestep. The second was a note based approach, where a string representing the given note was stored, followed by duration, and lastly offset.
Packages
Music Processing
At this point we had some idea of what the goal was, and how we were going to reach it, but we needed to find a package that could decode MIDI files into the information we wanted.
We believed that it was essential to be able to decode a MIDI file into some vector of numbers in order to parse that data through an LSTM model.
The package we settled on was music21, a package mostly created by professors at MIT in the Music and Theater Section.
Model Building and Data Preparation
In order to build the LSTM model, we used the keras package of tensorflow. To determine lagged data, we used statsmodels, and to one hot encode the data, we used sklearn,
Data Preparation
Retrieval
The first step was to retrieve note, frequency, duration, and offset information from each MIDI file and assemble that into one large data frame. Generally, music21 has plenty of useful built in functions to retrieve musical information or to determine what key a song was written in, however, there are some interesting quirks to this package.
For instance, retrieving duration information was difficult at times because the music21 package did not return duration in seconds as a decimal, instead providing duration in quarter lengths (where four corresponded to a full measure), which would’ve been fine, except unless otherwise clarified, calling the quarterLength method to the duration method of a note or chord object would return the duration in quarter measures rounded to some fraction of a quarter measure.
Functions for Decoding MIDI
Additionally, chords were not stored as individual notes with the same offset but as their own object which caused some headache and the names of notes were not stored as strings, and to get unique notes, the note name had to be combined with the octave.
The decomposed notes also had to be ordered by offset (since initially the MIDI object did not have them ordered by offset) to make the data represent a time series (albeit with nonuniform time steps).
In hindsight, much of our issues came from using the package in a much different way than I think it was initially intended to be used.
Overall though, our approach was to iterate through a list of paths, determine the key of all the MIDI files, and only breakdown songs that belonged to the most common musical key.
Preprocessing
The data we had at this point had note name, frequency, duration, and offset, the latter three being numerical data (ideally continuous for duration and frequency and discrete for offset) and the note name being a categorical variable.
So we needed two subdivided data sources, since the different variable types require two different approaches. The first was composed of frequency, duration, and offset (all numerical), and the second consisted of two parts, a part for the categorical variable, note name, and one for duration and offset.
Generally, deep learning methods work best with scaled data, so the numerical data then needed to be scaled, specifically to values between zero and one.
Formula for Scaled Data
Additionally, since the neural network package we used can’t interpret categorical data directly, the categorical data also needed to be encoded in some way. We used a one hot encoding, where each unique string value in the note names data becomes a new column, then each data point is given a value of zero or one in each column (representing whether or not the data point has the given note name).
Data Splits
Before we built lags for the data, we needed to separate the data into different objects. Now why would we need to do this? Well, we were asking for two separate tasks to be completed on our data.
The first was a regression style problem, where we’re trying to predict some numerical variables (frequency, duration, and offset).
The second was a classification style problem (organized into a numerical form the LSTM can understand) where we’re trying to classify notes, a fundamentally categorical variable.
Frequency, Duration, and Offset Data (normalized)
Note Data
Duration and Offset Data (normalized)
After we separated the data into the various tasks that we wanted to perform, we could move on to creating lagged data.
Lagged Data
The data was more or less ready, minus some testing and training splits (we specified the validation split in the LSTM call to train the model). However, we still needed to create lagged data and decide how many lags to use.
From previous work on time series stock forecasting, I was aware of the partial autocorrelation plot but there was an issue: what should be done about the one hot encoded note name data? It doesn’t make sense to apply an autocorrelation function to a series of categorical data.
Cramer’s V Formula
There are some more complicated measures for determining how “correlated” categorical data is (such as Cramer’s V) , but we wanted to be a bit sneakier. We knew two things: firstly, pop music is incredibly repetitive (from our own experiences and from research done by Elizabeth Hellmuth Margulis), and that the melody in pop music tends to be relatively simple.
Pop music commonly has chord progressions that are between two and four notes long. Additionally, for more complicated information about harmony, we knew we could get some idea of how many lags to use for notes (and certainly for frequencies) by running a partial autocorrelation on frequency data (since the plot would capture the patterns and dependencies that exist between the frequencies of notes).
After looking at the above partial autocorrelation plots, we determined around 16 lags should be used for frequency, while only a few are significant for duration, and one is significant for offset (this was to be expected, but just a sanity check).
So we decided to go with 16 lags for the categorical note names, considering that the length of chord progressions in many pop songs is between two and four, and the frequency partial autocorrelation plot indicates around 16 lags are significant.
In order to build the lags, we needed to split the data into a two dimensional data set of the current time value (our response variable) and a three dimensional tensor consisting of the lagged data (our predictors), and impute the lagged values for time step 0 as zero (otherwise they would be NA values).
Model Building and Predictions
A Tale of Two Bots
Here is where we tested two approaches to modeling and predicting music.
One model was a composite model, made from an LSTM to predict duration and offset and an LSTM to predict discrete notes.
The other was also a single model composed of an LSTM for frequency, duration and offset.
Freq Bot
The accuracy of this model was less than the model for the note bot for the training data, but this model outperformed the note model in the validation data.
Model Architecture and Performance
This was super exciting! The model supposedly had incredibly low mean squared error, and high accuracy (the final model had a performance of 93.76% accuracy, with a mean squared error of 0.000854 on the training data and an accuracy of 96.16% and a mean squared error of 0.0011 on the validation data).
So we decided to throw the predictions into a MIDI file and see what it sounded like:
Honestly, that sounds horrible. When my friend and I heard this we felt so discouraged: if that was 96%, then that extra 4% of accuracy must be incredibly important.
So we began to think about why the music sounded so horrible. Something my friend was worried about was microtonality: musical notes, while they might be different from culture to culture, are discrete. However, they can also be described in the form of a continuous variable, frequency.
Microtones (notice the notes between notes)
However, the computer doesn’t know that. So the computer was treating frequency as a purely continuous variable, generating microtones or notes between notes, which to our trained ears sounds awful.
I thought the microtonalities would be an enjoyable feature of the model, open our eyes to new music. Caleb thought otherwise and he was right.
Just in case, we tested many different parts of this model: we tried to give the predictions the original durations and offsets but that didn’t improve how the song sounded. We tried to squeeze a little more accuracy, but even then the model sounded horrible.
Onto the notes classification approach then!
Note Bot
At this point we moved on to a note based model, where we constructed a one hot encoding from the notes of the most common musical key in the data.
Model Architecture and Performance
While less exciting in the performance on the validation data, this model was still promising. The model had a low categorical cross entropy and a high accuracy (the final model had a performance of 98.08%, with a categorical cross entropy of 0.0658 on the training data and an accuracy of 90.71% and cross entropy of 0.7869 on the validation data).
We then threw the predictions of a song into a MIDI file and this is what came out:
Wow! That sounds like actual music. Now notably, this is not the same song as was predicted by the frequency bot, but it’s clear the sound is much more pleasing and less discordant. Also I changed the instrument to a nylon guitar, just for fun.
This is by far the best case scenario though. There are some songs this model predicted that are pretty bad:
This is clearly music, but it is a little rushed and the music almost feels rushed and poorly constructed.
We were pretty stumped: our accuracy is high, our loss function is small, what’s the big issue here? We tried a number of things, none of which noticeably improved the result.
Finally, we thought the issue might be with duration. In music, rhythm and repetition are incredibly important, so maybe the poorer performance with the duration and offset prediction (only 89.79% accurate on training data and 92.33% accurate on validation data) was the issue.
We tested this by using the original duration information in the data, and comparing the results:
That’s a significant improvement! It seems as though we underestimated the value of rhythm in music (which seems obvious in hindsight). In the future, if we improve the duration and offset model, we can significantly improve the result.
What Now?
At this point both Caleb and I were satisfied with how the model was working: the output actually sounded like the music it was predicting, and we had high accuracy and the loss functions for the various models were small.
As for next steps, we’re going to really unchain the model from any old songs, to allow the model to iteratively generate its own music from some initial notes, durations, and offsets.
So this project is still a work in progress, but I’ll update this page soon. For now, you can enjoy some bloopers, and incredibly bad music
A Frequency Based Model Output
Our Attempts with Classical Music
Another Previous Attempt