DEV Community

Cover image for Music Generation with Magenta
Mandy Wong
Mandy Wong

Posted on

Music Generation with Magenta

Problem Statement
Given the immense evolution of Classical music overtime in both style and form, we would like to generate pieces of piano music that combines the characteristics of all the classical music periods: Baroque, Classical, Romantic, and Contemporary, and listen to what such pieces of music would sound like.
Technology
Magenta, an open source Python library provided by Google, uses TensorFlow and enables users to train deep learning models and generate new artistic content from them, be it music or art.
I used Google Colab to write code and most importantly to avoid any installation compatibility issues. All of the installations can be easily done by using the “!pip install” command.
I used TensorBoard to visualize my model’s performance.
Most importantly, I had to have access to a MIDI player like GarageBand or Windows Media Player to listen to my music samples.
Installation Intructions
Go to Google Colab and click on new notebook:

You have to first mount your Google Drive so the notebook knows where to access your files:

Next, you have to clone the Magenta Git repository. Without this step you will not have access to the files I am referencing, so this step is crucial.

The “!pip install –e .” command allows us to develop on Magenta. After running the cell with “!pip install –e .”, you must re-start the runtime, re-run the cell with %cd to the folder where the file setup.py is located when you first clone the magenta Git repository. The installation command will not work if you are not in the folder with the setup.py file.

Now when you re-run the !pip install e . cell again, you should get this message saying successfully installed magenta.

Next, run the cell with “!pip install magenta” to ensure we have the required latest packages.

Now we run the “!sudo apt-get install build-essential libasound2-dev libjack-dev portaudio19-dev” command to get the sound libraries required to generate music MIDI files.

In order to use the Performance Model I built, which is named Mandy_performance _model.py, please put it under the performance_rnn model directory you cloned from git:

Now you have to go into these performance_rnn files and make sure you import Mandy_performance_model at the top: performance_rnn_generate.py, performance_rnn_create_dataset.py, and performance_rnn_train.py. If you do not import my model it cannot be run.

Data
I selected 72 piano MIDI tracks from yamahaden.com’s “Signature MIDI collection”, http://www.yamahaden.com/midi-files. 18 of these MIDI files were from the Baroque period (1600-1750); 18 were from the Classical period (1750-1820); 18 were from the Romantic period (1800-1920); and the last 18 considered “Contemporary” classical music from 1900 onwards. After selecting our MIDI files, we are ready to create NoteSequences from them.
Processing the Data
We are transforming these MIDI tracks into NoteSequences, which are protocol buffers. They are faster and a more efficient data format to work with than MIDI files. This processing is done using this command, but prepended by !python in Google Colab because you are calling a script, so like:

INPUT_DIRECTORY=

TFRecord file that will contain NoteSequence protocol buffers.

SEQUENCES_TFRECORD=/tmp/notesequences.tfrecord

convert_dir_to_note_sequences \
--input_dir=$INPUT_DIRECTORY \
--output_file=$SEQUENCES_TFRECORD \
--recursive
In Google Colab, you must specify in quotation marks what directory you are referring to, such as “content/drive/My Drive/Colab Notebooks/DeepLearningProject/magaenta/magenta/scripts/convert_dir_to_note_sequences.py” and type all your commands in one continuous line without any “\”.

Once we create these NoteSequences, we can turn those into SequenceExamples by extracting performances from the NoteSequences. Each SequenceExample will contain a sequence of inputs and a sequence of labels that represent a performance.
I created two collections of SequenceExamples: one for training, and one for evaluation. I elected to go with the train/validation split ratio of 80/20, so 80% of the SequenceExamples are saved in the training collection, and 20% are saved in the eval collection.

CONFIG=

performance_rnn_create_dataset \
--config=${CONFIG} \
--input=/tmp/notesequences.tfrecord \
--output_dir=/tmp/performance_rnn/sequence_examples \
--eval_ratio=0.20

In Google Colab this command is prepended by !python because you are running the performance_rnn_create_dataset script

Again, when running this set of commands in Google Colab, you must specify the directories in quotation marks, like “/content/drive/My Drive/Colab Notebooks/DeepLearningProject/…” and run all of these commands in one line without “.”
The configuration flag is what I created which is “performance_Mandy”, and we will go into detail about that in the next section.
Data augmentation happens in this step while the NoteSequences are converting to SequenceExamples. The performance_rnn_create_dataset function will apply data augmentation, stretching, and transposing each NoteSequence within a limited range.
During this step, performances can get thrown out because they are too short or truncated, so it is very important you have enough data which are MIDI files to start with, otherwise you will not be able to complete your evaluation phase. When I say enough data, I am mostly referring to length of the MIDI files. I show how data is processed below where NoteSequences are discarded due to them being too short, or just truncated.

Model
I used the Performance RNN model in Magenta as my base model to create my performance_Mandy model. This model is capable to supporting timing and dynamics which mimics that of a trained classical pianist, and most importantly it can generate polyphonic music. This model uses LSTM networks, and abstracts the use of a Basic LSTM cell by calling the contrib_rnn.BasicLSTMCell utility. This takes place in a function called make_rnn_cell within the events_rnn_graph.py module.
In my performance_Mandy.py file, I used the Modulo Performance Event Sequence Encoder Decoder structure to read the MIDI input files and translate the events into 12 different pitch classes. There are 12 semitones in an octave so this encoding method would result in more information preserved. Each note is encoded on a unit circle of 144 notes, creating 12 octaves instead of the normal 7 so that each MIDI note can be encoded to a position on the circle. In fact, there are more notes on the unit circle than MIDI notes which only range from 0 to 127, so the last 16 positions on the unit circle will not be used.
I increased the network size by making each RNN layer size double the original size of the Performance RNN’s, which were 512. I applied a dropout rate of 20% so it would not overfit on the training data. Due to the increased network size, I increased the batch size to 128 and the learning rate to 0.002.
I added 2 optional configurations to my model where the user could choose to specify the number of generated notes, also known as density conditioning; and the pitch distribution of the generated notes, also known as pitch conditioning.
These optional conditioning signals provide the user with more control over the model’s generated performances.
'performance_Mandy':
PerformanceRnnConfig(
magenta.music.protobuf.generator_pb2.GeneratorDetails(
id='performance_Mandy',
description='Performance RNN Mandy'),
#Using Modulo Encoder
magenta.music.ModuloPerformanceEventSequenceEncoderDecoder(
num_velocity_bins=32
),
contrib_training.HParams(
batch_size=128,
rnn_layer_sizes=[1024, 1024, 1024],
#Make dropout rate 20%
dropout_keep_prob=0.8,
clip_norm=3,
learning_rate=0.002),
num_velocity_bins=32,

        control_signals=[
            magenta.music.NoteDensityPerformanceControlSignal(
                window_size_seconds=3.0,
                density_bin_ranges=[1.0, 2.0, 4.0, 8.0, 16.0, 32.0, 64.0]),
            magenta.music.PitchHistogramPerformanceControlSignal(
                window_size_seconds=5.0)
        ],
        optional_conditioning=True          
        )
Enter fullscreen mode Exit fullscreen mode

I initially trained the model for 100 steps, but the model metrics were very poor, as the accuracy was just a mere 0.06! The loss per steps graph was clearly still erratic and has not stabilized. This could be attributed to the large batch size compared to the number of steps. The batch size is 128, and the steps are only 100 so the model didn’t have time to converge.
I visualized these metrics loading the TensorBoard extension in Google Colab:

So in my next training of the model, I trained it for 300 steps, hypothesizing that it would have better results. This took 10 hours! The model metrics were slightly better, but still the model was trying to converge, and the loss per step was still very erratic.

Results
The generated MIDI files will show up in your specified folder in Google Drive like this:

I cannot embed MIDI files into Word, but these are what the files should look like.
As you can see, I specified 10 MIDI files to be generated for the model trained on 300 steps.
The 10 pieces of music generated from the model trained on 100 steps did not sound too computerized! In fact, it sounded very much like a jazz piece. This makes sense because the form, harmonies, and melodies of jazz were derived from Classical music.
The 10 pieces of music generated from the model trained on 300 steps had weightier textures and more dramatic contrasts. The fun feeling of the jazz-like style pieces from the model trained on 100 steps seemed to have been eclipsed by a more serious harmony. This gets me thinking if we kept training the model up to 1000 steps would it start sounding like the layered melodies of the Baroque Period, as the model seems to be incorporating the more the style of each preceding music period with increased training steps.
Lessons Learned
Pros of Using Magenta
There are many modules and functions available for you to a lot of exploratory data analysis in the artistic space. The code was very well documented, and you could do a lot of self-discovery to see what libraries, modules, and models could assist you in solving your problem.
There is no “right” or “wrong” when using these models, because they are designed for the artistic space which does not have clear definitions of what is correct.
The Magenta library allows for one to play to their heart’s content with the different models and the goal is to generate a variety of musical samples, rather than to get to a “correct” answer.
The Performance RNN model was powerful and was able to generate polyphonic music that resembled that of a trained classical pianist. It was the main reason why I chose to create my model based on it because the music possibilities you could generate from it were truly endless.
Magenta allowed me to embark on this exploratory project where there was no right or wrong and I could keep experimenting with the model. My next step would be to get the model accuracy close to double digits and see what kind of music samples would be generated from it. What would the style sound like compared to models that have only a single digit accuracy?
Cons of Using Magenta
Many of the installation instructions were just specific to a local machine, so when I tried to use the same ones with Google Colab, they did not work. There is no support for users who choose to run Magenta on Google Colab. I had to figure out how to run commands in the Google Colab space myself, and when I ran into errors a lot of times it was trial and error to get them fixed.
Even though the code was very well documented, the details of how certain functions worked under the hood were not clear. For example, while converting my NoteSequences into SequenceExamples, I had no idea what music constituted as “too short” or having “too many time shift steps.” These criteria were not defined in the documentation or module, so it was a guessing game of how much data you needed to provide in order to not have a shortage of data after processing it.
There were cryptic errors I encountered when I was training models. It turns out that you cannot run different models within the same directory where checkpoints and TensorBoard data will be stored. A directory containing these checkpoints and TensorBoard data is unique to each model, so if you want to train a different model, you must create a new directory for it. If you don’t, you will get an “Assign requires shapes of both tensors to match” error.

If your music data files were not long enough to begin with or they got truncated a lot during the converting to SequenceExamples process, you will get a “num_batches must be greater than 0” error.

All the files in the Performance RNN model were .py files, so I had to create and edit them in the Google Colab editor which was cumbersome. I had to be careful about deleting code because there was not a way to retrieve the history of the files. Instead, I learned to just comment code out instead of deleting code whenever I needed to change something.

The training time was just way too long. It was difficult making efficient tweaks to your model to improve the performance when you wanted to see how long until it would converge. Training the model for 100 steps took 5 hours, and for 300 steps it was 10 hours.
Future Work
This project’s next phase can be a classification problem where we classify new generated pieces from music samples and their percentages of Baroque, Classical, Romantic, and Contemporary style. Additionally, I am interested in learning how the Performance RNN Model determines the SequenceExamples selection to put into the model.
References
Baker, D 1999, The Influence of Jazz, viewed 10 May 2020, https://www.npr.og/programs/specials/milestones/990210.motm.jazz.html.
DuBreuil, A 2020, Hands-On Music Generation with Magenta: Explore the role of deep learning in music generation and assisted music composition, Packt Publishing, Birmingham.
Hofmann, P 2006, The Inherent Compatibility of Jazz and Classical Music, viewed 10 May 2020, https://mhrrecords.com/articlesandessays/essay01.html.
Meade, N., Barreyre, N., Lowe, S.C., Oore, S. (2019). "Exploring Conditioning for Generative Music Systems with Human-Interpretable Controls.” arXiv:1907.04352 [cs.SD]

Top comments (1)

Collapse
 
mkubdev profile image
Maxime Kubik

Nice one!