CoreML Drummer
Generating media with deep learning is a dual-edge sword with both interesting and scary implications (e.g. deepfakes). However, one of the more uplifting domains is music generation. In this tutorial, you'll learn how to generate CoreML model that generates a rhythm sequence of MIDI notes.
Creating this kind of models doesn't fall under any of the task oriented models provided by either CreateML or TuriCreate, so you'll have to create and train the model using Keras, and then use a coremltools converter to convert the model into CoreML.
This tutorial was inspired by the Matthijs Hollemans' tutorial where he wrote an LSTM in Swift using the Accelerate framework. Its a very good tutorial, and many of the explanations won't be repeated here. Many tutorials have covered the topic of MIDI note generation before, however, one of the interesting approaches used by Matthijs in his tutorial is, he accounts for timing information during training.
Where Matthijs creates his model using tensorflow APIs, and implements his LSTM directly in Swift using the Accelerate framework, this tutorial will focus on creating a model using Keras, and then converting the model using coremltools.
In this tutorial, you'll learn how to
- Create an LSTM model using Keras
- Add a custom activation layer to the model
- Use a custom loss function to train the model
- Convert the model to CoreML
- Write a Swift application to generate a rhythm.
This tutorial makes use Google Colab, however, feel free to use whatever development environment you wish. You can always look at the completed notebook here
The Data
The dataset you'll be using comes from groovemonkee.com. The dataset is freely available, However, groovemonkee requires sign-in credentials to get them. You're free to use your own dataset if you have a library of MIDI files, however, there are some inherent things built into the groovemonkee dataset that you'll leverage in this tutorial.
Processing the data
In order to prepare the data for training, Matthijs' processes the MIDI files directly working with the bytes of the file. Unfortunately, this didn't work with the dataset from groovemonkee because the MIDI type was unsupported by his script. In this tutorial, you will instead make use of the MIDO python library for working with MIDI.
The entire script for processing the MIDI files can be found in the notebook, however, the core calls to MIDO in order to read MIDI files are as follows:
from mido import MidiFile
for num, file in enumerate(files):
midi = MidiFile(file)
for _, track in enumerate(midi.tracks):
for _, message in enumerate(track):
data = message.dict()
print(data) # process data object here
In the above code, you iterate over a list of file paths to all the midi files. In the analyze_midi_files
method, you'll iterate over all the files and collect some basic information from the data. Here's a dump of what that method prints:
Total files 599
MIDI file type {1: 599}
Ticks per Beat {240: 597, 480: 2}
Tracks per MIDI {2: 599}
Tempo {631579: 26, 500000: 66, 444444: 45, 508475: 2, 461538: 48, 618557: 6, 800000: 19, 857143: 17, 600000: 31, 375000: 5, 428571: 24, 545455: 61, 413793: 46,300000: 24, 666667: 11, 1000000: 21, 625000: 20, 560748: 19, 352941: 3, 571428: 20, 517241: 4, 400000: 8, 750000: 7, 333333: 1, 638298: 1, 571429: 5, 566038: 1, 923077: 2, 387097: 2, 422535: 1, 487805: 2, 681818: 23, 410959: 27, 483871: 1})
time_signature {'4/4/24/8': 523, '6/8/12/8': 26, '12/8/12/8': 26, '3/4/24/8': 11, '7/4/24/8': 4, '2/4/24/8': 3, '6/4/24/8': 6}
Notes 23 (unique notes): 2624 (max), 4 (min), {36: 14764, 42: 11638, 38: 12034, 46: 2723, 44: 886, 43: 1439, 48: 558, 45: 569, 57: 597, 51: 6423, 49: 596, 41: 101, 37: 643, 59: 118, 52: 198, 47: 74, 53: 226, 60: 47, 61: 26, 56: 73, 64: 33, 63: 27, 62: 6}
Ticks 280 (unique ticks), 0 (min) 960 (max)
You can see that the dataset consists of nearly 600 tracks. Most of the tracks contain 240 ticks_per_beat
. There's a wide variety of tempos found in each file, however a large majority seem to fall into the the '4/4' time signature. In this tutorial. you'll only work with files that have a tempo of 500000
microseconds (which is 120 beats per minute) with 240 ticks per beat. The filter_files
method filters the original files down to match those parameters. Here's a sanity check on the dataset:
Analyze the filtered files as a sanity check
Total files 60
MIDI file type {1: 60}
Ticks per Beat {240: 60}
Tracks per MIDI {2: 60}
Tempo {500000: 60}
time_signature {'4/4/24/8': 60}
Notes 15 (unique notes): 428 (max), 12 (min), {36: 788, 43: 239, 38: 719, 48: 45, 45: 26, 49: 40, 42: 648, 46: 268, 57: 24, 51: 270, 44: 30, 37: 96, 56: 37, 64: 10, 63: 11, 47: 12}
Ticks 147 (unique ticks), 0 (min) 687 (max)
Now that the dataset has been curated, it'll make working with it a lot easier. In the original blog post, the author combines all the tracks into one mega track, and then generates a matrix where each row is a one-hot encoded vector containing both the notes and tick information. You can see the same approach taken in the create_mega_track
method in the notebook. In the notebook, you will see that you are generating 10 different "mega tracks", and then one hot encoding them in the generate_one_hot_encoding
method. Here's an example of a single row of what the processed dataset will look like
# OHE = One hot encoded
# |---- OHE note ----||------ OHE tick ------|
[0, 0, 0, 0, 1, 0, ... 0, 0, 0, 1, 0, 0, 0, 0]
Essentially, since you analyzed all the MIDI files, you found that there are a total of 15 unique notes, and 147 unique ticks. This means, since the number of each feature is finite, you can safely one-hot encode each, and concatenate them into one large array. The array will effectively be all zeros, except for 2 entries which will be 1. This means that you'll generate a full dataset that will consist of N number of notes, some defined number of time steps T, and (15 + 147 = 162) features.
Note: Your results for the number of notes and ticks may differ because the script randomly shuffles the MIDI files before generating one large file before one hot encoding.
In order to feed the data into your Keras model (something you'll do in the next section), you'll wrap the data into a Keras sequence generator. Essentially, you're leveraging the on_epoch_end
method to return the data from one of the mega tracks you created:
class KerasDataGenerator(keras.utils.Sequence):
def __init__(self, data, step_size, batch_size):
self.batch_size = batch_size
self.step_size = step_size
self.current_song = -1
self.data = data
self.on_epoch_end()
def __len__(self):
current = self.data[self.current_song].shape
return current[0] - self.step_size - 1
def __getitem__(self, index):
current = self.data[self.current_song]
x = current[index : index+self.step_size ]
y = current[index+1 : index+self.step_size+1]
x = np.expand_dims(x, axis=0)
y = np.expand_dims(y, axis=0)
return x, y
def on_epoch_end(self):
self.current_song += 1
With the data processed, you'll now turn your focus to creating the Keras model.
The Keras Model
With the data ready to go, you can start working on the model to generate some MIDI notes. In this tutorial, you'll create a fairly simple model with a single Bidirectional-LSTM and a single Dense layer. However, since you're trying to output both the timing and the notes using one-hot encoding, it's the output layer, and the loss function that are far more interesting problems to solve.
What's the right Activation?
The first question you should ask yourself is, what activation should your model use? Recall, we're expecting a single array where two indices have the highest probability, i.e. this is a little like a classification problem with two independent classes. One way to solve this is to use a custom activation layer. Add the following code to an empty cell
import keras
from keras.models import *
from keras.layers import *
from keras import backend as K
import numpy as np
import tensorflow as tf
time_steps = 1
features = notes + ticks
def custom_activation(x):
# 2
notes_slice, ticks_slice = tf.split(x, [notes, ticks], -1)
# 3
notes_slice = K.softmax(notes_slice)
ticks_slice = K.softmax(ticks_slice)
# 4
result = tf.concat([notes_slice, ticks_slice], 1)
model = Sequential()
model.add(Bidirectional(LSTM(200, return_sequences=True), input_shape=input_shape))
model.add(TimeDistributed(Dense(features)))
model.add(Lambda(custom_activation))
# 1
model.add(Lambda(custom_activation))
For the most part, this is standard code for creating a model. The notes
and ticks
variables hold the number of unique notes and ticks you found in the previous section. The standout code here is the addingtion of a Lambda
layer using a custom activation method. Briefly, this is how things break down:
- A
Lambda
layer added to the end of the model, passing in a custom function. - The Lambda layer splits the input into two arrays. The first half will contain the one-hot encoded notes array, and the second, a one-hot encoded ticks.
- Then, the Lambda layer performs a
softmax
on each array individually. - Finally, the results are concatenated back into a single output.
The other curious layer, if you've never come across it, is the TimeDistributed
layer. Essentially, this layer allows us to generate an output for each sample provided. CoreML generally ignores this layer because in CoreML, there's always a sequence or "temporal" slice.
You might be wondering how coremltools deals with Lambda
layers. You'll go over that in a later section.
Note that this technique of splitting the arrays to get at the notes and ticks separately is a recurring theme in this tutorial. In fact, with this custom activation out of the way, it's time to think about the loss function
How do you Compute Loss?
In theory, you could use any loss function to train your model, however, there might be a better approach. Since the output of the model is essentially two classification results, you can perform cross-entropy loss by first splitting the output of the array. Keras also allows you to pass in a custom loss function during the compile step.
from keras import backend as K
import tensorflow as tf
def custom_loss(y_actual, y_predicted):
y_actual_notes_slice, y_actual_ticks_slice = tf.split(y_actual, [notes, ticks], -1)
y_predicted_notes_slice, y_predicted_ticks_slice = tf.split(y_predicted, [notes, ticks], -1)
note_loss = K.categorical_crossentropy(y_actual_notes_slice, y_predicted_notes_slice)
tick_loss = K.categorical_crossentropy(y_actual_ticks_slice, y_predicted_ticks_slice)
total_loss = (note_loss + tick_loss)
return total_loss
model.compile(loss=custom_loss, optimizer='adam', metrics=['accuracy'])
model.summary()
When you compile your model, you should get an output that looks as follows:
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
bidirectional_1 (Bidirection (None, None, 400) 580800
_________________________________________________________________
time_distributed_1 (TimeDist (None, None, 162) 64962
_________________________________________________________________
lambda_1 (Lambda) (None, None, 162) 0
=================================================================
Total params: 645,762
Trainable params: 645,762
Non-trainable params: 0
With the model ready to go, the next logical step is to train it, which you'll do in the next section.
Train the model
Now that you have your model ready, it's simply a matter of training it. Using the data generated earlier, invoke fit_generator
on your model, and allow it to finish training.
time_steps = 21
batch_size = 1
unique_notes = len(note_to_ix)
unique_ticks = len(tick_to_ix)
generator = KerasDataGenerator(data, ix_to_note, note_to_ix, ix_to_tick, tick_to_ix, time_steps, batch_size)
model = build_model((None, unique_notes + unique_ticks), unique_notes, unique_ticks)
history = model.fit_generator(generator, epochs=epochs)
Your results may vary, however, typical results from training may looks as follows;
Epoch 1/10
3241/3241 [==============================] - 220s 68ms/step - loss: 1.3910 - accuracy: 0.7348
Epoch 2/10
3241/3241 [==============================] - 205s 63ms/step - loss: 0.1635 - accuracy: 0.9660
Epoch 3/10
...
Epoch 9/10
3241/3241 [==============================] - 204s 63ms/step - loss: 0.0536 - accuracy: 0.9856
Epoch 10/10
3241/3241 [==============================] - 204s 63ms/step - loss: 0.0458 - accuracy: 0.9877
Converting the Keras Model to CoreML
Converting the model to CoreML is pretty straight forward. You'll start with installing the coremltools package.
!pip install -U git+https://github.com/apple/coremltools.git
Once its been installed, you'll make use of the general keras.convert
to convert the model. The one caveat here is the Lambda
layer. Since this is a custom layer, coremltools won't know how to convert the layer. Add the following code to a free cell, and you'll break it down after:
import coremltools
from coremltools.proto import NeuralNetwork_pb2
def convert_lambda(layer):
# 3
params = NeuralNetwork_pb2.CustomLayerParams()
params.className = "MidiNoteTickActivation"
params.description = "A fancy new activation function"
# 4
params.parameters["notes"].doubleValue = notes
params.parameters["ticks"].doubleValue = ticks
return params
coreml_model = coremltools.converters.keras.convert(
model,
input_names=['input'],
output_names=["output"],
# 1
add_custom_layers=True,
#2
custom_conversion_functions={ "Lambda": convert_lambda }
)
coreml_model.license = "MIT"
coreml_model.short_description = "MIDI Notes + Ticks generator"
coreml_model.save("Midi.mlmodel")
There's a lot going on here, so look at each line individually:
- You first need to pass a parameter to the coremltools converter notifying that your model has a custom layer
- Next You pass a map to the converter that will be invoked every time it sees a layer of type
Lambda
. - Since there's only one Lambda layer in your model, you can safely assume the layer passed in to
convert_lambda
is the right one. At minimum, CoreML expects aCustomLayerParams
instance to be returned with theclassName
field set. TheclassName
field represents the name of the Swift class used to replace this layer. More on that in the next section. - Finally, Since the Swift class defined by
className
is initialized at runtime, you won't have access to the number of notes or ticks in your input data. You could use global variabled, however,CustomLayerParams
also provides aparameter
map which allows you to pass in values in your Swift code. In this case, you'll pass in the number of unique notes and ticks defined when preprocessing the MIDI data.
Ok, with the model converted, it's time to use it in a Swift app.
Predictions in Swift
You've converted your model, and you're ready to use that model in your Swift application. In this section, you'll create a Mac OS Command Line Tool, but know that CoreML models can be used on all Apple devices that support CoreML. You'll print everything to the console.
Create a macOS Command Line Tool in Xcode, and drag the CoreML model you created previously into the project.
Note: Be sure to "Copy items if needed" is selected as the destination when adding the file to the project.
Xcode should have generated a class based on the name of the file. In this case, the name of the class should be Midi
. You should see the following in Xcode:
Notice that there's a section named "Dependencies". In that section, you can see that there's a "Custom Layer" named "MidiNoteTickActivation". You'll start by adding this dependency. Create a new file named MidiNoteTickActivation.swift, and paste the following code.
import Foundation
import CoreML
import Accelerate
// 1
@objc(MidiNoteTickActivation) class MidiNoteTickActivation: NSObject, MLCustomLayer {
private let notes: Int
private let ticks: Int
// 2
required init(parameters: [String : Any]) throws {
print(#function, parameters)
// 3
self.notes = parameters["notes"] as! Int
self.ticks = parameters["ticks"] as! Int
super.init()
}
// 4
func setWeightData(_ weights: [Data]) throws {
print(#function, weights)
}
// 5
func outputShapes(forInputShapes inputShapes: [[NSNumber]]) throws -> [[NSNumber]] {
print(#function, inputShapes)
return inputShapes
}
// 6
func evaluate(inputs: [MLMultiArray], outputs: [MLMultiArray]) throws {
// 7
for i in 0..<inputs.count {
let input = inputs[i]
let output = outputs[i]
// 8
var notes_slice = [Float](repeating: 0.0, count: self.notes)
var ticks_slice = [Float](repeating: 0.0, count: self.ticks)
for j in 0..<input.count {
if j < self.notes {
notes_slice[j] = input[j].floatValue
} else {
ticks_slice[j - self.notes] = input[j].floatValue
}
}
// 9
notes_slice = softmax(z: notes_slice)
ticks_slice = softmax(z: ticks_slice)
// 10
for j in 0..<output.count {
if j < self.notes {
output[j] = notes_slice[j] as NSNumber
} else {
output[j] = ticks_slice[j - self.notes] as NSNumber
}
}
}
}
}
- In order to implement a custom layer, you first create a class that extends
NSObject
andMLCustomLayer
. What's important is that you annotate the class using@objc(MidiNoteTickActivation)
. When CoreML loads the model, it will look for a class with that annotation to fill in the custom layer. - This is the required constructor that CoreML will look for when initializing the custom layer. As a convenience, it will pass in a parameter map which will contain any extra parameters that were passed in when creating the model in python.
- If you recall, you added the number of unique notes and ticks to the custom layer. Here, you're extracting those values and storing them for use when it comes to parsing the data seen by the layer.
- This is a method in the
MLCustomLayer
interface. Since the custom layer doesn't have any weights associated with it, you can skip any implementation here. - This is another method in the
MLCustomLayer
interface. Here, you can define what the output should look like, and is your chance to reshape the output. However, in this case, the output doesn't change in this layer. What's interesting is that the rank of the array is 5. Back when CoreML was first introduced, model input needed to always be of rank 5. Perhaps this is a hold-over from that time, but its certainly interesting that its the rank of the inputs given you never defined any data with that rank. - This is the last required method of
MLCustomLayer
. - First you must iterate over the list of inputs passed in. CoreML Layers support multiple inputs and outputs, however, in this case, you should receive a size of 1 for both the inputs and outputs.
- Here, you iterate over the input and split the data into a notes array and ticks array using the parameters passed into the constructor as the index to where the notes/ticks start/end.
- With the two vectors extracted from the input, you can run a softmax function on the array. This was the implementation used in this tutorial.
- Here you copy the data back into the output.
Now, is there a better implement this method? Perhaps. There's probably a way to run a softmax on the MLMultiArray
by getting access to the underlying pointers directly. Moreover, MLCustomLayer
also provides a version of the evaluate
that you can run using Metal shaders directly. This tutorial won't implement that interface.
With the Custom Layer complete, it's time to focus your attention to running a prediction in Swift. Start by pasting the following code in your main.swift file.
private let noteVectorSize = 15
private let tickVectorSize = 142
private let featureSize = noteVectorSize + tickVectorSize
These constants define the parameters of your data. If you recall, your dataset has 15 unique notes, and 142 unique ticks. Next, you have to create the input array, which is of type MLMultiArray
. You'll also create some helper functions. Paste the following code to the same file.
// 1
guard let array = try? MLMultiArray(shape: [1, 1, featureSize as NSNumber], dataType: .double) else {
print("Failed to create MLMultiArray")
exit(-1)
}
// 2
func assign(_ array: MLMultiArray, noteIndex: Int, tickIndex: Int) {
for index in 0..<array.shape[2].intValue {
array[[0, 0, index as NSNumber]] = 0
}
array[[0, 0, noteIndex as NSNumber]] = 1
array[[0, 0, tickIndex as NSNumber]] = 1
}
// 3
func argmax(_ input: MLMultiArray, start: Int, stop: Int) -> (maxIndex: Int, maxValue: Float) {
var maxValue = input[start].floatValue
var maxIndex = -1
for index in start..<stop {
let value = input[index].floatValue
if value >= maxValue {
maxValue = value
maxIndex = index
}
}
return (maxIndex, maxValue)
}
- Create a
MLMultiArray
which will take a single note/tick combination. assign
is a helper function which will take aMLMultiArray
and sets the indices for the note and tick respectively.argmax
is a useful method when trying to determine the index in an array with the highest value. If you recall, you're running a softmax method which fills an array with the probabilities a given index is the expected value.argmax
will take such an array and find the index with the highest value. In your case, you'll use this method to know which note and which tick the model predicts as the next note.
Next you'll run a prediction first randomly selecting a note and tick value, copy the following code in your file:
var seedIndexNote = Int(arc4random_uniform(UInt32(noteVectorSize)))
var seedIndexTick = Int(arc4random_uniform(UInt32(tickVectorSize)))
assign(array, noteIndex: seedIndexNote, tickIndex: (noteVectorSize + seedIndexTick))
Here, you're using the size of the note and tick dataset to generate a random value and assign it to the input MLMultiArray
.
var bidirectional_1_h: MLMultiArray? = nil
var bidirectional_1_c: MLMultiArray? = nil
var bidirectional_1_h_rev: MLMultiArray? = nil
var bidirectional_1_c_rev: MLMultiArray? = nil
let input = MidiInput(
input: array,
bidirectional_1_h_in: bidirectional_1_h,
bidirectional_1_c_in: bidirectional_1_c,
bidirectional_1_h_in_rev: bidirectional_1_h_rev,
bidirectional_1_c_in_rev: bidirectional_1_c_rev
)
Next you'll use the generated MidiInput
class to wrap the MLMultiArray
along with the h
and c
parameters for the LSTM into pass into the prediction. The h
and c
parameters are the "memory" of the LSTM. Note that after running a prediction, you will get output versions of the h
and c
parameters, which you can pass back into the model to get the next prediction.
Finally, in order to run a prediction, instantiate a Midi
object, and pass the MidiInput
into the model:
let model = Midi()
let prediction = try model.prediction(input: input)
let output = prediction.output
let note = argmax(output, start: 0, stop: noteVectorSize)
let tick = argmax(output, start: noteVectorSize, stop: output.count)
print("\(note.maxIndex), \(tick.maxIndex)")
The result is an object of type MidiOutput
. It should contain an output MLMultiArray
named output
, a name you defined when converting the model from Keras. You can use the argmax
function defined earlier to find the index of the note and tick of the next note.
If all went well, you should have an output that shows the next note and tick. You can turn around, and feed this value back into your model to get the next consecutive note and tick. Do this a number of times, and you'll have an entire MIDI track with some nice rhythm!
Congratulations, you've gone from creating an LSTM model in Keras model to using that model in a Swift project.
Takeaways
In this tutorial, you managed to create a model that will generate a bunch of MIDI notes built from a drum dataset. You managed to analyze the MIDI files, convert it to a dataset that you passed into a Keras model which you trained and then converted in to a CoreML model.
There are two key points in this tutorial:
- CoreML allows for custom layers from Keras
- Custom Layers allow for extra parameters which you can use from Swift
If you enjoyed this tutorial on working with CoreML LSTMs, consider checking out mlfairy.com. MLFairy is a service that helps you create better CoreML model for all your Apple edge devices.