Music has been a way for us to express ourselves for quite a while now. And as we evolve, our society evolves, our technology evolves, and also, our instruments evolve. Last century we started using electronics to create music and this has provided us with so many new sounds and ways to create and experience music. These days it’s normal for us when some person stands behind their machinery with some prepared set of songs. Music blasts out of the speakers. The bass beats at a ridiculous volume, causing the earth to tremble. And we rave on and about. As time progresses we get better and better at synthesizing sound exactly how we want sound to sound. Tuning more and more to our hearts and minds.

That’s great and all but new centuries offer new developments. With the rise of AI (Artificial Intelligence) we foresee the rise of new instruments. Obviously, fine-tuning synthesized sound by hand will still stay a craft and form of expression. But we do not have to. We can let AI do this for us too! And we’re certainly not the first. Google shares our vision and has provided the world with the NSynth dataset. A dataset with over 300.000 samples of various instruments played on different pitches (will be explained later), velocities or techniques like a bowed or plucked cello.

As experts in AI we intend to enter the field of generated sound. More specifically, we intend to work on creating generated instruments. We hope that by using GANs (generative adversarial networks) we can let an AI create beautiful instruments and sounds we’ve never heard before. We are not experts in the physics of sound nor are we very experienced in analysing sound with neural networks. We are enthusiastic AI-experts starting a new journey. With this blog, we give you a sneak peek into the first steps of our journey. In these first steps we explain the basics of sound and the options we have analysing sound such that we can extract the instrument, pitch and velocity from a sound sample. Followed by a quick overview of the data set we use. Lastly, we will build a neural network classifying pitches.

At the upcoming ODSC Europe conference in our session, “Exploring the Rimworld of Sound Space Using Generative Adversarial Networks,” we will show our first results of generated instruments using generative adversarial networks and we intend to take you with us on our journey of exploring new sounds and instruments.

The talk and this article are mainly aimed at people who have not yet delved deep into sound technology and analysis of sound using machine learning but do have a basic understanding of the idea behind machine learning and Python code development.

### Requirements

Our setup, running on a cheap laptop with a minor GPU is running Python 3.8 with the following libraries:

- Tensorflow 2.4.1
- Keras 2.4.3
- Librosa 0.8
- Matplotlib
- Jupyter / IPython

## Creating Sound

We all know what sound is, we hear it all the time. But how can we create sound on our computer? Audio on our computer, in its most basic form, is represented by a vector of numbers. These numbers indicate the position of our speaker cones. So, when processed by our sound card they will make the speakers bounce. Which in turn will move the air. Which we pick up with our ears. Which in turn sometimes makes us bounce.

The amount of numbers our sound card processes each second dictates the resolution of our sound. This is also called the sample rate. A higher sample rate means a higher quality, much like a higher resolution (pixels) of an image means a higher quality image.

Instead of using more words, let’s construct a sound using numpy! We are going to construct a wave with a frequency of 400 Hz, meaning the wave shape will have 400 ups and 400 downs per second. The sample will be 4 seconds long and have a sample rate of 16k values per second, means the entire sample consists of 4 x 16 k = 64k values in total.

import numpy as np import IPython as ipython sample_rate = 16_000 # values / second length = 4 # seconds frequency = 400 # Hz # time axis from 0 to 4 seconds with a total of 4 x 16k values x = np.linspace( start=0, stop=length, num=sample_rate * length ) # Construct a sine wave radian_x = x * 2 * np.pi y = np.sin(radian_x * frequency) # Display ipython.display.Audio(y, rate=sample_rate)

*Above image illustrates how badly humans want to click on triangles. Please run the code if you want to hear the sample!*

We can easily visualize the sound by plotting time (x) and amplitude (y):

from matplotlib import pyplot as plt %matplotlib inline plt.figure(figsize=(20, 5)) sample_size = int(sample_rate * 0.1) # 0.1 second sample_x = x[: sample_size] sample_y = y[: sample_size] plt.plot( sample_x, sample_y ) plt.ylabel('amplitude') plt.xlabel('time (s)')

Shifting a continuous signal over time (shifting left or right) will not result in a different sound.

However! If we use value@time as input features for a machine learning model it could have a huge impact! Can you already figure out why?

## Understanding sound

We process sound with little hairs in our ears that respond to specific frequencies. This preprocessing helps our brain understand sound. So, it does sound logical to also try using similar techniques when we let our AI analyze sound.

A Fourier transform does exactly this. From a (complex) waveform, such as the one we constructed above, it extracts separate frequencies and the amplitude in which those frequencies are present. In the above sample there is 1 component of 400 hz. Feel free to look up the mathematical background of Fourier transform, this is not the focus of this article.

import librosa intensities = np.abs(librosa.stft(y)).sum(axis=1) frequencies = librosa.fft_frequencies(sample_rate) plt.plot(frequencies, intensities) plt.xlabel('Frequency (Hz)') plt.ylabel('Intensity')

To include the dimension of time we repeatedly execute a Fourier Transform on a window sliding over the timeline. In turn we can visualize this data by turning it into a spectrogram, giving us a visual representation of frequencies and amplitudes at each window of time. This is called short-time Fourier transform. (The previous code actually used the same function as librosa does not have a “normal” Fourier transform, and we want you to spend your time having fun with us and not installing more libraries!)

from librosa import display spectrogram = np.abs(librosa.stft(y)) librosa.display.specshow(spectrogram, y_axis='log', x_axis='time', sr=16_000)

*The figure on the left shows our boring spectrogram of a sine wave (continuous signal), on the right we see a more interesting spectrogram of an acoustic bass*

## Data set

We consider ourselves very lucky. Google has taken the effort to record and label a dataset of over 300.000 sounds, so that we can put our neural networks to work. The dataset can be found here: https://magenta.tensorflow.org/datasets/nsynth

The dataset can be downloaded in two forms. As a tuple of JSON-files and corresponding audio files, or as a data type specifically made for working with the Tensorflow library called TFRecords. Either way we are gifted labels such as the instrument and pitch. The dataset is quite substantial. For example, the TFRecords training set consists of 289,205 elements taking up almost 70GB of space.

## Network setup

**Important choice****!** Will we use the raw audio signal, the Fourier spectrum or any frequency-over-time as input for our generative adversarial network?

Let’s go in blind and do NO preprocessing at all on the input data! Raw audio signal it is. Our first model will be a neural net trying to predict the Pitch of an audio sample.

We use the term pitch to distinguish whether a sound is higher or lower than some other sound. Take for example the piano. In the figure below we see the 4th A-note on the piano, colored yellow, it has a higher pitch than the blue note, the 4th C. The dataset h

as 128 unique values for pitch. 0 is the lowest pitch, 127 the highest.

Recall our samples last 4 seconds and have a sample rate of 16,000 values per second.

Let’s approach this as a classification problem with 128 possible classes, which correspond to the 128 possible pitches. This gives us the network outlines described in below figure, where values in X can be any intensity and values in y are either 0 or 1.

*Detailed setup of our network*

Combining everything we learned about tfrecords plus above setup, the following code is used to prepare the dataset

import tensorflow as tf def _one_hot(tensor: tf.Tensor, size) -> tf.Tensor: """ One hot encode a tensor and return it as 1D tensor :param tensor: :param size: number of unique values in tensor :return: """ hot_tensor = tf.one_hot(tensor, size) shaped_tensor = tf.reshape(hot_tensor, (size,)) return shaped_tensor @tf.autograph.experimental.do_not_convert def _parse_function(example_proto): # Schema features = { "pitch": tf.io.FixedLenFeature([1], dtype=tf.int64), "audio": tf.io.FixedLenFeature([64000], dtype=tf.float32) } example = tf.io.parse_single_example(example_proto, features) data = example['audio'] label_name = 'pitch' label_value_count = 128 label = _one_hot(example[label_name], label_value_count) return data, label path_to_train_data = 'path/to/data/nsynth-train.tfrecord' train_dataset = tf.data .TFRecordDataset(path_to_train_data) .map(_parse_function) .batch(128) path_to_test_data = 'path/to/data/nsynth-valid.tfrecord' test_dataset = tf.data .TFRecordDataset(path_to_test_data) .map(_parse_function) .batch(128)

## Convolution

The pitch we want to detect is based on the various frequencies and their relative occurrence in the samples. To be able to detect frequencies, we are looking for patterns in the data rather than specific values. Intuition: does a high value at a certain time correspond to a frequency? This is a typical use case for convolution. Very similar to wavelet transformation we can convolve various kernels over the audio signal to find these patterns. As our signal is 1 dimensional, we will use 1 dimensional convolution with a 1 dimensional kernel.Convolution

### Pooling

To find lower frequencies we have to use larger kernels in our convolutional steps. Imageine a frequency as low as 1 hz. This means an entire wave will take 1 second, which is 16,000 samples. If our kernel has size 9, it is easy to imagine the signal might get drowned out by higher frequency signals. Instead of increasing our kernel size, we could “zoom out” the signal and not change the kernel size. This “zooming out” is actually the same as downsampling, i.e. lowering the sample rate. As the maximum detectable frequency in a signal is half the sample rate (nyquist) this step acts as a low-pass filter; lower frequencies remain and higher frequencies are lost forever 🙁

Example:

Raw signal – sample rate 16k

1d Convolution – kernel size k, max detectable frequency: 8k hz

pooling – 4x → sample rate 4k

1d convolution – kernel size k, max detectable frequency: 2k hz

etc.

### Final network

import tensorflow as tf from keras import layers audio_length = sample_rate*length model = tf.keras.Sequential([ layers.Input(shape=(audio_length,)), layers.Reshape(target_shape=(audio_length, 1)), layers.Conv1D(4, 9, activation='relu'), layers.AveragePooling1D(4), layers.Conv1D(8, 9, activation='relu'), layers.AveragePooling1D(4), layers.Conv1D(12, 9, activation='relu'), layers.AveragePooling1D(4), layers.Conv1D(16, 9, activation='relu'), layers.AveragePooling1D(4), layers.Flatten(), layers.Dense(512, activation='relu'), layers.Dense(128, activation='softmax') ])

## Performance Metrics

As we are dealing with a multi-class classification problem, where only one class should be selected, we picked the following settings:Performance metrics

The final layer of the model is a softmax layer. This leads to an outcome vector of size 128. The sum of this vector is 1, and each of these 128 values is the estimated likelihood of that value being picked. An example output for a 3 class classification problem could be [0.2, 0.3, 0.5].

Categorical accuracy is our weapon of choice as a human-interpretable metric. This is the number of times the highest value in the prediction vector is in the right place divided by the total number of predictions. If our previously mentioned prediction of [0.2, 0.3, 0.5] would be compared to the label [0, 0, 1], and this would be our only example, we would have 100% categorical accuracy (once right divided by a total of one prediction). The categorical accuracy is described in more detail in the Keras documentation [src].

To optimize our model we are using a loss function which relates to the categorical accuracy. This loss function is called categorical cross-entropy, and, opposed to categorical accuracy, also takes into account how right or how wrong a prediction is. For example, the prediction [0.2, 0.3, 0.5] compared to the actual label [0, 0, 1] would have a loss of 0.69, whereas the prediction [0, 0.1, 0.9] has a loss of 0.11. Categorical cross-entropy is described in more detail in the Keras documentation [src].

## Results

So what categorical accuracy do we hope to achieve? Let’s be a little bit easy on ourselves, and put the baseline, or threshold for celebration, at random performance. We have 128 possible classes, so a 1 in 128 chance to accidentally have it right. That is slightly below 1%. Easy enough right?

We have slightly lowered the default learning rate, as we would not want to scare humankind with the overwhelming intelligence of 21st century AI. Also it gives better visuals. Below is the code to get this neat graphic:

model.compile( optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4), loss='categorical_crossentropy', metrics=[tf.keras.metrics.CategoricalAccuracy()] ) model.summary() history = model.fit(train_dataset, validation_data=test_dataset, epochs=30) plt.plot(history.history['val_categorical_accuracy'], label='Validation categorical accuracy') plt.plot(history.history['categorical_accuracy'], label='Training categorical accuracy') plt.legend() plt.title('NSynth Pitch Detection Performance') plt.xlabel('Epochs') plt.ylabel('Categorical accuracy') plt.savefig('model')

Above figure depicts the results of our first attempts at using raw audio for classification. Above the celebration-threshold after epoch 1. We are off to celebrate, and welcome you to do the same if you made it all the way here, in mind & code.

We have seen various ways to look at sound, and have shown that training a model on the rawest form does actually work. Next up: using our pitch-detector as discriminator for a generative adversarial network!

### About the authors/ODSC Europe 2021 speakers on Generative Adversarial Networks:

Laurens Koppenol: I am Laurens, Machine Learning Engineer at Dataworkz. I have a background in Artificial Intelligence and worked for several years as lead data scientist (at ProRail). I like to be on the edge where business meets tech, where AI and machine learning make an impact. Currently I am working as a data engineer / data architect at the Port of Rotterdam, via Dataworkz. I am intrigued by the creative side of artificial intelligence; can a machine show symptoms of creativity or can it only reproduce what it has seen before? How is that different from what we can do? The subject we are talking about is inspired by David’s love for music and my urge to do cool stuff in the field of data science.

David Isaacs Paternostro: I’m David, I work as a teacher, teaching artificial intelligence, at the applied university of Utrecht. Before I started teaching I’ve worked as a prototyping developer for AI research at Philips Research developing intelligent agents using machine learning and reinforcement learning for consumer healthcare products. It was very interesting but my need for more social engagement has driven me towards teaching. A step that has made me a very happy person so far.