Classify music or voice using Neural Network

Recently I found an interesting dataset that deals with music genres. The site seems to be unmaintained. However they have some interesting datasets. One of these datasets has sample wave files that distinguishes between human voice and music. I downloaded the dataset out of interest. It is a very small dataset (60 files each for voice and music). The idea here is to see if I can use it to train a model to identify between these data.

Please go here for the dataset (GTZAN music speech collection):

http://marsyas.info/downloads/datasets.html

The zip extract contains four directories viz. music, music_wav, speech, speech_wav. Music and speech directories contains Audacity (.au) files. Directories with _wav appended to the name contains Wave (.wav) file formats.

Mel Spectrogram

Most of audio machine learning tasks convert audio to corresponding Mel spectrograms. A Mel spectrogram is a spectrogram where all frequencies are converted to Mel scale. Mel scale is virtual scale that is based on the perception of audio by humans. On a Mel scale lower frequencies are separated by a much greater Mel distance than higher frequencies. This simulates how humans perceive sound.

We will not go into too much details here because the good people who made ‘librosa’ gave us a one liner to convert sound to Mel frequencies. Let’s plot a Mel spectrogram next. We will use matplotlib for the plotting part.

Let’s define a function to get a random file from the directory.

def get_random_filename(self, dirname):
  files = glob.glob(dirname + '/*')
  return random.sample(files, k=1)[0]

Next let’s define a function to accept an array of files and draw the Mel spectrograms for all of them.

def plot_spectrogram(self, files):
  plt.figure(figsize=(6, 7), dpi=100)
  plt.subplots_adjust(hspace=0.3)
  for i, filename in enumerate(files):
    plt.subplot(len(files), 1, i+1)
    y, sr = librosa.load(filename)
    spec = librosa.feature.melspectrogram(y=y, sr=sr, n_fft=2048, hop_length=1024)
    mel_spec = librosa.power_to_db(spec, ref=np.max)
    plt.title(os.path.basename(filename))
    librosa.display.specshow(mel_spec, x_axis='time', y_axis='mel', cmap='magma')
    plt.colorbar(format='%+2.0f dB')

    plt.suptitle('Mel Spectrogram')
    plt.show()

We are using librosa.feature.melspectrogram for generating the audio signal data. Y is the data array and sr is the sample rate for data. n_fft is the length of windowed signal. 2048 is the default value that corresponds to a physical duration of 93 milliseconds at a sample rate of 22050 Hz (default sample rate of librosa). Also we sent a hop_length of 1024 which is the total number of samples librosa will extract between two consecutive frames.

Here are Mel spectrograms for two random files. The file on top is a music file, and the one below is speech.

Extract Mel Spectrogram for Training

We will be storing the Mel spectrograms and paths to each file in a CSV file. For this we are going to use Pandas. Also, we will assign all Music files a 1 and Speech will be assigned a 0 (binary encoding). This will make this a binary classification problem.

Let us first write a method to calculate the Mel spectrogram for a sound file.

def extract_mel_spectrogram_mean(self, infile):
  print('Processing: %s' % infile)
  file_parts = os.path.splitext(os.path.basename(infile))
  out_path = os.path.join(SPEC_FILE_LOCATION, file_parts[0] + '.npy')
  X, sample_rate = librosa.core.load(infile)
  mel_spec = np.mean(librosa.feature.melspectrogram(X, sr=sample_rate).T, axis=0)
  np.save(out_path, mel_spec)
  return out_path

The code above is self explanatory. It saves the mel spectrogram data into a specific directory. Now let us calculate this for all files and store them in a CSV.

def prepare_data(self):
  """
  As first step we will extract the MEL spectrograms and store them.
  We will also create a data file with all the file names and type.
  Identify: MUSIC - 1, SPEECH - 0
  """
  MiscUtils.ensure_path(DATA_FILE_LOCATION)
  MiscUtils.ensure_path(SPEC_FILE_LOCATION)

  df = pd.DataFrame(columns=['filepath', 'type'])
  # Let's start reading the Music Location Now
  music_list = glob.glob(MUSIC_FILE_LOCATION + '/*.wav')
  speech_list = glob.glob(SPEECH_FILE_LOCATION + '/*.wav')
  for music in music_list:
    save_path = self.extract_mel_spectrogram_mean(music)
    df = df.append({'filepath': save_path, 'type': 1}, ignore_index=True)

    # And speech
    for speech in speech_list:
      save_path = self.extract_mel_spectrogram_mean(speech)
      df = df.append({'filepath': save_path, 'type': 0}, ignore_index=True)

      csv_file = os.path.join(DATA_FILE_LOCATION, EXPORT_FILE)
      df.to_csv(csv_file, index=False)

We started by creating a pandas data frame on line 10. Then we iterate through each directory one after the other and extract the mel spectrograms for them and add the file location to the pandas data frame. Finally we store the data frame as a csv file.

/Users/suvendra/codes/ML/temp/music/npy/opera.npy	1
/Users/suvendra/codes/ML/temp/music/npy/madradeus.npy	1
/Users/suvendra/codes/ML/temp/music/npy/bartok.npy	1
/Users/suvendra/codes/ML/temp/music/npy/winds.npy	1
/Users/suvendra/codes/ML/temp/music/npy/gravity2.npy	1
/Users/suvendra/codes/ML/temp/music/npy/god.npy	0
/Users/suvendra/codes/ML/temp/music/npy/teachers1.npy	0
/Users/suvendra/codes/ML/temp/music/npy/dialogue1.npy	0
/Users/suvendra/codes/ML/temp/music/npy/amal.npy	0
/Users/suvendra/codes/ML/temp/music/npy/serbian.npy	0
/Users/suvendra/codes/ML/temp/music/npy/undergrad.npy	0

Sample CSV data

Train Model

Next step in the process is to training the model. For training, I had just loaded the data in memory. It is loaded as the mel data from files (X: features) and Speech/ Music indicator (y). Then I split the data into training, validation and test sets. Since these codes are core for any training, I am not including them here. You can refer to full source code uploaded to my repository.

Next I define a ANN model as given below.

@staticmethod
def create_ann_model():
  model = models.Sequential()
  model.add(layers.Dense(128, input_dim=MEL_VECTOR_SIZE, activation='relu'))
  model.add(layers.Dropout(0.3))
  model.add(layers.Dense(256, activation='relu'))
  model.add(layers.Dropout(0.3))
  model.add(layers.Dense(128, activation='relu'))
  model.add(layers.Dropout(0.3))
  model.add(layers.Dense(64, activation='relu'))
  model.add(layers.Dropout(0.3))
  model.add(layers.Dense(1, activation='sigmoid'))
  model.compile(loss='binary_crossentropy', metrics=['accuracy'], optimizer='adam')
  model.summary()
  return model

Since the number of files in this dataset is small, I added quite a few of the Dropouts so that data is not overfit. Loss is calculated as binary cross entropy as this has a binary output (Music/ Not a Music).

def start_training(self, X, y, batch_size=32, epochs=10):
  data = self.split_data(X, y)
  model = DataTrainer.create_ann_model()

  early_stopping = callbacks.EarlyStopping(mode='min', patience=5, restore_best_weights=True)
  model.fit(data['X_train'], data['y_train'], epochs=epochs, batch_size=batch_size,
            validation_data=(data['X_valid'], data['y_valid']), callbacks=[early_stopping])

  # save the model to a file
  save_file = os.path.join(DATA_FILE_LOCATION, H5_FILE_NAME)
  model.save(save_file)

  # Calculate loss
  print('Fit done... continuing to calculate loss...')
  loss, accuracy = model.evaluate(data['X_test'], data['y_test'], verbose=0)
  print('Loss: %.4f, Accuracy: %.2f %%' % (loss, accuracy * 100))

Again, not much of an explanation here. We are running a trainer on the model we created above. After training is complete, we calculate the loss and accuracy on line 16. I am providing below the run logs for this training,

Loading data...
Start training...
Metal device set to: Apple M1
::::
=================================================================
Total params: 90,753
Trainable params: 90,753
Non-trainable params: 0
_________________________________________________________________
::::
Epoch 1/100
2022-02-03 23:36:49.009521: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
1/4 [======>.......................] - ETA: 0s - loss: 0.9451 - accuracy: 0.468820
4/4 [==============================] - 1s 63ms/step - loss: 0.7368 - accuracy: 0.6311 - val_loss: 0.6618 - val_accuracy: 0.6667
Epoch 2/100
4/4 [==============================] - 0s 11ms/step - loss: 0.8225 - accuracy: 0.6311 - val_loss: 0.6803 - val_accuracy: 0.5000
Epoch 3/100
4/4 [==============================] - 0s 11ms/step - loss: 0.6893 - accuracy: 0.6990 - val_loss: 0.6120 - val_accuracy: 0.8333
::::
Epoch 18/100
4/4 [==============================] - 0s 12ms/step - loss: 0.2368 - accuracy: 0.9029 - val_loss: 0.5786 - val_accuracy: 0.7500
Epoch 19/100
4/4 [==============================] - 0s 13ms/step - loss: 0.2344 - accuracy: 0.9126 - val_loss: 0.5648 - val_accuracy: 0.7500
Fit done... continuing to calculate loss...
Loss: 0.1948, Accuracy: 100.00 %

During this run, accuracy came as 100.00% and Loss was minimal. This is probably the first time I got that result, however most of the time I had seen accuracy ratings of over 82% during my previous runs. However, what we can assume is that the model did have a good fit.

Overfitting and Dropouts

When we work on small models, there is always a risk that the model will fit too accurately to data points. In that case, when we run it against real world problems, accuracy degrades. One approach to reduce this overfitting is to use Dropouts in the model. We randomly drop out some nodes from the fully connected layers to simulate noise in the model.

Artificial Neural Net – fully connected layers

Artificial Neural Net with dropouts introduced

The first image on top shows a fully connected neural network. In this case we are using two hidden layers and all of them are connected to each other. On the other hand, the second image shows a model that has dropouts. Here some of the nodes are randomly dropped out. Since we dropped some nodes from the inputs to next layer, these layers will now compensate for the missing nodes. As a result, training will be much more generalized.

Testing the Model

A model is as good as a real world test. So we will try to get some real world data and try to test our model. For this I went to a different website and grabbed some audio files.

https://www2.cs.uic.edu/~i101/SoundFiles/

This is the website I got some audio files from. I randomly downloaded four files for testing.

Gettysburg.wav (Speech)
CantinaBand60.wav (Music)
Taunt.wav (Speech)
PinkPanther30.wav (Music)

Also, I just wrote a small code to run a test for these files.

def run_test(self, files):
  # First load Model
  model_file = os.path.join(DATA_FILE_LOCATION, H5_FILE_NAME)
  model = models.load_model(model_file)
  model.summary()

  for afile in files:
    print('Analyzing: %s' % afile)
    X, sample_rate = librosa.core.load(afile)
    mel_spec = np.mean(librosa.feature.melspectrogram(X, sr=sample_rate).T, axis=0).reshape(1, -1)
    music_prob = model.predict(mel_spec)
    type_media = 'Music' if float(music_prob[0][0]) > 0.50 else 'Voice'
    print('%s is %s. Probability it is music: %.3f' % (
      afile, type_media, music_prob)
         )

The test results were very promising.

gettysburg.wav is Voice. Probability it is music: 0.258
CantinaBand60.wav is Music. Probability it is music: 0.954
taunt.wav is Voice. Probability it is music: 0.000
PinkPanther30.wav is Music. Probability it is music: 0.812

All of these results were accurate and matched what the files contained. That was not a bad model considering the data set was so small.

Conclusion

For this experiment I would consider it to be success. When I started the test, the intent was to identify if we can come up with a good model given a small dataset. One of the the side goals was also to try CNN if ANN was not good. However, based on the results that we achieved with the defined dense model, I gave up on that idea. You will not see a perfect accuracy score from a ANN model often. You can get the code here. Ciao for now!