We have seen basics of Machine Learning, Classification and Regression. In this article, we will dive a little deeper and work on how we can do audio classification. We will train Convolution Neural Network, Multi Layer Perceptron and SVM for this task. The same code can be easily extended to train other classification models as well. I strongly recommend you to go through previous articles on basics of Classification if you haven't already done.
The main question here is to how we can handle audio files and convert it into a form which we can feed into our neural networks.
It will take less than an hour to setup and get your first working audio classifier! So let's get started! 😉
We will be using python. Before we can begin coding, we need to have below modules. This can be easily downloaded using pip.
We are going to use ESC-10 dataset for sound classification. It is a labeled set of 400 environmental recordings (10 classes, 40 clips per class, 5 seconds per clip). It is a subset of the larger ESC-50 dataset
Each class contains 40 .ogg files. The ESC-10 and ESC-50 datasets have been prearranged into 5 uniformly sized folds so that clips extracted from the same original source recording are always contained in a single fold.
Before we can extract features and train our model, we need to visualize waveform for the different classes present in our dataset.
import matplotlib.pyplot as plt import numpy as np import wave import soundfile as sf
The below function
visualize_wav() takes an ogg file, reads it using soundfile module and returns the data and sample rate. We can use
sf.wav() function to write wav file for the corresponding ogg file. Using matplotlib, we are plotting signal wave across time and generating the plot.
def visualize_wav(oggfile): data, samplerate = sf.read(oggfile) if not os.path.exists('sample_wav'): os.mkdir('sample_wav') sf.write('sample_wav/new_file.wav', data, samplerate) spf = wave.open('sample_wav/new_file_Fire.wav') signal = spf.readframes(-1) signal = np.fromstring(signal,'Int16') if spf.getnchannels() == 2: print('just mono files. not stereo') sys.exit(0) # plotting x axis in seconds. create time vector spaced linearly with size of audio file. divide size of signal by frame rate to get stop limit Time = np.linspace(0,len(signal)/samplerate, num = len(signal)) plt.figure(1) plt.title('Signal Wave Vs Time(in sec)') plt.plot(Time, signal) plt.savefig('sample_wav/sample_waveplot_Fire.png', bbox_inches='tight') plt.show()
You can run the same code to generate wave plot for different classes and visualize the difference.
For each audio file in the dataset, we will extract MFCC (mel-frequency cepstrum - we will have an image representation for each audio sample) along with it’s classification label. For this we will use Librosa’s
mfcc() function which generates an MFCC from time series audio data.
get_features() takes an .ogg file and extracts mfcc using Librosa library.
def get_features(file_name): if file_name: X, sample_rate = sf.read(file_name, dtype='float32') # mfcc (mel-frequency cepstrum) mfccs = librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40) mfccs_scaled = np.mean(mfccs.T,axis=0) return mfccs_scaled
The dataset is downloaded inside a folder "dataset". We will iterate through the subdirectories (each class) and extract features from their ogg files. Finally we will create a dataframe with mfcc feature and corresponding class label.
def extract_features(): # path to dataset containing 10 subdirectories of .ogg files sub_dirs = os.listdir('dataset') sub_dirs.sort() features_list =  for label, sub_dir in enumerate(sub_dirs): for file_name in glob.glob(os.path.join('dataset',sub_dir,"*.ogg")): print("Extracting file ", file_name) try: mfccs = get_features(file_name) except Exception as e: print("Extraction error") continue features_list.append([mfccs,label]) features_df = pd.DataFrame(features_list,columns = ['feature','class_label']) print(features_df.head()) return features_df
Once we have extracted features, we need to convert them into numpy array so that they can be feeded into neural network.
def get_numpy_array(features_df): X = np.array(features_df.feature.tolist()) y = np.array(features_df.class_label.tolist()) # encode classification labels le = LabelEncoder() # one hot encoded labels yy = to_categorical(le.fit_transform(y)) return X,yy,le
yy are splitted into train and test data in ratio 80-20.
def get_train_test(X,y): X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 42) return X_train, X_test, y_train, y_test
Now we will define our model architecture. We will use Keras for creating our Multi Layer Perceptron network.
def create_mlp(num_labels): model = Sequential() model.add(Dense(256,input_shape = (40,))) model.add(Activation('relu')) model.add(Dropout(0.5)) model.add(Dense(256,input_shape = (40,))) model.add(Activation('relu')) model.add(Dropout(0.5)) model.add(Dense(num_labels)) model.add(Activation('softmax')) return model
Once the model is defined, we need to compile it by defining loss, metrics and optimizer. The model is then fitted to training data
y_train. Our model is trained for 100 epochs with a batch size of 32. The trained model is finally saved as .hd5 file in the disk. This model can be loaded later for prediction.
def train(model,X_train, X_test, y_train, y_test,model_file): # compile the model model.compile(loss = 'categorical_crossentropy',metrics=['accuracy'],optimizer='adam') print(model.summary()) print("training for 100 epochs with batch size 32") model.fit(X_train,y_train,batch_size= 32, epochs = 100, validation_data=(X_test,y_test)) # save model to disk print("Saving model to disk") model.save(model_file)
This is it! We have trained our Environmental Sound Classifier!! 😃
Now obviously we want to check how well our model is performing 😛
def compute(X_test,y_test,model_file): # load model from disk loaded_model = load_model(model_file) score = loaded_model.evaluate(X_test,y_test) return score,score*100
Test loss 1.5628961682319642 Test accuracy 78.7
We can also predict the class label for any input file we provide using below code -
def predict(filename,le,model_file): model = load_model(model_file) prediction_feature = extract_features.get_features(filename) if model_file == "trained_mlp.h5": prediction_feature = np.array([prediction_feature]) elif model_file == "trained_cnn.h5": prediction_feature = np.expand_dims(np.array([prediction_feature]),axis=2) predicted_vector = model.predict_classes(prediction_feature) predicted_class = le.inverse_transform(predicted_vector) print("Predicted class",predicted_class) predicted_proba_vector = model.predict_proba([prediction_feature]) predicted_proba = predicted_proba_vector for i in range(len(predicted_proba)): category = le.inverse_transform(np.array([i])) print(category, "\t\t : ", format(predicted_proba[i], '.32f') )
This function will load our pre-trained model, extract mfcc from the input ogg file you have provided and output range of probabilities for each class. The one with maximum probability is our desired class! 😃
For a sample ogg file of dog class, following were the probability predictions -
Predicted class 0 0 : 0.96639919281005859375000000000000 1 : 0.00000196780410988139919936656952 2 : 0.00000063572736053174594417214394 3 : 0.00000597824555370607413351535797 4 : 0.02464177832007408142089843750000 5 : 0.00003698830187204293906688690186 6 : 0.00031352625228464603424072265625 7 : 0.00013375715934671461582183837891 8 : 0.00846461206674575805664062500000 9 : 0.00000165236258453660411760210991
The class predicted is
0 which was the class label for
Working on audio files is not that tough as it sounded in the first place. Audio files can easily be represented in form of time series data. We have predefined libraries in python which makes our task more simpler.
You can also check the entire code for this in my Github repo. Here I have trained SVM, MLP and CNN for the same dataset and code is arranged in proper files which makes it easy to understand.
Though I have trained 3 different models for this, there was very little variance in accuracy among them. Do leave comments if you find any way to improve this score.
If you liked the article do show some ❤ Stay tuned for more! Till then happy learning 😸