ML: Working with Audio Data

Sameer Mahajan
4 min readMay 3, 2023

Background

In Machine Learning, engineers have to deal with a variety of data ranging from numerical, categorical, images, videos, speech, etc. This article will focus on audio data involved in activities like speech recognition, speech-to-text, text-to-speech, audio classification, etc. The activities involved with this data are, but are not limited to:

  • capturing audio
  • saving captured audio
  • playing saved audio
  • concatenating multiple audio files
  • splitting audio into multiple segments
  • looking at various metadata
  • converting data (e.g. from one frame rate to another)
  • removing silence
  • engineering features (like constructing waveforms, spectrograms etc.)
  • building models
  • making predictions

We will look at them with sample code and complete examples.

Capturing Audio

Different packages can be used for this purpose. Here we will use the SpeechRecognition package.

import speech_recognition as sr
r = sr.Recognizer()
with sr.Microphone() as source:
r.adjust_for_ambient_noise(source)
print(“Please say something: \n”)
audio = r.listen(source)
fname = input(“Please enter file name to save recording to: \n”)
with open(fname+”.wav”, “wb”) as f:
f.write(audio.get_wav_data())
print(“Recorded successfully\n”)

Saving Captured Audio

We already saw saving audio as a WAV file above. There are various formats

with open(fname+".wav", "wb") as f: 
f.write(audio.get_wav_data())

for audio files. A few common ones are MP3, WAV, AAC, etc. Here is a nice article describing them. There are various packages and methods for these formats.

Playing Saved Audio

We will use package pydub for this. Here we read a WAV file and play it.

from pydub import AudioSegment
from pydub.playback import play

playaudio = AudioSegment.from_file("1.wav", format="WAV")
play(playaudio)

Concatenating Multiple Audio Files

It can be done using AudioSegments and pydub package that we have used above. It can also be done using the wave package, which we will show here.

import wave

infiles = ["1.wav", "2.wav"]
outfile = "out.wav"

data= []
for infile in infiles:
w = wave.open(infile, 'rb')
data.append( [w.getparams(), w.readframes(w.getnframes())] )
w.close()

output = wave.open(outfile, 'wb')
output.setparams(data[0][0])
output.writeframes(data[0][1])
output.writeframes(data[1][1])
output.close()

Splitting into Multiple Segments

Here we split an old song into a new one whose start and end times in milliseconds are specified by the first and second parameters respectively.

from pydub import AudioSegment
import sys

newAudio = AudioSegment.from_wav("oldSong.wav")
newAudio = newAudio[int(sys.argv[1]):int(sys.argv[2])]
newAudio.export('newSong.wav', format="wav")

Metadata

You can check various metadata attributes as:

import wave
import sys

wf = wave.open(sys.argv[1], 'rb')
print (input, "#channels = ", wf.getnchannels(), "#frames = ", wf.getnframes(), "sample width = ", wf.getsampwidth(), "sample rate = ", wf.getframerate())

You can convert files with one metadata attribute value to another using pydub and its associated methods. The following code snippet changes the frame rate from 44100 to 48000.

from pydub import AudioSegment as am
import sys

sound = am.from_file(sys.argv[1], format='wav', frame_rate=44100)
sound = sound.set_frame_rate(48000)
sound.export(sys.argv[2], format='wav')

Removing Silence

Sometimes the audio files have long silence in them which cannot be handled by feature engineering and model building that we will look at next. Here is some code that removes such silence using Voice Activation Detection (VAD) technique.

Feature Engineering

The WAV files already have waveform in them. Here we will see how to use TensorFlow and Keras to read a number of such files from a folder, split them into training and validation sets, reshape and feed them into the ML pipeline

import tensorflow as tf
from keras import utils

train_ds, val_ds = utils.audio_dataset_from_directory(
directory='audio_samples',
batch_size=64,
validation_split=0.2,
seed=0,
output_sequence_length=16000,
subset='both')

The following code gets the spectrogram from the waveform in TensorFlow and makes further updates down the ML pipeline.

import tensorflow as tf

def get_spectrogram(waveform):
# Convert the waveform to a spectrogram via an STFT.
spectrogram = tf.signal.stft(
waveform, frame_length=255, frame_step=128)
# Obtain the magnitude of the STFT.
spectrogram = tf.abs(spectrogram)
# Add a `channels` dimension, so that the spectrogram can be used
# as image-like input data with convolution layers (which expect
# shape (`batch_size`, `height`, `width`, `channels`).
spectrogram = spectrogram[..., tf.newaxis]
return spectrogram

def make_spec_ds(ds):
return ds.map(
map_func=lambda audio,label: (get_spectrogram(audio), label),
num_parallel_calls=tf.data.AUTOTUNE)

train_spectrogram_ds = make_spec_ds(train_ds)
val_spectrogram_ds = make_spec_ds(val_ds)

Building Models

We can use various models based on our requirements. There are some speech-to-text models like whisper and wave2vec2. There are some phoneme-based models like wave2vec2phoneme. We can use a variety of pre-trained models directly. We can leverage pre-trained models for some transfer learning and fine-tune them with our training data, that doesn’t have to be large. We can also build and train our model from scratch. Here we will use Convolutional Neural Network (CNN) in TensorFlow built and trained from scratch.

# Instantiate the `tf.keras.layers.Normalization` layer.
norm_layer = layers.Normalization()
# Fit the state of the layer to the spectrograms
# with `Normalization.adapt`.
norm_layer.adapt(data=train_spectrogram_ds.map(map_func=lambda spec, label: spec))

model = models.Sequential([
layers.Input(shape=input_shape),
# Downsample the input.
layers.Resizing(32, 32),
# Normalize.
norm_layer,
layers.Conv2D(32, 3, activation='relu'),
layers.Conv2D(64, 3, activation='relu'),
layers.MaxPooling2D(),
layers.Dropout(0.25),
layers.Flatten(),
layers.Dense(128, activation='relu'),
layers.Dropout(0.5),
layers.Dense(num_labels),
])

model.compile(
optimizer=tf.keras.optimizers.Adam(),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'],
)

EPOCHS = 10
history = model.fit(
train_spectrogram_ds,
validation_data=val_spectrogram_ds,
epochs=EPOCHS,
callbacks=tf.keras.callbacks.EarlyStopping(verbose=1, patience=2),
)

model.save("model-dir")

Here we have used a normalization layer, two convolutional layers, a pooling layer, dropout regularization, and relu activation.

Inferencing

Inferencing can be done for a saved test audio sample as below:

import speech_recognition as sr

model = keras.models.load_model("model-dir")
# get spectrogram for each sample as above
test_file = sr.AudioFile('test_file1.wav')
with test_file as source:
waveform = r.record(source)
spectrogram = get_spectrogram(waveform)
prediction = model(spectrogram)
idx = np.argmax(np.array(prediction[0]))

We can also do it on live online data as it is getting recorded.

You can find all these snippets of code and complete working examples in my GitHub repo.

--

--

Sameer Mahajan

Generative AI, Machine Learning, Deep Learning, AI, Traveler