Additional Audio Feature Extraction Tips
I'm trying to create a speech emotion recognition model using Keras, I've done all of the code and have trained the model. It sits around 50% validation and is overfitting. When i
Solution 1:
A few observations for starter:
- Likely you have far too few samples to efficiently use neural networks. Use a simple algorithm for starter to understand well how your model is making prediction.
- Make sure you have enough (30% or more) samples from different speakers put aside for final testing. You can use this test set only once, so think about building a pipeline to generate train, validation and test sets. Make sure you don't put the same speaker into more than 1 set.
- First coefficient from
librosa
gives you AFAIK an offset. I'd recommend plotting how your features correlate with labels and how far they overlap, some can be easily confused I guess. Find if there are any feature that would differentiate your classes. Don't do this by running your model, do visual inspection first.
To the actual features! You're right to assume pitch should play a vital role. I'd recommend checking out aubio - it has Python bindings.
Yaafe also offers excellent selection of features.
You might easily end up with 150+ features. You might want to reduce dimensionality of the problem, perhaps even compress it to 2d and see if you can somehow separate the classes. Here is my own example with Dash.
Last but not least, some basic code to extract frequencies from the audio. In this case I am also trying to find three peak frequencies.
import numpy as np
def spectral_statistics(y: np.ndarray, fs: int, lowcut: int = 0) -> dict:
"""
Compute selected statistical properties of spectrum
:param y: 1-d signsl
:param fs: sampling frequency [Hz]
:param lowcut: lowest frequency [Hz]
:return: spectral features (dict)
"""
spec = np.abs(np.fft.rfft(y))
freq = np.fft.rfftfreq(len(y), d=1 / fs)
idx = int(lowcut / fs * len(freq) * 2)
spec = np.abs(spec[idx:])
freq = freq[idx:]
amp = spec / spec.sum()
mean = (freq * amp).sum()
sd = np.sqrt(np.sum(amp * ((freq - mean) ** 2)))
amp_cumsum = np.cumsum(amp)
median = freq[len(amp_cumsum[amp_cumsum <= 0.5]) + 1]
mode = freq[amp.argmax()]
Q25 = freq[len(amp_cumsum[amp_cumsum <= 0.25]) + 1]
Q75 = freq[len(amp_cumsum[amp_cumsum <= 0.75]) + 1]
IQR = Q75 - Q25
z = amp - amp.mean()
w = amp.std()
skew = ((z ** 3).sum() / (len(spec) - 1)) / w ** 3
kurt = ((z ** 4).sum() / (len(spec) - 1)) / w ** 4
top_peaks_ordered_by_power = {'stat_freq_peak_by_power_1': 0, 'stat_freq_peak_by_power_2': 0, 'stat_freq_peak_by_power_3': 0}
top_peaks_ordered_by_order = {'stat_freq_peak_by_order_1': 0, 'stat_freq_peak_by_order_2': 0, 'stat_freq_peak_by_order_3': 0}
amp_smooth = signal.medfilt(amp, kernel_size=15)
peaks, height_d = signal.find_peaks(amp_smooth, distance=100, height=0.002)
if peaks.size != 0:
peak_f = freq[peaks]for peak, peak_name in zip(peak_f, top_peaks_ordered_by_order.keys()):
top_peaks_ordered_by_order[peak_name] = peak
idx_three_top_peaks = height_d['peak_heights'].argsort()[-3:][::-1]
top_3_freq = peak_f[idx_three_top_peaks]
for peak, peak_name in zip(top_3_freq, top_peaks_ordered_by_power.keys()):
top_peaks_ordered_by_power[peak_name] = peak
specprops = {
'stat_mean': mean,
'stat_sd': sd,
'stat_median': median,
'stat_mode': mode,
'stat_Q25': Q25,
'stat_Q75': Q75,
'stat_IQR': IQR,
'stat_skew': skew,
'stat_kurt': kurt
}
specprops.update(top_peaks_ordered_by_power)
specprops.update(top_peaks_ordered_by_order)
return specprops
Post a Comment for "Additional Audio Feature Extraction Tips"