Delta MFCC Calculation - statistics

I'm currently experimenting with various feature vectors in order to maximise my speech recognition classification. I've read that using delta MFCC's and delta-delta MFCC's can improve the classification results.
My cross-validation test without delta's resulted 98% but decreased by 3% when using delta's.
Is the delta calculation for MFCC's can be done by simple difference operation?
Sharing my code below:
deltas = []
for x in xrange(n):
delt = np.subtract(mfcc_feat[index+1],mfcc_feat[index])
deltas.append(delt)
return np.array(deltas)
mfcc_delta = getDeltaMFCC(mfcc_normalised,0,13)

Usually you don't just take two adjacent feature frames to compute the delta, but perform a regression over multiple frames to come up with more stable deltas.
See here for a corresponding formula.

Related

Multi-features modeling based on one binary-feature which is rarely 1 (imbalanced data) when there is a cost

I need to model a multi-variate time-series data to predict a binary-target which is rarely 1 (imbalanced data).
This means that we want to model based on one feature is binary (outbreak), rarely 1?
All of the features are binary and rarely 1.
What is the suggested solution?
This features has an effect on cost function based on the following cost function. We want to know prepared or not prepared if the cost is the same as following.
Problem Definition:
Model based on outbreak which is rarely 1.
Prepared or not prepared to avoid the outbreak of a disease and the cost of outbreak is 20 times of preparation
cost of each day(next day):
cost=20*outbreak*!prepared+prepared
Model:prepare(prepare for next day)for outbreak for which days?
Questions:
Build a model to predict outbreaks?
Report the cost estimation for every year
csv file is uploaded and data is for end of the day
The csv file contains rows which each row is a day with its different features some of them are binary and last feature is outbreak which is rarely 1 and a main features considering in the cost.
You are describing class imbalance.
Typical approach is to generate balanced training data
by repeatedly running through examples containing
your (rare) positive class,
and each time choosing a new random sample
from the negative class.
Also, pay attention to your cost function.
You wouldn't want to reward a simple model
for always choosing the majority class.
My suggestions:
Supervised Approach
SMOTE for upsampling
Xgboost by tuning scale_pos_weight
replicate minority class eg:10 times
Try to use ensemble tree algorithms, trying to generate a linear surface is risky for your case.
Since your data is time series you can generate days with minority class just before real disease happened. For example you have minority class at 2010-07-20. Last observations before that time is 2010-06-27. You can generate observations by slightly changing variance as 2010-07-15, 2010-07-18 etc.
Unsupervised Approach
Try Anomaly Detection algorithms. Such as IsolationForest (try extended version of it also).
Cluster your observations check minority class becomes a cluster itself or not. If its successful you can label your data with cluster names (cluster1, cluster2, cluster3 etc) then train a decision tree to see split patterns. (Kmeans + DecisionTreeClassifier)
Model Evaluation
Set up a cost matrix. Do not use confusion matrix precision etc directly. You can find further information about cost matrix in here: http://mlwiki.org/index.php/Cost_Matrix
Note:
According to OP's question in comments groupby year could be done like this:
df["date"] = pd.to_datetime(df["date"])
df.groupby(df["date"].dt.year).mean()
You can use other aggregators also (mean, sum, count, etc)

How to split the training data and test data for LSTM for time series prediction in Tensorflow

I recently learn the LSTM for time series prediction from
https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/23_Time-Series-Prediction.ipynb
In his tutorial, he says: Instead of training the Recurrent Neural Network on the complete sequences of almost 300k observations, we will use the following function to create a batch of shorter sub-sequences picked at random from the training-data.
def batch_generator(batch_size, sequence_length):
"""
Generator function for creating random batches of training-data.
"""
# Infinite loop.
while True:
# Allocate a new array for the batch of input-signals.
x_shape = (batch_size, sequence_length, num_x_signals)
x_batch = np.zeros(shape=x_shape, dtype=np.float16)
# Allocate a new array for the batch of output-signals.
y_shape = (batch_size, sequence_length, num_y_signals)
y_batch = np.zeros(shape=y_shape, dtype=np.float16)
# Fill the batch with random sequences of data.
for i in range(batch_size):
# Get a random start-index.
# This points somewhere into the training-data.
idx = np.random.randint(num_train - sequence_length)
# Copy the sequences of data starting at this index.
x_batch[i] = x_train_scaled[idx:idx+sequence_length]
y_batch[i] = y_train_scaled[idx:idx+sequence_length]
yield (x_batch, y_batch)
He try to create several bacth samples for training.
My question is that, can we first randomly shuttle the x_train_scaled and y_train_scaled, and then begin sampling several batch size using the follow batch_generator?
my motivation for this question is that, for time series prediction, we want to training the past and predict for the furture. Therefore, is it legal to shuttle the training samples?
In the tutorial, the author chose a piece of continuous samples such as
x_batch[i] = x_train_scaled[idx:idx+sequence_length]
y_batch[i] = y_train_scaled[idx:idx+sequence_length]
Can we pick x_batch and y_batch not continous. For example, the x_batch[0] is picked at 10:00am and x_batch[1] is picked at 9:00am at the same day?
In summary: The follow two question are
(1) can we first randomly shuttle the x_train_scaled and y_train_scaled, and then begin sampling several batch size using the follow batch_generator?
(2) when we train LSTM, Do we need to consider the influence of time order? what parameters we learn for LSTM.
Thanks
(1) We cannot. Imagine trying to predict the weather for tomorrow. Would you want a sequence of temperature values for the last 10 hours or would you want random temperature values of the last 5 years?
Your dataset is a long sequence of values in a 1-hour interval. Your LSTM takes in a sequence of samples that is chronologically connected. For example, with sequence_length = 10 it can take the data from 2018-03-01 09:00:00 to 2018-03-01 19:00:00 as input. If you shuffle the dataset before generating batches that consist of these sequences, you will train your LSTM on predicting based on a sequence of random samples from your whole dataset.
(2) Yes, we need to consider temporal ordering for time series. You can find ways to test your time series LSTM in python here: https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/
The train/test data must be split in such a way as to respect the temporal ordering and the model is never trained on data from the future and only tested on data from the future.
It depends a lot on the dataset. For example, the weather from a random day in the dataset is highly related to the weather of the surrounding days. So, in this case, you should try a statefull LSTM (ie, a LSTM that uses the previous records as input to the next one) and train in order.
However, if your records (or a transformation of them) are independent from each other, but depend on some notion of time, such as the inter-arrival time of the items in a record or a subset of these records, there should be noticeable differences when using shuffling. In some cases, it will improve the robustness of the model; in other cases, it will not generalize. Noticing these differences is part of the evaluation of the model.
In the end, the question is: the "time series" as it is is really a time series (ie, records really depend on their neighbor) or there is some transformation that can break this dependency, but preserv the structure of the problem? And, for this question, there is only one way to get to the answer: explore the dataset.
About authoritative references, I will have to let you down. I learn this from a seasoned researcher in the field, however, according to him, he learn it through a lot of experimentation and failures. As he told me: these aren't rules, they are guidelines; try all the solutions that fits your budget; improve on the best ones; try again.

Scaling data real time for LibSVM

I am using LibSVM to classify data. I train and test the classifier with linearly scaled feature data on the interval [-1 1]. After establishing a model which produces acceptable accuracy, I want to classify new data which arrives periodically, almost in real time.
I don't know how to rescale the feature columns of the 'real time' data on an interval of [-1 1] since I'm only generating 1 row of features for this input data. If I were to store the min/max values of the testing/training set data feature columns (in order to scale new data), this presents the possibility that if the new real time data does not fall into this min/max range, thus the model is no longer valid as I would have to re-scale all prior data to accommodate for the new min/max and generate a new model.
I have thought about using other scaling techniques such as mean normalization, but I have read that SVM works particularly well with linearly scaled features so I am hesitant to apply another methodology.
How does one deal with the rescaling of new features to a linear interval, when the new features are a single row vector, and could have higher/lower feature values than the max/min feature values used in rescaling the training data?
This is the equation I'm using to rescale the training/testing feature set.
Even if one were to use another feature scaling technique (such as mean normalization), with each additional 'real time' classification, would it be prudent to recalculate the mean, min and max for ALL (new, test and train) data before rescaling, or is it acceptable to use the stored scaling values from training/testing for new samples -- until a "re-training" the classifier to account for all the newly acquired data were to occur.
All in all, I think what I'm having trouble with is: how does one deal with linear feature scaling in an 'online' classification problem?

scikit-learn SVM with a lot of samples / mini batch possible?

According to http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html I read:
"The fit time complexity is more than quadratic with the number of samples which makes it hard to scale to dataset with more than a couple of 10000 samples."
I have currently 350,000 samples and 4,500 classes and this number will grow further to 1-2 million samples and 10k + classes.
My problem is that I am running out of memory. All is working as it should when I use just 200,000 samples with less than 1000 classes.
Is there a way to build-in or use something like minibatches with SVM? I saw there exists MiniBatchKMeans but I dont think its for SVM?
Any input welcome!
I mentioned this problem in my answer to this question.
You can split your large dataset into batches that can be safely consumed by an SVM algorithm, then find support vectors for each batch separately, and then build a resulting SVM model on a dataset consisting of all the support vectors found in all the batches.
Also if there is no need in using kernels in your case, then you can use sklearn's SGDClassifier, which implements stochastic gradient descent. It fits linear SVM by default.

How to do multi-label classification in Apache Spark

I want to do a multi-label text classification on a big data set set and it seems like that big data machine learning tools such as Apache Mahout or Spark MLLib are not currently support that. I would like to know has any one done a multi-label classification for big data sets before? Are there any plan to integrate multi-label classification in either Mahout or Spark in the near future?
This paper addresses the nature of the benefits you would receive from multioutput forecasting... namely:
The ability to account for multiple independent input parameters when making a prediction, rather than having to continuously update your metrics for each nth index prediction your are trying to make within a given forecast.
Computational speed is increased.
Based on your need - I would recommend trying to down-sample to a smaller group for your current problem and then create multiple models around bespoke groups within your dataset if performance does not match what you are looking for.
I am still encountering this challenge myself (4 years since your post...).
Here is a list of helpful articles that I have collected while trying to address this:
Long-term forecasting with machine learning models
Sorry ARIMA, but I’m Going Bayesian
Multiple-Output Modelling for Multi-Step-Ahead Time Series
Forecasting
Can we first transform the labels into a class, and then after prediction, transform it back to the original label? for example, i have 3 labels to predict, [y1, y2, y3]. if [y1, y2, y3] = [1, 0, 1], then i give it label = 101 = 5. And during prediction, I predicted the probability of y1 in the following way:
p(y1=1) = p(100) + p(101) + p(110) + p(111). In this way a multi label problem became a multilabel problem

Resources