Imbalance Learn "balanced_batch_generator" for > 2 dimensional data - keras

I am using imbalanced learn's "balanced_batch_generator" to try to perform undersampling on image array which has 4 dimensions. I ran the code below:
training_generator, steps_per_epoch = balanced_batch_generator(x_train, y_train, sampler=NearMiss(), batch_size=10)
and got the following error:
ValueError: Found array with dim 4. Estimator expected <= 2.
I am aware that this function does not accept > 2 dimensional data however I am wondering if there is an work around to this. I would perform under/over sampling myself by just manually splitting the data, however i want to make use of keras's nicely implemented functions such as NearMiss to intelligently sample my data.

You need to reshape the array for your x_train:
x.train.shape #So you know the values for all 4 dims (1st dim, 2nd dim, 3rd dm, 4th dm)
x_train = x_train.reshape(x_train.shape[0],-1)
Then use the balanced_batch_generator. Afterwards:
x_train = x_train.reshape(x_train.shape[0], initial Value for the 2nd dim, initial Value for the 3rd, initial Value for the 4th)
for x_train, y_train in training_generator:
break
And your x_train contains balanced batches with its initial dimensions. You don't need to reshape y_train because it only consists of the labels. (dim <=2)
If you do not want to type in the values of the dims all the time, I would use variables that save the values.

Related

LSTM input shape through json file

I am working on the LSTM and after the pre-processing of data I get the data X in form of a list which contains the 3 lists of features and each list contains the sequence of 50 points in form of a list.
X = [list:100 [list:3 [list:50]]]
Y = [list:100]
since its a multivariate LSTM, I am not sure how to give all 3 sequences as an input to Keras-Lstm. Do I need to convert it in Pandas data frame?
model = models.Sequential()
model.add(layers.Bidirectional(layers.LSTM(units=32,
input_shape=(?,?,?)))
You can do do the following to convert the lists into NumPy arrays:
X = np.array(X)
Y = np.array(Y)
Calling the following after this conversion:
print(X.shape)
print(Y.shape)
should output: (100, 3, 50) and (100,), respectively. Finally, the input_shape of the LSTM layer can be (None, 50).
LSTM Call arguments Doc:
inputs: A 3D tensor with shape [batch, timesteps, feature].
You would have to transform that list into a numpy array to work with Keras.
As per the shape of X you have provided, it should work in theory. However you do have to figure out what the 3 dimensions of your array actually contain.
The 1st dimension should be your batch_size i.e. how many batches of data you have.
The 2nd dimension is your timestep data.
Ex: words in a sentence, "cat sat on dog" -> 'cat' is timestep 1, 'sat' is timestep 2 and 'on' is timestep 3 and so on.
The 3rd dimension represent the features of your data of each timestep.. For our sentence earlier, we can vectorize each word

GridSearchCv Pipeline MultiOutputClassifier with XGBoostClassifier - how to pass early_stopping_rounds and eval_set?

I want to do multioutput prediction of labels and continuous data. My data consists of time series, one 10 time-points series of 30 observables per sample. I want to predict 10 labels that are binary, and 5 that are continuous, based on this data.
For the sake of simplicity I have flattened the time series data - ending up with one row per sample.
Since there are many labels to predict about the same system, and since there exists relationships between these, I want to use MutliOutputPrediction to do so. My idea is to divide the task into two parts; one for MultiOutputClassification, another for MultiOutputRegression.
I generally like XGBoost and wish to use it for this task, but of course I want to prevent overfitting when doing so. So I have a piece of code as follows, and I wish to pass the early_stopping_rounds to the fit method of the XGBClassifier, but don't know how to.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
pipeline = Pipeline([
('imputer', SimpleImputer()), # XGBoost can deal with NaNs, but MultiOutputClassifier cannot
('classifier', MultiOutputClassifier(XGBClassifier()))
])
param_grid = dict(
classifier__estimator__n_estimators=[100], # this works
# classifier__estimator__early_stopping_rounds=[30], # needs to be passed to .fit
# classifier__estimator__scale_pos_weight=[scale_pos_weight], # XGBoostError: Invalid Parameter format for scale_pos_weight expect float
)
clf = GridSearchCV(estimator=pipeline, param_grid=param_grid, scoring='roc_auc', refit='roc_auc', cv=5, n_jobs=-1)
clf.fit(X_train, y_train[CLASSIFICATION_LABELS])
y_hat_proba = np.array(clf.predict_proba(X_test))
y_hat = pd.DataFrame(np.array([y_hat_proba[:, i, 0] for i in range(y_hat_proba.shape[1])]), columns=CLASSIFICATION_LABELS)
auc_roc_scores = np.array([roc_auc_score(y_test[label], (y_hat[label] > 0.5).astype(int)) for label in y_hat.columns])
print(f'average ROC AUC score: {np.mean(auc_roc_scores).round(3)}+/-{np.std(auc_roc_scores).round(3)}')
>>> average ROC AUC score: 0.499+/-0.002
I tried passing it to fit as follows:
classifier__estimator__early_stopping_rounds=30
classifier__early_stopping_rounds=30
I get AUC ROC scores of 0.5 on the labels, which means this clearly isn't working and hence why I want to pass the early_stopping_rounds parameter and the eval_set. I suppose that being able to pass scale_pos_weight could also be useful, but probably doesn't work for MultiOutput prediction. At the moment I get the feeling that this is not the way to go to solve this, and in case you agree I would appreciate alternative suggestions.

How to predict one data point at a time then update the network using all the data including the last using an LSTM

I have a Data set of 27 features, 1012 training data and 125 for testing.
Using an LSTM Network i trained the data on the training set. But when testing it i don't want it to predict all 125 at once because i'm working with time series. Instead i would like the network to iterate through the test data, predict one point at a time and update itself incrementally.
For that purpose i wrote the following code which iterates over the test data using the index:
Predictions = list()
for i in range(X):
model = load_model('model %s' %i)
y_pred = model.predict(x_test_t[i], batch_size=BATCH_SIZE)
y_pred = y_pred.flatten()
# Descaling the Predicted Values
Dynamic_Trainer.pred = (y_pred * min_max_scaler.data_range_[3]) + min_max_scaler.data_min_[3]
Dynamic_Trainer.test = (y_test_tt * min_max_scaler.data_range_[3]) + min_max_scaler.data_min_[3]
#Saving the model for each new data point predicted and added to training
u = i+1
model = model.save(Output_path + \Model %d'%u)
# Saving each new prediction (Dynamic_Trainer is the function i made of the LSTM)
Predictions.append(Dynamic_Trainer.pred)
However i get this error:
ValueError: Error when checking input: expected lstm_1_input to have 3 dimensions, but got array with shape (4, 27)
TLDR: How can i iterate over 3 dimensional data to extract one 3d data at a time and feed it to the network.
EDIT: If there's a more efficient way to achieve the same goal, i am open for suggestions.
The solution i found thanks to #Marco Cerliani for anyone with the same issue is to just use: y_pred = model.predict(x_test_t[i][None,:,:]) for the loop.

How to write a code for x_train row values addition, subtraction will affect to predict the next future value

I have a dataset with three inputs and trying to predict next value of X1 with the combination of previous inputs values.
My three inputs are X1, X2, X3, X4.
So here I am trying to predict next future value of X1. To predict the next X1 these four inputs combination affect with.
I just want to say that while prediction the value , These four inputs affect with addition and subtraction and give the prediction value.
Here I wrote the code for addition and subtraction inside the x_train. Then it run in lstm model.
Then I tried to predict it with x_test_n value. But it gave me an error Error when checking input: expected lstm_16_input to have 3 dimensions, but got array with shape (1530, 1)
Here is my code:
def predict(x_train):
s = np.apply_along_axis(lambda row: row[0] + row[1] - row[2] - row[3], arr=data)
return model.predict(s)
lstm model
model = Sequential()
model.add(LSTM(4, return_sequences=True, input_shape=(None, x_train.shape[2])))
model.add(LSTM(8, return_sequences=True)) # returns a sequence of vectors of dimension 32
model.add(LSTM(8)) # return a single vector of dimension 32
model.add(Dense(1))
predict with
pred = predict(x_test)
Gave me this error:
The data you supply to model.predict need to have the same dimension as x_train.shape[2].
Looking at the error message, this is 3. In the predict function you are summing 4 values, giving an input vector of size 1.

Keras: LSTM with class weights

my question is quite closely related to this question but also goes beyond it.
I am trying to implement the following LSTM in Keras where
the number of timesteps be nb_tsteps=10
the number of input features is nb_feat=40
the number of LSTM cells at each time step is 120
the LSTM layer is followed by TimeDistributedDense layers
From the question referenced above I understand that I have to present the input data as
nb_samples, 10, 40
where I get nb_samples by rolling a window of length nb_tsteps=10 across the original timeseries of shape (5932720, 40). The code is hence
model = Sequential()
model.add(LSTM(120, input_shape=(X_train.shape[1], X_train.shape[2]),
return_sequences=True, consume_less='gpu'))
model.add(TimeDistributed(Dense(50, activation='relu')))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(20, activation='relu')))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(10, activation='relu')))
model.add(Dropout(0.2))
model.add(TimeDistributed(Dense(3, activation='relu')))
model.add(TimeDistributed(Dense(1, activation='sigmoid')))
Now to my question (assuming the above is correct so far):
The binary responses (0/1) are heavily imbalanced and I need to pass a class_weight dictionary like cw = {0: 1, 1: 25} to model.fit(). However I get an exception class_weight not supported for 3+ dimensional targets. This is because I present the response data as (nb_samples, 1, 1). If I reshape it into a 2D array (nb_samples, 1) I get the exception Error when checking model target: expected timedistributed_5 to have 3 dimensions, but got array with shape (5932720, 1).
Thanks a lot for any help!
I think you should use sample_weight with sample_weight_mode='temporal'.
From the Keras docs:
sample_weight: Numpy array of weights for the training samples, used
for scaling the loss function (during training only). You can either
pass a flat (1D) Numpy array with the same length as the input samples
(1:1 mapping between weights and samples), or in the case of temporal
data, you can pass a 2D array with shape (samples, sequence_length),
to apply a different weight to every timestep of every sample. In this
case you should make sure to specify sample_weight_mode="temporal" in
compile().
In your case you would need to supply a 2D array with the same shape as your labels.
If this is still an issue.. I think the TimeDistributed Layer expects and returns a 3D array (kind of similar to if you have return_sequences=True in the regular LSTM layer). Try adding a Flatten() layer or another LSTM layer at the end before the prediction layer.
d = TimeDistributed(Dense(10))(input_from_previous_layer)
lstm_out = Bidirectional(LSTM(10))(d)
output = Dense(1, activation='sigmoid')(lstm_out)
Using temporal is a workaround. Check out this stack. The issue is also documented on github.

Resources