Can't get gridSearchCV to work for hmmlearn estimator - scikit-learn

I've got a hmm which I can train by passing the fit function a list 'merged' of all training sequences concatenated after each other, and a list 'all_lengths' of all of the individual sequence lengths
model = hmm.MultinomialHMM(n_components=3).fit(np.atleast_2d(merged).T, all_lengths)
This works, but I cant to determine the optimal n_components using sklearn's gridsearchCV, which keeps giving me errors if I try the following:
tuned_parameters = [{'n_components': [1,2,3]}]
test = GridSearchCV(hmm.MultinomialHMM(), tuned_parameters, cv=5,)
test.fit(np.atleast_2d(merged).T, all_lengths)
outputs
ValueError: Found input variables with inconsistent numbers of samples: [515031, 28923]
The 515031 relates to the length of merged, and 28923 is the length of all_lengths

Related

MultiOutput Classification with TensorFlow Extended (TFX)

I'm quite new to TFX (TensorFlow Extended), and have been going through the sample tutorial on the TensorFlow portal to understand a bit more to apply it to my dataset.
In my scenario, instead of predicting a single label, the problem at hand requires me to predict 2 outputs (category 1, category 2).
I've done this using pure TensorFlow Keras Functional API and that works fine, but then am now looking to see if that can be fitted into the TFX pipeline.
Where i get the error, is at the Trainer stage of the pipeline, and where it throws the error is in the _input_fn, and i suspect it's because i'm not correctly splitting out the given data into (features, labels) tensor pair in the pipeline.
Scenario:
Each row of the input data comes in the form of
[Col1, Col2, Col3, ClassificationA, ClassificationB]
ClassificationA and ClassificationB are the categorical labels which i'm trying to predict using the Keras Functional Model
The output layer of the keras functional model looks like below, where there's 2 outputs that is joined to a single dense layer (Note: _xf appended to the end is just to illustrate that i've encoded the classes to int representations)
output_1 = tf.keras.layers.Dense(
TargetA_Class, activation='sigmoid',
name = 'ClassificationA_xf')(dense)
output_2 = tf.keras.layers.Dense(
TargetB_Class, activation='sigmoid',
name = 'ClassificationB_xf')(dense)
model = tf.keras.Model(inputs = inputs,
outputs = [output_1, output_2])
In the trainer module file, i've imported the required packages at the start of the module file >
import tensorflow_transform as tft
from tfx.components.tuner.component import TunerFnResult
import tensorflow as tf
from typing import List, Text
from tfx.components.trainer.executor import TrainerFnArgs
from tfx.components.trainer.fn_args_utils import DataAccessor, FnArgs
from tfx_bsl.tfxio import dataset_options
The current input_fn in the trainer module file looks like the below (by following the tutorial)
def _input_fn(file_pattern: List[Text],
data_accessor: DataAccessor,
tf_transform_output: tft.TFTransformOutput,
batch_size: int = 200) -> tf.data.Dataset:
"""Helper function that Generates features and label dataset for tuning/training.
Args:
file_pattern: List of paths or patterns of input tfrecord files.
data_accessor: DataAccessor for converting input to RecordBatch.
tf_transform_output: A TFTransformOutput.
batch_size: representing the number of consecutive elements of returned
dataset to combine in a single batch
Returns:
A dataset that contains (features, indices) tuple where features is a
dictionary of Tensors, and indices is a single Tensor of label indices.
"""
return data_accessor.tf_dataset_factory(
file_pattern,
dataset_options.TensorFlowDatasetOptions(
batch_size=batch_size,
#label_key=[_transformed_name(x) for x in _CATEGORICAL_LABEL_KEYS]),
label_key=_transformed_name(_CATEGORICAL_LABEL_KEYS[0]), _transformed_name(_CATEGORICAL_LABEL_KEYS[1])),
tf_transform_output.transformed_metadata.schema)
When i run the trainer component the error that comes up is:
label_key=_transformed_name(_CATEGORICAL_LABEL_KEYS[0]),transformed_name(_CATEGORICAL_LABEL_KEYS1)),
^ SyntaxError: positional argument follows keyword argument
I've also tried label_key=[_transformed_name(x) for x in _CATEGORICAL_LABEL_KEYS]) which also gives an error.
However, if i just pass in a single label key, label_key=transformed_name(_CATEGORICAL_LABEL_KEYS[0]) then it works fine.
FYI - _CATEGORICAL_LABEL_KEYS is nothing but a list which contains the names of the 2 outputs i'm trying to predict (ClassificationA, ClassificationB).
transformed_name is nothing but a function to return an updated name/key for the transformed data:
def transformed_name(key):
return key + '_xf'
Question:
From what i can see, the label_key argument for dataset_options.TensorFlowDatasetOptions can only accept a single string/name of label, which means it may not be able to output the dataset with multi labels.
Is there a way which i can modify the _input_fn so that i can get the dataset that's returned by _input_fn to work with returning the 2 output labels? So the tensor that's returned looks something like:
Feature_Tensor: {Col1_xf: Col1_transformedfeature_values, Col2_xf:
Col2_transformedfeature_values, Col3_xf:
Col3_transformedfeature_values}
Label_Tensor: {ClassificationA_xf: ClassA_encodedlabels,
ClassificationB_xf: ClassB_encodedlabels}
Would appreciate advice from the wider community of tfx!
Since the label key is optional, maybe instead of specifying it in the TensorflowDatasetOptions, instead you can use dataset.map afterwards and pass both labels after taking them from your dataset.
Haven't tested it but something like:
def _data_augmentation(feature_dict):
features = feature_dict[_transformed_name(x) for x in
_CATEGORICAL_FEATURE_KEYS]]
keys=[_transformed_name(x) for x in _CATEGORICAL_LABEL_KEYS]
return features, keys
def _input_fn(file_pattern: List[Text],
data_accessor: DataAccessor,
tf_transform_output: tft.TFTransformOutput,
batch_size: int = 200) -> tf.data.Dataset:
"""Helper function that Generates features and label dataset for tuning/training.
Args:
file_pattern: List of paths or patterns of input tfrecord files.
data_accessor: DataAccessor for converting input to RecordBatch.
tf_transform_output: A TFTransformOutput.
batch_size: representing the number of consecutive elements of returned
dataset to combine in a single batch
Returns:
A dataset that contains (features, indices) tuple where features is a
dictionary of Tensors, and indices is a single Tensor of label indices.
"""
dataset = data_accessor.tf_dataset_factory(
file_pattern,
dataset_options.TensorFlowDatasetOptions(
batch_size=batch_size,
tf_transform_output.transformed_metadata.schema)
dataset = dataset.map(_data_augmentation)
return dataset

GridSearchCV gives different results than LassoCV for optimal alpha

I am aware of the standard process of finding the optimal value of alpha/lambda using Cross Validation technique through GridSearchCV class in sklearn.model_selection library.Here's my code to find that .
alphas=np.arange(0.0001,0.01,0.0005)
cv=RepeatedKFold(n_splits=10,n_repeats=3, random_state=100)
hyper_param = {'alpha':alphas}
model = Lasso()
model_cv = GridSearchCV(estimator = model,
param_grid=hyper_param,
scoring='r2',
cv=cv,
verbose=1,
return_train_score=True
)
model_cv.fit(X_train,y_train)
#checking the bestscore
model_cv.best_params_
This gives me alpha=0.01
Now, looking on LassoCV , as per my understanding , this library creates model by selecting best optimal alpha by the passed alphas list, and please note , I have used the same cross validation scheme for both of them. But when trying sklearn.linear_model.LassoCV with RepeatedKFold cross validation scheme.
alphas=np.arange(0.0001,0.01,0.0005)
cv=RepeatedKFold(n_splits=10,n_repeats=3,random_state=100)
ls_cv_m=LassoCV(alphas,cv=cv,n_jobs=1,verbose=True,random_state=100)
ls_cv_m.fit(X_train_reduced,y_train)
print('Alpha Value %d'%ls_cv_m.alpha_)
print('The coefficients are {}',ls_cv_m.coef_)
I get alpha=0 for the same data and this alpha value in not present in the list of decimal values passed in alphas argument for this.
This has confused me about the actual implementation of LassoCV.
and my doubts are ..
Why do I get optimal alpha as 0 in LassoCV when the list passed to the argument does not has zero in it.
What is the difference between LassoCV and Lasso then, if I have to anyways find most suitable alpha from GridSearchCV only?
First you should pass your alphas as keywords parameters rather then positional parameters since the first positional parameter for LassoCV is eps.
ls_cv_m=LassoCV(alphas=alphas,cv=cv,n_jobs=1,verbose=True,random_state=100)
Then, the model is returning as optimal parameter one of the alphas that you previously defined, however you are simply printing it as an integer number casting the float to int. Replace %d with %f to print it in the float format:
print('Alpha Value %f'%ls_cv_m.alpha_)
Have a look here for more details about Python printing formats and styles.
As for your second question, Lasso is the linear model while LassoCV is an iterative process that allows you to find the optimal parameters for a Lasso model using Cross-validation.

How to set Keras TimeseriesGenerator to predict the second next value?

Currently I have the following code using TimeseriesGenerator from Keras:
TimeseriesGenerator(train, prediction, length=TIME_STEPS, batch_size=1)
Currently this shifts prediction one value backwards, so the train data for t will have the output of t+1. Which makes sense, but I want to predict t+2, thus train data for t will have the output of t+2.
Is there any way to do it using TimeseriesGenerator?
The quickest solution is to just shift your predictions by 1, ie.:
TimeseriesGenerator(train[:-1], prediction[1:], length=TIME_STEPS, batch_size=1)
Note that you have to trim the train set, so both datasets have equal lengths.
You can also use the timeseries_dataset_from_array function where you can align the data and targets according to your needs as you can read in the documentation:
data: Numpy array or eager tensor containing consecutive data points
(timesteps). Axis 0 is expected to be the time dimension.
targets:
Targets corresponding to timesteps in data. It should have same length
as data. targets[i] should be the target corresponding to the window
that starts at index i (see example 2 below). Pass None if you don't
have target data (in this case the dataset will only yield the input
data).
So in your case it would be something like this:
tf.keras.preprocessing.timeseries_dataset_from_array(
train[:-TIME_STEPS-2],
prediction[TIME_STEPS+2:],
length=TIME_STEPS,
batch_size=1
)

How to include multiple input tensor in keras.model.fit_generator

I am a keras rookie and I need some help in working with keras after many days struggling at this problem. Please ask for further information if there is any ambiguity.
Currently, I am trying to modify the code from a link.According to their network model, there are 2 input tensors expected. Now I have trouble including 2 input tensors into the source code provided by them.
Function Boneage_prediction_model() initiates a model of 2 input tensors.
def Boneage_prediction_model():
i1 = Input(shape=(500, 500, 1), name='input_img') # the 1st input tensor
i2 = Input(shape=(1,), name='input_gender') # the 2nd input tensor
... ...
model = Model(inputs=(i1, i2), outputs=o) # define model input
with both i1 and i2
... ...
#using model.fit_generator to instantiate
# datagen is initiated by keras.preprocessing.image.ImageDataGenerator
# img_train is the 1st network input, and boneage_train is the training label
# gender_train is the 2nd network input
model.fit_generator(
(datagen.flow(img_train, boneage_train, batch_size=10),
gender_train),
... ...
)
I tried many ways to combine the two (datagen.flow(img_train, boneage_train, batch_size=10) and gender_train) as stated above, but it failed and kept reporting errors
such as the following,
ValueError: Error when checking model input: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 2 array(s), but instead got the following list of 1 arrays: [array([[[[-0.26078433],
[-0.26078433],
[-0.26078433],
...,
[-0.26078433],
[-0.26078433],
[-0.26078433]],
[[-0.26078433],
[-0.26...
If I understand you correctly, you want to have two inputs for one network and have one label for the combined output. In the official documentation for the fit_generator there is an example with multiple inputs.
Using a dictionary to map the multiple inputs would result in:
model.fit_generator(
datagen.flow({'input_img':img_train, 'input_gender':gender_train}, boneage_train, batch_size=10),
...
)
After failure either blindly to simply combine the 2 inputs, or as another contributor suggested, to use a dictionary to map the multiple inputs, I realized it seems to be the problem of datagen.flow which keeps me from combining a image tensor input and a categorical tensor input. datagen.flow is initiated by keras.preprocessing.image.ImageDataGenerator with the goal of preprocessing the input images. Therefore chances are that it is inappropriate to combine the 2 inputs inside datagen.flow. Additionally, fit_generator seems to expect an input of generator type, and what I did as proposed in my question is wrong, though I do not fully understand the mechanism of this function.
As I looked up carefully in other codes written by the team, I learned that I need to write a generator to combine the two. The solution is as following,
def combined_generators(image_generator, gender_data, batch_size):
gender_generator = cycle(batch(gender_data, batch_size))
while True:
nextImage = next(image_generator)
nextGender = next(gender_generator)
assert len(nextImage[0]) == len(nextGender)
yield [nextImage[0], nextGender], nextImage[1]
def batch(iterable, n=1):
l = len(iterable)
for ndx in range(0, l, n):
yield iterable[ndx:min(ndx + n, l)]
train_gen_wrapper = combined_generators(train_gen_boneage, train_df_boneage['male'], BATCH_SIZE_TRAIN)
model.fit_generator(train_gen_wrapper, ... )

How can I generate classification report by removing this error?

I want to generate classification report of dataset movie_reviews from corpus which has already target names [pos , neg]. but found an error.
Code:
movie_train_clf = Pipeline([('vect',CountVectorizer(stop_words='english')),('tfidf',TfidfTransformer()),('clas',BernoulliNB(fit_prior=True))])
movie_train_clas = movie_train_clf.fit(movie_train.data ,movie_train.target)
predict = movie_train_clas.predict(movie_train.data)
np.mean(predict==movie_train.target)
Now I use classification report
from sklearn.metrics import classification_report
print(classification_report(predict, movie_train_clas,target_names==target_names))
Error:
TypeError: iteration over a 0-d array.
please help me with correct syntax.
There are multiple errors in your code:
1) You have the wrong order of arguments in classification_report. As per the documentation:
classification_report(y_true, y_pred, ...
First argument is the true labels and second one is the predicted labels.
2) You are using movie_train_clas in the place of true labels. movie_train_clas as per your code is the return value of movie_train_clf.fit(), so its the movie_train_clf itself. fit() returns itself, so you cannot use that in place of ground truth labels.
3) As #AmiTavory spotted, the current error is due to comparison operator (==) used in place of assignment (=). The correct call to classification_report should be:
classification_report(movie_train.target, predict, target_names=target_names)

Resources