Using Sktime Regressor - python-3.x

I am trying to use any regressor model from sktime but but I couldn't figure out how to create the data type and format I need to use. Assume I want to use 2 columns as input and 1 column as target.
from sktime.regression.interval_based import TimeSeriesForestRegressor
rand = np.random.random((200, 3))
X = pd.DataFrame(rand)[[0, 1]]
y = pd.DataFrame(rand)[2]
forecaster = TimeSeriesForestRegressor()
forecaster.fit(X=X, y=pd.Series(y))
Code block above this error: "X is not of a supported input data type.X must be in a supported mtype format for Panel, found <class 'pandas.core.frame.DataFrame'>Use datatypes.check_is_mtype to check conformance with specifications."
How can I solve that problem?

Related

How add scalar to tensor in Keras or create tensor from scalar?

I need to somehow run someting like that:
x = Input(shape=(img_height, img_width, img_channels))
x1 = Add()([x, 127.5])
x2 = Multiply()(x1, -127.5])
But, error emerges:
ValueError: Layer add_1 was called with an input that isn't a symbolic tensor. Received type: <class 'float'>. Full input: [<tf.Tensor 'input_1:0' shape=(?, 400, 300, 3) dtype=float32>, 0.00784313725490196]. All inputs to the layer should be tensors.
I can't use Lambda() layer, because I need to convert final model into CoreML and I'll be unable to rewrite them in swift.
Is there any way to create Keras tensor from float?
Maybe there is a different solution for this problem?
UPD: backend is TensorFlow
Well, based on comments above I've tested 2 approaches. Custom layer was not an option, because I would need to write it in swift for conversion to CoreML model (and I do not know swift).
Additional input
There is no way to predefine input value, as far as I know, so I need to pass additional parameters on input, which is not very convinient.
Consider example code below:
input1 = keras.layers.Input(shape=(1,), tensor=t_b, name='11')
input2 = keras.layers.Input(shape=(1,))
input3 = keras.layers.Input(shape=(1,), tensor=t_a, name='22')
# x1 = keras.layers.Dense(4, activation='relu')(input1)
# x2 = keras.layers.Dense(4, activation='relu')(input2)
added = keras.layers.Add()([input1, input3]) # equivalent to added = keras.layers.add([x1, x2])
added2 = keras.layers.Add()([input2, added]) # equivalent to added = keras.layers.add([x1, x2])
# out = keras.layers.Dense(4)(added2)
model = keras.models.Model(inputs=[input1, input2, input3], outputs=added2)
If you will load that model in clean environment, than you actually will need to pass a 3 values to it: my_model.predict([np.array([1]), np.array([1]), np.array([1])]) or error will emerge.
CoreML tools
I was able to achieve desirable effect by using *_bias and image_scale parameters in importer function. Example below.
coreml_model = coremltools.converters.keras.convert(
model_path,
input_names='image',
image_input_names='image',
output_names=['cla','bo'],
image_scale=1/127.5, # divide matrix by value
# substract 1 from every value in matrix
red_bias=-1.0, # substract value from channel
blue_bias=-1.0,
green_bias=-1.0
)
If somebody knows how to predefine constant in Keras, which should not be loaded via input layer, please write how (tf.constant() solution is not working).

What to pass to clf.predict()?

I started playing with Decision Trees lately and I wanted to train my own simple model with some manufactured data. I wanted to use this model to predict some further mock data, just to get a feel of how it works, but then I got stuck. Once your model is trained, how do you pass data to predict()?
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
Docs state:
clf.predict(X)
Parameters:
X : array-like or sparse matrix of shape = [n_samples, n_features]
But when trying to pass np.array, np.ndarray, list, tuple or DataFrame it just throws an error. Can you help me understand why please?
Code below:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
import graphviz
import pandas as pd
import numpy as np
import random
from sklearn import tree
pd.options.display.max_seq_items=5000
pd.options.display.max_rows=20
pd.options.display.max_columns=150
lenght = 50000
miles_commuting = [random.choice([2,3,4,5,7,10,20,25,30]) for x in range(lenght)]
salary = [random.choice([1300,1600,1800,1900,2300,2500,2700,3300,4000]) for x in range(lenght)]
full_time = [random.choice([1,0,1,1,0,1]) for x in range(lenght)]
DataFrame = pd.DataFrame({'CommuteInMiles':miles_commuting,'Salary':salary,'FullTimeEmployee':full_time})
DataFrame['Moving'] = np.where((DataFrame.CommuteInMiles > 20) & (DataFrame.Salary > 2000) & (DataFrame.FullTimeEmployee == 1),1,0)
DataFrame['TargetLabel'] = np.where((DataFrame.Moving == 1),'Considering move','Not moving')
target = DataFrame.loc[:,'Moving']
data = DataFrame.loc[:,['CommuteInMiles','Salary','FullTimeEmployee']]
target_names = DataFrame.TargetLabel
features = data.columns.values
clf = tree.DecisionTreeClassifier()
clf = clf.fit(data, target)
clf.predict(?????) #### <===== What should go here?
clf.predict([30,4000,1])
ValueError: Expected 2D array, got 1D array instead:
array=[3.e+01 4.e+03 1.e+00].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
clf.predict(np.array(30,4000,1))
ValueError: only 2 non-keyword arguments accepted
Where is your "mock data" that you want to predict?
Your data should be of the same shape that you used when calling fit(). From the code above, I see that your X has three columns ['CommuteInMiles','Salary','FullTimeEmployee']. You need to have those many columns in your prediction data, number of rows can be arbitrary.
Now when you do
clf.predict([30,4000,1])
The model is not able to understand that these are columns of a same row or data of different rows.
So you need to convert that into 2-d array, where inner array represents the single row.
Do this:
clf.predict([[30,4000,1]]) #<== Observe the two square brackets
You can have multiple rows to be predicted, each in inner list. Something like this:
X_test = [[30,4000,1],
[35,15000,0],
[40,2000,1],]
clf.predict(X_test)
Now as for your last error clf.predict(np.array(30,4000,1)), this has nothing to do with predict(). You are using the np.array() wrong.
According to the documentation, the signature of np.array is:
(object, dtype=None, copy=True, order='K', subok=False, ndmin=0)
Leaving the first (object) all others are keyword arguments, so they need to be used as such. But when you do this: np.array(30,4000,1), each value is considered as input to separate param here: object=30, dtype=4000, copy=1. This is not allowed and hence error. If you want to make a numpy array from list, you need to pass a list.
Like this: np.array([30,4000,1])
Now this will be considered correctly as input to object param.

How to use categorical data neural network in tensorflow without estimator?

I am trying to build a neural network without using estimators. I have defined layers as,
x_categorical = tf.placeholder(tf.string)
x_numeric = tf.placeholder(tf.float32)
l1 = tf.add(tf.matmul(x_numeric,weights), biases)
l2 = tf.add(tf.matmul(x_categorical,weights), biases)
tf.matmul works well for numeric features but i also have some categorical features. So i am unable to use them
I tried tf.string_to_hash_bucket_fast but it converts the string to int64 which is not supported by tf.matmul, i also tried tf.decode_raw. that also did not work. So please help me with this I want use categorical features as well.
To handle categorical values in a Neural Network you have to represent them in OneHot representation. If they are string (as it seems to be your case) you first have to convert them to "Integer representation". Step by step:
Using from sklearn.preprocessing import LabelEncoder,OneHotEncoder
Define you categorial string values
categorical_values = np.array([['Foo','bar','values'],['more','foo','bar'],['many','foo','bar']])
Then encode them as integers:
categorical_values[:,0] = LabelEncoder().fit_transform(categorical_values[:,0])
categorical_values[:,1] = LabelEncoder().fit_transform(categorical_values[:,1])
categorical_values[:,2] = LabelEncoder().fit_transform(categorical_values[:,2])
And use OneHotEncoder to obtain the OneHot representation:
oneHot_values = OneHotEncoder().fit_transform(categorical_values).toarray()
Define your graph:
x_categorical = tf.placeholder(shape=[NUM_OBSERVATIONS,NUM_FEATURES],dtype=tf.float32)
weights = tf.Variable(tf.truncated_normal([NUM_FEATURES,NUM_CLASSES]),dtype=tf.float32)
bias = tf.Variable([NUM_CLASSES],dtype=tf.float32)
l2 = tf.add(tf.matmul(x_categorical,weights),bias)
And execute it obtaining the results:
with tf.Session() as sess:
tf.global_variables_initializer().run()
_l2 = sess.run(l2,feed_dict={x_categorical : oneHot_values})
Edit: As requested, no-sklearn version.
Using just numpy.unique() and tensorflow.one_hot()
categorical_values = np.array(['Foo','bar','values']) #For one observation
lookup, labeledValues = np.unique(categorical_values, return_inverse=True)
oneHotValues = tf.one_hot(labeledValues,depth=NUM_FEATURES)
Full example on the JN linked below
Here you have a Jupyter Notebook with the code on my Github

scikit-learn: Is there a way to provide an object as an input to predict function of a classifier?

I am planning to use an SGDClassifier in production. The idea is to train the classifier on some training data, use cPickle to dump it to a .pkl file and reuse it later in a script. However, there are certain high cardinality fields which are categorical in nature and translated to one hot matrix representation which creates around 5000 features. Now the input that I get for the predict will only have one of these features and rest all will be zeroes. It will also include ofcourse the other numerical features apart from this. From the docs, it appears that the predict function expects an array of array as input. Is there any way I can transform my input to the format expected by the predict function without having to store the fields everytime I train the model ?
Update
So, let us say my input contains 3 fields:
{
rate: 10, // numeric
flagged: 0, //binary
host: 'somehost.com' // keeping this categorical
}
host can have around 5000 different values. Now I loaded the file to a pandas dataframe, used the get_dummies function to transform the host field to around 5000 new fields which are binary fields.
Then I trained by model and stored it using cPickle.
Now, when I need to use the predict function, for the input, I only have 3 fields (shown above). However, as per my understanding the predict endpoint will expect an array of vectors and each vector is supposed to have those 5000 fields.
For the entry that I need to predict, I know only one field for that entry which will be the value of host itself.
For example, if my input is
{
rate: 5,
flagged: 1
host: 'new_host.com'
}
I know that the fields expected by the predict should be:
{
rate: 5,
flagged: 1
new_host: 1
}
But if I translate it to vector format, I don't know which index to place the new_host field. Also, I don't know in advance what other hosts are (unless I store it somewhere during the training phase)
I hope I am making some sense. Let me know if I am doing it the wrong way.
I don't know which index to place the new_host field
A good approach that has worked for me is to build a pipeline which you then use for training and prediction. This way you do not have to concern yourself with the column index of whatever output is produced by your transformation:
# in training
pipl = Pipeline(steps=[('binarizer', LabelBinarizer(),
('clf', SGDClassifier())])
model = pipl.train(X, Y)
pickle.dump(mf, model)
# in production
model = pickle.load(mf)
y = model.predict(X)
As X, Y inputs you need to pass an array-like object. Make sure the input is the same structure for both training and test, e.g.
X = [[data.get('rate'), data.get('flagged'), data.get('host')]]
Y = [[y-cols]] # your example doesn't specify what is Y in your data
More flexible: Pandas DataFrame + Pipeline
What also works nicely is to use a Pandas DataFrame in combination with sklearn-pandas as it allows you to use different transformations on different column names. E.g.
df = pd.DataFrame.from_dict(data)
mapper = DataFrameMapper([
('host', sklearn.preprocessing.LabelBinarizer()),
('rate', sklearn.preprocessing.StandardScaler())
])
pipl = Pipeline(steps=[('mapper', mapper),
('clf', SGDClassifier())])
X = df[x-cols]
y = df[y-col(s)]
pipl.fit()
Note that x-cols and y-col(s) are the list of the feature and target columns respectively.
You should use a scikit-learn transformer instead of get_dummies. In this case, LabelBinarizer makes sense. Seeing as LabelBinarizer doesn't work in a pipeline, this is one way to do what you want:
binarizer = LabelBinarizer()
# fitting LabelBinarizer means it remembers all the columns it's seen
one_hot_data = binarizer.fit_transform(X_train[:, categorical_col])
# replace string column with one-hot representation
X_train = np.concatenate([np.delete(X_train, categorical_col, axis=1),
one_hot_data], axis=1)
model = SGDClassifier()
clf.fit(X_train, y)
pickle.dump(f, {'clf': clf, 'binarizer': binarizer})
then at prediction time:
estimators = pickle.load(f)
clf = estimators['clf']
binarizer = estimators['binarizer']
one_hot_data = binarizer.transform(X_test[:, categorical_col])
X_test = np.concatenate([np.delete(X_test, categorical_col, axis=1),
one_hot_data], axis=1)
clf.predict(X_test)

Scikit-Learn Multiple Regression Fails with ElasticNetCV

According to the documentation and other SO questions, ElasticNetCV accepts multiple output regression. When I try it, though, it fails. Code:
from sklearn import linear_model
import numpy as np
import numpy.random as rnd
nsubj = 10
nfeat_train = 5
nfeat_predict = 20
x = rnd.random((nsubj, nfeat_train))
y = rnd.random((nsubj, nfeat_predict))
lm = linear_model.LinearRegression()
lm.fit(x,y) # works
el = linear_model.ElasticNetCV()
el.fit(x,y) # fails
Error message:
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
This is with scikit-learn version 0.14.1. Is this a mismatch between the documentation and implementation?
You may want to take a look at sklearn.linear_model.MultiTaskElasticNetCV. But beware, this object assumes that your multiple targets share features. Thus, a feature is either active for all tasks (with variable activation for each, which can be small), or active for none of them. Before using this object, make sure this is the functionality you need.

Resources