Create a fork on a sklearn transformer pipeline to allow data to pass through - scikit-learn

I have an sklearn pipeline that looks like this
You'll notice the duplicate step, features_to_vectorize, in the left and right side of the FeatureUnion. features_to_vectorize is the result of applying a DictVectorizer to a pandas DataFrame column. I'd like to then take features_to_vectorize and concatenate it with a transformation on itself. My current setup duplicates the transformation because I'm not sure how to create a fork at features_to_vectorize where I can create a passthrough for that data but also apply a transformation on that data and later FeatureUnion it. Any ideas how to better set this up to avoid duplicate computation? Thanks
sum_along_columns = FunctionTransformer(np.sum, kw_args={"axis": 1})
col_trans = ColumnTransformer([("features_to_vectorize", DictVectorizer(), "col")])
out = FeatureUnion(
[
("pipeline", Pipeline([("d_vec", col_trans), ("sum", sum_along_columns)])),
("column_transformer", col_trans),
]
)
Ideally it should look like
SOLUTION:
col_trans = ColumnTransformer([("features_to_vectorize", DictVectorizer(), "col")])
ident = FunctionTransformer()
fts = FeatureUnion([("sum", SumColumns()), ("ident", ident)])
out = Pipeline([("dv", col_trans), ("sum_and_pass", fts)])
where SumColumns is a simple transformation np.sum(axis=1).reshape(-1,1) in order to conform to 2d outputs that sklearn enforces

ColumnTransformer can send the same column to multiple transformers, so this should do:
sum_along_columns = FunctionTransformer(np.sum, kw_args={"axis": 1})
col_trans = ColumnTransformer([("features_to_vectorize", DictVectorizer(), "col")])
split = ColumnTransformer([
('sum', sum_along_columns, [0]),
('ident', 'passthrough', [0]),
])
out = Pipeline([
('vectorize', col_trans),
('split', split),
])
One issue is that after the 'vectorize' step in the pipeline you have an array not a frame, so we can't rely on the feature name in split, and hence [0].
You could also stick to FeatureUnion and implement your own simple identity transformer, e.g. using FunctionTransformer again, instead of using the ColumnTransformer's 'passthrough'.

Related

How to access the get_feature_names_out() or columns in a specific transformer in SKlearn's ColumnTransformer?

Consider the following ColumnTransformer code:
tree_preprocessor = ColumnTransformer(
[
(
"categorical",
OrdinalEncoder(),
["region","trade","shipping","payment","channel","device"]
),
(
"numeric",
"passthrough",
["bag_total", "hero_total", "acc_ind"]
),
# Custom Function for array hot encoding
(
"array_one_hot_encode",
MultiHotEncoder(df=df_final),
["model_list"]
)
],
remainder="drop",
)
How can I access the features of the categorical transformer?
What I'm trying to do is create a DataFrame post fit_transform().
SKLearn does not have get_feature_names_out() for all its transformers, so I would like to loop through each transformer in the ColumnTransformer and pull the features post fit (if possible). Otherwise, pull the input feature names out.
For example, because OrdinalEncoder() does not have get_feature_names_out(), I would like to access the original feature list (["region","trade","shipping","payment","channel","device"])
However, for MultiHotEncoder, I built a custom get_feature_names_out() function, so it should definitely work and will return a list of columns post transformation.

Saving LinearRegression (from sklearn.linear_model) coefficients in a list

I'm stuck in a problem that should be very simple. I'm running four simple linear regressions (changing only the x variables) and I need to store both de intercept and the scope coefficient in a list, for all regressions.
I thought it would be very easy, but it seems I'm not good at handling lists. The result stores me the same coefficients for all four models in the list.
This is my code:
from sklearn.linear_model import LinearRegression
variables = ['Number_of_likes','Number_of_comments','Number_of_followers','Number_of_repplies']
models = [None] * 4
lm = LinearRegression()
#Fit regressions
models[0] = lm.fit(X[[variables[0]]],y)
models[1] = lm.fit(X[[variables[1]]],y)
models[2] = lm.fit(X[[variables[2]]],y)
models[3] = lm.fit(X[[variables[3]]],y)
When I look at "models", it seems to be storing the results only for the last regression, in all four slots.
Hope I explained well my problem.
lm.fit() will modify the existing instance, not create a new copy of it. Also, the models list will store these instances by reference, which yields the behavior you are seeing.
To solve this, you need to create a new LogisticRegression every time you want to fit it to a new input, not re-use the same old model. For example:
models = [] # just an empty list; we will append our models to it one by one
for var in variables:
lm = LinearRegression() # create a new object
lm.fit(X[[var]], y) # fit it
models.append(lm) # add it to the list
Or, a more faithful version to your original code would be (using sklearn.base.clone):
from sklearn.base import clone # to create a new copy of the lm object
lm = LinearRegression()
#Fit regressions
models[0] = clone(lm).fit(X[[variables[0]]],y)
models[1] = clone(lm).fit(X[[variables[1]]],y)
models[2] = clone(lm).fit(X[[variables[2]]],y)
models[3] = clone(lm).fit(X[[variables[3]]],y)

Obtaining Same Result as Sklearn Pipeline without Using It

How would one correctly standardize the data without using pipeline? I am just wanting to make sure my code is correct and there is no data leakage.
So if I standardize the entire dataset once, right at the beginning of my project, and then go on to try different CV tests with different ML algorithms, will that be the same as creating an Sklearn Pipeline and performing the same standardization in conjunction with each ML algorithm?
y = df['y']
X = df.drop(columns=['y', 'Date'])
scaler = preprocessing.StandardScaler().fit(X)
X_transformed = scaler.transform(X)
clf1 = DecisionTreeClassifier()
clf1.fit(X_transformed, y)
clf2 = SVC()
clf2.fit(X_transformed, y)
####Is this the same as the below code?####
pipeline1 = []
pipeline1.append(('standardize', StandardScaler()))
pipeline1.append(('clf1', DecisionTreeClassifier()))
pipeline1.fit(X_transformed,y)
pipeline2 = []
pipeline2.append(('standardize', StandardScaler()))
pipeline2.append(('clf2', DecisionTreeClassifier()))
pipeline2.fit(X_transformed,y)
Why would anybody choose the latter other than personal preference?
They are the same. It is possible that you may want one or the other from a maintainability standpoint, but the outcome of a test set prediction will be identical.
Edit Note that this is only the case because the StandardScaler is idempotent. It is strange that you fit the pipeline on the data that has already been scaled...

What to pass to clf.predict()?

I started playing with Decision Trees lately and I wanted to train my own simple model with some manufactured data. I wanted to use this model to predict some further mock data, just to get a feel of how it works, but then I got stuck. Once your model is trained, how do you pass data to predict()?
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
Docs state:
clf.predict(X)
Parameters:
X : array-like or sparse matrix of shape = [n_samples, n_features]
But when trying to pass np.array, np.ndarray, list, tuple or DataFrame it just throws an error. Can you help me understand why please?
Code below:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
import graphviz
import pandas as pd
import numpy as np
import random
from sklearn import tree
pd.options.display.max_seq_items=5000
pd.options.display.max_rows=20
pd.options.display.max_columns=150
lenght = 50000
miles_commuting = [random.choice([2,3,4,5,7,10,20,25,30]) for x in range(lenght)]
salary = [random.choice([1300,1600,1800,1900,2300,2500,2700,3300,4000]) for x in range(lenght)]
full_time = [random.choice([1,0,1,1,0,1]) for x in range(lenght)]
DataFrame = pd.DataFrame({'CommuteInMiles':miles_commuting,'Salary':salary,'FullTimeEmployee':full_time})
DataFrame['Moving'] = np.where((DataFrame.CommuteInMiles > 20) & (DataFrame.Salary > 2000) & (DataFrame.FullTimeEmployee == 1),1,0)
DataFrame['TargetLabel'] = np.where((DataFrame.Moving == 1),'Considering move','Not moving')
target = DataFrame.loc[:,'Moving']
data = DataFrame.loc[:,['CommuteInMiles','Salary','FullTimeEmployee']]
target_names = DataFrame.TargetLabel
features = data.columns.values
clf = tree.DecisionTreeClassifier()
clf = clf.fit(data, target)
clf.predict(?????) #### <===== What should go here?
clf.predict([30,4000,1])
ValueError: Expected 2D array, got 1D array instead:
array=[3.e+01 4.e+03 1.e+00].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
clf.predict(np.array(30,4000,1))
ValueError: only 2 non-keyword arguments accepted
Where is your "mock data" that you want to predict?
Your data should be of the same shape that you used when calling fit(). From the code above, I see that your X has three columns ['CommuteInMiles','Salary','FullTimeEmployee']. You need to have those many columns in your prediction data, number of rows can be arbitrary.
Now when you do
clf.predict([30,4000,1])
The model is not able to understand that these are columns of a same row or data of different rows.
So you need to convert that into 2-d array, where inner array represents the single row.
Do this:
clf.predict([[30,4000,1]]) #<== Observe the two square brackets
You can have multiple rows to be predicted, each in inner list. Something like this:
X_test = [[30,4000,1],
[35,15000,0],
[40,2000,1],]
clf.predict(X_test)
Now as for your last error clf.predict(np.array(30,4000,1)), this has nothing to do with predict(). You are using the np.array() wrong.
According to the documentation, the signature of np.array is:
(object, dtype=None, copy=True, order='K', subok=False, ndmin=0)
Leaving the first (object) all others are keyword arguments, so they need to be used as such. But when you do this: np.array(30,4000,1), each value is considered as input to separate param here: object=30, dtype=4000, copy=1. This is not allowed and hence error. If you want to make a numpy array from list, you need to pass a list.
Like this: np.array([30,4000,1])
Now this will be considered correctly as input to object param.

scikit-learn: Is there a way to provide an object as an input to predict function of a classifier?

I am planning to use an SGDClassifier in production. The idea is to train the classifier on some training data, use cPickle to dump it to a .pkl file and reuse it later in a script. However, there are certain high cardinality fields which are categorical in nature and translated to one hot matrix representation which creates around 5000 features. Now the input that I get for the predict will only have one of these features and rest all will be zeroes. It will also include ofcourse the other numerical features apart from this. From the docs, it appears that the predict function expects an array of array as input. Is there any way I can transform my input to the format expected by the predict function without having to store the fields everytime I train the model ?
Update
So, let us say my input contains 3 fields:
{
rate: 10, // numeric
flagged: 0, //binary
host: 'somehost.com' // keeping this categorical
}
host can have around 5000 different values. Now I loaded the file to a pandas dataframe, used the get_dummies function to transform the host field to around 5000 new fields which are binary fields.
Then I trained by model and stored it using cPickle.
Now, when I need to use the predict function, for the input, I only have 3 fields (shown above). However, as per my understanding the predict endpoint will expect an array of vectors and each vector is supposed to have those 5000 fields.
For the entry that I need to predict, I know only one field for that entry which will be the value of host itself.
For example, if my input is
{
rate: 5,
flagged: 1
host: 'new_host.com'
}
I know that the fields expected by the predict should be:
{
rate: 5,
flagged: 1
new_host: 1
}
But if I translate it to vector format, I don't know which index to place the new_host field. Also, I don't know in advance what other hosts are (unless I store it somewhere during the training phase)
I hope I am making some sense. Let me know if I am doing it the wrong way.
I don't know which index to place the new_host field
A good approach that has worked for me is to build a pipeline which you then use for training and prediction. This way you do not have to concern yourself with the column index of whatever output is produced by your transformation:
# in training
pipl = Pipeline(steps=[('binarizer', LabelBinarizer(),
('clf', SGDClassifier())])
model = pipl.train(X, Y)
pickle.dump(mf, model)
# in production
model = pickle.load(mf)
y = model.predict(X)
As X, Y inputs you need to pass an array-like object. Make sure the input is the same structure for both training and test, e.g.
X = [[data.get('rate'), data.get('flagged'), data.get('host')]]
Y = [[y-cols]] # your example doesn't specify what is Y in your data
More flexible: Pandas DataFrame + Pipeline
What also works nicely is to use a Pandas DataFrame in combination with sklearn-pandas as it allows you to use different transformations on different column names. E.g.
df = pd.DataFrame.from_dict(data)
mapper = DataFrameMapper([
('host', sklearn.preprocessing.LabelBinarizer()),
('rate', sklearn.preprocessing.StandardScaler())
])
pipl = Pipeline(steps=[('mapper', mapper),
('clf', SGDClassifier())])
X = df[x-cols]
y = df[y-col(s)]
pipl.fit()
Note that x-cols and y-col(s) are the list of the feature and target columns respectively.
You should use a scikit-learn transformer instead of get_dummies. In this case, LabelBinarizer makes sense. Seeing as LabelBinarizer doesn't work in a pipeline, this is one way to do what you want:
binarizer = LabelBinarizer()
# fitting LabelBinarizer means it remembers all the columns it's seen
one_hot_data = binarizer.fit_transform(X_train[:, categorical_col])
# replace string column with one-hot representation
X_train = np.concatenate([np.delete(X_train, categorical_col, axis=1),
one_hot_data], axis=1)
model = SGDClassifier()
clf.fit(X_train, y)
pickle.dump(f, {'clf': clf, 'binarizer': binarizer})
then at prediction time:
estimators = pickle.load(f)
clf = estimators['clf']
binarizer = estimators['binarizer']
one_hot_data = binarizer.transform(X_test[:, categorical_col])
X_test = np.concatenate([np.delete(X_test, categorical_col, axis=1),
one_hot_data], axis=1)
clf.predict(X_test)

Resources