How to predict unseen data? - scikit-learn

Hi I am practicing ML models and facing issue while trying to predict the unseen data.
The error is coming while doing the onehotencoding for categorical data.
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
labelencoder_x_1 = LabelEncoder() #will encode country
X[:,1] = labelencoder_x_1.fit_transform(X[:,1])
labelencoder_x_2 = LabelEncoder() #will encode Gender
X[:,2] = labelencoder_x_2.fit_transform(X[:,2])
onehotencoder_x = OneHotEncoder(categorical_features=[1])
X= onehotencoder_x.fit_transform(X).toarray()
X = X[:,1:]
My X has 11 columns and column 2 and 3 are categorical type(Country and Gender).
Model running fine but while trying to test the model against a random input its failing at onehotencoding.
input = [[619], ['France'], ['Male'], [42], [2], [0.0], [1], [1], [1],[101348.88]]
input[1] = labelencoder_x_1.fit_transform(input[1])
input[2] = labelencoder_x_2.fit_transform(input[2])
input= onehotencoder_x.fit_transform(input).toarray()
Error:
C:\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py:451:
DeprecationWarning: The 'categorical_features' keyword is deprecated in version 0.20
and will be removed in 0.22. You can use the ColumnTransformer instead.
"use the ColumnTransformer instead.", DeprecationWarning)
Traceback (most recent call last):
File "<ipython-input-44-44a43edf17aa>", line 1, in <module>
input= onehotencoder_x.fit_transform(input).toarray()
File "C:\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py", line 624, in
fit_transform
self._handle_deprecations(X)
File "C:\Anaconda3\lib\site-packages\sklearn\preprocessing\_encoders.py", line 453, in
_handle_deprecations
n_features = X.shape[1]
AttributeError: 'list' object has no attribute 'shape'

I believe this is because you have nested lists.
You should flatten your input list and use that for the prediction.
input[1] = labelencoder_x_1.fit_transform(input[1])
input[2] = labelencoder_x_2.fit_transform(input[2])
intput = [item for sublist in input for item in sublist]
input= onehotencoder_x.fit_transform(input).toarray()
If you have a nested list, then each element in the list will be considered an item that needs to go through the fit_transform function, but since it's a single element, it does not match the shape that fit_transform looks for, which is [1, 10] (1 row, 10 columns).

Related

Incompatible row dimensions when using passthrough in GridSearch over sklearn Pipeline with FeatureUnion

I am trying to do grid search over a sklearn pipeline that uses a custom transformer in a pipeline with FeatureUnion. It works fine when the pipeline uses the custom transformer class in FeatureUnion; however, it fails when the custom class is ignored in the pipeline by setting passthrough in the grid search parameters.
The full pipeline is defined as follows:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline, FeatureUnion
ngram_vectorizer = Pipeline([
("vectorizer", CountVectorizer(analyzer="char_wb", ngram_range=(1,3))),
("tfidf", TfidfTransformer())
])
pipe_full = Pipeline(
[
("features", FeatureUnion(
[
("ngrams", ngram_vectorizer),
("lengths", TextLengthExtractor())
]
)
),
("classifier", MultinomialNB())
]
)
The custom transformer class TextLengthExtractor simply computes the number of characters from an input string:
from sklearn.base import BaseEstimator, TransformerMixin
class TextLengthExtractor(BaseEstimator, TransformerMixin):
def fit(self, X, y = None):
return self
def transform(self, X, y = None):
string_lengths = np.array([len(doc) for doc in X])
return string_lengths.reshape(-1,1)
The tuning parameters for grid search are defined through a dictionary params. Importantly, the parameters for the custom TextLengthExtractor contain the passthrough option to ignore the entire features__lengths step from the pipeline (see also the sklearn's documentation on pipelines):
params = {
"features__lengths": [TextLengthExtractor(), "passthrough"],
"features__ngrams__vectorizer__ngram_range" : [(1,3), (2,6)],
}
When the pipeline is fit on the following dummy data
X_train_dummy = ["a", "ab", "a bc", "aaaaa", "b ab cc b", "ba", "baba", "cc bb aa", "c", "bca"]
y_train_dummy = [1,0,1, 1, 0, 1, 0, 1, 0, 0]
pipe_full.fit(X_train_dummy, y_train_dummy)
it can be seen that the lengths step of the FeatureUnion pipeline works as expected:
pipe_full["features"].get_params()["lengths"].transform(X_train_dummy)
# gives the following output of shape (10,1)
# array([[1], [2], [4], [5], [9], [2], [4], [8], [1], [3]])
However - and now comes the problem - when grid search is performed as follows:
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(pipe_full, params, cv=5, n_jobs=-1, verbose=10)
grid_search.fit(X_train_dummy, y_train_dummy)
all fits that ignore the lengths step (as defined by the passthrough option from params["features__lengths"] throw the following error:
5 fits failed out of a total of 10.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.
Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
File "C:\dev\NameClassification\venv\lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "C:\dev\NameClassification\venv\lib\site-packages\sklearn\pipeline.py", line 378, in fit
Xt = self._fit(X, y, **fit_params_steps)
File "C:\dev\NameClassification\venv\lib\site-packages\sklearn\pipeline.py", line 336, in _fit
X, fitted_transformer = fit_transform_one_cached(
File "C:\dev\NameClassification\venv\lib\site-packages\joblib\memory.py", line 349, in __call__
return self.func(*args, **kwargs)
File "C:\dev\NameClassification\venv\lib\site-packages\sklearn\pipeline.py", line 870, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
File "C:\dev\NameClassification\venv\lib\site-packages\sklearn\pipeline.py", line 1162, in fit_transform
return self._hstack(Xs)
File "C:\dev\NameClassification\venv\lib\site-packages\sklearn\pipeline.py", line 1216, in _hstack
Xs = sparse.hstack(Xs).tocsr()
File "C:\dev\NameClassification\venv\lib\site-packages\scipy\sparse\_construct.py", line 532, in hstack
return bmat([blocks], format=format, dtype=dtype)
File "C:\dev\NameClassification\venv\lib\site-packages\scipy\sparse\_construct.py", line 665, in bmat
raise ValueError(msg)
ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 1, expected 8.
I do understand that both steps require identical row dimensions for both ngrams and lengths in the FeatureUnion, where the number of rows in the extracted feature matrices must equal the number of samples in the respective split. However, I have no idea how to control the shape of matrices when ignoring the lengths part of FeatureUnion using the passthrough option in the gird search params.
I have found any solution to the problem on SE or any other sklearn related resource. Does anyone have an idea on how to solve the issue?
I think I found the solution to the problem: To ignore an individual step in a FeatureUnion, the string drop rather than passthrough must be used. According to sklearn's documentation of FeatureUnion:
Parameters of the transformers may be set using its name and the parameter name separated by a '__'. A transformer may be replaced entirely by setting the parameter with its name to another transformer, removed by setting to 'drop' or disabled by setting to 'passthrough' (features are passed without transformation).
An example of dropping an entire transformer in FeatureUnion is also shown in sklearn's user guide on pipelines.
In conclusion, to solve my problem, I had to replace passthrough with drop in the grid search parameter dictionary as follows
Change from
params = {
"features__lengths": [TextLengthExtractor(), "passthrough"],
"features__ngrams__vectorizer__ngram_range" : [(1,3), (2,6)],
}
to
params = {
"features__lengths": [TextLengthExtractor(), "drop"],
"features__ngrams__vectorizer__ngram_range" : [(1,3), (2,6)],
}

Getting unxpected IndexError when creating a dataframe

I am trying to execute the below code:
heart_df = pd.read_csv(r"location")
X = heart_df.iloc[:, :-1].values
y = heart_df.iloc[:, 11].values
new_df = X[["Sex", "ChestPainType", "RestingECG", "ExerciseAngina", "ST_Slope"]].values() #this is line 17
cat_cols = new_df.copy()
and getting IndexError like:
File "***location***", line 17, in <module>
new_df = X[["Sex", "ChestPainType", "RestingECG", "ExerciseAngina", "ST_Slope"]].values()
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
As far as I know this IndexError comes when we use float numbers as indices but don't understand why it is coming in this case.
Here, by creating new_df and then cat_cols, I want to separate the categorical columns to apply OneHotEncoding at a later stage.
The dataset is here: https://www.kaggle.com/fedesoriano/heart-failure-prediction.
The error is coming from:
X = heart_df.iloc[:, :-1].values
The .values part converts the data frame to a numpy array and certain columns in X are not compatible with numpy array.
So we can write the same as:
X = heart_df.iloc[:, :-1]
new_df = X[["Sex", "ChestPainType", "RestingECG", "ExerciseAngina", "ST_Slope"]]

How to transform the data and calculate the TFIDF value?

My data format is:
datas = {[1,2,4,6,7],[2,3],[5,6,8,3,5],[2],[93,23,4,5,11,3,5,2],...}
Each element in datas is a sentence ,and each number is a word.I want to get the TFIDF value for each number. How to do it with sklearn or other ways?
My code:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
datas = {[1,2,4,6,7],[2,3],[5,6,8,3,5],[2],[93,23,4,5,11,3,5,2]}
vectorizer=CountVectorizer()
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(vectorizer.fit_transform(datas))
print(tfidf)
My code doesn't work.Error:
Traceback (most recent call last): File
"C:/Users/zhuowei/Desktop/OpenNE-master/OpenNE-
master/src/openne/buildTree.py", line 103, in <module>
X = vectorizer.fit_transform(datas) File
"C:\Users\zhuowei\Anaconda3\lib\site-
packages\sklearn\feature_extraction\text.py", line 869, in fit_transform
self.fixed_vocabulary_) File "C:\Users\zhuowei\Anaconda3\lib\site-
packages\sklearn\feature_extraction\text.py", line 792, in _count_vocab
for feature in analyze(doc): File
"C:\Users\zhuowei\Anaconda3\lib\site-
packages\sklearn\feature_extraction\text.py", line 266, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words) File
"C:\Users\zhuowei\Anaconda3\lib\site-
packages\sklearn\feature_extraction\text.py", line 232, in <lambda>
return lambda x: strip_accents(x.lower())
AttributeError: 'int' object has no attribute 'lower'
You are using CountVectorizer which requires an iterable of strings. Something like:
datas = ['First sentence',
'Second sentence', ...
...
'Yet another sentence']
But your data is a list of lists, which is why the error occurs. You need to make the inner lists as strings for the CountVectorizer to work. You can do this:
datas = [' '.join(map(str, x)) for x in datas]
This will result in datas like this:
['1 2 4 6 7', '2 3', '5 6 8 3 5', '2', '93 23 4 5 11 3 5 2']
Now this form is consumable by CountVectorizer. But even then you will not get proper results, because of the default token_pattern in CountVectorizer:
token_pattern : ’(?u)\b\w\w+\b’
string Regular expression denoting what constitutes a
“token”, only used if analyzer == 'word'. The default regexp select
tokens of 2 or more alphanumeric characters (punctuation is completely
ignored and always treated as a token separator)
In order for it to consider your numbers as words, you will need to change it so that it can accept single letters as words by doing this:
vectorizer = CountVectorizer(token_pattern=r"(?u)\b\w+\b")
Then it should work. But now your numbers are changed into strings

numpy code works in REPL, script says type error

Copy and pasting this code into the python3 REPL works, but when I run it as a script, I get a type error.
"""Softmax."""
scores = [3.0, 1.0, 0.2]
import numpy as np
from math import e
def softmax(x):
"""Compute softmax values for each sets of scores in x."""
results = []
x = np.transpose(x)
for j in range(len(x)):
exps = [np.exp(s) for s in x[j]]
_sum = np.sum(np.exp(x[j]))
softmax = [i / _sum for i in exps]
results.append(softmax)
final = np.vstack(results)
return np.transpose(final)
# pass # TODO: Compute and return softmax(x)
print(softmax(scores))
# Plot softmax curves
import matplotlib.pyplot as plt
x = np.arange(-2.0, 6.0, 0.1)
scores = np.vstack([x, np.ones_like(x), 0.2 * np.ones_like(x)])
plt.plot(x, softmax(scores).T, linewidth=2)
plt.show()
The error I get running the script via CLI is the following:
bash$ python3 softmax.py
Traceback (most recent call last):
File "softmax.py", line 22, in <module>
print(softmax(scores))
File "softmax.py", line 13, in softmax
exps = [np.exp(s) for s in x[j]]
TypeError: 'numpy.float64' object is not iterable
This kind of crap makes me so nervous about running interpreted code in production with libraries like these, seriously unreliable and undefined behaviour is totally unacceptable IMO.
At the top of your script, you define
scores = [3.0, 1.0, 0.2]
This is the argument in your first call of softmax(scores). When converted to a numpy array, scores is 1-d array with shape (3,).
You pass scores into the function, and then it is converted to a numpy array by the call
x = np.transpose(x)
However, it is still 1-d, with shape (3,). The transpose function swaps dimensions, but it does not add a dimension to a 1-d array. In effect, transpose is a "no-op" when applied to a 1-d array.
Then, in the loop that follows, x[j] is a scalar of type numpy.float64, so it does not make sense to write [np.exp(s) for s in x[j]]. x[j] is a scalar, not a sequence, so you can't iterate over it.
In the bottom part of your script, you redefine scores as
x = np.arange(-2.0, 6.0, 0.1)
scores = np.vstack([x, np.ones_like(x), 0.2 * np.ones_like(x)])
Now scores is 2-d array (scores.shape is (3, 80)), so you don't get an error when you call softmax(scores).

Apply MinMaxScaler() on a pandas column

I am trying to use the sklearn MinMaxScaler to rescale a python column like below:
scaler = MinMaxScaler()
y = scaler.fit(df['total_amount'])
But got the following errors:
Traceback (most recent call last):
File "/Users/edamame/workspace/git/my-analysis/experiments/my_seq.py", line 54, in <module>
y = scaler.fit(df['total_amount'])
File "/Users/edamame/workspace/git/my-analysis/venv/lib/python3.4/site-packages/sklearn/preprocessing/data.py", line 308, in fit
return self.partial_fit(X, y)
File "/Users/edamame/workspace/git/my-analysis/venv/lib/python3.4/site-packages/sklearn/preprocessing/data.py", line 334, in partial_fit
estimator=self, dtype=FLOAT_DTYPES)
File "/Users/edamame/workspace/git/my-analysis/venv/lib/python3.4/site-packages/sklearn/utils/validation.py", line 441, in check_array
"if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:
array=[3.180000e+00 2.937450e+03 6.023850e+03 2.216292e+04 1.074589e+04
:
0.000000e+00 0.000000e+00 9.000000e+01 1.260000e+03].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Any idea what was wrong?
The input to MinMaxScaler needs to be array-like, with shape [n_samples, n_features]. So you can apply it on the column as a dataframe rather than a series (using double square brackets instead of single):
y = scaler.fit(df[['total_amount']])
Though from your description, it sounds like you want fit_transform rather than just fit (but I could be wrong):
y = scaler.fit_transform(df[['total_amount']])
A little more explanation:
If your dataframe had 100 rows, consider the difference in shape when you transform a column to an array:
>>> np.array(df[['total_amount']]).shape
(100, 1)
>>> np.array(df['total_amount']).shape
(100,)
The first returns a shape that matches [n_samples, n_features] (as required by MinMaxScaler), whereas the second does not.
Try to do with this way:
import pandas as pd
from sklearn import preprocessing
x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)

Resources