How to combine features with different dimensions output using scikit-learn - python-3.x

I am using scikit-learn with Pipeline and FeatureUnion to extract features from different inputs. Each sample (instance) in my dataset refers to documents with different lengths. My goal is to compute the top tfidf for each document independently, but I keep getting this error message:
ValueError: blocks[0,:] has incompatible row dimensions. Got
blocks[0,1].shape[0] == 1, expected 2000.
2000 is the size of the training data.
This is the main code:
book_summary= Pipeline([
('selector', ItemSelector(key='book')),
('tfidf', TfidfVectorizer(analyzer='word', ngram_range(1,3), min_df=1, lowercase=True, stop_words=my_stopword_list, sublinear_tf=True))
])
book_contents= Pipeline([('selector3', book_content_count())])
ppl = Pipeline([
('feats', FeatureUnion([
('book_summary', book_summary),
('book_contents', book_contents)])),
('clf', SVC(kernel='linear', class_weight='balanced') ) # classifier with cross fold 5
])
I wrote two classes to handle each pipeline function. My problem is with book_contents pipeline which is mainly dealing with each sample and return TFidf matrix for each book independently.
class book_content_count():
def count_contents2(self, bookid):
book = open('C:/TheCorpus/'+str(int(bookid))+'_book.csv', 'r')
book_data = pd.read_csv(book, header=0, delimiter=',', encoding='latin1',error_bad_lines=False,dtype=str)
corpus=(str([user_data['text']]).strip('[]'))
return corpus
def transform(self, data_dict, y=None):
data_dict['bookid'] #from here take the name
text=data_dict['bookid'].apply(self.count_contents2)
vec_pipe= Pipeline([('vec', TfidfVectorizer(min_df = 1,lowercase = False, ngram_range = (1,1), use_idf = True, stop_words='english'))])
Xtr = vec_pipe.fit_transform(text)
return Xtr
def fit(self, x, y=None):
return self
Sample of data (example):
title Summary bookid
The beauty and the beast is a traditional fairy tale... 10
ocean at the end of the lane is a 2013 novel by British 11
Then each id will refer to a text file with the actual contents of these books
I have tried toarray and reshape functions but with no luck. Any idea how to solve this issue.
Thanks

You can use Neuraxle's Feature Union with a custom joiner that you would need to code yourself. The joiner is a class passed to Neuraxle's FeatureUnion to merge results together in the way you expected.
1. Import Neuraxle's classes.
from neuraxle.base import NonFittableMixin, BaseStep
from neuraxle.pipeline import Pipeline
from neuraxle.steps.sklearn import SKLearnWrapper
from neuraxle.union import FeatureUnion
2. Define your custom class by inheriting from BaseStep:
class BookContentCount(BaseStep):
def transform(self, data_dict, y=None):
transformed = do_things(...) # be sure to use SKLearnWrapper if you wrap sklearn items.
return transformed
def fit(self, x, y=None):
return self
3. Create a joiner to join the resuts of the feature union the way you wish:
class CustomJoiner(NonFittableMixin, BaseStep):
def __init__(self):
BaseStep.__init__(self)
NonFittableMixin.__init__(self)
# def fit: is inherited from `NonFittableMixin` and simply returns self.
def transform(self, data_inputs):
# TODO: insert your own concatenation method here.
result = np.concatenate(data_inputs, axis=-1)
return result
4. Finally create your pipeline by passing the joiner to the FeatureUnion:
book_summary= Pipeline([
ItemSelector(key='book'),
TfidfVectorizer(analyzer='word', ngram_range(1,3), min_df=1, lowercase=True, stop_words=my_stopword_list, sublinear_tf=True)
])
p = Pipeline([
FeatureUnion([
book_summary,
BookContentCount()
],
joiner=CustomJoiner()
),
SVC(kernel='linear', class_weight='balanced')
])
Note: if you want your Neuraxle pipeline to become a scikit-learn pipeline back, you can do p = p.tosklearn().
To learn more on Neuraxle:
https://github.com/Neuraxio/Neuraxle
More examples from the documentation:
https://www.neuraxle.org/stable/examples/index.html

Related

Sklearn Pipeline: One feature automatically missed out

I created a Custom Classifier(Dummy Classifier). Below is definition. I also added some print statements & global variables to capture values
class FeaturePassThroughClassifier(ClassifierMixin):
def __init__(self):
pass
def fit(self, X, y):
global test_arr1
self.classes_ = np.unique(y)
test_arr1 = X
print("1:", X.shape)
return self
def predict(self, X):
global test_arr2
test_arr2 = X
print("2:", X.shape)
return X
def predict_proba(self, X):
global test_arr3
test_arr3 = X
print("3:", X.shape)
return X
Below is Stacking Classifier definition where the above defined CustomClassifier is one of base classifier. There are 3 more base classifiers (these are fitted estimators). Goal is to get input training set variables as is (which will come out from CustomClassifier) + prediction from base_classifier2, base_classifier3, base_classifier4. These features will act as input to meta classifier.
model = StackingClassifier(estimators=[
('select_features', Pipeline(steps = [("model_feature_selector", ColumnTransformer([('feature_list', 'passthrough', X_train.columns)])),
('base(dummy)_classifier1', FeaturePassThroughClassifier())])),
('base_classifier2', base_classifier2),
('base_classifier3', base_classifier3),
('base_classifier4', base_classifier4)
],
final_estimator = Pipeline(memory=None,
steps=[
('save_base_estimator_output_data', FunctionTransformer(save_base_estimator_output_data, validate=False)), ('final_model', RandomForestClassifier())
], verbose=True), passthrough = False, **stack_method = 'predict_proba'**)
Below is o/p on fitting the model. There are 230 variables:
Here is the problem: There are 230 variables but CustomClassifier o/p is showing only 229 which is strange. We can clearly see from print statements above that 230 variables get passed through CustomClassifier.
I need to use stack_method = "predict_proba". I am not sure what's going wrong here. The code works fine when stack_method = "predict".
Since this is a binary classifier, the classifier class expects you to add two probability columns in the output matrix - one for probability for class label "1" and another for "0".
In the output, it has dropped one of these since both are not required, hence, 230 columns get reduced to 229. Add a dummy column to solve your problem.
In the Notes section of the documentation:
When predict_proba is used by each estimator (i.e. most of the time for stack_method='auto' or specifically for stack_method='predict_proba'), The first column predicted by each estimator will be dropped in the case of a binary classification problem.
Here's the code that eliminates the first column.
You could add a sacrificial first column in your custom estimator's predict_proba, or switch to decision_function (which will cause differences depending on your real base estimators), or use the passthrough option instead of the custom estimator (doing feature selection in the final_estimator object instead).
Both the above solutions are on point. This is how I implemented the workaround with dummy column:
Declare a custom transformer whose output is the column that gets dropped due reasons explained above:
class add_dummy_column(BaseEstimator, TransformerMixin):
def __init__(self, key):
self.key = key
def fit(self, X, y=None):
return self
def transform(self, X):
print(type(X))
return X[[self.key]]
Do a feature union where above customer transformer + column transformer are called to create final dataframe. This will duplicate the column that gets dropped. Below is altered definition for defining Stacking classifier with FeatureUnion:
model = StackingClassifier(estimators=[
('select_features', Pipeline(steps = [('featureunion', FeatureUnion([('add_dummy_column_to_input_dataframe', add_dummy_column(key='FEATURE_THAT_GETS_DROPPED')),
("model_feature_selector", ColumnTransformer([('feature_list', 'passthrough', X_train.columns)]))])),
('base(dummy)_classifier1', FeaturePassThroughClassifier())])),
('base_classifier2', base_classifier2),
('base_classifier3', base_classifier3),
('base_classifier4', base_classifier4)
],
final_estimator = Pipeline(memory=None,
steps=[
('save_base_estimator_output_data', FunctionTransformer(save_base_estimator_output_data, validate=False)), ('final_model', RandomForestClassifier())
], verbose=True), passthrough = False, **stack_method = 'predict_proba'**)

Sklearn's TfidfTransformer(use_idf=False, norm=None) returns the same output as CountVectorizer()

I am trying to understand the code behind TfidfTransformer(). From sklearn's documentation, I can get the term frequencies by setting use_idf=False. But when I check the code on Github, I noticed that the TfidfTransformer() will return the same value as CountVectorizer() when not using normalization, which is just the count of each term.
The code that is supposed to calculate term frequencies.
def transform(self, x, copy=True):
"""Transform a count matrix to a tf or tf-idf representation.
Parameters
----------
X : sparse matrix of (n_samples, n_features)
A matrix of term/token counts.
copy : bool, default=True
Whether to copy X and operate on the copy or perform in-place
operations.
Returns
-------
vectors : sparse matrix of shape (n_samples, n_features)
Tf-idf-weighted document-term matrix.
"""
X = self._validate_data(
X, accept_sparse="csr", dtype=FLOAT_DTYPES, copy-copy, reset=False
)
if not sp.issparse(X):
X = sp.csr_matrix(X, dtype=np.float64)
if self.sublinear_tf:
np.log(X.data, X.data)
X.data += 1
if self.use_idf:
# idf being a property, the automatic attributes detection
# does not work as usual and we need to specify the attribute not fitted")
# name:
check_is_fitted (self, attributes=["idf_"], msg="idf vector is not fitted")
# *= doesn't work
X = X * self._idf_diag
if self.norm is not None:
X = normalize(X, norm=self.norm, copy=False)
return X
image of code above
To investigate more, I ran both classes and compared the output of both CountVectorizer and TfidfTransformer using the following code and the output is equal.
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=(
'headers', 'footers', 'quotes'), subset='train', categories=['sci.electronics', 'rec.autos', 'rec.sport.hockey'])
train_documents = dataset.data
vectorizer = CountVectorizer()
train_documents_mat = vectorizer.fit_transform(train_documents)
tf_vectorizer = TfidfTransformer(use_idf=False, norm=None)
train_documents_mat_2 = tf_vectorizer.fit_transform(train_documents_mat)
equal = np.array_equal(
train_documents_mat.toarray(),
train_documents_mat_2.toarray()
)
print(equal)
I am trying to get the term frequencies for my documents rather than just the count. Any ideas why sklearn implement TF-IDF in this way?

Scikit learn GridSearchCV with pipeline with custom transformer

I'm trying to perform a GridSearchCV on a pipeline with a custom transformer. The transformer enriches the features "year" and "odometer" polynomially and one hot encodes the rest of the features. The ML model is a simple linear regression model.
custom transformer code:
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PolynomialFeatures
class custom_poly_features(TransformerMixin, BaseEstimator):
def __init__(self, degree = 2, poly_features = ['year', 'odometer']):
self.degree_ = degree
self.poly_features_ = poly_features
def fit(self, X, y=None):
# Return the classifier
return self
def transform(self, X, y=None):
poly_feat = PolynomialFeatures(degree=self.degree_)
OneHot = OneHotEncoder(sparse=False)
not_poly_features = list(set(X.columns) - set(self.poly_features_))
poly = poly_feat.fit_transform(X[self.poly_features_].to_numpy())
poly = np.hstack([poly, OneHot.fit_transform(X[not_poly_features].to_numpy())])
return poly
def get_params(self, deep=True):
return {"degree": self.degree_, "poly_features": self.poly_features_}
pipeline & gridsearch code:
#create pipeline
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
poly_pipeline = Pipeline(steps=[("cpf", custom_poly_features()), ("lin_reg", LinearRegression(n_jobs=-1))])
#perform gridsearch
from sklearn.model_selection import GridSearchCV
param_grid = {"cpf__degree": [3, 4, 5]}
search = GridSearchCV(poly_pipeline, param_grid, n_jobs=-1, cv=3)
search.fit(X_train_ordinal, y_train)
The custom transformer itself works fine and the pipeline also works (although the score is not great, but that is not the topic here).
poly_pipeline.fit(X_train, y_train).score(X_test, y_test)
Output:
0.543546844381771
However, when I perform the gridsearch, the scores are all nan values:
search.cv_results_
Output:
{'mean_fit_time': array([4.46928191, 4.58259885, 4.55605125]),
'std_fit_time': array([0.18111937, 0.03305779, 0.02080789]),
'mean_score_time': array([0.21119197, 0.13816587, 0.11357466]),
'std_score_time': array([0.09206233, 0.02171508, 0.02127906]),
'param_custom_poly_features__degree': masked_array(data=[3, 4, 5],
mask=[False, False, False],
fill_value='?',
dtype=object),
'params': [{'custom_poly_features__degree': 3},
{'custom_poly_features__degree': 4},
{'custom_poly_features__degree': 5}],
'split0_test_score': array([nan, nan, nan]),
'split1_test_score': array([nan, nan, nan]),
'split2_test_score': array([nan, nan, nan]),
'mean_test_score': array([nan, nan, nan]),
'std_test_score': array([nan, nan, nan]),
'rank_test_score': array([1, 2, 3])}
Does anyone know what the problem is? The transformer and the pipeline work fine on their own after all.
To debug searches in general, set error_score='raise', so that you get a full error traceback.
Your issue appears to be data-dependent; I can run this just fine on a custom dataset. That suggests to me that the comment by #Sanjar Adylov not only highlights an important issue, but the issue for your data: the train folds sometimes contain different values in some categorical feature(s) than the test folds, and so the one-hot encodings end up with different numbers of features, and the linear model justifiably breaks.
So the fix there is also as Sanjar says: instantiate, store as attributes, and fit the two transformers and in your fit method, and use their transform methods in your transform method.
You will find there is another big issue: all the scores in cv_results_ are the same. This is because you can't actually set the hyperparameters correctly, because in __init__ you've used mismatching names (degree as the parameter but degree_ as the attribute). Read more in the developer guide. (I think you can get around this by editing set_params similar to how you edited get_params, but it would be much easier to actually rely on the BaseEstimator versions of those and just match the parameter names to the attribute names.)
Also, note that setting a parameter default to a list can have surprising effects. Consider alternatives to the default of poly_features in __init__.
class custom_poly_features(TransformerMixin, BaseEstimator):
def __init__(self, degree=2, poly_features=['year', 'odometer']):
self.degree = degree
self.poly_features = poly_features
def fit(self, X, y=None):
self.poly_feat = PolynomialFeatures(degree=self.degree)
self.onehot = OneHotEncoder(sparse=False)
self.not_poly_features_ = list(set(X.columns) - set(self.poly_features))
self.poly_feat.fit(X[self.poly_features])
self.onehot.fit(X[self.not_poly_features_])
return self
def transform(self, X, y=None):
poly = self.poly_feat.transform(X[self.poly_features])
poly = np.hstack([poly, self.onehot.transform(X[self.not_poly_features_])
return poly
There are some additional things you might want to add, like checks for whether poly_features or not_poly_features_ is empty (which would break the corresponding transformer).
Finally, your custom estimator is just doing what a ColumnTransformer is meant to do. I think the only reason to prefer yours is if you need to search over which columns get which treatment; I don't think that's easy to do with a ColumnTransformer.
custom_poly = ColumnTransformer(
transformers=[('poly', PolynomialFeatures(), ['year', 'odometer'])],
remainder=OneHotEncoder(),
)
param_grid = {"cpf__poly__degree": [3, 4, 5]}

Torchtext 0.7 shows Field is being deprecated. What is the alternative?

Looks like the previous paradigm of declaring Fields, Examples and using BucketIterator is deprecated and will move to legacy in 0.8. However, I don't seem to be able to find an example of the new paradigm for custom datasets (as in, not the ones included in torch.datasets) that doesn't use Field. Can anyone point me at an up-to-date example?
Reference for deprecation:
https://github.com/pytorch/text/releases
It took me a little while to find the solution myself. The new paradigm is like so for prebuilt datasets:
from torchtext.experimental.datasets import AG_NEWS
train, test = AG_NEWS(ngrams=3)
or like so for custom built datasets:
from torch.utils.data import DataLoader
def collate_fn(batch):
texts, labels = [], []
for label, txt in batch:
texts.append(txt)
labels.append(label)
return texts, labels
dataloader = DataLoader(train, batch_size=8, collate_fn=collate_fn)
for idx, (texts, labels) in enumerate(dataloader):
print(idx, texts, labels)
I've copied the examples from the Source
Browsing through torchtext's GitHub repo I stumbled over the README in the legacy directory, which is not documented in the official docs. The README links a GitHub issue that explains the rationale behind the change as well as a migration guide.
If you just want to keep your existing code running with torchtext 0.9.0, where the deprecated classes have been moved to the legacy module, you have to adjust your imports:
# from torchtext.data import Field, TabularDataset
from torchtext.legacy.data import Field, TabularDataset
Alternatively, you can import the whole torchtext.legacy module as torchtext as suggested by the README:
import torchtext.legacy as torchtext
There is a post regarding this. Instead of the deprecated Field and BucketIterator classes, it uses the TextClassificationDataset along with the collator and other preprocessing. It reads a txt file and builds a dataset, followed by a model. Inside the post, there is a link to a complete working notebook. The post is at: https://mmg10.github.io/pytorch/2021/02/16/text_torch.html. But you need the 'dev' (or nightly build) of PyTorch for it to work.
From the link above:
After tokenization and building vocabulary, you can build the dataset as follows
def data_to_dataset(data, tokenizer, vocab):
data = [(text, label) for (text, label) in data]
text_transform = sequential_transforms(tokenizer.tokenize,
vocab_func(vocab),
totensor(dtype=torch.long)
)
label_transform = sequential_transforms(lambda x: 1 if x =='1' else (0 if x =='0' else x),
totensor(dtype=torch.long)
)
transforms = (text_transform, label_transform)
dataset = TextClassificationDataset(data, vocab, transforms)
return dataset
The collator is as follows:
def __init__(self, pad_idx):
self.pad_idx = pad_idx
def collate(self, batch):
text, labels = zip(*batch)
labels = torch.LongTensor(labels)
text = nn.utils.rnn.pad_sequence(text, padding_value=self.pad_idx, batch_first=True)
return text, labels
Then, you can build the dataloader with the typical torch.utils.data.DataLoader using the collate_fn argument.
Well it seems like pipeline could be like that:
import torchtext as TT
import torch
from collections import Counter
from torchtext.vocab import Vocab
# read the data
with open('text_data.txt','r') as f:
data = f.readlines()
with open('labels.txt', 'r') as f:
labels = f.readlines()
tokenizer = TT.data.utils.get_tokenizer('spacy', 'en') # can remove 'spacy' and use a simple built-in tokenizer
train_iter = zip(labels, data)
counter = Counter()
for (label, line) in train_iter:
counter.update(tokenizer(line))
vocab = TT.vocab.Vocab(counter, min_freq=1)
text_pipeline = lambda x: [vocab[token] for token in tokenizer(x)]
# this is data-specific - adapt for your data
label_pipeline = lambda x: 1 if x == 'positive\n' else 0
class TextData(torch.utils.data.Dataset):
'''
very basic dataset for processing text data
'''
def __init__(self, labels, text):
super(TextData, self).__init__()
self.labels = labels
self.text = text
def __getitem__(self, index):
return self.labels[index], self.text[index]
def __len__(self):
return len(self.labels)
def tokenize_batch(batch, max_len=200):
'''
tokenizer to use in DataLoader
takes a text batch of text dataset and produces a tensor batch, converting text and labels though tokenizer, labeler
tokenizer is a global function text_pipeline
labeler is a global function label_pipeline
max_len is a fixed len size, if text is less than max_len it is padded with ones (pad number)
if text is larger that max_len it is truncated but from the end of the string
'''
labels_list, text_list = [], []
for _label, _text in batch:
labels_list.append(label_pipeline(_label))
text_holder = torch.ones(max_len, dtype=torch.int32) # fixed size tensor of max_len
processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int32)
pos = min(200, len(processed_text))
text_holder[-pos:] = processed_text[-pos:]
text_list.append(text_holder.unsqueeze(dim=0))
return torch.FloatTensor(labels_list), torch.cat(text_list, dim=0)
train_dataset = TextData(labels, data)
train_loader = DataLoader(train_dataset, batch_size=2, shuffle=False, collate_fn=tokenize_batch)
lbl, txt = iter(train_loader).next()

How to add new sample to CIFAR10 torchvision?

Hi I want to add my own images to the CIFAR10 dataset in torchvision, how can I do that?
train_data = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=train_transform)
train_data.add # or a workaround!
thanks
You can either create a custom dataset for CIFAR10, using the raw cifar10 images here or you can still use the CIFAR10 dataset inside your new custom dataset and then add your logic in the __getitem__() method.
This is a simple example to get you going :
class CIFAR10_2(torch.utils.data.Dataset):
def __init__(self, dataset_path='/cifar10', transformations=None, should_download=True):
self.dataset_train = torchvision.datasets.CIFAR10(dataset_path, download=should_download)
self.transformations = transformations
def __getitem__(self, index):
# do as you wish , add your logic here
(img, label) = self.dataset_train[index]
# for transformations for example
if self.transformations is not None:
return self.transformations(img), label
return img, label
def __len__(self):
return len(self.dataset_train)
you can get fancy and add logic for test,validation, etc and do what ever you like.

Resources