I'm practicing on the kaggle news headline dataset on the DJIA prices as exported from Yahoo Finance: https://www.kaggle.com/aaron7sun/stocknews#Combined_News_DJIA.csv
There are not many discussions on NLP with TimeSeries, I attempted using this article's code using CountVectorizer() however unsuccessful. I was wondering if anyone has any resources or suggestions?
My code below based on headline in dataset above:
def modeller(vect, X_tr, y_tr, X_te):
X_train_dtm = vect.fit_transform(X_tr.unstack())
X_test_dtm = vect.fit_transform(X_te.unstack())
X_tr_arima = [x for x in X_train_dtm]
print('done with count vectorizer. now modelling.')
model = ARIMA(X_tr_arima, order=(1,1,1))
print('done modelling. now fitting')
model_fit = model.fit(X_tr_arima, y_tr)
y_hat = model.predict(x_te_arima)
return y_hat
vect = CountVectorizer(stop_words='english')
X_train, X_test, y_train, y_test = X.iloc[0:100], X.iloc[100:X.shape[0]], y[0:100], y[100:len(y)]
modeller(vect, X_train, y_train, X_test)
Output (error from ARIMA line):
ValueError: setting an array element with a sequence.
I had the same problem and I could fix it using this approach.
Try to change
from pmdarima.pipeline import Pipeline
to
from pmdarima.pipeline import Pipeline as arimaPip
Related
I have a very imbalanced dataset. I used sklearn.train_test_split function to extract the train dataset. Now I want to oversample the train dataset, so I used to count number of type1(my data set has 2 categories and types(type1 and tupe2) but approximately all of my train data are type1. So I cant oversample.
Previously I used to split train test datasets with my written code. In that code 0.8 of all type1 data and 0.8 of all type2 data were in the train dataset.
How I can use this method with train_test_split function or other spliting methods in sklearn?
*I should just use sklearn or my own written methods.
You're looking for stratification. Why?
There's a parameter stratify in method train_test_split to which you can give the labels list e.g. :
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
stratify=y,
test_size=0.2)
There's also StratifiedShuffleSplit.
It seems like we both had similar issues here. Unfortunately, imbalanced-learn isn't always what you need and scikit does not offer the functionality you want. You will want to implement your own code.
This is what I came up for my application. Note that I have not had extensive time to debug it but I believe it works from the testing I have done. Hope it helps:
def equal_sampler(classes, data, target, test_frac):
# Find the least frequent class and its fraction of the total
_, count = np.unique(target, return_counts=True)
fraction_of_total = min(count) / len(target)
# split further into train and test
train_frac = (1-test_frac)*fraction_of_total
test_frac = test_frac*fraction_of_total
# initialize index arrays and find length of train and test
train=[]
train_len = int(train_frac * data.shape[0])
test=[]
test_len = int(test_frac* data.shape[0])
# add values to train, drop them from the index and proceed to add to test
for i in classes:
indeces = list(target[target ==i].index.copy())
train_temp = np.random.choice(indeces, train_len, replace=False)
for val in train_temp:
train.append(val)
indeces.remove(val)
test_temp = np.random.choice(indeces, test_len, replace=False)
for val in test_temp:
test.append(val)
# X_train, y_train, X_test, y_test
return data.loc[train], target[train], data.loc[test], target[test]
For the input, classes expects a list of possible values, data expects the dataframe columns used for prediction, target expects the target column.
Take care that the algorithm may not be extremely efficient, due to the triple for-loop(list.remove takes linear time). Despite that, it should be reasonably fast.
You may also look into stratified shuffle split as follows:
# We use a utility to generate artificial classification data.
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
X, y = make_classification(n_samples=100, n_informative=10, n_classes=2)
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
for train_index, test_index in sss.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
1.The CSV that contains data(ie. text description) along with categorized labels
df = pd.read_csv('./output/csv_sanitized_16_.csv', dtype=str)
X = df['description_plus']
y = df['category_id']
2.This CSV contains unseen data(ie. text description) for which labels need to be predicted
df_2 = pd.read_csv('./output/csv_sanitized_2.csv', dtype=str)
X2 = df_2['description_plus']
Cross validation function that operates on the training data(item #1) above.
def cross_val():
cv = KFold(n_splits=20)
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
stop_words='english')
X_train = vectorizer.fit_transform(X)
clf = make_pipeline(preprocessing.StandardScaler(with_mean=False), svm.SVC(C=1))
scores = cross_val_score(clf, X_train, y, cv=cv)
print(scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
cross_val()
I need to know how to pass the unseen data(item #2) to the cross validation function and how to predict the labels?
Using scores = cross_val_score(clf, X_train, y, cv=cv) you can only get the cross-validated scores of the model. cross_val_score will internally split the data into training and testing based on the cv parameter.
So the values that you get are the cross-validated accuracy of the SVC.
To get the score on the unseen data, you can first fit the model e.g.
clf = make_pipeline(preprocessing.StandardScaler(with_mean=False), svm.SVC(C=1))
clf.fit(X_train, y) # the model is trained now
and then do clf.score(X_unseen,y)
The last will return the accuracy of the model on the unseen data.
EDIT: The best way to do what you want is the following using a GridSearch to first find the best model using the training data and then evaluate the best model using the unseen (test) data:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
# load some data
iris = datasets.load_iris()
X, y = iris.data, iris.target
#split data to training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
# hyperparameter tunig of the SVC model
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
# fit the GridSearch using the TRAINING data
grid_searcher = GridSearchCV(svc, parameters)
grid_searcher.fit(X_train, y_train)
#recover the best estimator (best parameters for the SVC, based on the GridSearch)
best_SVC_model = grid_searcher.best_estimator_
# Now, check how this best model behaves on the test set
cv_scores_on_unseen = cross_val_score(best_SVC_model, X_test, y_test, cv=5)
print(cv_scores_on_unseen.mean())
used scikit's DictVectorizer to make a feature vector
X = dataset.drop('Tag', axis=1)
v = DictVectorizer(sparse=False)
X = v.fit_transform(X.to_dict('records'))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state=0)
classes = np.unique(y)
classes = classes.tolist()
per = Perceptron(verbose=10, n_jobs=-1, max_iter=5)
per.partial_fit(X_train, y_train, classes)
joblib.dump(per, 'saved_model.pkl')
and save trined model to file.
load model in another file for new date
new_X=df
v = DictVectorizer(sparse=False)
new_X = v.fit_transform(new_X.to_dict('records'))
#Load model
per_load = joblib.load('saved_model2.pkl')
per_load.predict(new_X)
i try to predict new data When I execute this code, the output is
Value error
ValueError: X has 43 features per sample; expecting 983
How do I save the model ?
you need to save pickle object for vectorizer , as well and apply transform rather fit_transform because your vectorizer has already learned the vocabulary and that need to used for predicting unseen data
#use
import joblib
joblib.dump(v, 'vectorizer.pkl')
#loading pickle
v = joblib.load('vectorizer.pkl')
per_load.predict(v.transform(["new comment"]) #don't use fit_transform , use transfom only
I am new to NLP and I am trying to build a text classifier but my data is currently imbalanced.The highest category having as much as 280 entries while the lowest as much as 30.
I am trying to use cross validation technique for the current data, but after looking for days now i am unable to implement it.It looks pretty straightforward but I am still unable to implement it. Here is my code
y = resample.Subsystem
X = resample['new description']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape
#SVM
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
text_clf_svm = Pipeline([('vect', CountVectorizer(stop_words='english')),('tfidf', TfidfTransformer()),('clf-svm', SGDClassifier(loss='hinge', penalty='l2',alpha=1e-3, n_iter=5, random_state=42)),])
text_clf_svm.fit(X_train, y_train)
predicted_svm = text_clf_svm.predict(X_test)
print('The best accuracy is : ',np.mean(predicted_svm == y_test))
I have done some gridsearch and Stemmer further but right now I would work on cross validation on this code.I have cleaned the data pretty well but i am stil getting an accuracy of 60%
Any help would be appreciated
Try to do oversampling or under sampling. As the data is highly imbalanced, There is more bias towards the class with more data points. After the over/under sampling the bias will be very less and accuracy will up.
Else instead of SVM you can use MLP. It gives good results even with unbalanced data.
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, random_state=None)
# X is the feature set and y is the target
from sklearn.model_selection import RepeatedKFold
kf = RepeatedKFold(n_splits=20, n_repeats=10, random_state=None)
for train_index, test_index in kf.split(X):
#print("Train:", train_index, "Validation:",test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
I am using GridSearchCV like this:
corpus = load_files('corpus')
with open('stopwords.txt', 'r') as f:
stop_words = [y for x in f.read().split('\n') for y in (x, x.title())]
x = corpus.data
y = corpus.target
pipeline = Pipeline([
('vec', CountVectorizer(stop_words=stop_words)),
('classifier', MultinomialNB())])
parameters = {'vec__ngram_range': [(1, 1), (1, 2)],
'classifier__alpha': [1e-2, 1e-3],
'classifier__fit_prior': [True, False]}
gs_clf = GridSearchCV(pipeline, parameters, n_jobs=-1, cv=5, scoring="f1", verbose=10)
gs_clf = gs_clf.fit(x, y)
joblib.dump(gs_clf.best_estimator_, 'MultinomialNB.pkl', compress=1)
Then, in another file, to classify new documents (not from the corpus), I do this:
classifier = joblib.load(filepath) # path to .pkl file
result = classifier.predict(tokenlist)
My question is: Where do I get the values needed for the classification_report?
In many other examples, I see people split the corpus into traing set and test set.
However, since I am using GridSearchCV with kfold-cross-validation, I don't need to do that.
So how can I get those values from GridSearchCV?
If you have GridSearchCV object:
from sklearn.metrics import classification_report
clf = GridSearchCV(....)
clf.fit(x_train, y_train)
classification_report(y_test,clf.best_estimator_.predict(x_test))
If you have saved the best estimator and loaded it then:
classifier = joblib.load(filepath)
classification_report(y_test,classifier.predict(x_test))
The best model is in clf.best_estimator_. You need to fit the training data to this; then predict your test data and use ytest and ypreds for the classification report.