sklearn LogisticRegression does not accept csr_matrix - scikit-learn

I am a newby and I have to classify the words of a lexicon according to the De Pauw and Wagacha (1998) method (basically, maxent on char n-grams). The data is very large (500 000 entries and millions of n-grams). So I must load the samples as a sparse matrix. But I ran into a problem.
sklearn.linear_model.LogisticRegression().fit(X,y) says it does not accept scipy.sparse.csr.csr_matrix training vectors. I got this error
Traceback (most recent call last):
File "test-LR-4.py", line 8, in <module>
clf.fit(X,y)
File "/usr/lib/pymodules/python2.7/sklearn/svm/base.py", line 441, in fit
% type(X))
ValueError: Training vectors should be array-like, not <class 'scipy.sparse.csr.csr_matrix'>
for the following script:
from sklearn.linear_model import LogisticRegression
import numpy as np
import scipy.sparse as sp
X = sp.csr_matrix([[0, 1, 2],[1, 2, 3],[3, 2, 1]])
y = np.array(range(3))
clf=LogisticRegression(dual=True)
clf.fit(X,y)

As mentioned in comments by #Andreas and #Fred Foo, upgrading the sklearn version (> 0.13) will solve the problem.

Related

How to get prediction for a single data entry

I have a trained model stored in pickle. All i need to do is get a single-valued dataframe in pandas and get the prediction by passing it to the model.
To handle the categorical columns, i have used one-hot-encoding. So to convert the pandas dataframe to numpy array, i also used one-hot-encoding on the single valued dataframe. But it shows me error.
import pickle
import category_encoders as ce
import pandas as pd
pkl_filename = "pickle_model.pkl"
with open(pkl_filename, 'rb') as file:
pickle_model = pickle.load(file)
ohe = ce.OneHotEncoder(handle_unknown='ignore', use_cat_names=True)
X_t = pd.read_pickle("case1.pkl")
X_t_ohe = ohe.fit_transform(X_t)
X_t_ohe = X_t_ohe.fillna(0)
Ypredict = pickle_model.predict(X_t_ohe)
print(Ypredict[0])
Traceback (most recent call last):
File "Predict.py", line 14, in
Ypredict = pickle_model.predict(X_t_ohe)
File "/home/neo/anaconda3/lib/python3.6/site-> packages/sklearn/linear_model/base.py", line 289, in predict
scores = self.decision_function(X)
File "/home/neo/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/base.py", line 270, in decision_function
% (X.shape[1], n_features))
ValueError: X has 93 features per sample; expecting 989
This happens because OneHotEncoder actually converts your dataframe into many different numerical columns and your pickle model actually has the trained model from your original file which does not have the same dimensions(same number of column)
To rectify this issue you will need to retrain your model after applying the one-hot-encoder and then save it as a pickle file and reusing that modelel.

New to ML trying to use SVM and SVR for 1st time some syntax/transposition errors

I am trying to run A SVR on some data I got from yahoo finance. I want to use closing prices of Ethereum to predict next 10-15 days path using a supervised learning method. I have already done autoregressive model (ARIMA) but now I want to try ML techniques like pattern recognition so I start with SVR
I am simply running into a problem that I dont know how to convert my data column into a row so that the SVR works...I thought it would be simple but I am new to coding overall...appreciate your help; see below:
''' building a simple model for using machine learning to do pattern recog on stock prices'''
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.pylab as ply
import numpy as np
from pandas import DataFrame as df
from sklearn.svm import SVR
df = pd.read_csv("C:\Learning\ETH.csv", index_col='Date', parse_dates=True)
prices = df['Adj Close']
dates = df.index
dates = dates.values.reshape(1,len(dates))
# run support vector regressions to get next predicted value
svr_lin = SVR(kernel='linear', C=1e3)
svr_poly = SVR(kernel='poly', C=1e3)
svr_rbf = SVR(kernel='rbf', C=1e3)
svr_lin.fit(dates, prices)
svr_lin.poly(dates, prices)
svr_lin.rbf(dates, prices)
When I run this I get following error:
Traceback (most recent call last): File "C:/Users/.../Machine
Learning/MachLearningStockPrediction.py", line 19, in
svr_lin.fit(dates, prices) File "C:\Users...\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\svm\base.py",
line 149, in fit
X, y = check_X_y(X, y, dtype=np.float64, order='C', accept_sparse='csr') File
"C:\Users...\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\utils\validation.py",
line 583, in check_X_y
check_consistent_length(X, y) File "C:\Users...\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\utils\validation.py",
line 204, in check_consistent_length
" samples: %r" % [int(l) for l in lengths]) ValueError: Found input variables with inconsistent numbers of samples: [1, 1085]
Length of both prices and dates is same 1085 characters
Please help
When you do this:
dates = dates.values.reshape(1,len(dates))
dates will be converted to a row vector which have 1 rows and 1085 columns. In scikit-learn, the required data for X (data features) is [n_samples, n_features]. So here, scikit thinks that your data have only 1 sample with 1085 features.
But then your prices is of shape [1085, ] which according to scikit should have shape [n_samples, ]. So here number of samples are taken as 1085. And hence the error.
You should do this to correct the error:
dates = dates.values.reshape(len(dates), 1)

Unable to call the fit function on randomforest regressor python sklearn

I'm unable to call the fit function on the RandomForestRegressor and even the intellisense is only showing the predict and some other parameters. Below is my code, traceback call and an image showing the content of the intellisense.
import pandas
import numpy as np
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestRegressor
def predict():
Fvector = 'C:/Users/Oussema/Desktop/Cred_Data/VEctors/FinalFeatureVector.csv'
data = np.genfromtxt(Fvector, dtype=float, delimiter=',', names=True)
AnnotArr = np.array(data['CredAnnot']) #this is a 1D array containig the ground truth (50000 rows)
TempTestArr = np.array([data['GrammarV'],data['TweetSentSc'],data['URLState']]) #this is the features vector the shape is (3,50000) the values range is [0-1]
FeatureVector = TempTestArr.transpose() #i used the transpose method to get the shape (50000,3)
RF_model = RandomForestRegressor(n_estimators=20, max_features = 'auto', n_jobs = -1)
RF_model.fit(FeatureVector,AnnotArr)
print(RF_model.oob_score_)
predict()
Intelisense content:
[1]: https://i.stack.imgur.com/XweOo.png
Traceback call
Traceback (most recent call last):
File "C:\Users\Oussema\source\repos\Regression_Models\Regression_Models\Random_forest_TCA.py", line 15, in <module>
predict()
File "C:\Users\Oussema\source\repos\Regression_Models\Regression_Models\Random_forest_TCA.py", line 14, in predict
print(RF_model.oob_score_)
AttributeError: 'RandomForestRegressor' object has no attribute 'oob_score_'
You need to set the oob_score param to True when initializing the RandomForestRegressor.
As per the documentation:
oob_score : bool, optional (default=False)
whether to use out-of-bag samples to estimate the R^2 on unseen data.
So the attribute oob_score_ is only available if you do this:
def predict():
....
....
RF_model = RandomForestRegressor(n_estimators=20,
max_features = 'auto',
n_jobs = -1,
oob_score=True) #<= This is what you want
....
....
print(RF_model.oob_score_)

Error TensorFlow "Dimensions must be equal, but..."

For start in Tensorflow, I am triying to reproduce the basic example of the estimator with the IRIS data set, but with my own data.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
from six.moves.urllib.request import urlopen
import tensorflow as tf
from pandas import DataFrame, read_csv
import numpy as np
TESTFILE = "test.csv"
TRAINFILE = "train.csv"
# Load datasets
training_set = tf.contrib.learn.datasets.base.load_csv_with_header(filename=TRAINFILE, target_dtype=np.int, features_dtype=np.float32, target_column=0)
test_set = tf.contrib.learn.datasets.base.load_csv_with_header(filename=TESTFILE, target_dtype=np.int, features_dtype=np.float32, target_column=0)
# Specify that all features have real-value data
feature_columns = [tf.feature_column.numeric_column("x", shape=[4])]
# Build 3 layer DNN with 10, 20, 10 units respectively.
classifier = tf.estimator.DNNClassifier(feature_columns=feature_columns,
hidden_units=[10, 20, 10],
n_classes=3,
model_dir="/tmp/Goldman_model")
# Define the training inputs
train_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": np.array(training_set.data)},
y=np.array(training_set.target),
num_epochs=None,
shuffle=True)
# Train model.
classifier.train(input_fn=train_input_fn, steps=2000)
But, when I try to train the classifier, I receive the next error:
File "C:\Users\***\Anaconda3\lib\site-packages\tensorflow\python\framework\common_shapes.py", line 691, in _call_cpp_shape_fn_impl
raise ValueError(err.message)
ValueError: Dimensions must be equal, but are 160 and 128
for 'dnn/head/sparse_softmax_cross_entropy_loss/xentropy/xentropy'
(op: 'SparseSoftmaxCrossEntropyWithLogits') with input shapes: [160,3], [128].
And I have absolutely no idea what to do next.
Thank you very much for the answers,
JF Palomeque
Your labels and predictions have different dimensions (160 and 128).

Error: Tuple index out of range while plotting learning curve

Here is my code:
import matplotlib.pyplot as plt,matplotlib.colors as clr
import pandas as pd,csv,numpy as np
from sklearn import linear_model
from sklearn.model_selection import ShuffleSplit as ss, learning_curve as
lc,StratifiedKFold as skf
from sklearn.utils import shuffle
file=open('C:\\Users\\Anil Satya\\Desktop\\Internship_projects\\BD
Influenza\\BD_Influenza_revised_imputed.csv','r+')
flu_data=pd.read_csv(file)
flu_num=flu_data.ix[:,5:13]
features=np.array(flu_num.ix[:,0:7])
label=np.array(flu_num.ix[:,7])
splt=skf(n_splits=2,shuffle=True,random_state=None)
clf=linear_model.LogisticRegression()
model=clf.fit(features,label)
def classifier(clf,x,y):
accuracy=clf.score(x,y)
return accuracy
lc(estimator=clf,X=features,y=label,train_sizes=0.75,cv=splt,
scoring=classifier(clf,features,label))
On execution, it shows the following error:
Traceback (most recent call last):
File "C:/Ankur/Python36/Python Files/BD_influenza_learningcurve.py", line
26, in <module>
lc(estimator=clf,X=features,y=label,train_sizes=0.75,cv=splt,
scoring=classifier(clf,features,label))
File "C:\Ankur\Python36\lib\site-
packages\sklearn\model_selection\_validation.py", line 756, in
learning_curve
n_max_training_samples)
File "C:\Ankur\Python36\lib\site-
packages\sklearn\model_selection\_validation.py", line 808, in
_translate_train_sizes
n_ticks = train_sizes_abs.shape[0]
IndexError: **tuple index out of range**
I am not able to identify the problem yet. But, I believe the problem is in the learning curve function because I have executed the program without it and it works fine.
Either the scoring or train_sizes parameter causes the problem.
Try to replace:
lc(estimator=clf,X=features,y=label,train_sizes=0.75,cv=splt,
scoring=classifier(clf,features,label))
with
1)
lc(estimator=clf,X=features,y=label,train_sizes=0.75,cv=splt,
scoring="accuracy")
or 2)
import numpy as np
lc(estimator=clf,X=features,y=label,train_sizes=np.array([0.75]),cv=splt,
scoring="accuracy")
Finally, for the scoring parameter you can see here the available attributes/strings that you can use: The scoring parameter

Resources