How to get prediction for a single data entry - scikit-learn

I have a trained model stored in pickle. All i need to do is get a single-valued dataframe in pandas and get the prediction by passing it to the model.
To handle the categorical columns, i have used one-hot-encoding. So to convert the pandas dataframe to numpy array, i also used one-hot-encoding on the single valued dataframe. But it shows me error.
import pickle
import category_encoders as ce
import pandas as pd
pkl_filename = "pickle_model.pkl"
with open(pkl_filename, 'rb') as file:
pickle_model = pickle.load(file)
ohe = ce.OneHotEncoder(handle_unknown='ignore', use_cat_names=True)
X_t = pd.read_pickle("case1.pkl")
X_t_ohe = ohe.fit_transform(X_t)
X_t_ohe = X_t_ohe.fillna(0)
Ypredict = pickle_model.predict(X_t_ohe)
print(Ypredict[0])
Traceback (most recent call last):
File "Predict.py", line 14, in
Ypredict = pickle_model.predict(X_t_ohe)
File "/home/neo/anaconda3/lib/python3.6/site-> packages/sklearn/linear_model/base.py", line 289, in predict
scores = self.decision_function(X)
File "/home/neo/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/base.py", line 270, in decision_function
% (X.shape[1], n_features))
ValueError: X has 93 features per sample; expecting 989

This happens because OneHotEncoder actually converts your dataframe into many different numerical columns and your pickle model actually has the trained model from your original file which does not have the same dimensions(same number of column)
To rectify this issue you will need to retrain your model after applying the one-hot-encoder and then save it as a pickle file and reusing that modelel.

Related

Dask client.persist returns AssertionError when I try to use HashingVectorizer

I am trying to vectorize the dask.dataframe with dask HashingVectorizer. I want the vectorization results to stay in the cluster (distributed system). That's why I am using client.persist when I try to transform the data. But for some reason, I am getting the error below.
Traceback (most recent call last):
File "/home/dodzilla/my_project/components_with_adapter/vectorizers/base_vectorizer.py", line 112, in hybrid_feature_vectorizer
CLUSTERING_FEATURES=self.clustering_features)
File "/home/dodzilla/my_project/components_with_adapter/vectorizers/text_vectorizer.py", line 143, in vectorize
X = self.client.persist(fitted_vectorizer.transform, combined_data)
File "/home/dodzilla/.local/lib/python3.6/site-packages/distributed/client.py", line 2860, in persist
assert all(map(dask.is_dask_collection, collections))
AssertionError
I can't share the data but all of the necessary information about the data is as below:
>>>type(combined_data)
<class 'dask.dataframe.core.Series'>
>>>type(combined_data.compute())
<class 'pandas.core.series.Series'>
>>>combined_data.compute().shape
12
A minimal working example can be found below. Below, in the code snippet, combined_data holds the merged columns. Meaning: all of the columns are merged into 1 column. Data has 12 rows. All of the values inside the rows are string. This is the code where I am getting the error:
from stop_words import get_stop_words
from dask_ml.feature_extraction.text import HashingVectorizer as daskHashingVectorizer
import pandas as pd
import dask
import dask.dataframe as dd
from dask.distributed import Client
def convert_dataframe_to_single_text(documents):
"""
Combine all of the columns into 1 column.
"""
if type(documents) is dask.dataframe.core.DataFrame:
cols = documents.columns
documents['combined'] = documents[cols].apply(func=(lambda row: ' '.join(row.values.astype(str))), axis=1,
meta=('str'))
document_texts = documents.drop(cols, axis=1)
else:
raise TypeError('Wrong type of data. Expected Pandas DF or Dask DF but received ', type(documents))
return document_texts
# Init the client.
client = Client('localhost:8786')
# Get stopwords
stopwords = get_stop_words(language="english")
# Create dask dataframe from pandas dataframe
data = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':["twenty", "twentyone", "nineteen", "eighteen"]}
df = pd.DataFrame(data)
df = dd.from_pandas(df, npartitions=1)
# Init the vectorizer
vectorizer = daskHashingVectorizer(stop_words=stopwords, alternate_sign=False,
norm=None, binary=False,
n_features=10000)
# Combine all of to columns into 1 column.
combined_data = convert_dataframe_to_single_text(df)
# Fit the vectorizer.
fitted_vectorizer = client.persist(vectorizer.fit(combined_data))
# Transform the data.
X = client.persist(fitted_vectorizer.transform, combined_data)
I hope the information is enough.
Important note: I am not getting any kind of error when I say client.compute but from what I understand this doesn't work in the cluster of machines and instead it runs in the local machine. And it returns a csr matrix instead of a lazily evaluated dask.array.
This is not how I was supposed to use client.persist. Functions I was looking for are client.submit and client.map... In my case client.submit solved my issue.

New to ML trying to use SVM and SVR for 1st time some syntax/transposition errors

I am trying to run A SVR on some data I got from yahoo finance. I want to use closing prices of Ethereum to predict next 10-15 days path using a supervised learning method. I have already done autoregressive model (ARIMA) but now I want to try ML techniques like pattern recognition so I start with SVR
I am simply running into a problem that I dont know how to convert my data column into a row so that the SVR works...I thought it would be simple but I am new to coding overall...appreciate your help; see below:
''' building a simple model for using machine learning to do pattern recog on stock prices'''
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.pylab as ply
import numpy as np
from pandas import DataFrame as df
from sklearn.svm import SVR
df = pd.read_csv("C:\Learning\ETH.csv", index_col='Date', parse_dates=True)
prices = df['Adj Close']
dates = df.index
dates = dates.values.reshape(1,len(dates))
# run support vector regressions to get next predicted value
svr_lin = SVR(kernel='linear', C=1e3)
svr_poly = SVR(kernel='poly', C=1e3)
svr_rbf = SVR(kernel='rbf', C=1e3)
svr_lin.fit(dates, prices)
svr_lin.poly(dates, prices)
svr_lin.rbf(dates, prices)
When I run this I get following error:
Traceback (most recent call last): File "C:/Users/.../Machine
Learning/MachLearningStockPrediction.py", line 19, in
svr_lin.fit(dates, prices) File "C:\Users...\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\svm\base.py",
line 149, in fit
X, y = check_X_y(X, y, dtype=np.float64, order='C', accept_sparse='csr') File
"C:\Users...\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\utils\validation.py",
line 583, in check_X_y
check_consistent_length(X, y) File "C:\Users...\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\utils\validation.py",
line 204, in check_consistent_length
" samples: %r" % [int(l) for l in lengths]) ValueError: Found input variables with inconsistent numbers of samples: [1, 1085]
Length of both prices and dates is same 1085 characters
Please help
When you do this:
dates = dates.values.reshape(1,len(dates))
dates will be converted to a row vector which have 1 rows and 1085 columns. In scikit-learn, the required data for X (data features) is [n_samples, n_features]. So here, scikit thinks that your data have only 1 sample with 1085 features.
But then your prices is of shape [1085, ] which according to scikit should have shape [n_samples, ]. So here number of samples are taken as 1085. And hence the error.
You should do this to correct the error:
dates = dates.values.reshape(len(dates), 1)

Unable to call the fit function on randomforest regressor python sklearn

I'm unable to call the fit function on the RandomForestRegressor and even the intellisense is only showing the predict and some other parameters. Below is my code, traceback call and an image showing the content of the intellisense.
import pandas
import numpy as np
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestRegressor
def predict():
Fvector = 'C:/Users/Oussema/Desktop/Cred_Data/VEctors/FinalFeatureVector.csv'
data = np.genfromtxt(Fvector, dtype=float, delimiter=',', names=True)
AnnotArr = np.array(data['CredAnnot']) #this is a 1D array containig the ground truth (50000 rows)
TempTestArr = np.array([data['GrammarV'],data['TweetSentSc'],data['URLState']]) #this is the features vector the shape is (3,50000) the values range is [0-1]
FeatureVector = TempTestArr.transpose() #i used the transpose method to get the shape (50000,3)
RF_model = RandomForestRegressor(n_estimators=20, max_features = 'auto', n_jobs = -1)
RF_model.fit(FeatureVector,AnnotArr)
print(RF_model.oob_score_)
predict()
Intelisense content:
[1]: https://i.stack.imgur.com/XweOo.png
Traceback call
Traceback (most recent call last):
File "C:\Users\Oussema\source\repos\Regression_Models\Regression_Models\Random_forest_TCA.py", line 15, in <module>
predict()
File "C:\Users\Oussema\source\repos\Regression_Models\Regression_Models\Random_forest_TCA.py", line 14, in predict
print(RF_model.oob_score_)
AttributeError: 'RandomForestRegressor' object has no attribute 'oob_score_'
You need to set the oob_score param to True when initializing the RandomForestRegressor.
As per the documentation:
oob_score : bool, optional (default=False)
whether to use out-of-bag samples to estimate the R^2 on unseen data.
So the attribute oob_score_ is only available if you do this:
def predict():
....
....
RF_model = RandomForestRegressor(n_estimators=20,
max_features = 'auto',
n_jobs = -1,
oob_score=True) #<= This is what you want
....
....
print(RF_model.oob_score_)

Getting TypeErrror:DecisionTreeClassifier' object is not iterable in sparkml lib

I am trying to implement a decision tree in spark Mllib with help of Coursera "Machine learning for big data". I have got below error
<class 'pyspark.ml.classification.DecisionTreeClassifier'>
Traceback (most recent call last):
File "C:/sparkcourse/Pycharmproject/Decisiontree.py", line 65, in <module>
model=modelpipeline.fit(traindata)
File "C:\spark\python\lib\pyspark.zip\pyspark\ml\base.py", line 64, in fit
File "C:\spark\python\lib\pyspark.zip\pyspark\ml\pipeline.py", line 93, in _fit
TypeError: 'DecisionTreeClassifier' object is not iterable
Here is the code
from pyspark.sql import SparkSession
from pyspark.sql import DataFrameNaFunctions
#pipeline is estimator or transformer
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import Binarizer
from pyspark.ml.feature import VectorAssembler,VectorIndexer,StringIndexer
spark=SparkSession.builder.config("spark.sql.warehouse.dir", "file:///C:/temp").enableHiveSupport().getOrCreate()
weatherdata=spark.read.csv("file:///SparkCourse/daily_weather.csv",header="true",inferSchema="true")
#print(weatherdata.columns)
#for input features we explicitly take the columns
featurescolumn=['air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am', 'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am', 'rain_accumulation_9am', 'rain_duration_9am']
#print(featurescolumn)
weatherdata=weatherdata.drop("number")
#print(weatherdata.columns)
#missing value dealing
weatherdata=weatherdata.na.drop()
#print(weatherdata.count(),len(weatherdata.columns))
#create a categorical variable to denote if humid is not low(we weill deal heare relative_humidity_3pm column).if value is
#less than 25% then categorical value is 0 or if higher it will be 1. using binarizer will solve this
binarizer=Binarizer(threshold=24.99999,inputCol='relative_humidity_3pm',outputCol='low_humid')
#we transform whole weatherdata into Binarizer categorical value
binarizerDf=binarizer.transform(weatherdata)
#binarizerDf.select("relative_humidity_3pm",'low_humid').show(4)
#aggregating the fetures that will be used to make prediction into single columns
#The inputCols argument specifies our list of column names we defined earlier, and outputCol is the name of the new column. The second line creates a new DataFrame with the aggregated features in a column.
assembler=VectorAssembler(inputCols=featurescolumn,outputCol="features")
assembled=assembler.transform(binarizerDf)
#assembled.select("features").show(1)
#spliting Train and Test data by calling randomsplit
(traindata, testdata)=assembled.randomSplit([0.80,0.20],seed=1234)
#data counting
print(traindata.count(),testdata.count())
#create decision trees Model
#----------------------------------
#The labelCol argument is the column we are trying to predict, featuresCol specifies the aggregated features column, maxDepth is stopping criterion for tree induction based on maximum depth of tree
#minInstancesPerNode is stopping criterion for tree induction based on minimum number of samples in a node
#impurity is the impurity measure used to split nodes.
decisiontree=DecisionTreeClassifier(labelCol="label",featuresCol="features",maxDepth=5,minInstancesPerNode=20,impurity="gini")
print(type(decisiontree))
#creating model by training the decision tree, pipeline solve this
modelpipeline=Pipeline(stages=decisiontree)
model=modelpipeline.fit(traindata)
#predicting test data
predictions=model.transform(testdata)
#showing predictedvalue
prediction=predictions.select('prediction','label').show(5)
The course is using spark 1.6 in cloud era VM. but i have integrated spark 2.1.0 with PyCharm.
stages should a sequence of PipelineStages (Transofmers or Esitmators), not a single Estimator. Replace:
Pipeline(stages=decisiontree)
with
Pipeline(stages=[decisiontree])

sklearn LogisticRegression does not accept csr_matrix

I am a newby and I have to classify the words of a lexicon according to the De Pauw and Wagacha (1998) method (basically, maxent on char n-grams). The data is very large (500 000 entries and millions of n-grams). So I must load the samples as a sparse matrix. But I ran into a problem.
sklearn.linear_model.LogisticRegression().fit(X,y) says it does not accept scipy.sparse.csr.csr_matrix training vectors. I got this error
Traceback (most recent call last):
File "test-LR-4.py", line 8, in <module>
clf.fit(X,y)
File "/usr/lib/pymodules/python2.7/sklearn/svm/base.py", line 441, in fit
% type(X))
ValueError: Training vectors should be array-like, not <class 'scipy.sparse.csr.csr_matrix'>
for the following script:
from sklearn.linear_model import LogisticRegression
import numpy as np
import scipy.sparse as sp
X = sp.csr_matrix([[0, 1, 2],[1, 2, 3],[3, 2, 1]])
y = np.array(range(3))
clf=LogisticRegression(dual=True)
clf.fit(X,y)
As mentioned in comments by #Andreas and #Fred Foo, upgrading the sklearn version (> 0.13) will solve the problem.

Resources