Linear Regression Prediction on Python3 - python-3.x

I am trying to use LinearRegression on a data set using Python 3. I am trying to see the influence of Order Size on the metric OTIF (On Time In Full). The metric is a percentage of the amount of deliveries delivered in on time and in full. I get an error when I try to use LinearRegression.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# path of data
path = 'c:/Data/ame.csv'
df = pd.read_csv(path)
df.head()
from sklearn.linear_model import LinearRegression
lm = LinearRegression
lm
X = df[['Order Units']]
Y = df['OTIF%']
lm.fit(X,Y)
Yhat=lm.predict(X)
Yhat[0:5]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-39-b4b21bd8b84e> in <module>
----> 1 Yhat=lm.predict(X)
2 Yhat[0:5]
TypeError: predict() missing 1 required positional argument: 'X'

I think issue is you are not creating LinearRegression object for you.you must call its own constructor to get a object of the class.try this.
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
X = df[['Order Units']]
Y = df['OTIF%']
lm.fit(X,Y)
Yhat=lm.predict(X)

Related

Getting value error in train.test while eliminating features from dataset using RFE.what is the solution?

valueError image part
Here is the code for eliminating features where I am getting value errors. I want to use recursive feature elimination without specifying any features . I tried to use the RFE(Recursion feature elemination) model to automatically eliminate weak features with each iteration which I have unable to do.HERE is the link of the dataset. https://drive.google.com/file/d/1neYnunu6a_Mdn3NfRZsF8wE4gwMCpjAY/view?usp=sharing .I will be grateful if you suggest me how to do it.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pandas import DataFrame
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
from sklearn import datasets
from sklearn.metrics import classification_report
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
df.keys()
x=pd.DataFrame(df)
x.head()
X = df.drop(["Sub_Cat"],axis=1).values
y = df["Sub_Cat"].values
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=0)
X_train.shape,X_test.shape
sel=SelectFromModel(RandomForestClassifier(n_estimators=100,random_state=0,n_jobs=-1))
sel.fit(X_train,y_train)
sel.get_support()
[I am getting value error in this part][1]
Then i tried to do this also getting `X = df.drop(["Dst_IP","Timestamp","Flow_ID","Src_IP","Sub_Cat"],axis=1).values
y = df["Sub_Cat"].values
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=0)
X = df.drop(["Dst_IP","Timestamp","Flow_ID","Src_IP","Sub_Cat"],axis=1).values
y = df["Sub_Cat"].values
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=0)
sel=SelectFromModel(RandomForestClassifier(n_estimators=100,random_state=0,n_jobs=-1))
sel.fit(X_train,y_train)
sel.get_support()
I am still getting error:
ValueError Traceback (most recent call last)
in ()
1 sel=SelectFromModel(RandomForestClassifier(n_estimators=100,random_state=0,n_jobs=-1))
----> 2 sel.fit(X_train,y_train)
3 sel.get_support()
3 frames
/usr/local/lib/python3.7/dist-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
81
82 """
---> 83 return array(a, dtype, copy=False, order=order)
84
85
ValueError: could not convert string to float: 'Anomaly'

Python 3 and Sklearn: Difficulty to use a NOT-sklearn model as a sklearn model

The code below is working. I have just a routine to run a cross validation scheme using a linear model previous defined in sklearn. I do not have a problem with this. My problem is that: if I replace the code model=linear_model.LinearRegression() by the model=RBF('multiquadric') (please see line 14 and 15 in the __main__, it does not work anymore. So my problem is actually in the class RBF where I try to mimic a sklearn model.
If I replace the code described above, I get the following error:
FitFailedWarning)
/home/daniel/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
ValueError: All arrays must be equal length.
FitFailedWarning)
1) Should I define a score function in the Class RBF?
2) How to do that? I am lost. Since I am inherit BaseEstimator and RegressorMixin, I expected that this was internally solved.
3) Is there something else missing?
from sklearn import datasets
import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn import model_selection
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from scipy.interpolate import Rbf
np.random.seed(0)
from sklearn.base import BaseEstimator, RegressorMixin
class RBF(BaseEstimator, RegressorMixin):
def __init__(self,function):
self.function=function
def fit(self,x,y):
self.rbf = Rbf(x, y,function=self.function)
def predict(self,x):
return self.rbf(x)
if __name__ == "__main__":
# Load Data
targetName='HousePrice'
data=datasets.load_boston()
featuresNames=list(data.feature_names)
featuresData=data.data
targetData = data.target
df=pd.DataFrame(featuresData,columns=featuresNames)
df[targetName]=targetData
independent_variable_list=featuresNames
dependent_variable=targetName
X=df[independent_variable_list].values
y=np.squeeze(df[[dependent_variable]].values)
# Model Definition
model=linear_model.LinearRegression()
#model=RBF('multiquadric')
# Cross validation routine
number_splits=5
score_list=['neg_mean_squared_error','neg_mean_absolute_error','r2']
kfold = model_selection.KFold(n_splits=number_splits,shuffle=True, random_state=0)
scalar = StandardScaler()
pipeline = Pipeline([('transformer', scalar), ('estimator', model)])
results = model_selection.cross_validate(pipeline, X, y, cv=kfold, scoring=score_list,return_train_score=True)
for score in score_list:
print(score+':')
print('Train: '+'Mean',np.mean(results['train_'+score]),'Standard Error',np.std(results['train_'+score]))
print('Test: '+'Mean',np.mean(results['test_'+score]),'Standard Error',np.std(results['test_'+score]))
Lets look at the documentation here
*args : arrays
x, y, z, …, d, where x, y, z, … are the coordinates of the nodes and d is the array of values at the nodes
So it takes variable length argument with the last argument being the value which is y in your case. Argument k is the kth coordinates of all the data point (same for all other argument z, y, z, ….
Following the documentation, your code should be
from sklearn import datasets
import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn import model_selection
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from scipy.interpolate import Rbf
np.random.seed(0)
from sklearn.base import BaseEstimator, RegressorMixin
class RBF(BaseEstimator, RegressorMixin):
def __init__(self,function):
self.function=function
def fit(self,X,y):
self.rbf = Rbf(*X.T, y,function=self.function)
def predict(self,X):
return self.rbf(*X.T)
# Load Data
data=datasets.load_boston()
X = data.data
y = data.target
number_splits=5
score_list=['neg_mean_squared_error','neg_mean_absolute_error','r2']
kfold = model_selection.KFold(n_splits=number_splits,shuffle=True, random_state=0)
scalar = StandardScaler()
model = RBF(function='multiquadric')
pipeline = Pipeline([('transformer', scalar), ('estimator', model)])
results = model_selection.cross_validate(pipeline, X, y, cv=kfold, scoring=score_list,return_train_score=True)
for score in score_list:
print(score+':')
print('Train: '+'Mean',np.mean(results['train_'+score]),'Standard Error',np.std(results['train_'+score]))
print('Test: '+'Mean',np.mean(results['test_'+score]),'Standard Error',np.std(results['test_'+score]))
Output
neg_mean_squared_error:
Train: Mean -1.552450953914355e-20 Standard Error 7.932530906290208e-21
Test: Mean -23.007377210596463 Standard Error 4.254629143836107
neg_mean_absolute_error:
Train: Mean -9.398502208736061e-11 Standard Error 2.4673749061941226e-11
Test: Mean -3.1319779583728673 Standard Error 0.2162343985534446
r2:
Train: Mean 1.0 Standard Error 0.0
Test: Mean 0.7144217179633185 Standard Error 0.08526294242760363
Why *X.T : As we saw, each argument correspond to an axis of all the data points, so we transpose them and then use * operator to expand and pass each of the sub array as an argument to the variable length function.
Looks like the latest implementation has a mode parameter where we can pass the N-D array directly.

swap .has_khey by something else in python 3

I know that in python3 ".has_khey" is replace by "in"
But in my exemple , i didn't manage for make it working .
the whole quote for execution
from sklearn import model_selection
import pandas as pd
import numpy as np
from sklearn import neighbors, metrics
from matplotlib import pyplot as plt
data = pd.read_csv('your_path/winequality-red.csv', sep=";")
X = data.as_matrix([data.columns[:-1]])
y = data.as_matrix([data.columns[-1]])
y.flatten()
X_train, X_test, y_train, y_test = \
model_selection.train_test_split(X,y, test_size=0.3)
knn= neighbors.KNeighborsRegressor(n_neighbors = 12)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
the part which return me error :
sizes = {}
for (yt, yp) in zip(list(y_test), list(y_pred)):
if sizes.has_key((yt, yp)):
sizes[(yt, yp)] += 1
else:
sizes[(yt, yp)] = 1
keys = sizes.keys()
plt.scatter([k[0] for k in keys], [k[1] for k in keys], s=[sizes[k] for k in keys], color='coral')
when i try to swap if sizes.has_key((yt, yp)): in if (yt, yp) in sizes:
I got the error : TypeError: unhashable type: 'numpy.ndarray'
download the wine database
thanks in advance for any help
the result i'm looking for :
plot scatter size
here the .ipynb or .py file
I don't think the code you show can actually produce the error you report. Possibly you have redefined some variable in the notebook outside of that code?
In any case, concerning the question, you would want to replace if sizes.has_key((yt, yp)): by
if (yt, yp) in sizes.keys():
This should give you the desired plot

program is not working "TypeError: fit() missing 1 required positional argument: 'y'"

from sklearn import tree
from sklearn.datasets import load_iris
iris=load_iris()
dir(iris)
#output data to traixn setosa,versicolor and virginica
x=iris.data
#fetching data
x=np.delete(x, np.s_[::50], 0)
#print(x)
y=iris.target
#featching output
y=np.delete(y, np.s_[::50], 0)
algo=tree.DecisionTreeClassifier
when i try to use fit it does not support
train=algo.fit(x,y)
res=train.pridict([test_setosa])
print(res)
You need to change something in your code. The DecisionTreeClassifier is a class and the way your call it in your code is wrong.
Replace
algo=tree.DecisionTreeClassifier
with
algo=tree.DecisionTreeClassifier()
Full code
from sklearn import tree
from sklearn.datasets import load_iris
import numpy as np
iris=load_iris()
dir(iris)
#output data to traixn setosa,versicolor and virginica
x=iris.data
#fetching data
x=np.delete(x, np.s_[::50], 0)
#print(x)
y=iris.target
#featching output
y=np.delete(y, np.s_[::50], 0)
algo=tree.DecisionTreeClassifier()
train=algo.fit(x,y)
res=train.predict([test_setosa])

Unorderable Types: str() > float error KNN model

I have read quite a bit on this particular error and haven't been able to find an answer that addresses my issue. I have a data set that I have split into train and test sets and am looking to run a KNeighborsClassifier. My code is below... My problem is that when I look at the dtypes of my X_train i don't see any string formatted columns at all. My y_train is a single categorical variable. This is my first stackoverflow post so my apologies if I've overlooked any formalities and thanks for the help! :)
Error:
TypeError: unorderable types: str() > float()
Dtypes:
X_train.dtypes.value_counts()
Out[54]:
int64 2035
float64 178
dtype: int64
Code:
# Import Packages
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.dummy import DummyRegressor
from sklearn.cross_validation import train_test_split, KFold
from matplotlib.ticker import FormatStrFormatter
from sklearn import cross_validation
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import pdb
# Set Directory Path
path = "file_path"
os.chdir(path)
#Select Import File
data = 'RawData2.csv'
delim = ','
#Import Data File
df = pd.read_csv(data, sep = delim)
print (df.head())
df.columns.get_loc('Categories')
#Model
#Select/Update Features
X = df[df.columns[14:2215]]
#Get Column Index for Target Variable
df.columns.get_loc('Categories')
#Select Target and fill na's with "Small" label
y = y[y.columns[21]]
print(y.values)
y.fillna('Small')
#Training/Test Set
X_sample = X.loc[X.Var1 <1279]
X_valid = X.loc[X.Var1 > 1278]
y_sample = y.head(len(X_sample))
y_test = y.head(len(y)-len(X_sample))
X_train, X_test, y_train, y_test = train_test_split(X_sample, y_sample, test_size = 0.2)
cv = KFold(n = X_train.shape[0], n_folds = 5, random_state = 17)
print(X_train.shape, y_train.shape)
X_train.dtypes.value_counts()
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(X_train, y_train) **<-- This is where the error is flagged**
accuracy_score(knn.predict(X_test))
Everything in sklearn is based on numpy which only uses numbers. Hence categorical X and Y need to be encoded as numbers. For x you can use get_dummies. For y you can use LabelEncoder.
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

Resources