swap .has_khey by something else in python 3 - python-3.x

I know that in python3 ".has_khey" is replace by "in"
But in my exemple , i didn't manage for make it working .
the whole quote for execution
from sklearn import model_selection
import pandas as pd
import numpy as np
from sklearn import neighbors, metrics
from matplotlib import pyplot as plt
data = pd.read_csv('your_path/winequality-red.csv', sep=";")
X = data.as_matrix([data.columns[:-1]])
y = data.as_matrix([data.columns[-1]])
y.flatten()
X_train, X_test, y_train, y_test = \
model_selection.train_test_split(X,y, test_size=0.3)
knn= neighbors.KNeighborsRegressor(n_neighbors = 12)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
the part which return me error :
sizes = {}
for (yt, yp) in zip(list(y_test), list(y_pred)):
if sizes.has_key((yt, yp)):
sizes[(yt, yp)] += 1
else:
sizes[(yt, yp)] = 1
keys = sizes.keys()
plt.scatter([k[0] for k in keys], [k[1] for k in keys], s=[sizes[k] for k in keys], color='coral')
when i try to swap if sizes.has_key((yt, yp)): in if (yt, yp) in sizes:
I got the error : TypeError: unhashable type: 'numpy.ndarray'
download the wine database
thanks in advance for any help
the result i'm looking for :
plot scatter size
here the .ipynb or .py file

I don't think the code you show can actually produce the error you report. Possibly you have redefined some variable in the notebook outside of that code?
In any case, concerning the question, you would want to replace if sizes.has_key((yt, yp)): by
if (yt, yp) in sizes.keys():
This should give you the desired plot

Related

How do i fix "If using all scalar values, you must pass an index" error?

I am manually trying to build a linear regression model for understanding purpose without using the builtin function. I am getting the error while plotting the regression line. Kindly help me fix it.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sb
data = {'X': list(np.arange(0,10,1)), 'Y': [1,3,2,5,7,8,8,9,10,12]}
df = pd.DataFrame(data)
df2 = pd.DataFrame(np.ones(10), columns = ['ones'])
df_new = pd.concat([df2,df], axis = 1)
X = df_new.loc[:, ['ones', 'X']].values
Y = df_new['Y'].values.reshape(-1,1)
theta = np.array([0.5, 0.2]).reshape(-1,1)
Y_pred = X.dot(theta)
sb.lineplot(df['X'].values.reshape(-1,1),Y_pred)
plt.show()
Error message:
If using all scalar values, you must pass an index
You are passing a 2d array, while seaborn's lineplot expects a 1d array (or a pandas column which is basically same). So change it to
sb.lineplot(df['X'].values,Y_pred.reshape(-1))

Unable to use "from sklearn.preprocessing import Imputer" , it shows the exception " Data must be 1-dimensional"

I have made a model for the artificial neural network(ANN). I want to preprocess the data before train the model.
I have tried the code given below.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('Update-Detaset with hacking1.csv')
y=[]
X = dataset.iloc[:,2:7]
y = dataset.iloc[:,8]
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
Y = np.reshape(y,(-1,1))
imputer = imputer.fit(Y)
Y= imputer.transform(Y)
Exception: Data must be 1-dimensional
Here, Update-Detaset with hacking1.csv is the .csv file. The dataset is lookig like:
Please click the link to see the demo of the csv file
It shows the following errors.
How can I solve this?
This has nothing to do with Imputer. You should have been able to tell this from the line number that threw the Exception. The error is from you trying to reshape a pandas DataFrame. Change
y = dataset.iloc[:,8]
to
y = dataset.iloc[:,8].values
and it should work.

error of making a test and train data split in sklearn

I am trying to make a test and train data split by "train_test_split".
Why I got the error "At least one array required as input".
The input of "train_test_split" can be array and dataFrame, right ?
import pandas as pd
import numpy as np
from rpy2.robjects.packages import importr
import rpy2.robjects as ro
import pandas.rpy.common as rpy_common
from sklearn.model_selection import train_test_split
def la():
ro.r('library(MASS)')
pydf = rpy_common.load_data(name = 'Boston', package=None, convert=True)
pddf = pd.DataFrame(pydf)
targetIndex = pddf.columns.get_loc("medv")
# make train and test data
rowNum = pddf.shape[0]
colNum = pddf.shape[1]
print(type(pddf.as_matrix()))
print(pddf.as_matrix().shape)
m = np.asarray(pddf.as_matrix()).reshape(rowNum,colNum)
print(type(m))
x_train, x_test, y_train, y_test = train_test_split(x = m[:, 0:rowNum-2], \
y = m[:, -1],\
test_size = 0.5)
# error: raise ValueError("At least one array required as input")
ValueError: At least one array required as input
From the sklearn docs the arrays are handled with positional item unpacking ("*args").
You are using keyword arguments, "x=" and "y=", which it tries to handle by looking if "x" and "y" are the names of special keyword options.
Try:
train_test_split(m[:, 0:rowNum-2], m[:, -1], test_size=0.5)
(removing the keyword argument names from the arrays).

How to loop through items in pandas col and run and plot a scikit model?

I got some interesting user data from races. I know when the respecitve athletes planed to finish a race and I know when they actaully finished (next to some more stuff). The goal is to find out when the athletes come in late. I want to run a support vector machine for each athlete and plot the decision boundaries.
Here is what I do:
import numpy as np
import pandas as pd
from sklearn import svm
from mlxtend.plotting import plot_decision_regions
import matplotlib.pyplot as plt
# Create arbitrary dataset for example
df = pd.DataFrame({'User': np.random.random_integers(low=1, high=4, size=50),
'Planned_End': np.random.uniform(low=-5, high=5, size=50),
'Actual_End': np.random.uniform(low=-1, high=1, size=50),
'Late': np.random.random_integers(low=0, high=2, size=50)}
)
# Fit Support Vector Machine Classifier
X = df[['Planned_End', 'Actual_End']]
y = df['Late']
clf = svm.SVC(decision_function_shape='ovo')
for i, y in df['User']:
clf.fit(X, y)
ax = plt.subplot()
fig = plot_decision_regions(X=X, y=y, clf=clf, legend=2)
plt.title(lab)
plt.show()
I get the following error: TypeError: 'numpy.int64' object is not iterable - that is, I somehow can't loop through the column.
I guess it comes down to the numpy data format? How can I solve that?
try iteritems()
for i, y in df['User'].iteritems():
Your User Series contains numpy.int64 objects so you can only use:
for y in df['User']:
And you don't use i anywhere.
As for the rest of the code, this produces some solution, please edit accordingly:
import numpy as np
import pandas as pd
from sklearn import svm
from mlxtend.plotting import plot_decision_regions
import matplotlib.pyplot as plt
# Create arbitrary dataset for example
df = pd.DataFrame({'User': np.random.random_integers(low=1, high=4, size=50),
'Planned_End': np.random.uniform(low=-5, high=5, size=50),
'Actual_End': np.random.uniform(low=-1, high=1, size=50),
'Late': np.random.random_integers(low=0, high=2, size=50)}
)
# Fit Support Vector Machine Classifier
X = df[['Planned_End', 'Actual_End']].as_matrix()
y = df['Late']
clf = svm.SVC(decision_function_shape='ovo')
y = df['User'].values
clf.fit(X, y)
ax = plt.subplot()
fig = plot_decision_regions(X=X, y=y, clf=clf, legend=2)
plt.title('lab')
plt.show()

Unorderable Types: str() > float error KNN model

I have read quite a bit on this particular error and haven't been able to find an answer that addresses my issue. I have a data set that I have split into train and test sets and am looking to run a KNeighborsClassifier. My code is below... My problem is that when I look at the dtypes of my X_train i don't see any string formatted columns at all. My y_train is a single categorical variable. This is my first stackoverflow post so my apologies if I've overlooked any formalities and thanks for the help! :)
Error:
TypeError: unorderable types: str() > float()
Dtypes:
X_train.dtypes.value_counts()
Out[54]:
int64 2035
float64 178
dtype: int64
Code:
# Import Packages
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.dummy import DummyRegressor
from sklearn.cross_validation import train_test_split, KFold
from matplotlib.ticker import FormatStrFormatter
from sklearn import cross_validation
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import pdb
# Set Directory Path
path = "file_path"
os.chdir(path)
#Select Import File
data = 'RawData2.csv'
delim = ','
#Import Data File
df = pd.read_csv(data, sep = delim)
print (df.head())
df.columns.get_loc('Categories')
#Model
#Select/Update Features
X = df[df.columns[14:2215]]
#Get Column Index for Target Variable
df.columns.get_loc('Categories')
#Select Target and fill na's with "Small" label
y = y[y.columns[21]]
print(y.values)
y.fillna('Small')
#Training/Test Set
X_sample = X.loc[X.Var1 <1279]
X_valid = X.loc[X.Var1 > 1278]
y_sample = y.head(len(X_sample))
y_test = y.head(len(y)-len(X_sample))
X_train, X_test, y_train, y_test = train_test_split(X_sample, y_sample, test_size = 0.2)
cv = KFold(n = X_train.shape[0], n_folds = 5, random_state = 17)
print(X_train.shape, y_train.shape)
X_train.dtypes.value_counts()
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(X_train, y_train) **<-- This is where the error is flagged**
accuracy_score(knn.predict(X_test))
Everything in sklearn is based on numpy which only uses numbers. Hence categorical X and Y need to be encoded as numbers. For x you can use get_dummies. For y you can use LabelEncoder.
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

Resources