Convert string to float error in pandas machine learning - python-3.x

For my machine learning code, I have some unknown values with '?' in my csv file. So, I am trying to replace them with 'Nan' but it throws some error. The following code is for the replacement of '?' that I have used. Can anyone please solve this?
Thanks in advance !
import numpy
import pandas as pd
import matplotlib as plot
import numpy as np
df = pd.read_csv('cdk.csv')
x=df.iloc[:,0:24].values
y=df.iloc[:,24].values
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN', strategy='most_frequent', axis =0,copy=False)
imputer = imputer.fit(x[:,0:5])
imputer.fit_transform(x[:,0:5])
imputer = Imputer(missing_values='normal', strategy='mode', axis =0,copy=False)
imputer = imputer.fit(x[:,5:7])
imputer.fit_transform(x[:,5:7])
This is what error it throws,
Traceback (most recent call last):
File "kidney.py", line 10, in <module>
imputer = imputer.fit(x[:,0:5])
File "C:\Users\YAASHI\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\preprocessing\imputation.py", line 155, in fit
force_all_finite=False)
File "C:\Users\YAASHI\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\utils\validation.py", line 433, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: '?'
Link for the csv file

If you want to replace all ? strings with NaN, do this:
df.replace('?', np.nan, inplace=True)
Or better yet, load them as NaN as you read the CSV:
df = pd.read_csv('cdk.csv', na_values=['?'])

Related

Can't append NaN to python list

The error is
Traceback (most recent call last):
File "1.py", line 28, in <module>
buy.append(np.nan)
AttributeError: 'numpy.ndarray' object has no attribute 'nan'
Here the code its python 3
import fxcmpy
import socketio
from pylab import plt
import numpy as np
from finta import TA
TOKEN='xxxx'
con = fxcmpy.fxcmpy(access_token=TOKEN, log_level='error', server='real', log_file='log.txt')
#print(con.get_instruments())
data = con.get_candles('US30', period='D1', number=250)
con.close()
df1=data[['askopen','askhigh', 'asklow', 'askclose']]
plt.style.use('seaborn')
np=df1.to_numpy()
df2=df1.rename(columns={'askopen':'open','askhigh':'high','asklow':'low','askclose':'close'})
dfhma=TA.HMA(df2,14)
pr1=dfhma.shift(1)
pr2=dfhma.shift(2)
buy=[]
sell=[]
i=0
flag=''
for item in dfhma:
if item > pr1[i] and item > pr2[i] and flag!=1:
flag=1
buy.append(item)
else:
buy.append(np.nan)
if item < pr1[i] and item < pr2[i] and flag!=0:
flag=0
sell.append(item)
else:
sell.append(np.nan)
i=i+1
print(buy)
print('buy len='+str(len(buy)))
mk=[]
for item in dfhma:
print(item)
plt.plot(dfhma)
plt.scatter(dfhma.index,buy,marker='^',color='g')
plt.scatter(dfhma.index,sell,marker='v',color='r')
plt.show()
Search Google/Stackoverflow nothing was found and change nan to NaN,NAN still got the same error guessing its newbie error Help !! Just try to add NaN to the list as a buy/sell signal and it doesn't work what could be wrong here ?
In line 14 np=df1.to_numpy() you reassigned variable np from a package to a numpy array. So when you called np.nan it was searching nan from the numpy ndarray instance, not the package.
Change the variable to any other name and it will work fine.

sklearn.impute.SimpleImputer, Nan to mean, not working

I have a dataset Data.csv
Country,Age,Salary,Purchased
France,44,72000,No
Spain,27,48000,Yes
Germany,30,54000,No
Spain,38,61000,No
Germany,40,,Yes
France,35,58000,Yes
Spain,,52000,No
France,48,79000,Yes
Germany,50,83000,No
France,37,67000,Yes
I tried to fill nan values using sklearn.impute.SimpleImputer by using following code
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
# Taking care of missing data
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = 'NaN', strategy = 'mean')
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
But I get a error which says:
File "C:\Users\Krishna Rohith\Machine Learning A-Z\Part 1 - Data Preprocessing\Section 2 ----------- --------- Part 1 - Data Preprocessing --------------------\missing_data.py", line 16, in <module>
imputer = imputer.fit(X[:, 1:3])
File "C:\Users\Krishna Rohith\Anaconda3\lib\site-packages\sklearn\impute\_base.py", line 268, in fit
X = self._validate_input(X)
File "C:\Users\Krishna Rohith\Anaconda3\lib\site-packages\sklearn\impute\_base.py", line 242, in _validate_input
raise ve
File "C:\Users\Krishna Rohith\Anaconda3\lib\site-packages\sklearn\impute\_base.py", line 235, in _validate_input
force_all_finite=force_all_finite, copy=self.copy)
File "C:\Users\Krishna Rohith\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 562, in check_array
allow_nan=force_all_finite == 'allow-nan')
File "C:\Users\Krishna Rohith\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 60, in _assert_all_finite
msg_dtype if msg_dtype is not None else X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
I know how to do it numpy but can someone please tell me using sklearn.impute?
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')
Replace 'NaN' by numpy default Nan np.nan

How to get prediction for a single data entry

I have a trained model stored in pickle. All i need to do is get a single-valued dataframe in pandas and get the prediction by passing it to the model.
To handle the categorical columns, i have used one-hot-encoding. So to convert the pandas dataframe to numpy array, i also used one-hot-encoding on the single valued dataframe. But it shows me error.
import pickle
import category_encoders as ce
import pandas as pd
pkl_filename = "pickle_model.pkl"
with open(pkl_filename, 'rb') as file:
pickle_model = pickle.load(file)
ohe = ce.OneHotEncoder(handle_unknown='ignore', use_cat_names=True)
X_t = pd.read_pickle("case1.pkl")
X_t_ohe = ohe.fit_transform(X_t)
X_t_ohe = X_t_ohe.fillna(0)
Ypredict = pickle_model.predict(X_t_ohe)
print(Ypredict[0])
Traceback (most recent call last):
File "Predict.py", line 14, in
Ypredict = pickle_model.predict(X_t_ohe)
File "/home/neo/anaconda3/lib/python3.6/site-> packages/sklearn/linear_model/base.py", line 289, in predict
scores = self.decision_function(X)
File "/home/neo/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/base.py", line 270, in decision_function
% (X.shape[1], n_features))
ValueError: X has 93 features per sample; expecting 989
This happens because OneHotEncoder actually converts your dataframe into many different numerical columns and your pickle model actually has the trained model from your original file which does not have the same dimensions(same number of column)
To rectify this issue you will need to retrain your model after applying the one-hot-encoder and then save it as a pickle file and reusing that modelel.

sklearn LabelEncoder inverse_transform TypeError: only integer scalar arrays can be converted to a scalar index

I am getting the following error when calling the inverse_transform of LabelEncoder:
Traceback (most recent call last):
File "Test.py", line 31, in <module>
inverted = label_encoder.inverse_transform(integer_encoded['DEST'])
File "...\Python\Python36\lib\site-packages\sklearn\preprocessing\label.py", line 283, in inverse_transform
return self.classes_[y]
TypeError: only integer scalar arrays can be converted to a scalar index
The code that generates this error is the following:
import pandas as pd
import numpy as np
from collections import defaultdict
from sklearn import preprocessing
import bisect
data_cat = {'ORG': ['A', 'B', 'C', 'D'],
'DEST': ['A', 'E', 'F', 'G'],
'OP': ['F1', 'F1', 'F1', 'F2']}
data_cat = pd.DataFrame(data_cat)
#retain all columns LabelEncoder as dictionary.
label_encoder_dict = defaultdict(preprocessing.LabelEncoder)
integer_encoded = data_cat.apply(lambda x: label_encoder_dict[x.name].fit_transform(x))
print("Integer encoded: ")
print(integer_encoded)
#add a UNK class that will be used for the unseen values from the test dataset
for key, le in label_encoder_dict.items():
le_classes = np.array(le.classes_).tolist()
bisect.insort_left(le_classes, 'UNK')
le.classes_ = le_classes
label_encoder = label_encoder_dict['DEST']
print(label_encoder.classes_)
print(integer_encoded['DEST'])
print(type (integer_encoded['DEST']))
inverted = label_encoder.inverse_transform(integer_encoded['DEST'])
print(inverted)
If I remove the for loop that adds the UNK class to every LabelEncoder, everything is working fine. I don't understand why adding a new class impacts the call of the inverse_transform.
Thanks for any help or guidance.
LabelEncoder.inverse_transform is actually quite simple. The LabelEncoder object stores an array of original values in the classes_ attribute, and the encoded integer is the index of that value in classes_. Normally, classes_ is an np.array type which supports passing a list of indices to get the values at those indices. However, in your for loop you converted that to a regular old python list, which does not support that behavior.
If you change your for loop to keep le.classes_ as an ndarray, it should work:
for key, le in label_encoder_dict.items():
le_classes = np.array(le.classes_).tolist()
bisect.insort_left(le_classes, 'UNK')
le.classes_ = np.asarray(le_classes)

Error: Tuple index out of range while plotting learning curve

Here is my code:
import matplotlib.pyplot as plt,matplotlib.colors as clr
import pandas as pd,csv,numpy as np
from sklearn import linear_model
from sklearn.model_selection import ShuffleSplit as ss, learning_curve as
lc,StratifiedKFold as skf
from sklearn.utils import shuffle
file=open('C:\\Users\\Anil Satya\\Desktop\\Internship_projects\\BD
Influenza\\BD_Influenza_revised_imputed.csv','r+')
flu_data=pd.read_csv(file)
flu_num=flu_data.ix[:,5:13]
features=np.array(flu_num.ix[:,0:7])
label=np.array(flu_num.ix[:,7])
splt=skf(n_splits=2,shuffle=True,random_state=None)
clf=linear_model.LogisticRegression()
model=clf.fit(features,label)
def classifier(clf,x,y):
accuracy=clf.score(x,y)
return accuracy
lc(estimator=clf,X=features,y=label,train_sizes=0.75,cv=splt,
scoring=classifier(clf,features,label))
On execution, it shows the following error:
Traceback (most recent call last):
File "C:/Ankur/Python36/Python Files/BD_influenza_learningcurve.py", line
26, in <module>
lc(estimator=clf,X=features,y=label,train_sizes=0.75,cv=splt,
scoring=classifier(clf,features,label))
File "C:\Ankur\Python36\lib\site-
packages\sklearn\model_selection\_validation.py", line 756, in
learning_curve
n_max_training_samples)
File "C:\Ankur\Python36\lib\site-
packages\sklearn\model_selection\_validation.py", line 808, in
_translate_train_sizes
n_ticks = train_sizes_abs.shape[0]
IndexError: **tuple index out of range**
I am not able to identify the problem yet. But, I believe the problem is in the learning curve function because I have executed the program without it and it works fine.
Either the scoring or train_sizes parameter causes the problem.
Try to replace:
lc(estimator=clf,X=features,y=label,train_sizes=0.75,cv=splt,
scoring=classifier(clf,features,label))
with
1)
lc(estimator=clf,X=features,y=label,train_sizes=0.75,cv=splt,
scoring="accuracy")
or 2)
import numpy as np
lc(estimator=clf,X=features,y=label,train_sizes=np.array([0.75]),cv=splt,
scoring="accuracy")
Finally, for the scoring parameter you can see here the available attributes/strings that you can use: The scoring parameter

Resources