Error: Tuple index out of range while plotting learning curve - python-3.x

Here is my code:
import matplotlib.pyplot as plt,matplotlib.colors as clr
import pandas as pd,csv,numpy as np
from sklearn import linear_model
from sklearn.model_selection import ShuffleSplit as ss, learning_curve as
lc,StratifiedKFold as skf
from sklearn.utils import shuffle
file=open('C:\\Users\\Anil Satya\\Desktop\\Internship_projects\\BD
Influenza\\BD_Influenza_revised_imputed.csv','r+')
flu_data=pd.read_csv(file)
flu_num=flu_data.ix[:,5:13]
features=np.array(flu_num.ix[:,0:7])
label=np.array(flu_num.ix[:,7])
splt=skf(n_splits=2,shuffle=True,random_state=None)
clf=linear_model.LogisticRegression()
model=clf.fit(features,label)
def classifier(clf,x,y):
accuracy=clf.score(x,y)
return accuracy
lc(estimator=clf,X=features,y=label,train_sizes=0.75,cv=splt,
scoring=classifier(clf,features,label))
On execution, it shows the following error:
Traceback (most recent call last):
File "C:/Ankur/Python36/Python Files/BD_influenza_learningcurve.py", line
26, in <module>
lc(estimator=clf,X=features,y=label,train_sizes=0.75,cv=splt,
scoring=classifier(clf,features,label))
File "C:\Ankur\Python36\lib\site-
packages\sklearn\model_selection\_validation.py", line 756, in
learning_curve
n_max_training_samples)
File "C:\Ankur\Python36\lib\site-
packages\sklearn\model_selection\_validation.py", line 808, in
_translate_train_sizes
n_ticks = train_sizes_abs.shape[0]
IndexError: **tuple index out of range**
I am not able to identify the problem yet. But, I believe the problem is in the learning curve function because I have executed the program without it and it works fine.

Either the scoring or train_sizes parameter causes the problem.
Try to replace:
lc(estimator=clf,X=features,y=label,train_sizes=0.75,cv=splt,
scoring=classifier(clf,features,label))
with
1)
lc(estimator=clf,X=features,y=label,train_sizes=0.75,cv=splt,
scoring="accuracy")
or 2)
import numpy as np
lc(estimator=clf,X=features,y=label,train_sizes=np.array([0.75]),cv=splt,
scoring="accuracy")
Finally, for the scoring parameter you can see here the available attributes/strings that you can use: The scoring parameter

Related

Getting value error in train.test while eliminating features from dataset using RFE.what is the solution?

valueError image part
Here is the code for eliminating features where I am getting value errors. I want to use recursive feature elimination without specifying any features . I tried to use the RFE(Recursion feature elemination) model to automatically eliminate weak features with each iteration which I have unable to do.HERE is the link of the dataset. https://drive.google.com/file/d/1neYnunu6a_Mdn3NfRZsF8wE4gwMCpjAY/view?usp=sharing .I will be grateful if you suggest me how to do it.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pandas import DataFrame
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
from sklearn import datasets
from sklearn.metrics import classification_report
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
df.keys()
x=pd.DataFrame(df)
x.head()
X = df.drop(["Sub_Cat"],axis=1).values
y = df["Sub_Cat"].values
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=0)
X_train.shape,X_test.shape
sel=SelectFromModel(RandomForestClassifier(n_estimators=100,random_state=0,n_jobs=-1))
sel.fit(X_train,y_train)
sel.get_support()
[I am getting value error in this part][1]
Then i tried to do this also getting `X = df.drop(["Dst_IP","Timestamp","Flow_ID","Src_IP","Sub_Cat"],axis=1).values
y = df["Sub_Cat"].values
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=0)
X = df.drop(["Dst_IP","Timestamp","Flow_ID","Src_IP","Sub_Cat"],axis=1).values
y = df["Sub_Cat"].values
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=0)
sel=SelectFromModel(RandomForestClassifier(n_estimators=100,random_state=0,n_jobs=-1))
sel.fit(X_train,y_train)
sel.get_support()
I am still getting error:
ValueError Traceback (most recent call last)
in ()
1 sel=SelectFromModel(RandomForestClassifier(n_estimators=100,random_state=0,n_jobs=-1))
----> 2 sel.fit(X_train,y_train)
3 sel.get_support()
3 frames
/usr/local/lib/python3.7/dist-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
81
82 """
---> 83 return array(a, dtype, copy=False, order=order)
84
85
ValueError: could not convert string to float: 'Anomaly'

Linear Regression Prediction on Python3

I am trying to use LinearRegression on a data set using Python 3. I am trying to see the influence of Order Size on the metric OTIF (On Time In Full). The metric is a percentage of the amount of deliveries delivered in on time and in full. I get an error when I try to use LinearRegression.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# path of data
path = 'c:/Data/ame.csv'
df = pd.read_csv(path)
df.head()
from sklearn.linear_model import LinearRegression
lm = LinearRegression
lm
X = df[['Order Units']]
Y = df['OTIF%']
lm.fit(X,Y)
Yhat=lm.predict(X)
Yhat[0:5]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-39-b4b21bd8b84e> in <module>
----> 1 Yhat=lm.predict(X)
2 Yhat[0:5]
TypeError: predict() missing 1 required positional argument: 'X'
I think issue is you are not creating LinearRegression object for you.you must call its own constructor to get a object of the class.try this.
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
X = df[['Order Units']]
Y = df['OTIF%']
lm.fit(X,Y)
Yhat=lm.predict(X)

module 'seaborn' has no attribute 'distplot'

I've some code like:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv('StudentsPerformance.csv')
#print(data.isnull().sum()) // checking if there are some missing values or not
#print(data.dtypes)checking datatypes of the dataset
# ANALYSÄ°S VALUES OF THE COLUMN'S
"""print(data['gender'].value_counts())
print(data['parental level of education'].value_counts())
print(data['race/ethnicity'].value_counts())
print(data['lunch'].value_counts())
print(data['test preparation course'].value_counts())"""
# Adding column total and average to the dataset
data['total'] = data['math score'] + data['reading score'] + data['writing score']
data['average'] = data ['total'] / 3
sns.distplot(data['average'])
I would like to see distplot of average for visualization but I run the program that gives me an error like
Traceback (most recent call last): File
"C:/Users/usersample/PycharmProjects/untitled1/sample.py", line 22, in
sns.distplot(data['average']) AttributeError: module 'seaborn' has no attribute 'distplot'
I've tried to reinstall and install seaborn and upgrade the seaborn to 0.9.0 but it doesn't work.
head of my data female,"group B","bachelor's
degree","standard","none","72","72","74" female,"group C","some
college","standard","completed","69","90","88" female,"group
B","master's degree","standard","none","90","95","93" male,"group
A","associate's degree","free/reduced","none","47","57","44"
this might be due to removal of paths in environment variables section. Try considering to add your IDE scripts and python folder. I am using pycharm IDE, and did the same and its working fine.

Convert string to float error in pandas machine learning

For my machine learning code, I have some unknown values with '?' in my csv file. So, I am trying to replace them with 'Nan' but it throws some error. The following code is for the replacement of '?' that I have used. Can anyone please solve this?
Thanks in advance !
import numpy
import pandas as pd
import matplotlib as plot
import numpy as np
df = pd.read_csv('cdk.csv')
x=df.iloc[:,0:24].values
y=df.iloc[:,24].values
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN', strategy='most_frequent', axis =0,copy=False)
imputer = imputer.fit(x[:,0:5])
imputer.fit_transform(x[:,0:5])
imputer = Imputer(missing_values='normal', strategy='mode', axis =0,copy=False)
imputer = imputer.fit(x[:,5:7])
imputer.fit_transform(x[:,5:7])
This is what error it throws,
Traceback (most recent call last):
File "kidney.py", line 10, in <module>
imputer = imputer.fit(x[:,0:5])
File "C:\Users\YAASHI\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\preprocessing\imputation.py", line 155, in fit
force_all_finite=False)
File "C:\Users\YAASHI\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\utils\validation.py", line 433, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: '?'
Link for the csv file
If you want to replace all ? strings with NaN, do this:
df.replace('?', np.nan, inplace=True)
Or better yet, load them as NaN as you read the CSV:
df = pd.read_csv('cdk.csv', na_values=['?'])

sklearn LogisticRegression does not accept csr_matrix

I am a newby and I have to classify the words of a lexicon according to the De Pauw and Wagacha (1998) method (basically, maxent on char n-grams). The data is very large (500 000 entries and millions of n-grams). So I must load the samples as a sparse matrix. But I ran into a problem.
sklearn.linear_model.LogisticRegression().fit(X,y) says it does not accept scipy.sparse.csr.csr_matrix training vectors. I got this error
Traceback (most recent call last):
File "test-LR-4.py", line 8, in <module>
clf.fit(X,y)
File "/usr/lib/pymodules/python2.7/sklearn/svm/base.py", line 441, in fit
% type(X))
ValueError: Training vectors should be array-like, not <class 'scipy.sparse.csr.csr_matrix'>
for the following script:
from sklearn.linear_model import LogisticRegression
import numpy as np
import scipy.sparse as sp
X = sp.csr_matrix([[0, 1, 2],[1, 2, 3],[3, 2, 1]])
y = np.array(range(3))
clf=LogisticRegression(dual=True)
clf.fit(X,y)
As mentioned in comments by #Andreas and #Fred Foo, upgrading the sklearn version (> 0.13) will solve the problem.

Resources