Getting unxpected IndexError when creating a dataframe - python-3.x

I am trying to execute the below code:
heart_df = pd.read_csv(r"location")
X = heart_df.iloc[:, :-1].values
y = heart_df.iloc[:, 11].values
new_df = X[["Sex", "ChestPainType", "RestingECG", "ExerciseAngina", "ST_Slope"]].values() #this is line 17
cat_cols = new_df.copy()
and getting IndexError like:
File "***location***", line 17, in <module>
new_df = X[["Sex", "ChestPainType", "RestingECG", "ExerciseAngina", "ST_Slope"]].values()
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
As far as I know this IndexError comes when we use float numbers as indices but don't understand why it is coming in this case.
Here, by creating new_df and then cat_cols, I want to separate the categorical columns to apply OneHotEncoding at a later stage.
The dataset is here: https://www.kaggle.com/fedesoriano/heart-failure-prediction.

The error is coming from:
X = heart_df.iloc[:, :-1].values
The .values part converts the data frame to a numpy array and certain columns in X are not compatible with numpy array.
So we can write the same as:
X = heart_df.iloc[:, :-1]
new_df = X[["Sex", "ChestPainType", "RestingECG", "ExerciseAngina", "ST_Slope"]]

Related

TypeError: only integer scalar arrays can be converted to a scalar index (object detection)

I am struggling with this one part. Not sure how to fix it! Would be great if someone could tell me what I need to fix in the code. Down below is the code & error message that I'm receiving.
This it the code:
categoriesList=["airplane","automobile","bird", "cat", "deer", "dog", "frog", "horse", "ship", "truck"]
import matplotlib.pyplot as plt
import random
def plotImages(x_test, images_arr, labels_arr, n_images=8):
fig, axes = plt.subplots(n_images, n_images, figsize=(9,9))
axes = axes.flatten()
for i in range(100):
rand = random.randint(0, x_test.shape[0] -1)
img = images_arr[rand]
ax = axes[i]
ax.imshow( img, cmap="Greys_r")
ax.set_xticks(())
ax.set_yticks(())
sample = x_test[rand].reshape((1,32,32,3))
predict_x = model2000.predict(sample)
label=categoriesList[predict_x[0]]
if labels_arr[rand][predictions[0]] == 0:
ax.set_title(label, fontsize=18 - n_images, color="red")
else:
ax.set_title(label, fontsize=18 - n_images)
plot = plt.tight_layout()
return plot
display (plotImages(x_test, data_test_picture, y_test, n_images=10))
This is the error message:
TypeError: only integer scalar arrays can be converted to a scalar index
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<command-2104322840429397> in <module>
28 return plot
29
---> 30 display (plotImages(x_test, data_test_picture, y_test, n_images=10))
<command-2104322840429397> in plotImages(x_test, images_arr, labels_arr, n_images)
18 sample = x_test[rand].reshape((1,32,32,3))
19 predict_x = model2000.predict(sample)
---> 20 label=categoriesList[predict_x[0]]
21
22 if labels_arr[rand][predictions[0]] == 0:
TypeError: only integer scalar arrays can be converted to a scalar index
The output i'm getting:
To fix the integer scalar arrays can be converted to a scalar index error
Concatenate array by list
Here we have 2 array we have to convert into list using the numpy.concatenate() like numpy.concatenate([ar1, ar2])
import numpy
# Create 2 different arrays
ar1 = numpy.array(['Apple', 'Orange', 'Banana', 'Pineapple', 'Grapes'])
ar2 = numpy.array(['Onion', 'Potato'])
# Concatenate array ar1 & ar2 using numpy.concatenate()
ar3 = numpy.concatenate([ar1, ar2]) print(ar3)
# Output
['Apple' 'Orange' 'Banana' 'Pineapple' 'Grapes' 'Onion' 'Potato']
Concatenate array by Tuple
Convert array 1 and array 2 to tuple using the numpy.concatenate() like numpy.concatenate((ar1, ar2))
import numpy
# Create 2 different arrays
ar1 = numpy.array(['Apple', 'Orange', 'Banana', 'Pineapple', 'Grapes'])
ar2 = numpy.array(['Onion', 'Potato'])
# Concatenate array ar1 & ar2 using numpy.concatenate()
ar3 = numpy.concatenate((ar1, ar2)) print(ar3)
# Output
['Apple' 'Orange' 'Banana' 'Pineapple' 'Grapes' 'Onion' 'Potato']
If you use the plain array and perform some indexing operation it will show the same error. To overcome this you can convert the ordinary array into a NumPy array and then perform the required operation.
categoriesList=numpy.array(["airplane","automobile","bird", "cat", "deer", "dog", "frog", "horse", "ship", "truck"])
Refer here for more information

getting error while forming train matrix in book recommendation system

I am new to data science and facing issues while creating a book recommendation system by collaborative filtering. Can someone please advise on the below error.
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
data = pd.read_csv('BX-Book-Ratings.csv',engine = 'python')
df = data.iloc[1:10000,:]
print(df)
print(df.dtypes)
df['isbn']= pd.to_numeric(df['isbn'], errors = 'coerce')
df = df[np.isfinite(df).all(1)]
df['isbn'] = df['isbn'].astype(np.int64)
from sklearn.model_selection import train_test_split
n_users = df.user_id.unique().shape[0]
n_book = df.isbn.unique().shape[0]
train_data, test_data = train_test_split(df, test_size=0.5)
print(n_users , n_book)
train_data_matrix = np.zeros((n_users, n_book))
for line in train_data.itertuples():
#[user_id index, book_id index] = given rating.
train_data_matrix[line[1] - 1, line[2] - 1] = line[3]
train_data_matrix
--------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-125-caa0bcd40167> in <module>
2 for line in train_data.itertuples():
3 #[user_id index, book_id index] = given rating.
----> 4 train_data_matrix[line[1] - 1, line[2] - 1] = line[3]
5 train_data_matrix
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
The Most probable cause of error is the index value are having mismatch.
I can see ISBN is int type but what about user_id??
Fix:
The fix is to create an unique index for the these n_users * n_book.
Method1 : this can be created either using another unique
dataframe for consumer and item and use its index.
Method2 : create a dict and use unique values as key and some index.
Now whatever method is used should be consistent across rest of process or it will result in mismatch of
book-item rating.
This fix uses method 2.
# Method2
user_dict= {}
for item,value in enumerate(df.user_id.unique().tolist()):
consumer_dict[value]= item
book_dict = {}
for item, value in enumerate(df.isbn.unique().tolist()):
item_dict[value] = item
print(len(user_dict.keys()), len(book_dict.keys()))
for line in train.itertuples():
row_index = user_dict[line[1]]
col_index = book_dict[line[2]]
data_matrix[row_index, col_index] = line[3]
Hope This Helps , Snapshot of data will probably help to fix this.

how to select specific columns in a table by using np.r__ in dataset.loc and deal with string data

I would like to classify a problem which its data rows are something similar to
In order to divide to test train data:
x_train, x_test, y_train, y_test = train_test_split(X, y,test_size = 0.25, random_state = 0)
Method 1:
X = dataset.loc[np.r_[0:5, 7:26]].values
y = dataset.loc[np.r_[6]].values
Method 2:
X = dataset.loc[:, ['x1', 'x2','x3','x4','x5','x6','x7','x8','x9','x10','x11','x12','x13','x14','x15','x16','x17','x18','x19','x20','x21','x22','x23','x24','x25','x26']].values
y = dataset.loc[:, ['y']].values
The first method encounters this problem:
ValueError: Found input variables with inconsistent numbers of samples: [24, 1]
while the second one is OK. I do not like to write all of the columns but I do not know how to solve the problem of first method.
Also, since the data is string I encounter this error:
ValueError: could not convert string to float: 'id8053'
I tried to solve with:
X = X.apply(lambda x: pd.factorize(x)[1])
y = y.apply(lambda x: pd.factorize(x)[0])
but I encounter this error:
AttributeError: 'numpy.ndarray' object has no attribute 'apply'
What is wrong?
np.r_ should work fine in your case. Method 1 missed the rows. You slice on integer index-columns (i.e, slicing by integer position of columns), so you need to use .iloc with np.r_ for columns and specify : for rows
Try this (note the right-end of slices in np.r_ got added 1 because .iloc ignore the right-end while loc keeps it)
Method 1:
X = dataset.iloc[:, np.r_[0:6, 7:27]].values
y = dataset.iloc[:, np.r_[7]].values

Apply MinMaxScaler() on a pandas column

I am trying to use the sklearn MinMaxScaler to rescale a python column like below:
scaler = MinMaxScaler()
y = scaler.fit(df['total_amount'])
But got the following errors:
Traceback (most recent call last):
File "/Users/edamame/workspace/git/my-analysis/experiments/my_seq.py", line 54, in <module>
y = scaler.fit(df['total_amount'])
File "/Users/edamame/workspace/git/my-analysis/venv/lib/python3.4/site-packages/sklearn/preprocessing/data.py", line 308, in fit
return self.partial_fit(X, y)
File "/Users/edamame/workspace/git/my-analysis/venv/lib/python3.4/site-packages/sklearn/preprocessing/data.py", line 334, in partial_fit
estimator=self, dtype=FLOAT_DTYPES)
File "/Users/edamame/workspace/git/my-analysis/venv/lib/python3.4/site-packages/sklearn/utils/validation.py", line 441, in check_array
"if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:
array=[3.180000e+00 2.937450e+03 6.023850e+03 2.216292e+04 1.074589e+04
:
0.000000e+00 0.000000e+00 9.000000e+01 1.260000e+03].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Any idea what was wrong?
The input to MinMaxScaler needs to be array-like, with shape [n_samples, n_features]. So you can apply it on the column as a dataframe rather than a series (using double square brackets instead of single):
y = scaler.fit(df[['total_amount']])
Though from your description, it sounds like you want fit_transform rather than just fit (but I could be wrong):
y = scaler.fit_transform(df[['total_amount']])
A little more explanation:
If your dataframe had 100 rows, consider the difference in shape when you transform a column to an array:
>>> np.array(df[['total_amount']]).shape
(100, 1)
>>> np.array(df['total_amount']).shape
(100,)
The first returns a shape that matches [n_samples, n_features] (as required by MinMaxScaler), whereas the second does not.
Try to do with this way:
import pandas as pd
from sklearn import preprocessing
x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)

slicing error in numpy array

I am trying to run the following code
fs = 1000
data = np.loadtxt("trainingdataset.txt", delimiter=",")
data1 = data[:,2]
data2 = data1.astype(int)
X,Y = data2['521']
but it gets me the following error
Traceback (most recent call last):
File "C:\Users\hadeer.elziaat\Desktop\testspec.py", line 58, in <module>
X,Y = data2['521']
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
my dataset
1,4,6,10
2,100,125,10
3,100,7216,254
4,100,527,263
5,100,954,13
6,100,954,23
You're using the string '521' rather than the number 521 for indexing. Try X,Y = data2[521] instead.
If you are only given the string, you could cast it to an int first: X,Y = data2[int('521')], but this might result in some errors and/or unexpected behaviour.
Next problem, you are requiring two variable, one for X and one for Y, yet the data2[521] selection only provides you with a single variable (the number in the 3rd column, 522nd row).
You say you want all the data in the 3rd column.
I assume you also want some kind of x-axis, since you are attempting to do X, Y = .... How about using the first column for that? Then your code would be:
import numpy as np
data = np.loadtxt("trainingdataset.txt", delimiter=',', dtype='int')
x = data[:, 0]
y = data[:, 2]
What remains unclear from your question is why you tried to index your data with 521 - which failed because you cannot use strings as indices on plain arrays.

Resources