getting error while forming train matrix in book recommendation system - python-3.x

I am new to data science and facing issues while creating a book recommendation system by collaborative filtering. Can someone please advise on the below error.
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
data = pd.read_csv('BX-Book-Ratings.csv',engine = 'python')
df = data.iloc[1:10000,:]
print(df)
print(df.dtypes)
df['isbn']= pd.to_numeric(df['isbn'], errors = 'coerce')
df = df[np.isfinite(df).all(1)]
df['isbn'] = df['isbn'].astype(np.int64)
from sklearn.model_selection import train_test_split
n_users = df.user_id.unique().shape[0]
n_book = df.isbn.unique().shape[0]
train_data, test_data = train_test_split(df, test_size=0.5)
print(n_users , n_book)
train_data_matrix = np.zeros((n_users, n_book))
for line in train_data.itertuples():
#[user_id index, book_id index] = given rating.
train_data_matrix[line[1] - 1, line[2] - 1] = line[3]
train_data_matrix
--------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-125-caa0bcd40167> in <module>
2 for line in train_data.itertuples():
3 #[user_id index, book_id index] = given rating.
----> 4 train_data_matrix[line[1] - 1, line[2] - 1] = line[3]
5 train_data_matrix
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

The Most probable cause of error is the index value are having mismatch.
I can see ISBN is int type but what about user_id??
Fix:
The fix is to create an unique index for the these n_users * n_book.
Method1 : this can be created either using another unique
dataframe for consumer and item and use its index.
Method2 : create a dict and use unique values as key and some index.
Now whatever method is used should be consistent across rest of process or it will result in mismatch of
book-item rating.
This fix uses method 2.
# Method2
user_dict= {}
for item,value in enumerate(df.user_id.unique().tolist()):
consumer_dict[value]= item
book_dict = {}
for item, value in enumerate(df.isbn.unique().tolist()):
item_dict[value] = item
print(len(user_dict.keys()), len(book_dict.keys()))
for line in train.itertuples():
row_index = user_dict[line[1]]
col_index = book_dict[line[2]]
data_matrix[row_index, col_index] = line[3]
Hope This Helps , Snapshot of data will probably help to fix this.

Related

TypeError: only integer scalar arrays can be converted to a scalar index (object detection)

I am struggling with this one part. Not sure how to fix it! Would be great if someone could tell me what I need to fix in the code. Down below is the code & error message that I'm receiving.
This it the code:
categoriesList=["airplane","automobile","bird", "cat", "deer", "dog", "frog", "horse", "ship", "truck"]
import matplotlib.pyplot as plt
import random
def plotImages(x_test, images_arr, labels_arr, n_images=8):
fig, axes = plt.subplots(n_images, n_images, figsize=(9,9))
axes = axes.flatten()
for i in range(100):
rand = random.randint(0, x_test.shape[0] -1)
img = images_arr[rand]
ax = axes[i]
ax.imshow( img, cmap="Greys_r")
ax.set_xticks(())
ax.set_yticks(())
sample = x_test[rand].reshape((1,32,32,3))
predict_x = model2000.predict(sample)
label=categoriesList[predict_x[0]]
if labels_arr[rand][predictions[0]] == 0:
ax.set_title(label, fontsize=18 - n_images, color="red")
else:
ax.set_title(label, fontsize=18 - n_images)
plot = plt.tight_layout()
return plot
display (plotImages(x_test, data_test_picture, y_test, n_images=10))
This is the error message:
TypeError: only integer scalar arrays can be converted to a scalar index
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<command-2104322840429397> in <module>
28 return plot
29
---> 30 display (plotImages(x_test, data_test_picture, y_test, n_images=10))
<command-2104322840429397> in plotImages(x_test, images_arr, labels_arr, n_images)
18 sample = x_test[rand].reshape((1,32,32,3))
19 predict_x = model2000.predict(sample)
---> 20 label=categoriesList[predict_x[0]]
21
22 if labels_arr[rand][predictions[0]] == 0:
TypeError: only integer scalar arrays can be converted to a scalar index
The output i'm getting:
To fix the integer scalar arrays can be converted to a scalar index error
Concatenate array by list
Here we have 2 array we have to convert into list using the numpy.concatenate() like numpy.concatenate([ar1, ar2])
import numpy
# Create 2 different arrays
ar1 = numpy.array(['Apple', 'Orange', 'Banana', 'Pineapple', 'Grapes'])
ar2 = numpy.array(['Onion', 'Potato'])
# Concatenate array ar1 & ar2 using numpy.concatenate()
ar3 = numpy.concatenate([ar1, ar2]) print(ar3)
# Output
['Apple' 'Orange' 'Banana' 'Pineapple' 'Grapes' 'Onion' 'Potato']
Concatenate array by Tuple
Convert array 1 and array 2 to tuple using the numpy.concatenate() like numpy.concatenate((ar1, ar2))
import numpy
# Create 2 different arrays
ar1 = numpy.array(['Apple', 'Orange', 'Banana', 'Pineapple', 'Grapes'])
ar2 = numpy.array(['Onion', 'Potato'])
# Concatenate array ar1 & ar2 using numpy.concatenate()
ar3 = numpy.concatenate((ar1, ar2)) print(ar3)
# Output
['Apple' 'Orange' 'Banana' 'Pineapple' 'Grapes' 'Onion' 'Potato']
If you use the plain array and perform some indexing operation it will show the same error. To overcome this you can convert the ordinary array into a NumPy array and then perform the required operation.
categoriesList=numpy.array(["airplane","automobile","bird", "cat", "deer", "dog", "frog", "horse", "ship", "truck"])
Refer here for more information

Getting unxpected IndexError when creating a dataframe

I am trying to execute the below code:
heart_df = pd.read_csv(r"location")
X = heart_df.iloc[:, :-1].values
y = heart_df.iloc[:, 11].values
new_df = X[["Sex", "ChestPainType", "RestingECG", "ExerciseAngina", "ST_Slope"]].values() #this is line 17
cat_cols = new_df.copy()
and getting IndexError like:
File "***location***", line 17, in <module>
new_df = X[["Sex", "ChestPainType", "RestingECG", "ExerciseAngina", "ST_Slope"]].values()
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
As far as I know this IndexError comes when we use float numbers as indices but don't understand why it is coming in this case.
Here, by creating new_df and then cat_cols, I want to separate the categorical columns to apply OneHotEncoding at a later stage.
The dataset is here: https://www.kaggle.com/fedesoriano/heart-failure-prediction.
The error is coming from:
X = heart_df.iloc[:, :-1].values
The .values part converts the data frame to a numpy array and certain columns in X are not compatible with numpy array.
So we can write the same as:
X = heart_df.iloc[:, :-1]
new_df = X[["Sex", "ChestPainType", "RestingECG", "ExerciseAngina", "ST_Slope"]]

e. Create new variable whose values should be square of difference between variable1 and variable2

I am trying to Create a new variable whose values should be square of the difference between variable1 and variable2, both are integer data types, I have cleaned the data for missing values, but I am not able to do that with below-mentioned code.
import numpy as np
df['new'] = 0
for i in range(len(df)):
df['new'].loc[i] = df['imdbVotes'].loc[0] - df['imdbRating'].loc[0]
df['new'] = 0
for i in range(len(df)):
df['new'].loc[i] = df['imdbVotes'].loc[0] - df['imdbRating'].loc[0]
TypeError Traceback (most recent call
last) in
4
5 for i in range(len(df)):
----> 6 df['new'].loc[i] = df['imdbVotes'].loc[0] - df['imdbRating'].loc[0]
TypeError: unsupported operand type(s) for -: 'str' and 'str'
I could not check the solution with real data.
# Set columns float type
df['imdbVotes'] = df['imdbVotes'].astype(float) # or int
df['imdbRating'] = df['imdbRating'].astype(float) # or int
# Do not iterate, use vector operations
df['new'] = df['imdbVotes'] - df['imdbRating']

How do I fix KeyError bug in my code while implementing K-Nearest Neighbours from scratch?

I am trying to implement K-Nearest Neighbours algorithm from scratch in Python. The code I wrote worked well for the Breast-Cancer-Wisconsin.csv dataset.
However, the same code when I try to run for Iris.csv dataset, my implementation fails and gives KeyError.
The only difference in the 2 datasets is the fact that in Breast-Cancer-Wisconsin.csv there are only 2 classes ('2' for malignant and '4' for benign) and both the labels are integers wheres in Iris.csv there are 3 classes ('setosa', 'versicolor', 'virginica') and all these 3 labels are in string type.
Here is the code I wrote (for Iris.csv) :
import numpy as np
from math import sqrt
import matplotlib.pyplot as plt
from matplotlib import style
from collections import Counter
import warnings
import pandas as pd
import random
style.use('fivethirtyeight')
dataset = {'k':[[1,2],[2,3],[3,1]], 'r':[[6,5],[7,7],[8,6]]}
new_features = [5,7]
#[[plt.scatter(j[0],j[1], s=100, color=i) for j in dataset[i]] for i in dataset]
#plt.scatter(new_features[0], new_features[1], s=100)
#plt.show()
def k_nearest_neighbors(data, predict, k=3):
if len(data) >= k:
warnings.warn('K is set to a value less than total voting groups!')
distances = []
for group in data:
for features in data[group]:
euclidean_distance = np.linalg.norm(np.array(features) - np.array(predict))
distances.append([euclidean_distance, group])
votes = [i[1] for i in sorted(distances)[:k]]
vote_result = Counter(votes).most_common(1)[0][0]
return vote_result
df = pd.read_csv('iris.csv')
df.replace('?', -99999, inplace=True)
#full_data = df.astype(float).values.tolist()
#random.shuffle(full_data)
test_size = 0.2
train_set = {'setosa':[], 'versicolor':[], 'virginica':[]}
test_set = {'setosa':[], 'versicolor':[], 'virginica':[]}
train_data = full_data[:-int(test_size*len(full_data))]
test_data = full_data[-int(test_size*len(full_data)):]
for i in train_data:
train_set[i[-1]].append(i[:-1])
for i in test_data:
test_set[i[-1]].append(i[:-1])
correct = 0
total = 0
for group in test_set:
for data in test_set[group]:
vote = k_nearest_neighbors(train_set, data, k=5)
if group == vote:
correct += 1
total += 1
print('Accuracy : ', correct/total)
When I run the above code, I get a KeyError message at line number 49.
Could anyone please explain to me where I am going wrong? Also, it would be great if someone could point out how do I modify this algorithm to classify multiple classes (instead of 2 or 3) in the future?
Also, how do I handle if the classes are in string type instead of integer?
one solution I thought of was to convert all string types to integer types and try to solve but would that work?
REFERENCES
Iris.csv
Breas-Cancer-Wisconsin.csv
Let's start from your last question:
one solution I thought of was to convert all string types to integer types and try to solve but would that work?
Yes, that would work. You shouldn't have to hardcode the names of all the classes of every problem in your code. Instead, you can just write a function that reads all the different values for the class attribute, and assigns a numeric value to each different one.
Could anyone please explain to me where I am going wrong?
Most likely, the problem is that you are reading an instance whose class attribute is not 'setosa', 'versicolor', 'virginica' (something like Iris-setosa perhaps?). The idea above should fix this problem.
Also, it would be great if someone could point out how do I modify this algorithm to classify multiple classes (instead of 2 or 3) in the future?
As discuss before, you just need to avoid hard-coding the names of the classes in your code
Also, how do I handle if the classes are in string type instead of integer?
def get_class_values(data):
classes_seen = {}
for i in data:
_class = data[-1]
if _class not in classes_seen:
classes_seen[_class] = len(classes_seen)
return classes_seen
A function like this one would return a mapping between all your classes (no matter the type) and numeric codes (from 0 to N-1). Using this mapping would also solve all the problems mentioned before.
Convert String Labels In CSV Files To Integer Labels
After going through some GitHub repos I came across a very simple yet elegant piece of code that solves the above problem. Hope it helps those who have faced this problem before (beginners especially!)
% read the csv file
df = pd.read_csv('iris.csv')
% clean the data file
df.replace('?', -99999, inplace=True)
% convert the string classes into integer types.
% integers are assigned from 0 to N-1.
% species is the name of the column which has class labels.
df['species'] = df['species'].astype('category')
df['species_value'] = df['species'].cat.codes
df.drop(['species'], 1, inplace=True)
% convert the data frame to list
full_data = df.astype(float).values.tolist()
random.shuffle(full_data)
Post Debugging
Turns out that we need not use the above piece of code also, i.e I can get the answer without explicitly converting the string labels into integer labels (using the above code).
I have posted the original code after some minor changes (below) and the key error is now fixed. Also, I am now getting an accuracy of 97% to 100% (only on IRIS dataset).
test_size = 0.2
train_set = {0:[], 1:[], 2:[]}
test_set = {0:[], 1:[], 2:[]}
That is the only change you need to make to the original code I posted in order to make it work!! Simple!
However, please note that the numbers have to be given as integers and not string (otherwise it would lead to key error!).
Wrap-Up
There are some commented lines in the original code which I thought would be good to explain in case somebody ran into some issues. Here's one snippet with the comments removed (compare with original code in the question).
df = pd.read_csv('iris.csv')
df.replace('?', -99999, inplace=True)
full_data = df.astype(float).values.tolist()
random.shuffle(full_data)
Here's the output you get:
ValueError: could not convert string to float: 'virginica'
What went wrong?
Note that here we did not convert the string labels into integer labels. Therefore, when we tried to convert the data in the CSV to float values, the kernel threw an error because a string cannot be converted to float!
So one way to go about it is that you don't convert the data into floating point values and then you won't get this error. However in many cases you need to convert all the data into floating point (for eg.. normalisation, accuracy, long mathematical calculations, prevention of loss of precision etc etc..).
Hence after heavy debugging and going through a lot of articles I finally came up with a simple version of the original code (below):
import numpy as np
from math import sqrt
import matplotlib.pyplot as plt
from matplotlib import style
from collections import Counter
import warnings
import pandas as pd
import random
def k_nearest_neighbors(data, predict, k=3):
if len(data) >= k:
warnings.warn('K is set to a value less than total voting groups!')
distances = []
for group in data:
for features in data[group]:
euclidean_distance = np.linalg.norm(np.array(features) - np.array(predict))
distances.append([euclidean_distance, group])
votes = [i[1] for i in sorted(distances)[:k]]
vote_result = Counter(votes).most_common(1)[0][0]
return vote_result
df = pd.read_csv('iris.csv')
df.replace('?', -99999, inplace=True)
df['species'] = df['species'].astype('category')
df['species_value'] = df['species'].cat.codes
df.drop(['species'], 1, inplace=True)
full_data = df.astype(float).values.tolist()
random.shuffle(full_data)
test_size = 0.2
train_set = {0:[], 1:[], 2:[]}
test_set = {0:[], 1:[], 2:[]}
train_data = full_data[:-int(test_size*len(full_data))]
test_data = full_data[-int(test_size*len(full_data)):]
for i in train_data:
train_set[i[-1]].append(i[:-1])
for i in test_data:
test_set[i[-1]].append(i[:-1])
correct = 0
total = 0
for group in test_set:
for data in test_set[group]:
vote = k_nearest_neighbors(train_set, data, k=5)
if group == vote:
correct += 1
total += 1
print('Accuracy : ', (correct/total)*100,'%')
Hope this helps!

slicing error in numpy array

I am trying to run the following code
fs = 1000
data = np.loadtxt("trainingdataset.txt", delimiter=",")
data1 = data[:,2]
data2 = data1.astype(int)
X,Y = data2['521']
but it gets me the following error
Traceback (most recent call last):
File "C:\Users\hadeer.elziaat\Desktop\testspec.py", line 58, in <module>
X,Y = data2['521']
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
my dataset
1,4,6,10
2,100,125,10
3,100,7216,254
4,100,527,263
5,100,954,13
6,100,954,23
You're using the string '521' rather than the number 521 for indexing. Try X,Y = data2[521] instead.
If you are only given the string, you could cast it to an int first: X,Y = data2[int('521')], but this might result in some errors and/or unexpected behaviour.
Next problem, you are requiring two variable, one for X and one for Y, yet the data2[521] selection only provides you with a single variable (the number in the 3rd column, 522nd row).
You say you want all the data in the 3rd column.
I assume you also want some kind of x-axis, since you are attempting to do X, Y = .... How about using the first column for that? Then your code would be:
import numpy as np
data = np.loadtxt("trainingdataset.txt", delimiter=',', dtype='int')
x = data[:, 0]
y = data[:, 2]
What remains unclear from your question is why you tried to index your data with 521 - which failed because you cannot use strings as indices on plain arrays.

Resources