slicing error in numpy array - python-3.x

I am trying to run the following code
fs = 1000
data = np.loadtxt("trainingdataset.txt", delimiter=",")
data1 = data[:,2]
data2 = data1.astype(int)
X,Y = data2['521']
but it gets me the following error
Traceback (most recent call last):
File "C:\Users\hadeer.elziaat\Desktop\testspec.py", line 58, in <module>
X,Y = data2['521']
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
my dataset
1,4,6,10
2,100,125,10
3,100,7216,254
4,100,527,263
5,100,954,13
6,100,954,23

You're using the string '521' rather than the number 521 for indexing. Try X,Y = data2[521] instead.
If you are only given the string, you could cast it to an int first: X,Y = data2[int('521')], but this might result in some errors and/or unexpected behaviour.
Next problem, you are requiring two variable, one for X and one for Y, yet the data2[521] selection only provides you with a single variable (the number in the 3rd column, 522nd row).

You say you want all the data in the 3rd column.
I assume you also want some kind of x-axis, since you are attempting to do X, Y = .... How about using the first column for that? Then your code would be:
import numpy as np
data = np.loadtxt("trainingdataset.txt", delimiter=',', dtype='int')
x = data[:, 0]
y = data[:, 2]
What remains unclear from your question is why you tried to index your data with 521 - which failed because you cannot use strings as indices on plain arrays.

Related

Getting unxpected IndexError when creating a dataframe

I am trying to execute the below code:
heart_df = pd.read_csv(r"location")
X = heart_df.iloc[:, :-1].values
y = heart_df.iloc[:, 11].values
new_df = X[["Sex", "ChestPainType", "RestingECG", "ExerciseAngina", "ST_Slope"]].values() #this is line 17
cat_cols = new_df.copy()
and getting IndexError like:
File "***location***", line 17, in <module>
new_df = X[["Sex", "ChestPainType", "RestingECG", "ExerciseAngina", "ST_Slope"]].values()
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
As far as I know this IndexError comes when we use float numbers as indices but don't understand why it is coming in this case.
Here, by creating new_df and then cat_cols, I want to separate the categorical columns to apply OneHotEncoding at a later stage.
The dataset is here: https://www.kaggle.com/fedesoriano/heart-failure-prediction.
The error is coming from:
X = heart_df.iloc[:, :-1].values
The .values part converts the data frame to a numpy array and certain columns in X are not compatible with numpy array.
So we can write the same as:
X = heart_df.iloc[:, :-1]
new_df = X[["Sex", "ChestPainType", "RestingECG", "ExerciseAngina", "ST_Slope"]]

Error trying to use nump frombuffer function on a large bytes object

I am trying to pass a very, very long bytes object in numpy frombuffer, and it is giving me the following error:
ValueError: buffer size must be a multiple of element size
Is there a flag I am missing? How can I specify and larger buffer size?
Edit: The format is like:
x = b'\xdc\x08....\x01'
y = np.frombuffer(x)
You need to tell it what type of data it is, and if it's an array, what is the array shape. For example
import numpy as np
a = [[1,2,3],[2,4,6]]
npa = np.array(a)
x = npa.tobytes()
y = np.frombuffer(x, dtype = npa.dtype.name).reshape(npa.shape)
# check to see that y is the same as npa

getting error while forming train matrix in book recommendation system

I am new to data science and facing issues while creating a book recommendation system by collaborative filtering. Can someone please advise on the below error.
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
data = pd.read_csv('BX-Book-Ratings.csv',engine = 'python')
df = data.iloc[1:10000,:]
print(df)
print(df.dtypes)
df['isbn']= pd.to_numeric(df['isbn'], errors = 'coerce')
df = df[np.isfinite(df).all(1)]
df['isbn'] = df['isbn'].astype(np.int64)
from sklearn.model_selection import train_test_split
n_users = df.user_id.unique().shape[0]
n_book = df.isbn.unique().shape[0]
train_data, test_data = train_test_split(df, test_size=0.5)
print(n_users , n_book)
train_data_matrix = np.zeros((n_users, n_book))
for line in train_data.itertuples():
#[user_id index, book_id index] = given rating.
train_data_matrix[line[1] - 1, line[2] - 1] = line[3]
train_data_matrix
--------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-125-caa0bcd40167> in <module>
2 for line in train_data.itertuples():
3 #[user_id index, book_id index] = given rating.
----> 4 train_data_matrix[line[1] - 1, line[2] - 1] = line[3]
5 train_data_matrix
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
The Most probable cause of error is the index value are having mismatch.
I can see ISBN is int type but what about user_id??
Fix:
The fix is to create an unique index for the these n_users * n_book.
Method1 : this can be created either using another unique
dataframe for consumer and item and use its index.
Method2 : create a dict and use unique values as key and some index.
Now whatever method is used should be consistent across rest of process or it will result in mismatch of
book-item rating.
This fix uses method 2.
# Method2
user_dict= {}
for item,value in enumerate(df.user_id.unique().tolist()):
consumer_dict[value]= item
book_dict = {}
for item, value in enumerate(df.isbn.unique().tolist()):
item_dict[value] = item
print(len(user_dict.keys()), len(book_dict.keys()))
for line in train.itertuples():
row_index = user_dict[line[1]]
col_index = book_dict[line[2]]
data_matrix[row_index, col_index] = line[3]
Hope This Helps , Snapshot of data will probably help to fix this.

how to select specific columns in a table by using np.r__ in dataset.loc and deal with string data

I would like to classify a problem which its data rows are something similar to
In order to divide to test train data:
x_train, x_test, y_train, y_test = train_test_split(X, y,test_size = 0.25, random_state = 0)
Method 1:
X = dataset.loc[np.r_[0:5, 7:26]].values
y = dataset.loc[np.r_[6]].values
Method 2:
X = dataset.loc[:, ['x1', 'x2','x3','x4','x5','x6','x7','x8','x9','x10','x11','x12','x13','x14','x15','x16','x17','x18','x19','x20','x21','x22','x23','x24','x25','x26']].values
y = dataset.loc[:, ['y']].values
The first method encounters this problem:
ValueError: Found input variables with inconsistent numbers of samples: [24, 1]
while the second one is OK. I do not like to write all of the columns but I do not know how to solve the problem of first method.
Also, since the data is string I encounter this error:
ValueError: could not convert string to float: 'id8053'
I tried to solve with:
X = X.apply(lambda x: pd.factorize(x)[1])
y = y.apply(lambda x: pd.factorize(x)[0])
but I encounter this error:
AttributeError: 'numpy.ndarray' object has no attribute 'apply'
What is wrong?
np.r_ should work fine in your case. Method 1 missed the rows. You slice on integer index-columns (i.e, slicing by integer position of columns), so you need to use .iloc with np.r_ for columns and specify : for rows
Try this (note the right-end of slices in np.r_ got added 1 because .iloc ignore the right-end while loc keeps it)
Method 1:
X = dataset.iloc[:, np.r_[0:6, 7:27]].values
y = dataset.iloc[:, np.r_[7]].values

Apply MinMaxScaler() on a pandas column

I am trying to use the sklearn MinMaxScaler to rescale a python column like below:
scaler = MinMaxScaler()
y = scaler.fit(df['total_amount'])
But got the following errors:
Traceback (most recent call last):
File "/Users/edamame/workspace/git/my-analysis/experiments/my_seq.py", line 54, in <module>
y = scaler.fit(df['total_amount'])
File "/Users/edamame/workspace/git/my-analysis/venv/lib/python3.4/site-packages/sklearn/preprocessing/data.py", line 308, in fit
return self.partial_fit(X, y)
File "/Users/edamame/workspace/git/my-analysis/venv/lib/python3.4/site-packages/sklearn/preprocessing/data.py", line 334, in partial_fit
estimator=self, dtype=FLOAT_DTYPES)
File "/Users/edamame/workspace/git/my-analysis/venv/lib/python3.4/site-packages/sklearn/utils/validation.py", line 441, in check_array
"if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:
array=[3.180000e+00 2.937450e+03 6.023850e+03 2.216292e+04 1.074589e+04
:
0.000000e+00 0.000000e+00 9.000000e+01 1.260000e+03].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Any idea what was wrong?
The input to MinMaxScaler needs to be array-like, with shape [n_samples, n_features]. So you can apply it on the column as a dataframe rather than a series (using double square brackets instead of single):
y = scaler.fit(df[['total_amount']])
Though from your description, it sounds like you want fit_transform rather than just fit (but I could be wrong):
y = scaler.fit_transform(df[['total_amount']])
A little more explanation:
If your dataframe had 100 rows, consider the difference in shape when you transform a column to an array:
>>> np.array(df[['total_amount']]).shape
(100, 1)
>>> np.array(df['total_amount']).shape
(100,)
The first returns a shape that matches [n_samples, n_features] (as required by MinMaxScaler), whereas the second does not.
Try to do with this way:
import pandas as pd
from sklearn import preprocessing
x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)

Resources