Pandas dataframe float index not self-consistent - python-3.x

I need/want to work with float indices in pandas but I get a keyerror when running something like this:
inds = [1.1, 2.2]
cols = [5.4, 6.7]
df = pd.DataFrame(np.random.randn(2, 2), index=inds, columns=cols)
df[df.index[0]]
I have seen some errors regarding precision, but shouldn't this work?

You get the KeyError because df[df.index[0]] would try to access a column with label 1.1 in this case - which does not exist here.
What you can do is use loc or iloc to access rows based on indices:
import numpy as np
import pandas as pd
inds = [1.1, 2.2]
cols = [5.4, 6.7]
df = pd.DataFrame(np.random.randn(2, 2), index=inds, columns=cols)
# to access e.g. the first row use
df.loc[df.index[0]]
# or more general
df.iloc[0]
# 5.4 1.531411
# 6.7 -0.341232
# Name: 1.1, dtype: float64
In principle, if you can, avoid equal comparisons for floating point numbers for the reason you already came across: precision. The 1.1 displayed to you might be != 1.1 for the computer - simply because that would theoretically require infinite precision. Most of the time, it will work though because certain tolerance checks will kick in; for example if the difference of the compared numbers is < 10^6.

Related

Calculate similarity of 1-row dataframe and a large dataframe with the same columns in Python?

I have a very large dataframe (millions of rows) and every time I am getting a 1-row dataframe with the same columns.
For example:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,-1], 'c': [-1,0.4,31]})
input = pd.DataFrame([[11, -0.44, 4]], columns=list('abc'))
I would like to calculate cosine similarity between the input and the whole df.
I am using the following:
from scipy.spatial.distance import cosine
df.apply(lambda row: 1 - cosine(row, input), axis=1)
But it's a bit slow. Tried with swifter package, and it seems to run faster.
Please advise what is the best practice for such a task, do it like this or change to another method?
I usually don't do matrix manipulation with DataFrame but with numpy.array. So I will first convert them
df_npy = df.values
input_npy = input.values
And then I don't want to use scipy.spatial.distance.cosine so I will take care of the calculation myself, which is to first normalize each of the vectors
df_npy = df_npy / np.linalg.norm(df_npy, axis=1, keepdims=True)
input_npy = input_npy / np.linalg.norm(input_npy, axis=1, keepdims=True)
And then matrix multiply them together
df_npy # input_npy.T
which will give you
array([[0.213],
[0.524],
[0.431]])
The reason I don't want to use scipy.spatial.distance.cosine is that it only takes care of one pair of vector at a time, but in the way I show, it takes care of all at the same time.

Vectorizing a for loop with a pandas dataframe

I am trying to do a project for my physics class where we are supposed to simulate motion of charged particles. We are supposed to randomly generate their positions and charges but we have to have positively charged particles in one region and negatively charged ones anywhere else. Right now, as a proof of concept, I am trying to do only 10 particles but the final project will have at least 1000.
My thought process is to create a dataframe with the first column containing the randomly generated charges and run a loop to see what value I get and place in the same dataframe as the next three columns their generated positions.
I have tried to do a simple for loop going over the rows and inputting the data as I go, but I run into an IndexingError: too many indexers. I also want this to run as efficiently as possible so that if I scale up the number of particles, it doesn't slow as much.
I also want to vectorize the operations of calculating the motion of each particle since it is based on position of every other particle which, through normal loops would take a lot of computational time.
Any vectorization optimization or offloading to GPU would be very helpful, thanks.
# In[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
# In[2]:
num_points=10
df_position = pd.DataFrame(pd,np.empty((num_points,4)),columns=['Charge','X','Y','Z'])
# In[3]:
charge = np.array([np.random.choice(2,num_points)])
df_position.iloc[:,0]=np.where(df_position["Charge"]==0,-1,1)
# In[4]:
def positive():
return np.random.uniform(low=0, high=5)
def negative():
return np.random.uniform(low=5, high=10)
# In[5]:
for row in df_position.itertuples(index=True,name='Charge'):
if(getattr(row,"Charge")==-1):
df_position.iloc[row,1]=positive()
df_position.iloc[row,2]=positive()
df_position.iloc[row,3]=positive()
else:
df_position.iloc[row,1]=negative()
#this is where I would get the IndexingError and would like to optimize this portion
df_position.iloc[row,2]=negative()
df_position.iloc[row,3]=negative()
df_position.iloc[:,0]=np.where(df_position["Charge"]==0,-1,1)
# In[6]:
ax=plt.axes(projection='3d')
ax.set_xlim(0, 10); ax.set_ylim(0, 10); ax.set_zlim(0,10);
xdata=df_position.iloc[:,1]
ydata=df_position.iloc[:,2]
zdata=df_position.iloc[:,3]
chargedata=df_position.iloc[:11,0]
colors = np.where(df_position["Charge"]==1,'r','b')
ax.scatter3D(xdata,ydata,zdata,c=colors,alpha=1)
EDIT:
The dataframe that I want the results in would be something like this
Charge X Y Z
-1
1
-1
-1
1
With the inital coordinates of each charge listed after in their respective columns. It will be a 3D dataframe as I will need to track of all their new positions after each time step so that I can do animations of the motion. Each layer will be exactly the same format.
Some code for creating your dataframe:
import numpy as np
import pandas as pd
num_points = 1_000
# uniform distribution of int, not sure it is the best one for your problem
# positive_point = np.random.randint(0, num_points)
positive_point = int(num_points / 100 * np.random.randn() + num_points / 2)
negavite_point = num_points - positive_point
positive_df = pd.DataFrame(
np.random.uniform(0.0, 5.0, size=[positive_point, 3]), index=[1] * positive_point, columns=['X', 'Y', 'Z']
)
negative_df = pd.DataFrame(
np.random.uniform(5.0, 10.0, size=[negavite_point, 3]), index=[-1] *negavite_point, columns=['X', 'Y', 'Z']
)
df = pd.concat([positive_df, negative_df])
It is quite fast for 1,000 or 1,000,000.
Edit: with my first answer, I totally miss a big part of the question. This new one should fit better.
Second edit: I use a better distribution for the number of positive point than a uniform distribution of int.

How to encode multiple categorical columns for test data efficiently?

I have multiple category columns (nearly 50). I using custom made frequency encoding and using it on training data. At last i am saving it as nested dictionary. For the test data I am using map function to encode and unseen labels are replaced with 0. But I need more efficient way?
I have already tried pandas replace method but it don't cares of unseen labels and leaves it as it. Further I am much concerned about the time and i want say 80 columns and 1 row to be encoded within 60 ms. Just need the most efficient way I can do it. I have taken my example from here.
import pandas
from sklearn import preprocessing
df = pandas.DataFrame({'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'meo'],
'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'],
'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego',
'New_York']})
My dict looks something like this :
enc = {'pets': {'cat': 0, 'dog': 1, 'monkey': 2},
'owner': {'Brick': 0, 'Champ': 1, 'Ron': 2, 'Veronica': 3},
'location': {'New_York': 0, 'San_Diego': 1}}
for col in enc:
if col in input_df.columns:
input_df[col]= input_df[col].map(dict_online['encoding'][col]).fillna(0)
Further I want multiple columns to be encoded at once. I don't want any loop for every column.... I guess we cant do it in map. Hence replace is good choice but in that as said it doesn't cares about unseen labels.
EDIT:
This the code i am using for now, Please note there is only 1 row in test data frame ( Not very sure i should handle it like numpy array to reduce time...). But i need to decrease this time to under 60 ms: Further i have dictionary only for mapping ( Cant use one hot because of use case). Currently time = 331.74 ms. Any idea how to do it more efficiently. Not sure that multiprocessing will work..? Further with replace method i have got many issues like : 1. It does not handle unseen labels and leave them as it is ( for string its issue). 2. It has problem with overlapping of keys and values.
from string import ascii_lowercase
import itertools
import pandas as pd
import numpy as np
import time
def iter_all_strings():
for size in itertools.count(1):
for s in itertools.product(ascii_lowercase, repeat=size):
yield "".join(s)
l = []
for s in iter_all_strings():
l.append(s)
if s == 'gr':
break
columns = l
df = pd.DataFrame(columns=columns)
for col in df.columns:
df[col] = np.random.randint(1, 4000, 3000)
transform_dict = {}
for col in df.columns:
cats = pd.Categorical(df[col]).categories
d = {}
for i, cat in enumerate(cats):
d[cat] = i
transform_dict[col] = d
print(f"The length of the dictionary is {len(transform_dict)}")
# Creating another test data frame
df2 = pd.DataFrame(columns=columns)
for col in df2.columns:
df2[col] = np.random.randint(1, 4000, 1)
print(f"The shape of teh 2nd data frame is {df2.shape}")
t1 = time.time()
for col in df2.columns:
df2[col] = df2[col].map(transform_dict[col]).fillna(0)
print(f"Time taken is {time.time() - t1}")
# print(df)
Firstly, when you want to encode categorical variables, which is not ordinal (meaning: there is no inherent ordering between the values of the variable/column. ex- cat, dog), you must use one hot encoding.
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'meo'],
'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'],
'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego',
'New_York']})
enc = [['cat','dog','monkey'],
['Brick', 'Champ', 'Ron', 'Veronica'],
['New_York', 'San_Diego']]
ohe = OneHotEncoder(categories=enc, handle_unknown='ignore', sparse=False)
Here, I have modified your enc in a way that can be fed into the OneHotEncoder.
Now comes the point of how can we going to handle the unseen
labels?
when you handle_unknown as False, the unseen values will have zeros in all the dummy variables, which in a way would help the model to understand its a unknown value.
colnames= ['{}_{}'.format(col,val) for col,unique_values in zip(df.columns,ohe.categories_) \
for val in unique_values]
pd.DataFrame(ohe.fit_transform(df), columns=colnames)
Update:
If you are fine with ordinal endocing, the following change could help.
df2.apply(lambda row: [transform_dict[val].get(col,0) \
for val,col in row.items()],
axis=1,
result_type='expand')
#1000 loops, best of 3: 1.17 ms per loop

How do I fix KeyError bug in my code while implementing K-Nearest Neighbours from scratch?

I am trying to implement K-Nearest Neighbours algorithm from scratch in Python. The code I wrote worked well for the Breast-Cancer-Wisconsin.csv dataset.
However, the same code when I try to run for Iris.csv dataset, my implementation fails and gives KeyError.
The only difference in the 2 datasets is the fact that in Breast-Cancer-Wisconsin.csv there are only 2 classes ('2' for malignant and '4' for benign) and both the labels are integers wheres in Iris.csv there are 3 classes ('setosa', 'versicolor', 'virginica') and all these 3 labels are in string type.
Here is the code I wrote (for Iris.csv) :
import numpy as np
from math import sqrt
import matplotlib.pyplot as plt
from matplotlib import style
from collections import Counter
import warnings
import pandas as pd
import random
style.use('fivethirtyeight')
dataset = {'k':[[1,2],[2,3],[3,1]], 'r':[[6,5],[7,7],[8,6]]}
new_features = [5,7]
#[[plt.scatter(j[0],j[1], s=100, color=i) for j in dataset[i]] for i in dataset]
#plt.scatter(new_features[0], new_features[1], s=100)
#plt.show()
def k_nearest_neighbors(data, predict, k=3):
if len(data) >= k:
warnings.warn('K is set to a value less than total voting groups!')
distances = []
for group in data:
for features in data[group]:
euclidean_distance = np.linalg.norm(np.array(features) - np.array(predict))
distances.append([euclidean_distance, group])
votes = [i[1] for i in sorted(distances)[:k]]
vote_result = Counter(votes).most_common(1)[0][0]
return vote_result
df = pd.read_csv('iris.csv')
df.replace('?', -99999, inplace=True)
#full_data = df.astype(float).values.tolist()
#random.shuffle(full_data)
test_size = 0.2
train_set = {'setosa':[], 'versicolor':[], 'virginica':[]}
test_set = {'setosa':[], 'versicolor':[], 'virginica':[]}
train_data = full_data[:-int(test_size*len(full_data))]
test_data = full_data[-int(test_size*len(full_data)):]
for i in train_data:
train_set[i[-1]].append(i[:-1])
for i in test_data:
test_set[i[-1]].append(i[:-1])
correct = 0
total = 0
for group in test_set:
for data in test_set[group]:
vote = k_nearest_neighbors(train_set, data, k=5)
if group == vote:
correct += 1
total += 1
print('Accuracy : ', correct/total)
When I run the above code, I get a KeyError message at line number 49.
Could anyone please explain to me where I am going wrong? Also, it would be great if someone could point out how do I modify this algorithm to classify multiple classes (instead of 2 or 3) in the future?
Also, how do I handle if the classes are in string type instead of integer?
one solution I thought of was to convert all string types to integer types and try to solve but would that work?
REFERENCES
Iris.csv
Breas-Cancer-Wisconsin.csv
Let's start from your last question:
one solution I thought of was to convert all string types to integer types and try to solve but would that work?
Yes, that would work. You shouldn't have to hardcode the names of all the classes of every problem in your code. Instead, you can just write a function that reads all the different values for the class attribute, and assigns a numeric value to each different one.
Could anyone please explain to me where I am going wrong?
Most likely, the problem is that you are reading an instance whose class attribute is not 'setosa', 'versicolor', 'virginica' (something like Iris-setosa perhaps?). The idea above should fix this problem.
Also, it would be great if someone could point out how do I modify this algorithm to classify multiple classes (instead of 2 or 3) in the future?
As discuss before, you just need to avoid hard-coding the names of the classes in your code
Also, how do I handle if the classes are in string type instead of integer?
def get_class_values(data):
classes_seen = {}
for i in data:
_class = data[-1]
if _class not in classes_seen:
classes_seen[_class] = len(classes_seen)
return classes_seen
A function like this one would return a mapping between all your classes (no matter the type) and numeric codes (from 0 to N-1). Using this mapping would also solve all the problems mentioned before.
Convert String Labels In CSV Files To Integer Labels
After going through some GitHub repos I came across a very simple yet elegant piece of code that solves the above problem. Hope it helps those who have faced this problem before (beginners especially!)
% read the csv file
df = pd.read_csv('iris.csv')
% clean the data file
df.replace('?', -99999, inplace=True)
% convert the string classes into integer types.
% integers are assigned from 0 to N-1.
% species is the name of the column which has class labels.
df['species'] = df['species'].astype('category')
df['species_value'] = df['species'].cat.codes
df.drop(['species'], 1, inplace=True)
% convert the data frame to list
full_data = df.astype(float).values.tolist()
random.shuffle(full_data)
Post Debugging
Turns out that we need not use the above piece of code also, i.e I can get the answer without explicitly converting the string labels into integer labels (using the above code).
I have posted the original code after some minor changes (below) and the key error is now fixed. Also, I am now getting an accuracy of 97% to 100% (only on IRIS dataset).
test_size = 0.2
train_set = {0:[], 1:[], 2:[]}
test_set = {0:[], 1:[], 2:[]}
That is the only change you need to make to the original code I posted in order to make it work!! Simple!
However, please note that the numbers have to be given as integers and not string (otherwise it would lead to key error!).
Wrap-Up
There are some commented lines in the original code which I thought would be good to explain in case somebody ran into some issues. Here's one snippet with the comments removed (compare with original code in the question).
df = pd.read_csv('iris.csv')
df.replace('?', -99999, inplace=True)
full_data = df.astype(float).values.tolist()
random.shuffle(full_data)
Here's the output you get:
ValueError: could not convert string to float: 'virginica'
What went wrong?
Note that here we did not convert the string labels into integer labels. Therefore, when we tried to convert the data in the CSV to float values, the kernel threw an error because a string cannot be converted to float!
So one way to go about it is that you don't convert the data into floating point values and then you won't get this error. However in many cases you need to convert all the data into floating point (for eg.. normalisation, accuracy, long mathematical calculations, prevention of loss of precision etc etc..).
Hence after heavy debugging and going through a lot of articles I finally came up with a simple version of the original code (below):
import numpy as np
from math import sqrt
import matplotlib.pyplot as plt
from matplotlib import style
from collections import Counter
import warnings
import pandas as pd
import random
def k_nearest_neighbors(data, predict, k=3):
if len(data) >= k:
warnings.warn('K is set to a value less than total voting groups!')
distances = []
for group in data:
for features in data[group]:
euclidean_distance = np.linalg.norm(np.array(features) - np.array(predict))
distances.append([euclidean_distance, group])
votes = [i[1] for i in sorted(distances)[:k]]
vote_result = Counter(votes).most_common(1)[0][0]
return vote_result
df = pd.read_csv('iris.csv')
df.replace('?', -99999, inplace=True)
df['species'] = df['species'].astype('category')
df['species_value'] = df['species'].cat.codes
df.drop(['species'], 1, inplace=True)
full_data = df.astype(float).values.tolist()
random.shuffle(full_data)
test_size = 0.2
train_set = {0:[], 1:[], 2:[]}
test_set = {0:[], 1:[], 2:[]}
train_data = full_data[:-int(test_size*len(full_data))]
test_data = full_data[-int(test_size*len(full_data)):]
for i in train_data:
train_set[i[-1]].append(i[:-1])
for i in test_data:
test_set[i[-1]].append(i[:-1])
correct = 0
total = 0
for group in test_set:
for data in test_set[group]:
vote = k_nearest_neighbors(train_set, data, k=5)
if group == vote:
correct += 1
total += 1
print('Accuracy : ', (correct/total)*100,'%')
Hope this helps!

Insert field into structured array at a specific column index

I'm currently using np.loadtxt to load some mixed data into a structured numpy array. I do some calculations on a few of the columns to output later. For compatibility reasons I need to maintain a specific output format so I'd like to insert those columns at specific points and use np.savetxt to export the array in one shot.
A simple setup:
import numpy as np
x = np.zeros((2,),dtype=('i4,f4,a10'))
x[:] = [(1,2.,'Hello'),(2,3.,'World')]
newcol = ['abc','def']
For this example I'd like to make newcol the 2nd column. I'm very new to Python (coming from MATLAB). From my searching all I've been able to find so far are ways to append newcol to the end of x to make it the last column, or x to newcol to make it the first column. I also turned up np.insert but it doesn't seem to work on a structured array because it's technically a 1D array (from my understanding).
What's the most efficient way to accomplish this?
EDIT1:
I investigated np.savetxt a little further and it seems like it can't be used with a structured array, so I'm assuming I would need to loop through and write each row with f.write. I could specify each column explicitly (by field name) with that approach and not have to worry about the order in my structured array, but that doesn't seem like a very generic solution.
For the above example my desired output would be:
1, abc, 2.0, Hello
2, def, 3.0, World
This is a way to add a field to the array, at the position you require:
from numpy import zeros, empty
def insert_dtype(x, position, new_dtype, new_column):
if x.dtype.fields is None:
raise ValueError, "`x' must be a structured numpy array"
new_desc = x.dtype.descr
new_desc.insert(position, new_dtype)
y = empty(x.shape, dtype=new_desc)
for name in x.dtype.names:
y[name] = x[name]
y[new_dtype[0]] = new_column
return y
x = zeros((2,), dtype='i4,f4,a10')
x[:] = [(1, 2., 'Hello'), (2, 3., 'World')]
new_dt = ('my_alphabet', '|S3')
new_col = ['abc', 'def']
x = insert_dtype(x, 1, new_dt, new_col)
Now x looks like
array([(1, 'abc', 2.0, 'Hello'), (2, 'def', 3.0, 'World')],
dtype=[('f0', '<i4'), ('my_alphabet', 'S3'), ('f1', '<f4'), ('f2', 'S10')])
The solution is adapted from here.
To print the recarray to file, you could use something like:
from matplotlib.mlab import rec2csv
rec2csv(x,'foo.txt')

Resources