Python pandas search in DataFrame MultiIndex

Python pandas search in DataFrame MultiIndex - python-3.x

This is what I do:
import pandas as pd
t = pd.DataFrame(data={'i1':[0,0,0,0,1,1,1,1,2,2,2,2],
'i2':[0,1,2,3,0,1,2,3,0,1,2,3],
'x':[1.,2.,3.,4.,5.,6.,7.,8.,9.,10.,11.,12.]})
t.set_index(['i1','i2'], inplace=True)
t.sort_index(inplace=True)
t.index.values.searchsorted( (1,1) )
This is the error I get:
> Traceback (most recent call last): File "<stdin>", line 1, in
> <module> TypeError: '<' not supported between instances of 'tuple' and
> 'int'
Please, help me to understand what I am doing wrong.
Index values are tuples: type(t.index.values[0]) gives correctly <class 'tuple'> and I give as input to searchsorted a tuple. So where does the 'tuple' to 'int' comparison come from?
>>> print(t)
x
i1 i2
0 0 1.0
1 2.0
2 3.0
3 4.0
1 0 5.0
1 6.0
2 7.0
3 8.0
2 0 9.0
1 10.0
2 11.0
3 12.0

searchsorted doesn't work with tuples. There's an open issue on github for Multiarray searchsorted fails
On that issue, one of the participants suggests using get_indexer
With your code
t.index.get_indexer([(1,1)])[0]
# outputs:
5

I have found a solution:
>>> t.index.get_loc( (1,1) )
5
This solution is ~200 times faster that using t.index.get_indexer:
>>> import time
>>> time.clock()
168.56
>>> for i in range(10000): a = t.index.get_indexer([(1,1)])[0]
...
>>> time.clock()
176.76
>>> (176.76 - 168.56) / 10000
0.0008199999999999989 # 820e-6 sec per call
>>> time.clock()
176.76
>>> for i in range(1000000): a = t.index.get_loc( (1,1) )
...
>>> time.clock()
180.94
>>> (180.94-176.76)/1000000
4.1800000000000066e-06 # 4.2e-6 sec per call

Related

Python Pandas indexing provides KeyError: (slice(None, None, None), )

I am indexing and slicing my data using Pandas in Python3 to calculate spatial statistics.
When I am running a for loop over the range of latitude and longitude using .loc, gives an error KeyError: (slice(None, None, None), ) for the particular set of latitude and longitude for what no values are available in the input file. Instead of skipping those values, it gives an error and stops running the code. Following is my code.
import numpy as np
import pandas as pd
from scipy import stats
filename='input.txt'
df = pd.read_csv(filename,delim_whitespace=True, header=None, names = ['year','month','lat','lon','aod'], index_col = ['year','month','lat','lon'])
idx=pd.IndexSlice
for i in range (1, 13):
for lat0 in N.arange(0.,40.25,0.25,dtype=float):
for lon0 in N.arange(20.0,75.25,0.25,dtype=float):
tmp = df.loc[idx[:,i,lat0,lon0],:]
if (len(tmp) <= 0):
continue
tmp2 = tmp.index.tolist()
In the code above, if I run for tmp = df.loc[idx[:,1,0.0,34.0],:], it works well and provides the following output, which I used for the further calculation.
aod
year month lat lon
2003 1 0.0 34.0 0.032000
2006 1 0.0 34.0 0.114000
2007 1 0.0 34.0 0.035000
2008 1 0.0 34.0 0.026000
2011 1 0.0 34.0 0.097000
2012 1 0.0 34.0 0.106333
2013 1 0.0 34.0 0.081000
2014 1 0.0 34.0 0.038000
2015 1 0.0 34.0 0.278500
2016 1 0.0 34.0 0.033000
2017 1 0.0 34.0 0.036333
2019 1 0.0 34.0 0.064333
2020 1 0.0 34.0 0.109500
But, a same code I run for tmp = df.loc[idx[:,1,0.0,32.75],:], for the respective latitude and longitude no values available in the input file. Instead of skipping those, it gives me the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 925, in __getitem__
return self._getitem_tuple(key)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 1100, in _getitem_tuple
return self._getitem_lowerdim(tup)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 822, in _getitem_lowerdim
return self._getitem_nested_tuple(tup)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 906, in _getitem_nested_tuple
obj = getattr(obj, self.name)._getitem_axis(key, axis=axis)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 1157, in _getitem_axis
locs = labels.get_locs(key)
File "/usr/lib/python3/dist-packages/pandas/core/indexes/multi.py", line 3347, in get_locs
indexer = _update_indexer(
File "/usr/lib/python3/dist-packages/pandas/core/indexes/multi.py", line 3296, in _update_indexer
raise KeyError(key)
KeyError: (slice(None, None, None), 1, 0.0, 32.75)
I tried to replace .loc with .iloc, but it came out with a too many indexers error. However, I tried solutions from internet using .to_numpy(), .values and .as_matrix(), but nothing work.

But, a same code I run for tmp = df.loc[idx[:,1,0.0,32.75],:], for the respective latitude and longitude no values available in the input file. Instead of skipping those, it gives me the following error:
The idiomatic Pandas solution would be to write this as a groupby. Example:
# split df into groups by the keys month, lat, and lon
for index, tmp in df.groupby(['month','lat','lon']):
# tmp is a dataframe where all rows have identical month, lat, and lon values
# ... do something with the tmp dataframe ...
This has three benefits.
Speed. A groupby will be faster because it only needs to loop over the dataframe once, rather than searching the whole dataframe for everything matching the first group, then searching for the second group, etc.
Simplicity.
Robustness. From a robustness perspective, if a dataframe doesn't have, for example, any rows matching "month=1,lat=0.0,lon=32.75", then it will not create that group.
More information: User guide on grouping
Remark about groupby aggregation functions
You'll also sometimes see groupby used with aggregation functions. For example, suppose you wanted to get the sum of each column within each group.
>>> l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
>>> df = pd.DataFrame(l, columns=["a", "b", "c"])
>>> df.groupby(by=["b"]).sum()
a c
b
1.0 2 3
2.0 2 5
These aggregation functions are faster and easier to use, but sometimes I need something that is custom and unusual, so I'll write a loop. But if you're doing something common, like getting the average of a group, consider looking for an aggregation function.

Operands could not be broadcast together with shapes Error for Pandas Dataframe

I have looked through the other answers for Operand errors and none seem to fit this example.
The mathematics/equation works, either coding in X values or importing from the DataFrame.
Using the same equation in an np.where expression causes the operand error.
import csv
import pandas as pd
from pandas import DataFrame
import numpy as np
data= pd.read_csv('miniDF.csv')
df=pd.DataFrame(data, columns=['X','Z'])
df['y']=df['Z']*0.01
df['y']=(14.6413819224756*(df['X']**0.5)+64.4092780704338*(np.log(df['X'])**-2)
+1675.7498523727*(np.exp(-df['X']))+3.07221083927051*np.cos(df['X']))
print(df)
df['y']=np.where(df['Z']>=(14.6413819224756*(df['X']**0.5)+64.4092780704338*(np.log(df['X'])**-2)
+1675.7498523727*(np.exp(-df['X']))+3.07221083927051*np.cos(df['X']),8,9))
print(df)
The values in my Dataframe, the output from the first print(df) and the error are as follows.
X Z y
0 1.4 1 999.999293
1 2.0 2000 380.275104
2 3.0 3 159.114194
3 4.0 4 91.481930
4 5.0 5 69.767368
5 6.0 6 63.030212
6 7.0 70 59.591631
7 8.0 8 56.422723
8 9.0 9 54.673108
9 10.0 10 55.946732
Traceback (most recent call last):
File "/Users/willhutchins/Desktop/minitest.py", line 17, in <module>
df['y']=np.where(df['Z']>=(14.6413819224756*(df['X']**0.5)+64.4092780704338*(np.log(df['X'])**-2)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/ops/__init__.py", line 1229, in wrapper
res = na_op(values, other)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/ops/__init__.py", line 1115, in na_op
result = method(y)
ValueError: operands could not be broadcast together with shapes (10,) (3,)

The answer was simply a misplaced Parenthesis as answered here:
[https://stackoverflow.com/questions/63046213/complex-curve-equation-giving-error-in-np-where-usage][1]
Corrected code is:
df['y']=np.where(df['Z']>=(14.6413819224756*(df['X']**0.5)+64.4092780704338*(np.log(df['X'])**-2)
+1675.7498523727*(np.exp(-df['X']))+3.07221083927051*np.cos(df['X'])),8,9)

Why I am getting matrices are not aligned error for DataFrame dot function?

I am trying to implement simple linear regression in Python using Numpy and Pandas. But I am getting a ValueError: matrices are not aligned error for calling the dot function which essentially calculates the matrix multiplication as the documentation says. Following is the code snippet:
import numpy as np
import pandas as pd
#initializing the matrices for X, y and theta
#dataset = pd.read_csv("data1.csv")
dataset = pd.DataFrame([[6.1101,17.592],[5.5277,9.1302],[8.5186,13.662],[7.0032,11.854],[5.8598,6.8233],[8.3829,11.886],[7.4764,4.3483],[8.5781,12]])
X = dataset.iloc[:, :-1]
y = dataset.iloc[:, -1]
X.insert(0, "x_zero", np.ones(X.size), True)
print(X)
print(f"\n{y}")
theta = pd.DataFrame([[0],[1]])
temp = pd.DataFrame([[1],[1]])
print(X.shape)
print(theta.shape)
print(X.dot(theta))
And this is the output for the same:
x_zero 0
0 1.0 6.1101
1 1.0 5.5277
2 1.0 8.5186
3 1.0 7.0032
4 1.0 5.8598
5 1.0 8.3829
6 1.0 7.4764
7 1.0 8.5781
0 17.5920
1 9.1302
2 13.6620
3 11.8540
4 6.8233
5 11.8860
6 4.3483
7 12.0000
Name: 1, dtype: float64
(8, 2)
(2, 1)
Traceback (most recent call last):
File "linear.py", line 16, in <module>
print(X.dot(theta))
File "/home/tejas/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 1063, in dot
raise ValueError("matrices are not aligned")
ValueError: matrices are not aligned
As you can see the output of shape attributes for both of them, the second axis has same dimension (2) and dot function should return a 8*1 DataFrame. Then, why the error?

This misalignment is not a one coming from shapes, but the one coming from pandas indexes. You have 2 options to fix your problem:
Tweak theta assignment:
theta = pd.DataFrame([[0],[1]], index=X.columns)
So the indexes you multiply will match.
Remove indexes relevancy, by moving second df to numpy:
X.dot(theta.to_numpy())
This functionality is actually useful in pandas - that it tries to match smart the indexes, your case is just the quite specific one, when it becomes counterproductive ;)

pandas: Using built-in and customized aggregation function together?

I used the following code:
s=df.groupby('version').agg({'duration':['mean','std'],'ts':['min','max']).reset_index()
s.columns=s.columns.map("_".join)
The results work fine.
Then I tried to add one more aggregate function quantile(.25)
s=df.groupby('version').agg({'duration':['mean','std', quantile(.25)],'ts':['min','max']}).reset_index()
s.columns=s.columns.map("_".join)
Then get the following error:
NameError Traceback (most recent call last)
<ipython-input-22-d4857cf7740e> in <module>()
----> 1 s=df.groupby('version').agg({'duration':['mean','std', quantile(.25)],'ts':['min','max']}).reset_index()
2 s.columns=s.columns.map("_".join)
3 s
NameError: name 'quantile' is not defined
What would be the proper way to achieve this? Thanks!

You can wrap the quantile operation in a lambda function:
> import pandas as pd
> import numpy as np
> df = pd.DataFrame({'Duration': np.random.rand(10),
'Version': [1,1,2,2,3,3,4,4,4,4]})
> df
Duration Version
0 0.843479 1
1 0.028724 1
2 0.605053 2
3 0.548231 2
4 0.223244 3
5 0.883418 3
6 0.772413 4
7 0.100166 4
8 0.865734 4
9 0.865839 4
> df.groupby('Version').agg({'Duration' : [min, max, lambda x: x.quantile(.25)]}
Duration
min max <lambda>
Version
1 0.028724 0.843479 0.232413
2 0.548231 0.605053 0.562437
3 0.223244 0.883418 0.388287
4 0.100166 0.865839 0.604351

applying a lambda function to pandas dataframe

First time posting on stackoverflow, so bear with me if I'm making some faux pas please :)
I'm trying to calculate the distance between two points, using geopy, but I can't quite get the actual application of the calculation to work.
Here's the head of the dataframe I'm working with (there are some missing values later in the dataframe, not sure if this is the issue or how to handle it in general):
start lat start long end_lat end_long
0 38.902760 -77.038630 38.880300 -76.986200
2 38.895914 -77.026064 38.915400 -77.044600
3 38.888251 -77.049426 38.895914 -77.026064
4 38.892300 -77.043600 38.888251 -77.049426
I've set up a function:
def dist_calc(st_lat, st_long, fin_lat, fin_long):
from geopy.distance import vincenty
start = (st_lat, st_long)
end = (fin_lat, fin_long)
return vincenty(start, end).miles
This one works fine when given manual input.
However, when I try to apply() the function, I run into trouble with the below code:
distances = df.apply(lambda row: dist_calc(row[-4], row[-3], row[-2], row[-1]), axis=1)
I'm fairly new to python, any help will be much appreciated!
Edit: error message:
distances = df.apply(lambda row: dist_calc2(row[-4], row[-3], row[-2], row[-1]), axis=1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 4262, in apply
ignore_failures=ignore_failures)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 4358, in _apply_standard
results[i] = func(v)
File "<stdin>", line 1, in <lambda>
File "<stdin>", line 5, in dist_calc2
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/geopy/distance.py", line 322, in __init__
super(vincenty, self).__init__(*args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/geopy/distance.py", line 115, in __init__
kilometers += self.measure(a, b)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/geopy/distance.py", line 414, in measure
u_sq = cos_sq_alpha * (major ** 2 - minor ** 2) / minor ** 2
UnboundLocalError: ("local variable 'cos_sq_alpha' referenced before assignment", 'occurred at index 10')

The default settings for pandas functions typically used to import text data like this (pd.read_table() etc) will interpret the spaces in the first 2 column names as separators, so you'll end up with 6 columns instead of 4, and your data will be misaligned:
In [23]: df = pd.read_clipboard()
In [24]: df
Out[24]:
start lat start.1 long end_lat end_long
0 0 38.902760 -77.038630 38.880300 -76.986200 NaN
1 2 38.895914 -77.026064 38.915400 -77.044600 NaN
2 3 38.888251 -77.049426 38.895914 -77.026064 NaN
3 4 38.892300 -77.043600 38.888251 -77.049426 NaN
In [25]: df.columns
Out[25]: Index(['start', 'lat', 'start.1', 'long', 'end_lat', 'end_long'], dtype='object')
Notice column names are wrong, the last column is full of NaNs, etc. If I apply your function to the dataframe in this form, I get the same error as you did.
Its usually better to try to fix this before it gets imported as a dataframe. I can think of 2 methods:
clean the data before importing, for example copy it into an editor and replace the offending spaces with underscores. This is the easiest.
use a regex to fix it during import. This may be necessary if the dataset is very large, or its is pulled from a website and has to be refreshed regularly.
Here's an example of case (2):
In [35]: df = pd.read_clipboard(sep=r'\s{2,}|\s(?=-)', engine='python')
In [36]: df = df.rename_axis({'start lat': 'start_lat', 'start long': 'start_long'}, axis=1)
In [37]: df
Out[37]:
start_lat start_long end_lat end_long
0 38.902760 -77.038630 38.880300 -76.986200
2 38.895914 -77.026064 38.915400 -77.044600
3 38.888251 -77.049426 38.895914 -77.026064
4 38.892300 -77.043600 38.888251 -77.049426
The specified that separators must contain either 2+ whitespaces characters, or 1 whitespace followed by a hyphen (minus sign). Then I rename the columns to what i assume are the expected values.
From this point your function / apply works fine, but i've changed it a little:
PEP8 recommends putting imports at the top of each file, rather than in a function
Extracting the columns by name is more robust, and would have given a much more understandable error than the weird error thrown by geopy.
For example:
In [51]: def dist_calc(row):
...: start = row[['start_lat','start_long']]
...: end = row[['end_lat', 'end_long']]
...: return vincenty(start, end).miles
...:
In [52]: df.apply(lambda row: dist_calc(row), axis=1)
Out[52]:
0 3.223232
2 1.674780
3 1.365851
4 0.420305
dtype: float64

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Python pandas search in DataFrame MultiIndex - python-3.x

searchsorted doesn't work with tuples. There's an open issue on github for Multiarray searchsorted fails On that issue, one of the participants suggests using get_indexer With your code t.index.get_indexer([(1,1)])[0] # outputs: 5

Related

Python Pandas indexing provides KeyError: (slice(None, None, None), )

Operands could not be broadcast together with shapes Error for Pandas Dataframe

Why I am getting matrices are not aligned error for DataFrame dot function?

pandas: Using built-in and customized aggregation function together?

applying a lambda function to pandas dataframe

Categories

Resources