applying a lambda function to pandas dataframe - python-3.x

First time posting on stackoverflow, so bear with me if I'm making some faux pas please :)
I'm trying to calculate the distance between two points, using geopy, but I can't quite get the actual application of the calculation to work.
Here's the head of the dataframe I'm working with (there are some missing values later in the dataframe, not sure if this is the issue or how to handle it in general):
start lat start long end_lat end_long
0 38.902760 -77.038630 38.880300 -76.986200
2 38.895914 -77.026064 38.915400 -77.044600
3 38.888251 -77.049426 38.895914 -77.026064
4 38.892300 -77.043600 38.888251 -77.049426
I've set up a function:
def dist_calc(st_lat, st_long, fin_lat, fin_long):
from geopy.distance import vincenty
start = (st_lat, st_long)
end = (fin_lat, fin_long)
return vincenty(start, end).miles
This one works fine when given manual input.
However, when I try to apply() the function, I run into trouble with the below code:
distances = df.apply(lambda row: dist_calc(row[-4], row[-3], row[-2], row[-1]), axis=1)
I'm fairly new to python, any help will be much appreciated!
Edit: error message:
distances = df.apply(lambda row: dist_calc2(row[-4], row[-3], row[-2], row[-1]), axis=1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 4262, in apply
ignore_failures=ignore_failures)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 4358, in _apply_standard
results[i] = func(v)
File "<stdin>", line 1, in <lambda>
File "<stdin>", line 5, in dist_calc2
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/geopy/distance.py", line 322, in __init__
super(vincenty, self).__init__(*args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/geopy/distance.py", line 115, in __init__
kilometers += self.measure(a, b)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/geopy/distance.py", line 414, in measure
u_sq = cos_sq_alpha * (major ** 2 - minor ** 2) / minor ** 2
UnboundLocalError: ("local variable 'cos_sq_alpha' referenced before assignment", 'occurred at index 10')

The default settings for pandas functions typically used to import text data like this (pd.read_table() etc) will interpret the spaces in the first 2 column names as separators, so you'll end up with 6 columns instead of 4, and your data will be misaligned:
In [23]: df = pd.read_clipboard()
In [24]: df
Out[24]:
start lat start.1 long end_lat end_long
0 0 38.902760 -77.038630 38.880300 -76.986200 NaN
1 2 38.895914 -77.026064 38.915400 -77.044600 NaN
2 3 38.888251 -77.049426 38.895914 -77.026064 NaN
3 4 38.892300 -77.043600 38.888251 -77.049426 NaN
In [25]: df.columns
Out[25]: Index(['start', 'lat', 'start.1', 'long', 'end_lat', 'end_long'], dtype='object')
Notice column names are wrong, the last column is full of NaNs, etc. If I apply your function to the dataframe in this form, I get the same error as you did.
Its usually better to try to fix this before it gets imported as a dataframe. I can think of 2 methods:
clean the data before importing, for example copy it into an editor and replace the offending spaces with underscores. This is the easiest.
use a regex to fix it during import. This may be necessary if the dataset is very large, or its is pulled from a website and has to be refreshed regularly.
Here's an example of case (2):
In [35]: df = pd.read_clipboard(sep=r'\s{2,}|\s(?=-)', engine='python')
In [36]: df = df.rename_axis({'start lat': 'start_lat', 'start long': 'start_long'}, axis=1)
In [37]: df
Out[37]:
start_lat start_long end_lat end_long
0 38.902760 -77.038630 38.880300 -76.986200
2 38.895914 -77.026064 38.915400 -77.044600
3 38.888251 -77.049426 38.895914 -77.026064
4 38.892300 -77.043600 38.888251 -77.049426
The specified that separators must contain either 2+ whitespaces characters, or 1 whitespace followed by a hyphen (minus sign). Then I rename the columns to what i assume are the expected values.
From this point your function / apply works fine, but i've changed it a little:
PEP8 recommends putting imports at the top of each file, rather than in a function
Extracting the columns by name is more robust, and would have given a much more understandable error than the weird error thrown by geopy.
For example:
In [51]: def dist_calc(row):
...: start = row[['start_lat','start_long']]
...: end = row[['end_lat', 'end_long']]
...: return vincenty(start, end).miles
...:
In [52]: df.apply(lambda row: dist_calc(row), axis=1)
Out[52]:
0 3.223232
2 1.674780
3 1.365851
4 0.420305
dtype: float64

Related

Python Pandas indexing provides KeyError: (slice(None, None, None), )

I am indexing and slicing my data using Pandas in Python3 to calculate spatial statistics.
When I am running a for loop over the range of latitude and longitude using .loc, gives an error KeyError: (slice(None, None, None), ) for the particular set of latitude and longitude for what no values are available in the input file. Instead of skipping those values, it gives an error and stops running the code. Following is my code.
import numpy as np
import pandas as pd
from scipy import stats
filename='input.txt'
df = pd.read_csv(filename,delim_whitespace=True, header=None, names = ['year','month','lat','lon','aod'], index_col = ['year','month','lat','lon'])
idx=pd.IndexSlice
for i in range (1, 13):
for lat0 in N.arange(0.,40.25,0.25,dtype=float):
for lon0 in N.arange(20.0,75.25,0.25,dtype=float):
tmp = df.loc[idx[:,i,lat0,lon0],:]
if (len(tmp) <= 0):
continue
tmp2 = tmp.index.tolist()
In the code above, if I run for tmp = df.loc[idx[:,1,0.0,34.0],:], it works well and provides the following output, which I used for the further calculation.
aod
year month lat lon
2003 1 0.0 34.0 0.032000
2006 1 0.0 34.0 0.114000
2007 1 0.0 34.0 0.035000
2008 1 0.0 34.0 0.026000
2011 1 0.0 34.0 0.097000
2012 1 0.0 34.0 0.106333
2013 1 0.0 34.0 0.081000
2014 1 0.0 34.0 0.038000
2015 1 0.0 34.0 0.278500
2016 1 0.0 34.0 0.033000
2017 1 0.0 34.0 0.036333
2019 1 0.0 34.0 0.064333
2020 1 0.0 34.0 0.109500
But, a same code I run for tmp = df.loc[idx[:,1,0.0,32.75],:], for the respective latitude and longitude no values available in the input file. Instead of skipping those, it gives me the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 925, in __getitem__
return self._getitem_tuple(key)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 1100, in _getitem_tuple
return self._getitem_lowerdim(tup)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 822, in _getitem_lowerdim
return self._getitem_nested_tuple(tup)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 906, in _getitem_nested_tuple
obj = getattr(obj, self.name)._getitem_axis(key, axis=axis)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 1157, in _getitem_axis
locs = labels.get_locs(key)
File "/usr/lib/python3/dist-packages/pandas/core/indexes/multi.py", line 3347, in get_locs
indexer = _update_indexer(
File "/usr/lib/python3/dist-packages/pandas/core/indexes/multi.py", line 3296, in _update_indexer
raise KeyError(key)
KeyError: (slice(None, None, None), 1, 0.0, 32.75)
I tried to replace .loc with .iloc, but it came out with a too many indexers error. However, I tried solutions from internet using .to_numpy(), .values and .as_matrix(), but nothing work.
But, a same code I run for tmp = df.loc[idx[:,1,0.0,32.75],:], for the respective latitude and longitude no values available in the input file. Instead of skipping those, it gives me the following error:
The idiomatic Pandas solution would be to write this as a groupby. Example:
# split df into groups by the keys month, lat, and lon
for index, tmp in df.groupby(['month','lat','lon']):
# tmp is a dataframe where all rows have identical month, lat, and lon values
# ... do something with the tmp dataframe ...
This has three benefits.
Speed. A groupby will be faster because it only needs to loop over the dataframe once, rather than searching the whole dataframe for everything matching the first group, then searching for the second group, etc.
Simplicity.
Robustness. From a robustness perspective, if a dataframe doesn't have, for example, any rows matching "month=1,lat=0.0,lon=32.75", then it will not create that group.
More information: User guide on grouping
Remark about groupby aggregation functions
You'll also sometimes see groupby used with aggregation functions. For example, suppose you wanted to get the sum of each column within each group.
>>> l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
>>> df = pd.DataFrame(l, columns=["a", "b", "c"])
>>> df.groupby(by=["b"]).sum()
a c
b
1.0 2 3
2.0 2 5
These aggregation functions are faster and easier to use, but sometimes I need something that is custom and unusual, so I'll write a loop. But if you're doing something common, like getting the average of a group, consider looking for an aggregation function.

How to fill NANs with values tending to zero until the next valid value?

While resampling a dataframe (df) as:
df = pd.DataFrame.from_dict({'2021-03-02': 442,
'2021-03-04': 520,
'2021-03-09': 390,
'2021-03-11': 442,
'2021-03-16': 520,
'2021-03-23': 520,
'2021-03-25': 520,
'2021-03-26': 442,}, orient='index',)
df.index = pd.to_datetime(df.index)
df = df.resample('30Min').asfreq()
How do I fill the NANs with values that linearly tend to zero from their predecessor? (a graphic would be looking like a saw)
Are there any built in methods for this operation or a custom method needs to be used in conjuncture with .apply()?
Thank you for your time.
There's no built-in function for this. You can create one quickly like this:
# group of rows starting with non-nan
groups = df[0].groupby(df[0].notnull().cumsum())
# output
out = df[0].ffill().mul(1-groups.cumcount()/ groups.transform('size'))
# plot
out.plot()
And you get:
Another option is to fill the nan just before a non-nan value with 0 using notnull and shift, then interpolate.
df.loc[df[0].notnull().shift(-1, fill_value=False), 0] = 0
df[0] = df[0].interpolate()

Problem with rolling window: ValueError: Length of passed values is 3, index implies 2

I am facing the following problem with Pandas and can't identify anything to be wrong.
churned_or_dormant_customers_by_month = jobs_by_customer_and_month.fillna(0).rolling(2, 2, axis='columns').apply(lambda window: 1 if not window[1] and window[0] else 0).sum(skipna=True)
The above gives the following traceback:
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/usr/lib/python3.8/site-packages/pandas/core/window/rolling.py", line 2059, in apply
return super().apply(
File "/usr/lib/python3.8/site-packages/pandas/core/window/rolling.py", line 1388, in apply
return self._apply(
File "/usr/lib/python3.8/site-packages/pandas/core/window/rolling.py", line 586, in _apply
result = np.apply_along_axis(calc, self.axis, values)
File "<__array_function__ internals>", line 5, in apply_along_axis
File "/usr/lib/python3.8/site-packages/numpy/lib/shape_base.py", line 379, in apply_along_axis
res = asanyarray(func1d(inarr_view[ind0], *args, **kwargs))
File "/usr/lib/python3.8/site-packages/pandas/core/window/rolling.py", line 576, in calc
return func(x, start, end, min_periods)
File "/usr/lib/python3.8/site-packages/pandas/core/window/rolling.py", line 1414, in apply_func
values = Series(values, index=self.obj.index)
File "/usr/lib/python3.8/site-packages/pandas/core/series.py", line 313, in __init__
raise ValueError(
ValueError: Length of passed values is 3, index implies 2.
Im sure this is not a bug but I am instead making a silly mistake using the rolling window function. I can't figure out what the mistake is though and I could swear that this worked with a previous version of Pandas. Which reminds me, the version I am runnning this code on is 1.1.0rc0
Example data in pickle format is here. and looks like this:
>>> jobs_by_customer_and_month
2019-1 2019-2 2019-3
1.0 1.0 1.0 1.0
2.0 2.0 2.5 2.1
Any version less than 0.23, value is always passed as ndarray . Option raw of rolling apply was implemented from version 0.23+. From version 0.23 to version < 1.0.0, raw is defaulted to True. However, it will produce a warning:
C:\Python\Python37-32\Scripts\ipython:3: FutureWarning: Currently, 'apply' passes
the values as ndarrays to the applied function. In the future, this will change
to passing it as Series objects. You need to specify 'raw=True' to keep the current
behaviour, and you can pass 'raw=False' to silence this warning
You don't see any error or warning on your old pandas, so I guess your old version is < 0.23.
From version 1.0.0+, rolling officially default passes value as series (i.e. raw=False) to apply
On your error, I guess it is a bug and it only appears when rolling apply along axis = 1.
I checked on version 0.24, this bug already exists. So, it probably appears along the implementation of passing value as series to the rolling object apply. However, This bug only appears when rolling apply along columns (in other words, axis=1).
On rolling apply along axis = 1 and series passing as value, each series is a row in df. In your case, it has length = 3. I.e. it is df.shape[1]
df:
2019-1 2019-2 2019-3
1.0 1.0 1.0 1.0
2.0 2.0 2.5 2.1
In [13]: df.loc[1.0].size
Out[13]: 3
In [14]: df.shape[1]
Out[14]: 3
Just look at your error trace-back above:
...
File "/usr/lib/python3.8/site-packages/pandas/core/window/rolling.py", line 1414, in apply_func
values = Series(values, index=self.obj.index)
...
It tries to construct a series from values and use self.obj.index as index. self is the rolling object and obj is its attribute. Let's check what the value of obj is:
In [17]: (df.fillna(0)
...: .rolling(window=3, axis='columns').__dict__
...: )
Out[17]:
{'obj': 2019-1 2019-2 2019-3
1.0 1.0 1.0 1.0
2.0 2.0 2.5 2.1,
'on': None,
'closed': None,
'window': 3,
'min_periods': None,
'center': False,
'win_type': None,
'win_freq': None,
'axis': 1,
'_cache': {'_on': Index(['2019-1', '2019-2', '2019-3'], dtype='object'),
'is_datetimelike': False},
'_numba_func_cache': {}}
So, self.obj is the df itself. That means self.obj.index is df.index and its length is 2
In [19]: df.index.size
Out[19]: 2
The construction of series checking length of data against length of index (inside file series.py)
...
if index is None:
if not is_list_like(data):
data = [data]
index = ibase.default_index(len(data))
elif is_list_like(data):
# a scalar numpy array is list-like but doesn't
# have a proper length
try:
if len(index) != len(data):
raise ValueError(
f"Length of passed values is {len(data)}, "
f"index implies {len(index)}."
)
except TypeError:
pass
...
As you see, length each row is 3 and length of df.index is 2, so it throws the ValueError.
It is a bug, so in the mean time, you need to specify your rolling apply with parameter raw = True to overcome this issue
The solution to my issue was to use the parameter raw=True although I am confused as to why this should solve the issue. The documentation for pandas.core.window.rolling.Rolling.apply states
Must produce a single value from an ndarray input if raw=True or a
single value from a Series if raw=False.
So it seems like the function returning a single value should work either way. This looks like there is a bug either in how Rolling.apply works or else in the documentation

Strange problem when saving to excel pandas

I have some problem wirting to excel. I have 15 columns in my dataframe. I wish only to write 7 of them to excel and in the process use another name for the header.
Here is my code
cols = ['SN', 'Date_x','Material_x', 'Batch_x', 'Qty_x', 'Booked_x', 'State_x']
headers = ['SN', 'Date', 'Material', 'Batch', 'Qty', 'Booked', 'State']
df.style.apply(highlight_changes_ivt2, axis=None).to_excel(writer, columns =cols, header=headers, sheet_name="temp", index = False)
But I have the following errors
File "/home/week/anaconda3/envs/SC/lib/python3.7/site-packages/pandas/io/formats/style.py", line 235, in to_excel
engine=engine,
File "/home/week/anaconda3/envs/SC/lib/python3.7/site-packages/pandas/io/formats/excel.py", line 735, in write
freeze_panes=freeze_panes,
File "/home/week/anaconda3/envs/SC/lib/python3.7/site-packages/pandas/io/excel/_xlsxwriter.py", line 214, in write_cells
for cell in cells:
File "/home/week/anaconda3/envs/SC/lib/python3.7/site-packages/pandas/io/formats/excel.py", line 684, in get_formatted_cells
for cell in itertools.chain(self._format_header(), self._format_body()):
File "/home/week/anaconda3/envs/SC/lib/python3.7/site-packages/pandas/io/formats/excel.py", line 513, in _format_header_regular
f"Writing {len(self.columns)} cols but got {len(self.header)} "
ValueError: Writing 15 cols but got 7 aliases
I tried to do debugging.. and setting pdb.set_trace()
df.style.apply(highlight_changes_ivt2, axis=None).to_excel(writer, columns =cols, header=headers, sheet_name="temp", index = False)
(Pdb) df.columns
Index(['SN', 'Status_x', 'Material_x', 'Batch_x', 'Date_x', 'Quantity_x',
'Booked_x', 'DiffQty_x', 'Status_y', 'Material_y', 'Batch_y',
'Date_y', 'Quantity_y', 'Booked_y', 'DiffQty_y'],
dtype='object')
(Pdb)
This code is running well at my home laptop though... just wondering what's wrong... the difference is only python using version 3.7 for this and 3.8 back at home
Thanks
Let me elaborate my idea in the comment by an example:
df = pd.DataFrame(np.arange(16).reshape(4,-1))
# this is the reference dataframe
np.random.seed(1)
ref_df = pd.DataFrame(np.random.randint(1,10,(4,4)))
# this is the function
def highlight(col, ref_df=None):
return ['background-color: yellow' if c>r else ''
for c,r in zip(col, ref_df[col.name])]
# this works
df[[0,1,3]].style.apply(highlight, ref_df=ref_df).to_excel('style.xlsx', header=list('abc'))
Output:

TypeError: conversion from Series to Decimal is not supported

Any ideas on how to convert a series (column) from float to decimal? I am on Python 3.6. I have read the Decimal documentation, but it offers no help.
df['rate'].dtype
Out[158]: dtype('float64')
Decimal(df['rate'])
Traceback (most recent call last):
File "C:\Users\user\Anaconda3\lib\site-packages\IPython
\core\interactiveshell.py", line 2862, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-159-88710e11f7cd>", line 1, in <module>
Decimal(df['rate'])
TypeError: conversion from Series to Decimal is not supported
You can't cast like this, you will need to do
df['rate'] = df['rate'].apply(Decimal)
pandas does support Decimal but you can't cast like that
Example:
In[28]:
from decimal import *
df = pd.DataFrame(np.random.randn(5,3), columns=list('abc'))
df['a'] = df['a'].apply(Decimal)
df
Out[28]:
a b c
0 -1.6122557830197199457700207858579233288764953... -1.865243 -0.836893
1 0.96962430214434858211092205237946473062038421... -0.105823 -0.842267
2 -0.9113389075755260471112251252634450793266296... -0.351389 -0.183489
3 1.22765470106414120721183280693367123603820800... -1.232627 -0.067909
4 -0.0376339704393285762185072940155805554240942... -0.445606 -0.080623
the dtype will still show object but the dtype really is Decimal:
In[29]:
type(df['a'].iloc[0])
Out[29]: decimal.Decimal
If you use astype(Decimal) it will look like it worked but it doesn't:
In[38]:
df['b'].astype(Decimal)
Out[38]:
0 -1.86524
1 -0.105823
2 -0.351389
3 -1.23263
4 -0.445606
Name: b, dtype: object
If we try to assign this back:
In[39]:
df['b'] = df['b'].astype(Decimal)
type(df['b'].iloc[0])
Out[39]: float
As pointed out by #JonClements and I agree it is ill-advised to use non-native numpy types as you lose any vectorisation in particular with arithmetic operations, additionally the dtype may be converted when you perform some operation on it which then loses your original intention

Resources