Pandas DataFrame Apply Efficiency - python-3.x

I have a dataframe to which I wan't to add a column with a kind of status if there is a matching value in another dataframe. I have the current code which works:
df1['NewColumn'] = df1['ComparisonColumn'].apply(lambda x: 'Match' if any(df2.ComparisonColumn == x) else ('' if x is None else 'Missing'))
I know the line is ugly, but I get the impression that its inefficient. Can you suggest a better way to make this comparison?

You can use np.where, isin, and isnull:
Create some dummy data:
np.random.seed(123)
df = pd.DataFrame({'ComparisonColumn':np.random.randint(10,20,20)})
df.iloc[4] = np.nan #Create missing data
df2 = pd.DataFrame({'ComparisonColumn':np.random.randint(15,30,20)})
Do matching with np.where:
df['NewColumn'] = np.where(df.ComparisonColumn.isin(df2.ComparisonColumn),'Matched',np.where(df.ComparisonColumn.isnull(),'Missing',''))
Output:
ComparisonColumn NewColumn
0 12.0
1 12.0
2 16.0 Matched
3 11.0
4 NaN Missing
5 19.0 Matched
6 16.0 Matched
7 11.0
8 10.0
9 11.0
10 19.0 Matched
11 10.0
12 10.0
13 19.0 Matched
14 13.0
15 14.0
16 10.0
17 10.0
18 14.0
19 11.0

Related

Get value from grouped data frame maximum in another column

Return and assign the value of column c based on the maximum value of column b in a grouped data frame, grouped on column a.
I'm trying to return and assign (ideally in a different column) the value of column c, determined by the row with the maximum in column b (which are dates). The dataframe is grouped by column a, but I keep getting errors. The data frame is relatively large, with over 16,000 rows.
df = pd.DataFrame({'a': ['a','a','a','b','b','b','c','c','c'],
'b': ['2008-11-01', '2022-07-01', '2017-02-01', '2017-02-01', '2018-02-01', '2008-11-01', '2014-11-01', '2008-11-01', '2022-07-01'],
'c': [numpy.NaN,6,8,2,1,numpy.NaN,6,numpy.NaN,7]})
df['b'] = pd.to_datetime(df['b'])
df
a b c
0 a 2008-11-01 NaN
1 a 2022-07-01 6.0
2 a 2017-02-01 8.0
3 b 2017-02-01 2.0
4 b 2018-02-01 1.0
5 b 2008-11-01 NaN
6 c 2014-11-01 6.0
7 c 2008-11-01 NaN
8 c 2022-07-01 7.0
I want the following result:
a b c d
0 a 2008-11-01 NaN 8.0
1 a 2022-07-01 6.0 8.0
2 a 2017-02-01 8.0 8.0
3 b 2017-02-01 2.0 1.0
4 b 2018-02-01 1.0 1.0
5 b 2008-11-01 NaN 1.0
6 c 2014-11-01 6.0 7.0
7 c 2008-11-01 NaN 7.0
8 c 2022-07-01 7.0 7.0
I've tried the following, but grouped series ('SeriesGroupBy') cannot be cast to another dtype:
df['d'] = df.groupby('a').b.astype(np.int64).max()[df.c].reset_index()
I've set column b to int before running the above code, but get an error saying none of the values in c are in the index.
None of [Index([....values in column c...\n dtype = 'object', name = 'a', length = 16280)] are in the [index]
I've tried expanding out the command but errors (which I just realized relate to no alternative values being given in the pd.where function - but I don't know how to give the correct value):
df= df.groupby('a')
df['d'] = df.['c'].where(grouped['b'] == grouped['b'].max())
I've also tried using solutions provided here, here, and here.
Any help or direction would be appreciated.
I'm assuming the first three values should be 6.0 (because maximum date for group a is 2022-07-01 with value 6):
df["d"] = df["a"].map(
df.groupby("a").apply(lambda x: x.loc[x["b"].idxmax(), "c"])
)
print(df)
Prints:
a b c d
0 a 2008-11-01 NaN 6.0
1 a 2022-07-01 6.0 6.0
2 a 2017-02-01 8.0 6.0
3 b 2017-02-01 2.0 1.0
4 b 2018-02-01 1.0 1.0
5 b 2008-11-01 NaN 1.0
6 c 2014-11-01 6.0 7.0
7 c 2008-11-01 NaN 7.0
8 c 2022-07-01 7.0 7.0

How to dinamically change the shift periods if value is NaN in Pandas?

I have the following DF (simplified):
value
1
4
2
NaN
9
8
7
For example, when I apply the function shift(3)+current_row I get:
value result
1 NaN
4 NaN
2 NaN
NaN NaN
9 13
8 10
7 NaN
But what I need is if a value is NaN, try with periods=N-1:
value result
1 1 shift(3)=NAN -> shift(2)=NAN -> shift(1)=NAN -> current_value
4 5 shift(3)=NAN -> shift(2)=NAN -> shift(1)=1 + current_value
2 3 shift(3)=NAN -> shift(2)=1 + current_value
NaN 1 shift(3)=1 + if current_value == NaN then 0 else current_value
9 13 shift(3)=4 + current_value
8 10 shift(3)=2 + current_value
7 16 shift(3)=NaN -> shift(2)=9 + current_value
If it's possible, in the spirit of the pythonic way.
Thanks and regards.
Shift the value column for the periods in the range 3...1, then create a dataframe from these shifted columns and backfill along the index axis to fill the NaN values, then using iloc select the first row and add this row with the value column
s = pd.DataFrame(df['value'].shift(3 - i) for i in range(3)).bfill().iloc[0]
df['result'] = df['value'].add(s, fill_value=0)
value result
0 1.0 1.0
1 4.0 5.0
2 2.0 3.0
3 NaN 1.0
4 9.0 13.0
5 8.0 10.0
6 7.0 16.0

TypeError: No matching signature found, when tryint to pivot dataframe

I have a dataframe,
df
ID UD MD TD
1 1 3 13.0 115
2 1 4 14.0 142
3 2 3 13.0 156
4 2 8 13.5 585
I get the error. TypeError: No matching signature found, when I try df_p = df.pivot('ID','UD','MD')
What could be wrong ?
here is the data
df = pd.DataFrame({'ID': [1,1,2,2], 'UD': [3,4,3,8], 'MD':[13,14,13,13.5], 'TD':[115,142,156,585]})
by default in python 3.9 the dtypes are as following
ID int64
UD int64
MD float64
TD int64
dtype: object
and you can run df.pivot('ID','UD','MD')
UD 3 4 8
ID
1 13.0 14.0 NaN
2 13.0 NaN 13.5
but if you by a chance have read the data as float16
df.astype({'MD':np.float16}).pivot('ID','UD','MD')
you would get the TypeError: No matching signature found

Looping through a pandas DataFrame accessing previous elements

I have two DataFrames, FirstColumn & SecondColumn.
How do I create a new column, containing the correlation coeff. row by row for the two columns 5 periods back?
For example, the 5th row would be the R2 value of the two columns 5 periods back, the 6th row would be the corr coeff. value of the columns ranging from row 1-6 etc etc.
Additionally, what method is the most efficient when looping through a DataFrame, having to access previous rows?
FirstColumn SecondColumn
0 2 1.0
1 3 3.0
2 4 4.0
3 5 5.0
4 6 2.0
5 7 6.0
6 2 2.0
7 3 3.0
8 5 9.0
9 3 2.0
10 2 3.0
11 4 2.0
12 2 2.0
13 4 2.0
14 2 4.0
15 5 3.0
16 3 1.0
You can do:
df["corr"]=df.rolling(5, min_periods=1).corr()["FirstColumn"].loc[(slice(None), "SecondColumn")]
Outputs:
FirstColumn SecondColumn corr
0 2.0 1.0 NaN
1 3.0 3.0 1.000000
2 4.0 4.0 0.981981
3 5.0 5.0 0.982708
4 6.0 2.0 0.400000
5 7.0 6.0 0.400000
6 2.0 2.0 0.566707
7 3.0 3.0 0.610572
8 5.0 9.0 0.426961
9 3.0 2.0 0.737804
10 2.0 3.0 0.899659
11 4.0 2.0 0.698774
12 2.0 2.0 0.716769
13 4.0 2.0 -0.559017
14 2.0 4.0 -0.612372
15 5.0 3.0 -0.250000
16 3.0 1.0 -0.067267
You can use the shift(n) method to access the element n rows back. One approach would be to create "lag" columns, like so:
for i in range(5):
df['FirstCol_lag'+str(i)] = df.FirstColumn.shift(i)
Then you can do your formula operations on a row-by-row basis, e.g.
df['R2'] = foo([df.FirstCol_lag1, ... df.SecondCol_lag5])
The most efficient approach would be to not use a loop and do it this way. But if the data is very large this may not be feasible. I think the iterrows() function is pretty efficient too, you can test which is faster if you really care. For that you'd have to offset the row index manually and it would take more code.
Still you'll have to be careful about handling nans because the shift will be null for the first n columns of your dataframe.

How do I fill these `NaN` values properly?

Here's my original dataframe with NaN values which I'm trying to fill;
https://prnt.sc/i40j33
If I use df.interpolate(axis=1) to fill up the NaN values, only some of the rows fill up properly with a number.
For e.g
https://prnt.sc/i40mgq
As you can see in the screenshot column:1981 and row:3 which had a NaN value has filled up properly with a value other than NaN. I want to fill the rest of NaN as well like that? Any idea how do I do that?
Using DataFrame.interpolate()
In your case it is failing because there are no columns to the left, and therefore the interpolate method doesn't know what to interpolate it to: missing_value = (left_value + right_value)/2
So you could, for example, insert a column to the left with all 0's (if you would like to impute your missing values on the first column with half of the next value), as such:
df.insert(loc=0, column='allZeroes', value=0)
After this, you could interpolate as you are doing and remove the column
General missing value imputation
Either use df.fillna('DEFAULT-VALUE') as Alex mentioned in the comments to the question. Docs here
or do something like:
df.my_col[df.my_col.isnull()] = 'DEFAULT-VALUE'
I'd recommend using the fillna as you can use methods such as forward fill (ffill) -- impute the missings with the previous value -- and other similar methods.
It seems like you might want to interpolate on axis=0, column-wise:
>>> df = pd.DataFrame(np.arange(35, dtype=float).reshape(5,7),
columns=[1951, 1961, 1971, 1981, 1991, 2001, 2001],
index=range(0, 5))
>>> df.iloc[1:3, 0] = np.nan
>>> df.iloc[3, 3] = np.nan
>>> df.interpolate(axis=0)
1951 1961 1971 1981 1991 2001 2001
0 0.0 1.0 2.0 3.0 4.0 5.0 6.0
1 7.0 8.0 9.0 10.0 11.0 12.0 13.0
2 14.0 15.0 16.0 17.0 18.0 19.0 20.0
3 21.0 22.0 23.0 24.0 25.0 26.0 27.0
4 28.0 29.0 30.0 31.0 32.0 33.0 34.0
Currently you're interpolating row-wise. NaNs that "begin" a Series aren't padded by a value on either side, making interpolation impossible for them.
Update: pandas is adding some more optionality for this in v 0.23.0.

Resources