pandas groupby apply does not broadcast into a DataFrame - python-3.x

Using pandas 0.19.0. The following code will reproduce the problem:
In [1]: import pandas as pd
import numpy as np
In [2]: df = pd.DataFrame({'c1' : list('AAABBBCCC'),
'c2' : list('abcdefghi'),
'c3' : np.random.randn(9),
'c4' : np.arange(9)})
df
Out[2]: c1 c2 c3 c4
0 A a 0.819618 0
1 A b 1.764327 1
2 A c -0.539010 2
3 B d 1.430614 3
4 B e -1.711859 4
5 B f 1.002522 5
6 C g 2.257341 6
7 C h 1.338807 7
8 C i -0.458534 8
In [3]: def myfun(s):
"""Function does practically nothing"""
req = s.values
return pd.Series({'mean' : np.mean(req),
'std' : np.std(req),
'foo' : 'bar'})
In [4]: res = df.groupby(['c1', 'c2'])['c3'].apply(myfun)
res.head(10)
Out[4]: c1 c2
A a foo bar
mean 0.819618
std 0
b foo bar
mean 1.76433
std 0
c foo bar
mean -0.53901
std 0
B d foo bar
And, of course I expect this:
Out[4]: foo mean std
c1 c2
A a bar 0.819618 0
b bar 1.76433 0
c bar -0.53901 0
B d bar 1.43061 0
Pandas automatically converts a Series to a DataFrame when returned by a function that is applied to a Series or a DataFrame. Why is the behavior different for functions applied to groups?
I am looking for an answer that will result in the output desired. Bonus points for explaining the difference in behavior among pandas.Series.apply or pandas.DataFrame.apply and pandas.core.groupby.GroupBy.apply

an easy fix would be to unstack
df = pd.DataFrame({'c1' : list('AAABBBCCC'),
'c2' : list('abcdefghi'),
'c3' : np.random.randn(9),
'c4' : np.arange(9)})
def myfun(s):
"""Function does practically nothing"""
req = s.values
return pd.Series({'mean' : np.mean(req),
'std' : np.std(req),
'foo' : 'bar'})
res = df.groupby(['c1', 'c2'])['c3'].apply(myfun)
res.unstack()

Related

Sum in Column based on condition in rows in pandas dataframe [duplicate]

I have a dataframe which I want to plot with matplotlib, but the index column is the time and I cannot plot it.
This is the dataframe (df3):
but when I try the following:
plt.plot(df3['magnetic_mag mean'], df3['YYYY-MO-DD HH-MI-SS_SSS'], label='FDI')
I'm getting an error obviously:
KeyError: 'YYYY-MO-DD HH-MI-SS_SSS'
So what I want to do is to add a new extra column to my dataframe (named 'Time) which is just a copy of the index column.
How can I do it?
This is the entire code:
#Importing the csv file into df
df = pd.read_csv('university2.csv', sep=";", skiprows=1)
#Changing datetime
df['YYYY-MO-DD HH-MI-SS_SSS'] = pd.to_datetime(df['YYYY-MO-DD HH-MI-SS_SSS'],
format='%Y-%m-%d %H:%M:%S:%f')
#Set index from column
df = df.set_index('YYYY-MO-DD HH-MI-SS_SSS')
#Add Magnetic Magnitude Column
df['magnetic_mag'] = np.sqrt(df['MAGNETIC FIELD X (μT)']**2 + df['MAGNETIC FIELD Y (μT)']**2 + df['MAGNETIC FIELD Z (μT)']**2)
#Subtract Earth's Average Magnetic Field from 'magnetic_mag'
df['magnetic_mag'] = df['magnetic_mag'] - 30
#Copy interesting values
df2 = df[[ 'ATMOSPHERIC PRESSURE (hPa)',
'TEMPERATURE (C)', 'magnetic_mag']].copy()
#Hourly Average and Standard Deviation for interesting values
df3 = df2.resample('H').agg(['mean','std'])
df3.columns = [' '.join(col) for col in df3.columns]
df3.reset_index()
plt.plot(df3['magnetic_mag mean'], df3['YYYY-MO-DD HH-MI-SS_SSS'], label='FDI')
Thank you !!
I think you need reset_index:
df3 = df3.reset_index()
Possible solution, but I think inplace is not good practice, check this and this:
df3.reset_index(inplace=True)
But if you need new column, use:
df3['new'] = df3.index
I think you can read_csv better:
df = pd.read_csv('university2.csv',
sep=";",
skiprows=1,
index_col='YYYY-MO-DD HH-MI-SS_SSS',
parse_dates='YYYY-MO-DD HH-MI-SS_SSS') #if doesnt work, use pd.to_datetime
And then omit:
#Changing datetime
df['YYYY-MO-DD HH-MI-SS_SSS'] = pd.to_datetime(df['YYYY-MO-DD HH-MI-SS_SSS'],
format='%Y-%m-%d %H:%M:%S:%f')
#Set index from column
df = df.set_index('YYYY-MO-DD HH-MI-SS_SSS')
EDIT: If MultiIndex or Index is from groupby operation, possible solutions are:
df = pd.DataFrame({'A':list('aaaabbbb'),
'B':list('ccddeeff'),
'C':range(8),
'D':range(4,12)})
print (df)
A B C D
0 a c 0 4
1 a c 1 5
2 a d 2 6
3 a d 3 7
4 b e 4 8
5 b e 5 9
6 b f 6 10
7 b f 7 11
df1 = df.groupby(['A','B']).sum()
print (df1)
C D
A B
a c 1 9
d 5 13
b e 9 17
f 13 21
Add parameter as_index=False:
df2 = df.groupby(['A','B'], as_index=False).sum()
print (df2)
A B C D
0 a c 1 9
1 a d 5 13
2 b e 9 17
3 b f 13 21
Or add reset_index:
df2 = df.groupby(['A','B']).sum().reset_index()
print (df2)
A B C D
0 a c 1 9
1 a d 5 13
2 b e 9 17
3 b f 13 21
You can directly access in the index and get it plotted, following is an example:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
#Get index in horizontal axis
plt.plot(df.index, df[0])
plt.show()
#Get index in vertiacal axis
plt.plot(df[0], df.index)
plt.show()
You can also use eval to achieve this:
In [2]: df = pd.DataFrame({'num': range(5), 'date': pd.date_range('2022-06-30', '2022-07-04')}, index=list('ABCDE'))
In [3]: df
Out[3]:
num date
A 0 2022-06-30
B 1 2022-07-01
C 2 2022-07-02
D 3 2022-07-03
E 4 2022-07-04
In [4]: df.eval('index_copy = index')
Out[4]:
num date index_copy
A 0 2022-06-30 A
B 1 2022-07-01 B
C 2 2022-07-02 C
D 3 2022-07-03 D
E 4 2022-07-04 E

Do I use a loop, df.melt or df.explode to achieve a flattened dataframe?

Can anyone help with some code that will achieve the following transformation? I have tried variations of df.melt, df.explode, and also a looping statement but only get error statements. I think it might need nesting but don't have the experience to do so.
index A B C D
0 X d 4 2
1 Y b 5 2
Where column D represents frequency of column C.
desired output is:
index A B C
0 X d 4
1 X d 4
2 Y b 5
3 Y b 5
If you want to repeat rows, why not use index.repeat?
import pandas as pd
#recreate the sample dataframe
df = pd.DataFrame({"A":["X","Y"],"B":["d","b"],"C":[4,5],"D":[3,2]}, columns=list("ABCD"))
df = df.reindex(df.index.repeat(df["D"])).drop("D", 1).reset_index(drop=True)
print(df)
Sample output
A B C
0 X d 4
1 X d 4
2 X d 4
3 Y b 5
4 Y b 5

Easily generate edge list from specific structure using pandas

This is a question about how to make things properly with pandas (I use version 1.0).
Let say I have a DataFrame with missions which contains an origin and one or more destinations:
mid from to
0 0 A [C]
1 1 A [B, C]
2 2 B [B]
3 3 C [D, E, F]
Eg.: For the mission (mid=1) people will travel from A to B, then from B to C and finally from C to A. Notice, that I have no control on the datamodel of the input DataFrame.
I would like to compute metrics on each travel of the mission. The expected output would be exactly:
tid mid from to
0 0 0 A C
1 1 0 C A
2 2 1 A B
3 3 1 B C
4 4 1 C A
5 5 2 B B
6 6 2 B B
7 7 3 C D
8 8 3 D E
9 9 3 E F
10 10 3 F C
I have found a way to achieve my goal. Please, find bellow the MCVE:
import pandas as pd
# Input:
df = pd.DataFrame(
[["A", ["C"]],
["A", ["B", "C"]],
["B", ["B"]],
["C", ["D", "E", "F"]]],
columns = ["from", "to"]
).reset_index().rename(columns={'index': 'mid'})
# Create chain:
df['chain'] = df.apply(lambda x: list(x['from']) + x['to'] + list(x['from']), axis=1)
# Explode chain:
df = df.explode('chain')
# Shift to create travel:
df['end'] = df.groupby("mid")["chain"].shift(-1)
# Remove extra row, clean, reindex and rename:
df = df.dropna(subset=['end']).reset_index(drop=True).reset_index().rename(columns={'index': 'tid'})
df = df.drop(['from', 'to'], axis=1).rename(columns={'chain': 'from', 'end': 'to'})
My question is: Is there a better/easier way to make it with Pandas? By saying better I mean, not necessary more performant (it can be off course), but more readable and intuitive.
Your operation is basically explode and concat:
# turn series of lists in to single series
tmp = df[['mid','to']].explode('to')
# new `from` is concatenation of `from` and the list
df1 = pd.concat((df[['mid','from']],
tmp.rename(columns={'to':'from'})
)
).sort_index()
# new `to` is concatenation of list and `to``
df2 = pd.concat((tmp,
df[['mid','from']].rename(columns={'from':'to'})
)
).sort_index()
df1['to'] = df2['to']
Output:
mid from to
0 0 A C
0 0 C A
1 1 A B
1 1 B C
1 1 C A
2 2 B B
2 2 B B
3 3 C D
3 3 D E
3 3 E F
3 3 F C
If you don't mind re-constructing the entire DataFrame then you can clean it up a bit with np.roll to get the pairs of destinations and then assign the value of mid based on the number of trips (length of each sublist in l)
import pandas as pd
import numpy as np
from itertools import chain
l = [[fr]+to for fr,to in zip(df['from'], df['to'])]
df1 = (pd.DataFrame(data=chain.from_iterable([zip(sl, np.roll(sl, -1)) for sl in l]),
columns=['from', 'to'])
.assign(mid=np.repeat(df['mid'].to_numpy(), [*map(len, l)])))
from to mid
0 A C 0
1 C A 0
2 A B 1
3 B C 1
4 C A 1
5 B B 2
6 B B 2
7 C D 3
8 D E 3
9 E F 3
10 F C 3

Python/Pandas return column and row index of found string

I've searched previous answers relating to this but those answers seem to utilize numpy because the array contains numbers. I am trying to search for a keyword in a sentence in a dataframe ('Timeframe') where the full sentence is 'Timeframe for wave in ____' and would like to return the column and row index. For example:
df.iloc[34,0]
returns the string I am looking for but I am avoiding a hard code for dynamic reasons. Is there a way to return the [34,0] when I search the dataframe for the keyword 'Timeframe'
EDIT:
For check index need contains with boolean indexing, but then there are possible 3 values:
df = pd.DataFrame({'A':['Timeframe for wave in ____', 'a', 'c']})
print (df)
A
0 Timeframe for wave in ____
1 a
2 c
def check(val):
a = df.index[df['A'].str.contains(val)]
if a.empty:
return 'not found'
elif len(a) > 1:
return a.tolist()
else:
#only one value - return scalar
return a.item()
print (check('Timeframe'))
0
print (check('a'))
[0, 1]
print (check('rr'))
not found
Old solution:
It seems you need if need numpy.where for check value Timeframe:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,'Timeframe'],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 Timeframe 0 4 b
a = np.where(df.values == 'Timeframe')
print (a)
(array([5], dtype=int64), array([2], dtype=int64))
b = [x[0] for x in a]
print (b)
[5, 2]
In case you have multiple columns where to look into you can use following code example:
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2,3,4],["a","b","Timeframe for wave in____","d"],[5,6,7,8]])
mask = np.column_stack([df[col].str.contains("Timeframe", na=False) for col in df])
find_result = np.where(mask==True)
result = [find_result[0][0], find_result[1][0]]
Then output for df and result would be:
>>> df
0 1 2 3
0 1 2 3 4
1 a b Timeframe for wave in____ d
2 5 6 7 8
>>> result
[1, 2]

Pandas Pivot Table Count Values (Exclude "NaN")

Given the following data frame:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Site':['a','a','a','b','b','b'],
'x':[1,1,0,1,np.nan,0],
'y':[1,np.nan,0,1,1,0]
})
df
Site y x
0 a 1.0 1
1 a NaN 1
2 a 0.0 0
3 b 1.0 1
4 b 1.0 NaN
5 b 0.0 0
I'd like to pivot this data frame to get the count of values (excluding "NaN") for each column.
I tried what I found in other posts, but nothing seems to work (maybe there was a change in pandas 0.18)?
Desired result:
Item count
Site
a y 2
b y 3
a x 3
b x 2
Thanks in advance!
pvt = pd.pivot_table(df, index = "Site", values = ["x", "y"], aggfunc = "count").stack().reset_index(level = 1)
pvt.columns = ["Item", "count"]
pvt
Out[38]:
Item count
Site
a x 3
a y 2
b x 2
b y 3
You can add pvt.sort_values("Item", ascending = False) if you want y's to appear first.

Resources