pandas: do not count nan in an aggregate function - python-3.x

I have the following code:
data_agg_df = data_df.groupby("team", as_index=False).player.agg({"player_set": lambda x: set(list(x)), "player_count": "nunique"})
Then my results look like:
team player_set player_count
-------------------------------------------------
A {John, Mary} 2
B {nan} 0
C {Dave,nan} 1
I am wondering how to not show the nana in the player_set? i.e. I want the resulting data frame look like:
team player_set player_count
-------------------------------------------------
A {John, Mary} 2
B {} 0
C {Dave} 1
Thanks!

replace
set(list(x))
with
set(list(i for i in x if pd.notnull(i)))
to take out the nans

Related

Check whether the two string columns contain each other in Python

Given a small dataset as follows:
id a b
0 1 lol lolec
1 2 rambo ram
2 3 ki pio
3 4 iloc loc
4 5 strip rstrip
5 6 lambda lambda
I would like to create a new column c based on the following criterion:
If a is equal or substring of b or vise versa, then create a new column c with value 1, otherwise keep it as 0.
How could I do that in Pandas or Python?
The expected result:
id a b c
0 1 lol lolec 1
1 2 rambo ram 1
2 3 ki pio 0
3 4 iloc loc 1
4 5 strip rstrip 1
5 6 lambda lambda 1
To check whether a is in b or b is in a, we can use:
df.apply(lambda x: x.a in x.b, axis=1)
df.apply(lambda x: x.b in x.a, axis=1)
Use zip and list comprehension:
df['c'] = [int(a in b or b in a) for a, b in zip(df.a, df.b)]
df
id a b c
0 1 lol lolec 1
1 2 rambo ram 1
2 3 ki pio 0
3 4 iloc loc 1
4 5 strip rstrip 1
5 6 lambda lambda 1
Or use apply, just combine both conditions with or:
df['c'] = df.apply(lambda r: int(r.a in r.b or r.b in r.a), axis=1)

Do I use a loop, df.melt or df.explode to achieve a flattened dataframe?

Can anyone help with some code that will achieve the following transformation? I have tried variations of df.melt, df.explode, and also a looping statement but only get error statements. I think it might need nesting but don't have the experience to do so.
index A B C D
0 X d 4 2
1 Y b 5 2
Where column D represents frequency of column C.
desired output is:
index A B C
0 X d 4
1 X d 4
2 Y b 5
3 Y b 5
If you want to repeat rows, why not use index.repeat?
import pandas as pd
#recreate the sample dataframe
df = pd.DataFrame({"A":["X","Y"],"B":["d","b"],"C":[4,5],"D":[3,2]}, columns=list("ABCD"))
df = df.reindex(df.index.repeat(df["D"])).drop("D", 1).reset_index(drop=True)
print(df)
Sample output
A B C
0 X d 4
1 X d 4
2 X d 4
3 Y b 5
4 Y b 5

groupby value_counts store in a data frame

My data frame looks like -
city
a
f
m
m
m
d
I want to store this data into other data frame -
city total
a 1
f 1
m 3
d 1
my code is -
df_city = df.groupby(['city'])['city'].count()
but not getting proper results.
This will do:
df['city'].value_counts().to_frame(name="Total")
Develop from your codes
df.groupby('city').city.count().rename('total').reset_index()
Out[505]:
city total
0 a 1
1 d 1
2 f 1
3 m 3
Solution with groupby is better if want avoid sorting total - add Series.reset_index with name parameter:
df_city = df.groupby('city')['city'].count().reset_index(name='total')
print (df_city)
city total
0 a 1
1 d 1
2 f 1
3 m 3
If use Series.value_counts output is sorting, for DataFrame add Series.rename_axis and Series.reset_index:
df_city = df['city'].value_counts().rename_axis('city').reset_index(name="total")
print (df_city)
city total
0 m 3
1 d 1
2 f 1
3 a 1
Could you please try following.
df['total']=df.groupby('city').cumcount()+1
df.drop_duplicates('city',keep='last').reset_index(drop='True')
To store this in a data frame use:
df['total']=df.groupby('city').cumcount()+1
df1=df.drop_duplicates('city',keep='last').reset_index(drop='True')
df1
When we print df its value will be as follows:
city total
0 a 1
1 f 1
2 m 3
3 d 1

Drop by multiple columns groups if specific values not exit in another column in Pandas

How can I drop the whole group by city and district if date's value of 2018/11/1 not exits in the following dataframe:
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
3 b d 2018/9/1 3
4 b d 2018/10/1 7
The expected result will like this:
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
Thank you!
Create helper column by DataFrame.assign, compare by datetime and test if at least one true per groups with GroupBy.any and GroupBy.transform for possible filter by boolean indexing:
mask = (df.assign(new=df['date'].eq('2018/11/1'))
.groupby(['city','district'])['new'].transform('any'))
df = df[mask]
print (df)
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
If error with misisng values in mask one possivle idea is replace misisng values in columns used for groups:
mask = (df.assign(new=df['date'].eq('2018/11/1'),
city= df['city'].fillna(-1),
district= df['district'].fillna(-1))
.groupby(['city','district'])['new'].transform('any'))
df = df[mask]
print (df)
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
Another idea is add possible misisng index values by reindex and also replace missing values to False:
mask = (df.assign(new=df['date'].eq('2018/11/1'))
.groupby(['city','district'])['new'].transform('any'))
df = df[mask.reindex(df.index, fill_value=False).fillna(False)]
print (df)
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
There's a special GroupBy.filter() method for this. Assuming date is already datetime:
filter_date = pd.Timestamp('2018-11-01').date()
df = df.groupby(['city', 'district']).filter(lambda x: (x['date'].dt.date == filter_date).any())

Join rows based on particular column value in python [duplicate]

I have a dataframe like this:
A B C
0 1 0.749065 This
1 2 0.301084 is
2 3 0.463468 a
3 4 0.643961 random
4 1 0.866521 string
5 2 0.120737 !
Calling
In [10]: print df.groupby("A")["B"].sum()
will return
A
1 1.615586
2 0.421821
3 0.463468
4 0.643961
Now I would like to do "the same" for column "C". Because that column contains strings, sum() doesn't work (although you might think that it would concatenate the strings). What I would really like to see is a list or set of the strings for each group, i.e.
A
1 {This, string}
2 {is, !}
3 {a}
4 {random}
I have been trying to find ways to do this.
Series.unique() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html) doesn't work, although
df.groupby("A")["B"]
is a
pandas.core.groupby.SeriesGroupBy object
so I was hoping any Series method would work. Any ideas?
In [4]: df = read_csv(StringIO(data),sep='\s+')
In [5]: df
Out[5]:
A B C
0 1 0.749065 This
1 2 0.301084 is
2 3 0.463468 a
3 4 0.643961 random
4 1 0.866521 string
5 2 0.120737 !
In [6]: df.dtypes
Out[6]:
A int64
B float64
C object
dtype: object
When you apply your own function, there is not automatic exclusions of non-numeric columns. This is slower, though, than the application of .sum() to the groupby
In [8]: df.groupby('A').apply(lambda x: x.sum())
Out[8]:
A B C
A
1 2 1.615586 Thisstring
2 4 0.421821 is!
3 3 0.463468 a
4 4 0.643961 random
sum by default concatenates
In [9]: df.groupby('A')['C'].apply(lambda x: x.sum())
Out[9]:
A
1 Thisstring
2 is!
3 a
4 random
dtype: object
You can do pretty much what you want
In [11]: df.groupby('A')['C'].apply(lambda x: "{%s}" % ', '.join(x))
Out[11]:
A
1 {This, string}
2 {is, !}
3 {a}
4 {random}
dtype: object
Doing this on a whole frame, one group at a time. Key is to return a Series
def f(x):
return Series(dict(A = x['A'].sum(),
B = x['B'].sum(),
C = "{%s}" % ', '.join(x['C'])))
In [14]: df.groupby('A').apply(f)
Out[14]:
A B C
A
1 2 1.615586 {This, string}
2 4 0.421821 {is, !}
3 3 0.463468 {a}
4 4 0.643961 {random}
You can use the apply method to apply an arbitrary function to the grouped data. So if you want a set, apply set. If you want a list, apply list.
>>> d
A B
0 1 This
1 2 is
2 3 a
3 4 random
4 1 string
5 2 !
>>> d.groupby('A')['B'].apply(list)
A
1 [This, string]
2 [is, !]
3 [a]
4 [random]
dtype: object
If you want something else, just write a function that does what you want and then apply that.
You may be able to use the aggregate (or agg) function to concatenate the values. (Untested code)
df.groupby('A')['B'].agg(lambda col: ''.join(col))
You could try this:
df.groupby('A').agg({'B':'sum','C':'-'.join})
Named aggregations with pandas >= 0.25.0
Since pandas version 0.25.0 we have named aggregations where we can groupby, aggregate and at the same time assign new names to our columns. This way we won't get the MultiIndex columns, and the column names make more sense given the data they contain:
aggregate and get a list of strings
grp = df.groupby('A').agg(B_sum=('B','sum'),
C=('C', list)).reset_index()
print(grp)
A B_sum C
0 1 1.615586 [This, string]
1 2 0.421821 [is, !]
2 3 0.463468 [a]
3 4 0.643961 [random]
aggregate and join the strings
grp = df.groupby('A').agg(B_sum=('B','sum'),
C=('C', ', '.join)).reset_index()
print(grp)
A B_sum C
0 1 1.615586 This, string
1 2 0.421821 is, !
2 3 0.463468 a
3 4 0.643961 random
a simple solution would be :
>>> df.groupby(['A','B']).c.unique().reset_index()
If you'd like to overwrite column B in the dataframe, this should work:
df = df.groupby('A',as_index=False).agg(lambda x:'\n'.join(x))
Following #Erfan's good answer, most of the times in an analysis of aggregate values you want the unique possible combinations of these existing character values:
unique_chars = lambda x: ', '.join(x.unique())
(df
.groupby(['A'])
.agg({'C': unique_chars}))

Resources