Pandas how to get top n group by flag column - python-3.x

I have dataframe like below.
df = pd.DataFrame({'group':[1,2,1,3,3,1,4,4,1,4], 'match': [1,1,1,1,1,1,1,1,1,1]})
group match
0 1 1
1 2 1
2 1 1
3 3 1
4 3 1
5 1 1
6 4 1
7 4 1
8 1 1
9 4 1
I want to get top n group like below (n=3).
group match
0 1 1
1 1 1
2 1 1
3 1 1
4 4 1
5 4 1
6 4 1
7 3 1
8 3 1
My data, in actually, each row have another information to use, so only sort to num of match, and extract top n.
How to do this?

I believe you need if need top3 groups per column match - use SeriesGroupBy.value_counts with GroupBy.head for top3 per groups and then convert index to DataFrame by Index.to_frame and DataFrame.merge:
s = df.groupby('match')['group'].value_counts().groupby(level=0).head(3).swaplevel()
df = s.index.to_frame().reset_index(drop=True).merge(df)
print (df)
group match
0 1 1
1 1 1
2 1 1
3 1 1
4 4 1
5 4 1
6 4 1
7 3 1
8 3 1
Or if need filter only values if match is 1 use Series.value_counts with filtering by boolean indexing:
s = df.loc[df['match'] == 1, 'group'].value_counts().head(3)
df = s.index.to_frame(name='group').merge(df)
print (df)
group match
0 1 1
1 1 1
2 1 1
3 1 1
4 4 1
5 4 1
6 4 1
7 3 1
8 3 1
Solution with isin and ordered categoricals:
#if need filter match == 1
idx = df.loc[df['match'] == 1, 'group'].value_counts().head(3).index
#if dont need filter
#idx = df.group.value_counts().head(3).index
df = df[df.group.isin(idx)]
df['group'] = pd.CategoricalIndex(df['group'], ordered=True, categories=idx)
df = df.sort_values('group')
print (df)
group match
0 1 1
2 1 1
5 1 1
8 1 1
6 4 1
7 4 1
9 4 1
3 3 1
4 3 1
Difference in solutions is best seen in changed data of match column:
df = pd.DataFrame({'group':[1,2,1,3,3,1,4,4,1,4,10,20,10,20,10,30,40],
'match': [1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0]})
print (df)
group match
0 1 1
1 2 1
2 1 1
3 3 1
4 3 1
5 1 1
6 4 1
7 4 1
8 1 1
9 4 1
10 10 0
11 20 0
12 10 0
13 20 0
14 10 0
15 30 0
16 40 0
Top3 values per groups by match:
s = df.groupby('match')['group'].value_counts().groupby(level=0).head(3).swaplevel()
df1 = s.index.to_frame().reset_index(drop=True).merge(df)
print (df1)
group match
0 10 0
1 10 0
2 10 0
3 20 0
4 20 0
5 30 0
6 1 1
7 1 1
8 1 1
9 1 1
10 4 1
11 4 1
12 4 1
13 3 1
14 3 1
Top3 values by match == 1:
s = df.loc[df['match'] == 1, 'group'].value_counts().head(3)
df2 = s.index.to_frame(name='group').merge(df)
print (df2)
group match
0 1 1
1 1 1
2 1 1
3 1 1
4 4 1
5 4 1
6 4 1
7 3 1
8 3 1
Top3 values, match column is not important:
s = df['group'].value_counts().head(3)
df3 = s.index.to_frame(name='group').merge(df)
print (df3)
group match
0 1 1
1 1 1
2 1 1
3 1 1
4 10 0
5 10 0
6 10 0
7 4 1
8 4 1
9 4 1

Related

Adding timepoints to a multirow dataframe based on ID and date

As the title says it, my dataframe looks as follows:
ID
Follow up month
Value-x
value -y
1
0
12
12
1
0
11
14
2
0
10
11
2
3
11
0
2
0
12
1
1
3
13
12
2
3
11
5
I want to add another column called timepoint which would make the table look like as follows:
ID
Follow up month
Value-x
value -y
Timepoint
1
0
12
12
1
1
0
11
14
1
2
0
10
11
1
2
3
11
0
2
2
0
12
1
1
1
3
13
12
2
2
3
11
5
2
2
3
11
0
2
2
0
12
1
1
1
3
13
12
2
2
3
11
5
2
So far I tried to group the rows by their ID and follow up month and then apply a timepoint using cumcount. This didn't give me any results any help on how to handle this would be appreciated.
From your table I can only infer that you want to create the Timepoint column based on the corresponding values in Follow up month, which will look like:
from io import StringIO
import pandas as pd
wt = StringIO("""ID Follow up month Value-x value -y
1 0 12 12
1 0 11 14
2 0 10 11
2 3 11 0
2 0 12 1
1 3 13 12
2 3 11 5""")
df = pd.read_csv(wt, sep='\s\s+')
df['Timepoint'] = df['Follow up month'].apply(lambda x: 1 if x==0 else 2)
df
Output:
ID Follow up month Value-x value -y Timepoint
0 1 0 12 12 1
1 1 0 11 14 1
2 2 0 10 11 1
3 2 3 11 0 2
4 2 0 12 1 1
5 1 3 13 12 2
6 2 3 11 5 2
Edit
Based on your comment, this should be what you want:
def timepoint(s):
if not s.isin([0]).any() and s.iloc[0] == 3:
return 1
else:
return s.apply(lambda x: 1 if x==0 else 2)
df['Timepoint'] = df.groupby('ID')['Follow up month'].transform(timepoint)

Creating a variable using conditionals python using vectorization

I have a pandas dataframe as below,
flag a b c
0 1 5 1 3
1 1 2 1 3
2 1 3 0 3
3 1 4 0 3
4 1 5 5 3
5 1 6 0 3
6 1 7 0 3
7 2 6 1 4
8 2 2 1 4
9 2 3 1 4
10 2 4 1 4
I want to create a column 'd' based on the below condition:
1) For first row of each flag, if a>c, then d = b, else d = nan
2) For non-first row of each flag, if (a>c) & ((previous row of d is nan) | (b > previous row of d)), d=b, else d = prev row of d
I am expecting the below output:
flag a b c d
0 1 5 1 3 1
1 1 2 1 3 1
2 1 3 0 3 1
3 1 4 0 3 1
4 1 5 5 3 5
5 1 6 0 3 5
6 1 7 0 3 5
7 2 6 1 4 1
8 2 2 1 4 1
9 2 3 1 4 1
10 2 4 1 4 1
Here's how I would translate your logic:
df['d'] = np.nan
# first row of flag
s = df.flag.ne(df.flag.shift())
# where a > c
a_gt_c = df['a'].gt(df['c'])
# fill the first rows with a > c
df.loc[s & a_gt_c, 'd'] = df['b']
# mask for second fill
mask = ((~s) # not first rows
& a_gt_c # a > c
& (df['d'].shift().isna() # previous d not null
| df['b'].gt(df['d']).shift()) # or b > previous d
)
# fill those values:
df.loc[mask, 'd'] = df['b']
# ffill for the rest
df['d'] = df['d'].ffill()
Output:
flag a b c d
0 1 5 1 3 1.0
1 1 2 1 3 1.0
2 1 3 0 3 1.0
3 1 4 0 3 0.0
4 1 5 5 3 5.0
5 1 6 0 3 0.0
6 1 7 0 3 0.0
7 2 6 1 4 1.0
8 2 2 1 4 1.0
9 2 3 1 4 1.0
10 2 4 1 4 1.0

Table wise value count

I have a table like this, I want to draw a histogram for number of 0, 1, 2, 3 across all table, is there a way to do it?
you can apply melt and hist
for example:
df
A B C D
0 3 1 1 1
1 3 3 2 2
2 1 0 1 1
3 3 2 3 0
4 3 1 1 3
5 3 0 3 1
6 3 1 1 0
7 1 3 3 0
8 3 1 3 3
9 3 3 1 3
df.melt()['value'].value_counts()
3 18
1 14
0 5
2 3

Dataframe concatenate columns

I have a dataframe with a multiindex (ID, Date, LID) and columns from 0 to N that looks something like this:
0 1 2 3 4
ID Date LID
00112 11-02-2014 I 0 1 5 6 7
00112 11-02-2014 II 2 4 5 3 4
00112 30-07-2015 I 5 7 1 1 2
00112 30-07-2015 II 3 2 8 7 1
I would like to group the dataframe by ID and Date and concatenate the columns to the same row such that it looks like this:
0 1 2 3 4 5 6 7 8 9
ID Date
00112 11-02-2014 0 1 5 6 7 2 4 5 3 4
00112 30-07-2015 5 7 1 1 2 3 2 8 7 1
Using pd.concat and pd.DataFrame.xs
pd.concat(
[df.xs(x, level=2) for x in df.index.levels[2]],
axis=1, ignore_index=True
)
0 1 2 3 4 5 6 7 8 9
ID Date
112 11-02-2014 0 1 5 6 7 2 4 5 3 4
30-07-2015 5 7 1 1 2 3 2 8 7 1
Use unstack + sort_index:
df = df.unstack().sort_index(axis=1, level=1)
#for new columns names
df.columns = np.arange(len(df.columns))
print (df)
0 1 2 3 4 5 6 7 8 9
ID Date
112 11-02-2014 0 1 5 6 7 2 4 5 3 4
30-07-2015 5 7 1 1 2 3 2 8 7 1

convert dataframe columns value into digital number

I have following data in my column of data frame. How can I convert each domain name by digital number? I try to use replace in a for loop. However, since I have more than 1200 unqie domain name. I do not want to It seems like it is not a idea way to do it
for i, v in np.ndenumerate(np.unique(df['domain'])):
df['domain'] = df['domain'].replace(to_replace=[v], value=i[0]+1, inplace=True)
but it does not work
data frame:
type domain
0 1 yahoo.com
1 1 google.com
2 0 google.com
3 0 aa.com
4 0 google.com
5 0 aa.com
6 1 abc.com
7 1 msn.com
8 1 abc.com
9 1 abc.com
....
I want to convert to
type domain
0 1 1
1 1 2
2 0 2
3 0 3
4 0 2
5 0 3
6 1 4
7 1 5
8 1 4
9 1 4
....
Let's use pd.factorize:
df.assign(domain=pd.factorize(df.domain)[0]+1)
Output:
type domain
0 1 1
1 1 2
2 0 2
3 0 3
4 0 2
5 0 3
6 1 4
7 1 5
8 1 4
9 1 4
If it does really matter for the digital number assignment, you can try this
import pandas as pd
df.domain.astype('category').cat.codes
Out[154]:
0 4
1 2
2 2
3 0
4 2
5 0
6 1
7 3
8 1
9 1
dtype: int8
If that is matter, you can try
maplist=df[['domain']].drop_duplicates(keep='first').reset_index(drop=True).reset_index().set_index('domain')
maplist['index']=maplist['index']+1
df.domain=df.domain.map(maplist['index'])
Out[177]:
type domain
0 1 1
1 1 2
2 0 2
3 0 3
4 0 2
5 0 3
6 1 4
7 1 5
8 1 4
9 1 4

Resources