I have a pandas dataframe as below,
flag a b c
0 1 5 1 3
1 1 2 1 3
2 1 3 0 3
3 1 4 0 3
4 1 5 5 3
5 1 6 0 3
6 1 7 0 3
7 2 6 1 4
8 2 2 1 4
9 2 3 1 4
10 2 4 1 4
I want to create a column 'd' based on the below condition:
1) For first row of each flag, if a>c, then d = b, else d = nan
2) For non-first row of each flag, if (a>c) & ((previous row of d is nan) | (b > previous row of d)), d=b, else d = prev row of d
I am expecting the below output:
flag a b c d
0 1 5 1 3 1
1 1 2 1 3 1
2 1 3 0 3 1
3 1 4 0 3 1
4 1 5 5 3 5
5 1 6 0 3 5
6 1 7 0 3 5
7 2 6 1 4 1
8 2 2 1 4 1
9 2 3 1 4 1
10 2 4 1 4 1
Here's how I would translate your logic:
df['d'] = np.nan
# first row of flag
s = df.flag.ne(df.flag.shift())
# where a > c
a_gt_c = df['a'].gt(df['c'])
# fill the first rows with a > c
df.loc[s & a_gt_c, 'd'] = df['b']
# mask for second fill
mask = ((~s) # not first rows
& a_gt_c # a > c
& (df['d'].shift().isna() # previous d not null
| df['b'].gt(df['d']).shift()) # or b > previous d
)
# fill those values:
df.loc[mask, 'd'] = df['b']
# ffill for the rest
df['d'] = df['d'].ffill()
Output:
flag a b c d
0 1 5 1 3 1.0
1 1 2 1 3 1.0
2 1 3 0 3 1.0
3 1 4 0 3 0.0
4 1 5 5 3 5.0
5 1 6 0 3 0.0
6 1 7 0 3 0.0
7 2 6 1 4 1.0
8 2 2 1 4 1.0
9 2 3 1 4 1.0
10 2 4 1 4 1.0
I have dataframe like below.
df = pd.DataFrame({'group':[1,2,1,3,3,1,4,4,1,4], 'match': [1,1,1,1,1,1,1,1,1,1]})
group match
0 1 1
1 2 1
2 1 1
3 3 1
4 3 1
5 1 1
6 4 1
7 4 1
8 1 1
9 4 1
I want to get top n group like below (n=3).
group match
0 1 1
1 1 1
2 1 1
3 1 1
4 4 1
5 4 1
6 4 1
7 3 1
8 3 1
My data, in actually, each row have another information to use, so only sort to num of match, and extract top n.
How to do this?
I believe you need if need top3 groups per column match - use SeriesGroupBy.value_counts with GroupBy.head for top3 per groups and then convert index to DataFrame by Index.to_frame and DataFrame.merge:
s = df.groupby('match')['group'].value_counts().groupby(level=0).head(3).swaplevel()
df = s.index.to_frame().reset_index(drop=True).merge(df)
print (df)
group match
0 1 1
1 1 1
2 1 1
3 1 1
4 4 1
5 4 1
6 4 1
7 3 1
8 3 1
Or if need filter only values if match is 1 use Series.value_counts with filtering by boolean indexing:
s = df.loc[df['match'] == 1, 'group'].value_counts().head(3)
df = s.index.to_frame(name='group').merge(df)
print (df)
group match
0 1 1
1 1 1
2 1 1
3 1 1
4 4 1
5 4 1
6 4 1
7 3 1
8 3 1
Solution with isin and ordered categoricals:
#if need filter match == 1
idx = df.loc[df['match'] == 1, 'group'].value_counts().head(3).index
#if dont need filter
#idx = df.group.value_counts().head(3).index
df = df[df.group.isin(idx)]
df['group'] = pd.CategoricalIndex(df['group'], ordered=True, categories=idx)
df = df.sort_values('group')
print (df)
group match
0 1 1
2 1 1
5 1 1
8 1 1
6 4 1
7 4 1
9 4 1
3 3 1
4 3 1
Difference in solutions is best seen in changed data of match column:
df = pd.DataFrame({'group':[1,2,1,3,3,1,4,4,1,4,10,20,10,20,10,30,40],
'match': [1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0]})
print (df)
group match
0 1 1
1 2 1
2 1 1
3 3 1
4 3 1
5 1 1
6 4 1
7 4 1
8 1 1
9 4 1
10 10 0
11 20 0
12 10 0
13 20 0
14 10 0
15 30 0
16 40 0
Top3 values per groups by match:
s = df.groupby('match')['group'].value_counts().groupby(level=0).head(3).swaplevel()
df1 = s.index.to_frame().reset_index(drop=True).merge(df)
print (df1)
group match
0 10 0
1 10 0
2 10 0
3 20 0
4 20 0
5 30 0
6 1 1
7 1 1
8 1 1
9 1 1
10 4 1
11 4 1
12 4 1
13 3 1
14 3 1
Top3 values by match == 1:
s = df.loc[df['match'] == 1, 'group'].value_counts().head(3)
df2 = s.index.to_frame(name='group').merge(df)
print (df2)
group match
0 1 1
1 1 1
2 1 1
3 1 1
4 4 1
5 4 1
6 4 1
7 3 1
8 3 1
Top3 values, match column is not important:
s = df['group'].value_counts().head(3)
df3 = s.index.to_frame(name='group').merge(df)
print (df3)
group match
0 1 1
1 1 1
2 1 1
3 1 1
4 10 0
5 10 0
6 10 0
7 4 1
8 4 1
9 4 1
I have a table like this, I want to draw a histogram for number of 0, 1, 2, 3 across all table, is there a way to do it?
you can apply melt and hist
for example:
df
A B C D
0 3 1 1 1
1 3 3 2 2
2 1 0 1 1
3 3 2 3 0
4 3 1 1 3
5 3 0 3 1
6 3 1 1 0
7 1 3 3 0
8 3 1 3 3
9 3 3 1 3
df.melt()['value'].value_counts()
3 18
1 14
0 5
2 3
This question already has answers here:
Generalise slicing operation in a NumPy array
(4 answers)
Closed 4 years ago.
I have a 2d numpy matrix, for example:
arr = np.arange(0, 12).reshape(3,4)
This I would like to get in a DataFrame, such that:
X Y Z
0 0 0
0 1 1
0 2 2
0 3 3
1 0 4
1 1 5
1 2 6
1 3 7
2 0 8
2 1 9
2 2 10
2 3 11
How would I do this (efficiently)?
You can use numpy functions:
x1 =np.repeat(np.arange(arr.shape[0]), len(arr.flatten())/len(np.arange(arr.shape[0])))
x2 =np.tile(np.arange(arr.shape[1]), int(len(arr.flatten())/len(np.arange(arr.shape[1]))))
x3= arr.flatten()
pd.DataFrame(np.array([x1,x2,x3]).T, columns=['X','Y','Z'])
Output
X Y Z
0 0 0 0
1 0 1 1
2 0 2 2
3 0 3 3
4 1 0 4
5 1 1 5
6 1 2 6
7 1 3 7
8 2 0 8
9 2 1 9
10 2 2 10
11 2 3 11
I have following data in my column of data frame. How can I convert each domain name by digital number? I try to use replace in a for loop. However, since I have more than 1200 unqie domain name. I do not want to It seems like it is not a idea way to do it
for i, v in np.ndenumerate(np.unique(df['domain'])):
df['domain'] = df['domain'].replace(to_replace=[v], value=i[0]+1, inplace=True)
but it does not work
data frame:
type domain
0 1 yahoo.com
1 1 google.com
2 0 google.com
3 0 aa.com
4 0 google.com
5 0 aa.com
6 1 abc.com
7 1 msn.com
8 1 abc.com
9 1 abc.com
....
I want to convert to
type domain
0 1 1
1 1 2
2 0 2
3 0 3
4 0 2
5 0 3
6 1 4
7 1 5
8 1 4
9 1 4
....
Let's use pd.factorize:
df.assign(domain=pd.factorize(df.domain)[0]+1)
Output:
type domain
0 1 1
1 1 2
2 0 2
3 0 3
4 0 2
5 0 3
6 1 4
7 1 5
8 1 4
9 1 4
If it does really matter for the digital number assignment, you can try this
import pandas as pd
df.domain.astype('category').cat.codes
Out[154]:
0 4
1 2
2 2
3 0
4 2
5 0
6 1
7 3
8 1
9 1
dtype: int8
If that is matter, you can try
maplist=df[['domain']].drop_duplicates(keep='first').reset_index(drop=True).reset_index().set_index('domain')
maplist['index']=maplist['index']+1
df.domain=df.domain.map(maplist['index'])
Out[177]:
type domain
0 1 1
1 1 2
2 0 2
3 0 3
4 0 2
5 0 3
6 1 4
7 1 5
8 1 4
9 1 4