Only drop duplicates if number of duplicates is less than X - python-3.x

I need to drop duplicate rows in my DataFrame only if the number of duplicates is less than x (e.g. 3)
(if more than 3 duplicates, keep them !)
Sample:
where count is number of duplicates and duplicates are in col data
data | count
-------------
a | 1
b | 2
b | 2
c | 1
d | 3
d | 3
d | 3
Desired result:
data | count
-------------
a | 1
b | 1
c | 1
d | 3
d | 3
d | 3
How can i achieve this? Thanks in advance.

I believe you need chain conditions with Series.duplicated and get greater or equal values of N in boolean indexing, last set 1 for count column:
N = 3
df1 = df[~df.duplicated('data') | df['count'].ge(N)].copy()
df1.loc[df['count'] < N, 'count'] = 1
print (df1)
data count
0 a 1
1 b 1
3 c 1
4 d 3
5 d 3
6 d 3

IIUC, you could do the following:
# create mask for non-duplicates and groups larger than 3
mask = (df.groupby('data')['count'].transform('count') >= 3) | ~df.duplicated('data')
# filter
filtered = df.loc[mask].drop('count', axis=1)
# reset count column
filtered['count'] = filtered.groupby('data')['data'].transform('count')
print(filtered)
Output
data count
0 a 1
1 b 1
3 c 1
4 d 3
5 d 3
6 d 3

N = 3
df['count'] = df['count'].apply(lambda x: 1 if x < N else x)
result = pd.concat([df[df['count'].eq(1)].drop_duplicates(), df[df['count'].eq(N)]])
result
data count
0 a 1
1 b 1
3 c 1
4 d 3
5 d 3
6 d 3

Related

pandas transform one row into multiple rows

I have a dataframe as below.
My dataframe as below.
ID list
1 a, b, c
2 a, s
3 NA
5 f, j, l
I need to break each items in the list column(String) into independent row as below:
ID item
1 a
1 b
1 c
2 a
2 s
3 NA
5 f
5 j
5 l
Thanks.
Use str.split to separate your items then explode:
print (df.assign(list=df["list"].str.split(", ")).explode("list"))
ID list
0 1 a
0 1 b
0 1 c
1 2 a
1 2 s
2 3 NaN
3 5 f
3 5 j
3 5 l
A beginners approach : Just another way of doing the same thing using pd.DataFrame.stack
df['list'] = df['list'].map(lambda x : str(x).split(','))
dfOut = pd.DataFrame(df['list'].values.tolist())
dfOut.index = df['ID']
dfOut = dfOut.stack().reset_index()
del dfOut['level_1']
dfOut.rename(columns = {0 : 'list'}, inplace = True)
Output:
ID list
0 1 a
1 1 b
2 1 c
3 2 a
4 2 s
5 3 nan
6 5 f
7 5 j
8 5 l

Python Pandas: copy several columns at specific row from one dataframe to another with different names

I have dataframe1 with columns a,b,c,d with 5 rows.
I also have another dataframe2 with columns e,f,g,h
Let's say I want to copy columns a,b in row 3 from dataframe1 to columns f,g in row 3 at dataframe2.
I tried to use this code:
dataframe2.loc[3,['f','g']] = dataframe1.loc[3,['a','b']].
The results was NaN in dataframe2.
Any ideas how can I solve it?
One idea is convert to numpy array for avoid alignment data by columns names:
dataframe2.loc[3,['f','g']] = dataframe1.loc[3,['a','b']].values
Sample:
dataframe1 = pd.DataFrame({'a':list('abcdef'),
'b':[4,5,4,5,5,4],
'c':[7,8,9,4,2,3]})
print (dataframe1)
a b c
0 a 4 7
1 b 5 8
2 c 4 9
3 d 5 4
4 e 5 2
5 f 4 3
dataframe2 = pd.DataFrame({'f':list('HIJK'),
'g':[0,0,7,1],
'h':[0,1,0,1]})
print (dataframe2)
f g h
0 H 0 0
1 I 0 1
2 J 7 0
3 K 1 1
dataframe2.loc[3,['f','g']] = dataframe1.loc[3,['a','b']].values
print (dataframe2)
f g h
0 H 0 0
1 I 0 1
2 J 7 0
3 d 5 1

pandas how to convert a two-dimension dataframe to a one-dimension dataframe

suppose I have a dataframe with multi columns.
a b c
1
2
3
How to convert it to a single columns dataframe
1 a
2 a
3 a
1 b
2 b
3 b
1 c
2 c
3 c
please note that the former is a Dataframe other than Panel
Use melt:
df = df.reset_index().melt('index', var_name='col').set_index('index')[['col']]
print (df)
col
index
1 a
2 a
3 a
1 b
2 b
3 b
1 c
2 c
3 c
Or numpy.repeat and numpy.tile with DataFrame constructor::
a = np.repeat(df.columns, len(df))
b = np.tile(df.index, len(df.columns))
df = pd.DataFrame(a, index=b, columns=['col'])
print (df)
col
1 a
2 a
3 a
1 b
2 b
3 b
1 c
2 c
3 c
another way is,
pd.DataFrame(list(itertools.product(df.index, df.columns.values))).set_index([0])
Output:
1
0
1 a
1 b
1 c
2 a
2 b
2 c
3 a
3 b
3 c
For exact output:
use sort_values
print pd.DataFrame(list(itertools.product(df.index, df.columns.values))).set_index([0]).sort_values(by=[1])
1
0
1 a
2 a
3 a
1 b
2 b
3 b
1 c
2 c
3 c

Deleting the first instance in a data frame

I was wondering whats the best way to delete the first instance of a particular index in a Pandas dataframe?
In the example below, I want to delete row 0,5 and 9
Use boolean indexing with Index.duplicated:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')}, index=[0,0,1,2,2,2])
print (df)
A B C D E F
0 a 4 7 1 5 a
0 b 5 8 3 3 a
1 c 4 9 5 6 a
2 d 5 4 7 9 b
2 e 5 2 1 2 b
2 f 4 3 0 4 b
df = df[df.index.duplicated()]
print (df)
A B C D E F
0 b 5 8 3 3 a
2 e 5 2 1 2 b
2 f 4 3 0 4 b
Detail:
print (df.index.duplicated())
[False True False False True True]
Heres a way to do it using groupby:
rst = df.reset_index()
df['int_index'] = df.reset_index().index
firsts = df.groupby(df.index).first()
filt = df[~df['int_index'].isin(firsts['int_index'])]
missing = df[df.index.value_counts() == 1]
res = pd.concat([drp, missing]).sort_index().drop('int_index', axis=1)

Column label of max in pandas

I am trying to extract maximum value in row and contributing column label from pandas dataframe. For example,
A B C D
index
x 0 1 2 3
y 3 2 1 0
I expect the following output,
A B C D Maxv Con
index
x 0 1 2 3 3 D
y 3 2 1 0 3 A
I tried the following,
df['Maxv'] = df.apply(max,axis=1)
df['Con'] = df.idxmax(axis='rows')
It returned only the max column and 'NaN' for Con column. What is the error here?
Thanks in Advance.
AP
Need axis='columns' or axis=1 in DataFrame.idxmax:
df['Con'] = df.idxmax(axis='columns')
print (df)
A B C D Maxv Con
index
x 0 1 2 3 3 D
y 3 2 1 0 3 A
Or:
df['Con'] = df.idxmax(axis=1)
print (df)
A B C D Maxv Con
index
x 0 1 2 3 3 D
y 3 2 1 0 3 A
You get NaNs, because data are not align to index:
print (df.idxmax(axis='rows'))
A y
B y
C x
D x
dtype: object

Resources