Pandas DataFrame copy with condition [duplicate] - python-3.x

This question already has answers here:
How can I replicate rows of a Pandas DataFrame?
(10 answers)
Closed 11 months ago.
I want to replicate rows in a Pandas Dataframe. Each row should be repeated n times, where n is a field of each row.
import pandas as pd
what_i_have = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'n' : [ 1, 2, 3],
'v' : [ 10, 13, 8]
})
what_i_want = pd.DataFrame(data={
'id': ['A', 'B', 'B', 'C', 'C', 'C'],
'v' : [ 10, 13, 13, 8, 8, 8]
})
Is this possible?

You can use Index.repeat to get repeated index values based on the column then select from the DataFrame:
df2 = df.loc[df.index.repeat(df.n)]
id n v
0 A 1 10
1 B 2 13
1 B 2 13
2 C 3 8
2 C 3 8
2 C 3 8
Or you could use np.repeat to get the repeated indices and then use that to index into the frame:
df2 = df.loc[np.repeat(df.index.values, df.n)]
id n v
0 A 1 10
1 B 2 13
1 B 2 13
2 C 3 8
2 C 3 8
2 C 3 8
After which there's only a bit of cleaning up to do:
df2 = df2.drop("n", axis=1).reset_index(drop=True)
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
Note that if you might have duplicate indices to worry about, you could use .iloc instead:
df.iloc[np.repeat(np.arange(len(df)), df["n"])].drop("n", axis=1).reset_index(drop=True)
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
which uses the positions, and not the index labels.

You could use set_index and repeat
In [1057]: df.set_index(['id'])['v'].repeat(df['n']).reset_index()
Out[1057]:
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
Details
In [1058]: df
Out[1058]:
id n v
0 A 1 10
1 B 2 13
2 C 3 8

It's something like the uncount in tidyr:
https://tidyr.tidyverse.org/reference/uncount.html
I wrote a package (https://github.com/pwwang/datar) that implements this API:
from datar import f
from datar.tibble import tribble
from datar.tidyr import uncount
what_i_have = tribble(
f.id, f.n, f.v,
'A', 1, 10,
'B', 2, 13,
'C', 3, 8
)
what_i_have >> uncount(f.n)
Output:
id v
0 A 10
1 B 13
1 B 13
2 C 8
2 C 8
2 C 8

Not the best solution, but I want to share this: you could also use pandas.reindex() and .repeat():
df.reindex(df.index.repeat(df.n)).drop('n', axis=1)
Output:
id v
0 A 10
1 B 13
1 B 13
2 C 8
2 C 8
2 C 8
You can further append .reset_index(drop=True) to reset the .index.

Related

Create new column with a list of max frequency values for each row of a pandas dataframe

Given this Dataframe:
df2 = pd.DataFrame([[3,3,3,3,3,3,5,5,5,5],[2,2,2,2,8,8,8,8,6,6]], columns=list('ABCDEFGHIJ'))
A B C D E F G H I J
0 3 3 3 3 3 3 5 5 5 5
1 2 2 2 2 8 8 8 8 6 6
I created 2 news columns which give for each row the max_freq and the max_freq_value:
df2["max_freq_val"] = df2.apply(lambda x: x.mode().agg(list), axis=1)
df2["max_freq"] = df2.loc[:, df2.columns != "max_freq_val"].apply(lambda x: x.value_counts().max(), axis=1)
A B C D E F G H I J max_freq_val max_freq
0 3 3 3 3 3 3 5 5 5 5 [3] 6
1 2 2 2 2 8 8 8 8 6 6 [2, 8] 4
EDIT: I've edited my code inspired by the answer given by #rhug123.
Thanks to all of you for your answers.
Try this, it uses mode()
df2.assign(max_freq=pd.Series(df2.mode(axis=1).stack().groupby(level=0).agg(list)),
max_freq_value = df2.eq(df2.mode(axis=1)[0].squeeze(),axis=0).sum(axis=1))
or
df2.assign(freq = df2.eq((s := df2.mode(axis=1).stack().groupby(level=0).agg(list)).str[0],axis=0).sum(axis=1),val = s)
We can try stack then adjust the freq with agg put the multiple into the list
s = df2.stack().groupby(level=0).value_counts()
s = s[s.eq(s.max(level=0),level=0)].reset_index(level=1).groupby(level=0).agg(val= ('level_1',list),fre=(0,'first'))
df2 = df2.join(s)
df2
Out[156]:
A B C D E F G H I J val fre
0 3 3 3 3 3 3 5 5 5 5 [3] 6
1 2 2 2 2 8 8 8 8 6 6 [2, 8] 4
Perhaps you could use this function:
def give_back_maximums(a = [2,2,2,2,8,8,8,8,6,6]):
values, counts = np.unique(a, return_counts=True)
return values[counts >= counts.max()].tolist()
The order of the below could affect the result
df2["max_freq_value"] = df2.apply(lambda x: give_back_maximums(x), axis=1)
df2["max_freq"] = df2.apply(lambda x: x.value_counts().max(), axis=1)
print(df2)
A B C D E F G H I J max_freq_value max_freq
0 3 3 3 3 3 3 5 5 5 5 [3] 6
1 2 2 2 2 8 8 8 8 6 6 [2, 8] 4
Hope it helps : )

Reshape a Pandas dataframe into multilevel columns [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 3 years ago.
I have a dataframe which looks like:
df = pd.DataFrame(
{
"id": [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4],
"mod": ["a", "a", "b", "b"] * 4,
"qid": [11, 12, 13, 14] * 4,
"ans": ["Z","Y","X","W","V","U","T","S","R","Q","P","O","N","M","L", "K"],
}
)
df
id mod qid ans
0 1 a 11 Z
1 1 a 12 Y
2 1 b 13 X
3 1 b 14 W
4 2 a 11 V
5 2 a 12 U
6 2 b 13 T
7 2 b 14 S
8 3 a 11 R
9 3 a 12 Q
10 3 b 13 P
11 3 b 14 O
12 4 a 11 N
13 4 a 12 M
14 4 b 13 L
15 4 b 14 K
Each value of qid fits within mod entirely. E.g., qid = 11 only occurs in mod = a.
I'd like to reshape the data into wide format, with mod and qid as column levels:
a b
11 12 13 14
id
1 Z Y X W
2 V U T S
3 R Q P O
4 N M L K
Is this possible in Pandas? I've tried pivot() with no luck.
Use pandas.pivot_table
pd.pivot_table(df, index='id', columns=['mod', 'qid'], aggfunc='first')
Output
ans
mod a b
qid 11 12 13 14
id
1 Z Y X W
2 V U T S
3 R Q P O
4 N M L K

The way `Drop column by id ` result in all same columns removed in dataframe

import pandas as pd
df1 = pd.DataFrame({"A":[14, 4, 5, 4],"B":[1,2,3,4]})
df2 = pd.DataFrame({"A":[14, 4, 5, 4],"C":[5,6,7,8]})
df = pd.concat([df1,df2],axis=1)
Let's see the concated df,the first column and third column shares the same column name A.
df
A B A C
0 14 1 14 5
1 4 2 4 6
2 5 3 5 7
3 4 4 4 8
I want to get the following format.
df
A B C
0 14 1 5
1 4 2 6
2 5 3 7
3 4 4 8
Drop column by id.
result = df.drop(df.columns[2],axis=1)
result
B C
0 1 5
1 2 6
2 3 7
3 4 8
I can get what i expect this way:
import pandas as pd
df1 = pd.DataFrame({"A":[14, 4, 5, 4],"B":[1,2,3,4]})
df2 = pd.DataFrame({"A":[14, 4, 5, 4],"C":[5,6,7,8]})
df2 = df2.drop(df2.columns[0],axis=1)
df = pd.concat([df1,df2],axis=1)
It is so strange that both the first and third column removed when to drop specified column by id.
1.Please tell me the reason of dataframe's this action.
2.How can i remove the third column at the same time keep the first column undeleted?
Here's a way using indexes:
index_to_drop = 2
# get indexes to keep
col_idxs = [en for en, _ in enumerate(df.columns) if en != index_to_drop]
# subset the df
df = df.iloc[:,col_idxs]
A B C
0 14 1 5
1 4 2 6
2 5 3 7
3 4 4 8

Get value from another dataframe column based on condition

I have a dataframe like below:
>>> df1
a b
0 [1, 2, 3] 10
1 [4, 5, 6] 20
2 [7, 8] 30
and another like:
>>> df2
a
0 1
1 2
2 3
3 4
4 5
I need to create column 'c' in df2 from column 'b' of df1 if column 'a' value of df2 is in coulmn 'a' df1. In df1 each tuple of column 'a' is a list.
I have tried to implement from following url, but got nothing so far:
https://medium.com/#Imaadmkhan1/using-pandas-to-create-a-conditional-column-by-selecting-multiple-columns-in-two-different-b50886fabb7d
expect result is
>>> df2
a c
0 1 10
1 2 10
2 3 10
3 4 20
4 5 20
Use Series.map by flattening values from df1 to dictionary:
d = {c: b for a, b in zip(df1['a'], df1['b']) for c in a}
print (d)
{1: 10, 2: 10, 3: 10, 4: 20, 5: 20, 6: 20, 7: 30, 8: 30}
df2['new'] = df2['a'].map(d)
print (df2)
a new
0 1 10
1 2 10
2 3 10
3 4 20
4 5 20
EDIT: I think problem is mixed integers in list in column a, solution is use if/else for test it for new dictionary:
d = {}
for a, b in zip(df1['a'], df1['b']):
if isinstance(a, list):
for c in a:
d[c] = b
else:
d[a] = b
df2['new'] = df2['a'].map(d)
Use :
m=pd.DataFrame({'a':np.concatenate(df.a.values),'b':df.b.repeat(df.a.str.len())})
df2.merge(m,on='a')
a b
0 1 10
1 2 10
2 3 10
3 4 20
4 5 20
First we unnest the list df1 to rows, then we merge them on column a:
df1 = df1.set_index('b').a.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'a'})
print(df1, '\n')
df_final = df2.merge(df1, on='a')
print(df_final)
b a
0 10 1.0
1 10 2.0
2 10 3.0
0 20 4.0
1 20 5.0
2 20 6.0
0 30 7.0
1 30 8.0
a b
0 1 10
1 2 10
2 3 10
3 4 20
4 5 20

Column name and index of max value

I currently have a pandas dataframe where values between 0 and 1 are saved. I am looking for a function which can provide me the top 5 values of a column, together with the name of the column and the associated index of the values.
Sample Input: data frame with column names a:z, index 1:23, entries are values between 0 and 1
Sample Output: array of 5 highest entries in each column, each with column name and index
Edit:
For the following data frame:
np.random.seed([3,1415])
df = pd.DataFrame(np.random.randint(10, size=(10, 4)), list('abcdefghij'), list('ABCD'))
df
A B C D
a 0 2 7 3
b 8 7 0 6
c 8 6 0 2
d 0 4 9 7
e 3 2 4 3
f 3 6 7 7
g 4 5 3 7
h 5 9 8 7
i 6 4 7 6
j 2 6 6 5
I would like to get an output like (for example for the first column):
[[8,b,A], [8, c, A], [6,i,A], [5, h, A], [4,g,A]].
consider the dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(
np.random.randint(10, size=(10, 4)), list('abcdefghij'), list('ABCD'))
df
A B C D
a 0 2 7 3
b 8 7 0 6
c 8 6 0 2
d 0 4 9 7
e 3 2 4 3
f 3 6 7 7
g 4 5 3 7
h 5 9 8 7
i 6 4 7 6
j 2 6 6 5
I'm going to use np.argpartition to separate each column into the 5 smallest and 10 - 5 (also 5) largest
v = df.values
i = df.index.values
k = len(v) - 5
pd.DataFrame(
i[v.argpartition(k, 0)[-k:]],
np.arange(k), df.columns
)
A B C D
0 g f i i
1 b c a d
2 h h f h
3 i b d f
4 c j h g
print(your_dataframe.sort_values(ascending=False)[0:4])

Resources