Reshape a Pandas dataframe into multilevel columns [duplicate] - python-3.x

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 3 years ago.
I have a dataframe which looks like:
df = pd.DataFrame(
{
"id": [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4],
"mod": ["a", "a", "b", "b"] * 4,
"qid": [11, 12, 13, 14] * 4,
"ans": ["Z","Y","X","W","V","U","T","S","R","Q","P","O","N","M","L", "K"],
}
)
df
id mod qid ans
0 1 a 11 Z
1 1 a 12 Y
2 1 b 13 X
3 1 b 14 W
4 2 a 11 V
5 2 a 12 U
6 2 b 13 T
7 2 b 14 S
8 3 a 11 R
9 3 a 12 Q
10 3 b 13 P
11 3 b 14 O
12 4 a 11 N
13 4 a 12 M
14 4 b 13 L
15 4 b 14 K
Each value of qid fits within mod entirely. E.g., qid = 11 only occurs in mod = a.
I'd like to reshape the data into wide format, with mod and qid as column levels:
a b
11 12 13 14
id
1 Z Y X W
2 V U T S
3 R Q P O
4 N M L K
Is this possible in Pandas? I've tried pivot() with no luck.

Use pandas.pivot_table
pd.pivot_table(df, index='id', columns=['mod', 'qid'], aggfunc='first')
Output
ans
mod a b
qid 11 12 13 14
id
1 Z Y X W
2 V U T S
3 R Q P O
4 N M L K

Related

Functional Programming: How does one create a new column to a dataframe that contains a multiindex column?

Suppose the below simplified dataframe. (The actual df is much, much bigger.) How does one assign values to a new column f such that f is a function of another column (e.,g. e)? I'm pretty sure one needs to use apply or map but never done this with a dataframe that has multiindex columns?
df = pd.DataFrame([[1,2,3,4], [5,6,7,8], [9,10,11,12], [13,14,15,16]])
df.columns = pd.MultiIndex.from_tuples((("a", "d"), ("a", "e"), ("b", "d"), ("b","e")))
df
a b
d e d e
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
3 13 14 15 16
Desired output:
a b
d e f d e f
0 1 2 1 3 4 1
1 5 6 1 7 8 -1
2 9 10 -1 11 12 -1
3 13 14 -1 15 16 -1
Would like to be able to apply the following lines and assign them to a new column f. Two problems: First, the last line that contains the apply doesn't work but hopefully my intent is clear. Second, I'm unsure how to assign values to a new column of a dataframe with a multi index column structure. Would like to be able use functional programming methods.
lt = df.loc(axis=1)[:,'e'] < 8
gt = df.loc(axis=1)[:,'e'] >= 8
conditions = [lt, gt]
choices = [1, -1]
df.loc(axis=1)[:,'f'] = df.loc(axis=1)[:,'e'].apply(np.select(conditions, choices))
nms = [(i, 'f')for i, j in df.columns if j == 'e']
df[nms] = (df.iloc[:, [j == 'e' for i, j in df.columns]] < 8) * 2 - 1
df = df.sort_index(axis=1)
df
a b
d e f d e f
0 1 2 1 3 4 1
1 5 6 1 7 8 -1
2 9 10 -1 11 12 -1
3 13 14 -1 15 16 -1
EDIT:
for a custom ordering:
d = {i:j for j, i in enumerate(df.columns.levels[0])}
df1 = df.loc[:, sorted(df.columns, key = lambda x: d[x[0]])]
IF the whole data is in a way symmetric, you could do:
df.stack(0).assign(f = lambda x: 2*(x.e < 8) - 1).stack().unstack([1,2])
Out[]:
a b
d e f d e f
0 1 2 1 3 4 1
1 5 6 1 7 8 -1
2 9 10 -1 11 12 -1
3 13 14 -1 15 16 -1

Create new column with a list of max frequency values for each row of a pandas dataframe

Given this Dataframe:
df2 = pd.DataFrame([[3,3,3,3,3,3,5,5,5,5],[2,2,2,2,8,8,8,8,6,6]], columns=list('ABCDEFGHIJ'))
A B C D E F G H I J
0 3 3 3 3 3 3 5 5 5 5
1 2 2 2 2 8 8 8 8 6 6
I created 2 news columns which give for each row the max_freq and the max_freq_value:
df2["max_freq_val"] = df2.apply(lambda x: x.mode().agg(list), axis=1)
df2["max_freq"] = df2.loc[:, df2.columns != "max_freq_val"].apply(lambda x: x.value_counts().max(), axis=1)
A B C D E F G H I J max_freq_val max_freq
0 3 3 3 3 3 3 5 5 5 5 [3] 6
1 2 2 2 2 8 8 8 8 6 6 [2, 8] 4
EDIT: I've edited my code inspired by the answer given by #rhug123.
Thanks to all of you for your answers.
Try this, it uses mode()
df2.assign(max_freq=pd.Series(df2.mode(axis=1).stack().groupby(level=0).agg(list)),
max_freq_value = df2.eq(df2.mode(axis=1)[0].squeeze(),axis=0).sum(axis=1))
or
df2.assign(freq = df2.eq((s := df2.mode(axis=1).stack().groupby(level=0).agg(list)).str[0],axis=0).sum(axis=1),val = s)
We can try stack then adjust the freq with agg put the multiple into the list
s = df2.stack().groupby(level=0).value_counts()
s = s[s.eq(s.max(level=0),level=0)].reset_index(level=1).groupby(level=0).agg(val= ('level_1',list),fre=(0,'first'))
df2 = df2.join(s)
df2
Out[156]:
A B C D E F G H I J val fre
0 3 3 3 3 3 3 5 5 5 5 [3] 6
1 2 2 2 2 8 8 8 8 6 6 [2, 8] 4
Perhaps you could use this function:
def give_back_maximums(a = [2,2,2,2,8,8,8,8,6,6]):
values, counts = np.unique(a, return_counts=True)
return values[counts >= counts.max()].tolist()
The order of the below could affect the result
df2["max_freq_value"] = df2.apply(lambda x: give_back_maximums(x), axis=1)
df2["max_freq"] = df2.apply(lambda x: x.value_counts().max(), axis=1)
print(df2)
A B C D E F G H I J max_freq_value max_freq
0 3 3 3 3 3 3 5 5 5 5 [3] 6
1 2 2 2 2 8 8 8 8 6 6 [2, 8] 4
Hope it helps : )

Pandas DataFrame copy with condition [duplicate]

This question already has answers here:
How can I replicate rows of a Pandas DataFrame?
(10 answers)
Closed 11 months ago.
I want to replicate rows in a Pandas Dataframe. Each row should be repeated n times, where n is a field of each row.
import pandas as pd
what_i_have = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'n' : [ 1, 2, 3],
'v' : [ 10, 13, 8]
})
what_i_want = pd.DataFrame(data={
'id': ['A', 'B', 'B', 'C', 'C', 'C'],
'v' : [ 10, 13, 13, 8, 8, 8]
})
Is this possible?
You can use Index.repeat to get repeated index values based on the column then select from the DataFrame:
df2 = df.loc[df.index.repeat(df.n)]
id n v
0 A 1 10
1 B 2 13
1 B 2 13
2 C 3 8
2 C 3 8
2 C 3 8
Or you could use np.repeat to get the repeated indices and then use that to index into the frame:
df2 = df.loc[np.repeat(df.index.values, df.n)]
id n v
0 A 1 10
1 B 2 13
1 B 2 13
2 C 3 8
2 C 3 8
2 C 3 8
After which there's only a bit of cleaning up to do:
df2 = df2.drop("n", axis=1).reset_index(drop=True)
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
Note that if you might have duplicate indices to worry about, you could use .iloc instead:
df.iloc[np.repeat(np.arange(len(df)), df["n"])].drop("n", axis=1).reset_index(drop=True)
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
which uses the positions, and not the index labels.
You could use set_index and repeat
In [1057]: df.set_index(['id'])['v'].repeat(df['n']).reset_index()
Out[1057]:
id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8
Details
In [1058]: df
Out[1058]:
id n v
0 A 1 10
1 B 2 13
2 C 3 8
It's something like the uncount in tidyr:
https://tidyr.tidyverse.org/reference/uncount.html
I wrote a package (https://github.com/pwwang/datar) that implements this API:
from datar import f
from datar.tibble import tribble
from datar.tidyr import uncount
what_i_have = tribble(
f.id, f.n, f.v,
'A', 1, 10,
'B', 2, 13,
'C', 3, 8
)
what_i_have >> uncount(f.n)
Output:
id v
0 A 10
1 B 13
1 B 13
2 C 8
2 C 8
2 C 8
Not the best solution, but I want to share this: you could also use pandas.reindex() and .repeat():
df.reindex(df.index.repeat(df.n)).drop('n', axis=1)
Output:
id v
0 A 10
1 B 13
1 B 13
2 C 8
2 C 8
2 C 8
You can further append .reset_index(drop=True) to reset the .index.

Use list of index's to create a subset df

So i have a dataframe which ive selected certain values from :
x=df[df['column'].str.contains('foo')].index
if i then want to make a new df with the selected indexs from the original df by:
df2=df[x],
the following message pops up:
KeyError: "Int64Index([ 48, 64, 98, 118, 120, 128, 138, 144, 151,\n 166,\n ...\n 15892, 15893, 15894, 15895, 15896, 15897, 15898, 15899, 15900,\n 15901],\n dtype='int64', length=4711) not in index"
those indexs are in the dataframe as df.iloc[48] returns a value
Anyone got any ideas?
I believe you need loc - select by index values:
x=df.index[df['column'].str.contains('foo')]
df2=df.loc[x]
#if default monotonic index - 0,1,..., len(df) - 1
#df2=df.iloc[x]
Sample:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
x=df.index[df['F'].str.contains('b')]
print (x)
Int64Index([3, 4, 5], dtype='int64')
df2=df.loc[x]
print (df2)
A B C D E F
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
Simplier is use only:
df2=df[df['F'].str.contains('b')]
print (df2)
A B C D E F
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b

Column name and index of max value

I currently have a pandas dataframe where values between 0 and 1 are saved. I am looking for a function which can provide me the top 5 values of a column, together with the name of the column and the associated index of the values.
Sample Input: data frame with column names a:z, index 1:23, entries are values between 0 and 1
Sample Output: array of 5 highest entries in each column, each with column name and index
Edit:
For the following data frame:
np.random.seed([3,1415])
df = pd.DataFrame(np.random.randint(10, size=(10, 4)), list('abcdefghij'), list('ABCD'))
df
A B C D
a 0 2 7 3
b 8 7 0 6
c 8 6 0 2
d 0 4 9 7
e 3 2 4 3
f 3 6 7 7
g 4 5 3 7
h 5 9 8 7
i 6 4 7 6
j 2 6 6 5
I would like to get an output like (for example for the first column):
[[8,b,A], [8, c, A], [6,i,A], [5, h, A], [4,g,A]].
consider the dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(
np.random.randint(10, size=(10, 4)), list('abcdefghij'), list('ABCD'))
df
A B C D
a 0 2 7 3
b 8 7 0 6
c 8 6 0 2
d 0 4 9 7
e 3 2 4 3
f 3 6 7 7
g 4 5 3 7
h 5 9 8 7
i 6 4 7 6
j 2 6 6 5
I'm going to use np.argpartition to separate each column into the 5 smallest and 10 - 5 (also 5) largest
v = df.values
i = df.index.values
k = len(v) - 5
pd.DataFrame(
i[v.argpartition(k, 0)[-k:]],
np.arange(k), df.columns
)
A B C D
0 g f i i
1 b c a d
2 h h f h
3 i b d f
4 c j h g
print(your_dataframe.sort_values(ascending=False)[0:4])

Resources