I have the following data in an Excel sheet and I want to read it as a multiindex dataframe:
Y1 Y1 Y2 Y2
B H1 H2 H1 H2
1 80 72 79.2 84.744
2 240 216 237.6 254.232
3 40 36 39.6 42.372
4 160 144 158.4 169.488
5 240 216 237.6 254.232
6 0 0 0 0
I am reading it as:
DATA = pd.read_excel('data.xlsx',sheet_name=None)
since I am reading other sheets too.
Question 1:
This data is not read as multi-index data. How do I make it read it as multi-index? Or maybe I should read it as a dataframe and then convert it to multi-index?
Current result of reading as dataframe
DATA['Load']
Y1 Y1.1 Y2 Y2.1
bus H1 H2 H1 H2
1 80 72 79.2 84.744
2 240 216 237.6 254.232
3 40 36 39.6 42.372
4 160 144 158.4 169.488
5 240 216 237.6 254.232
6 0 0 0 0
Question 2 and probably the more fundamental question:
How do I deal with multi-indexing when one or more of the indexes are on the columns side? In this example, I want to access the data by specifying B, Y, H. I know how to work with multi-index when they are all as index, but can't get the hang of it when the indexes are on the columns.
Thank you very much for your help :)
PS:
Another sheet may look like the following:
from to x ratea
1 2 0.4 10
1 4 0.6 80
1 5 0.2 10
2 3 0.2 10
2 4 0.4 10
2 6 0.3 10
3 5 0.2 10
4 6 0.3 10
where I will set from and to as set (set_index(['from','to']) to get a multi-index dataframe.
to read a dataframe like this to a multiindex users the header param in pd.read_excel()
df = pd.read_excel('myFile.xlsx', header=[0,1])
Y1 Y2
B H1 H2 H1 H2
1 80 72 79.2 84.744
2 240 216 237.6 254.232
3 40 36 39.6 42.372
4 160 144 158.4 169.488
5 240 216 237.6 254.232
6 0 0 0.0 0.000
this means that you are telling pandas that you have two header rows 0 and 1
after our conversation:
df = pd.read_excel('Book2.xlsx', header=[0,1])
df2 = df.unstack().to_frame()
idx = df2.swaplevel(0,2).swaplevel(1,2).index.set_names(['B', 'Y', 'H'])
df2.set_index(idx, inplace=True)
0
B Y H
1 Y1 H1 80.000
2 Y1 H1 240.000
3 Y1 H1 40.000
4 Y1 H1 160.000
5 Y1 H1 240.000
6 Y1 H1 0.000
1 Y1 H2 72.000
2 Y1 H2 216.000
3 Y1 H2 36.000
4 Y1 H2 144.000
5 Y1 H2 216.000
6 Y1 H2 0.000
1 Y2 H1 79.200
2 Y2 H1 237.600
3 Y2 H1 39.600
4 Y2 H1 158.400
5 Y2 H1 237.600
6 Y2 H1 0.000
1 Y2 H2 84.744
2 Y2 H2 254.232
3 Y2 H2 42.372
4 Y2 H2 169.488
5 Y2 H2 254.232
6 Y2 H2 0.000
Related
I've a dataframe like this:
pd.DataFrame({'time':['01-01-2020','02-01-2020','01-01-2020','02-01-2020'],'level':['q','q','r','r'],'a':[1,2,3,4],'b':[12,34,54,67],'c':[18,29,39,47],'a_1':[0.1,0.2,0.3,0.4],'a_2':[0,1,0,1],'b_1':[0.28,0.47,0.02,0.05],'b_2':[1,1,0,1],'c_1':[0.18,0.40,0.12,0.01],'c_2':[1,1,0,0]})
>> time level a b c a_1 a_2 b_1 b_2 c_1 c_2
0 01-01-2020 q 1 12 18 0.1 0 0.28 1 0.18 1
1 02-01-2020 q 2 34 29 0.2 1 0.47 1 0.40 1
2 01-01-2020 r 3 54 39 0.3 0 0.02 0 0.12 0
3 02-01-2020 r 4 67 47 0.4 1 0.05 1 0.01 0
I wish to melt the data with time and level as index have all other columns as rows which had a flag 1 corresponding to their prefixes. Eg. I wish to have the values of a and a_1 listed as values and items if the value a_2 was 1.
Desired output:
>> time level column values items
0 01-01-2020 q b 12 0.28
1 01-01-2020 q c 18 0.18
2 02-01-2020 q a 2 0.20
3 02-01-2020 q b 34 0.47
4 02-01-2020 q c 29 0.40
5 02-01-2020 r a 4 0.40
6 02-01-2020 r b 67 0.05
I can get all the values irrespective of the flags and then filter for flags==1. But,not sure how to "melt" /"unstack" in this case. I tried a lot of ways but in vain. Please help me out here.
Let's try with melt:
i, c = ['time', 'level'], pd.Index(['a', 'b','c'])
# mask the values where flag=0
m = df[c + '_1'].mask(df[c + '_2'].eq(0).values)
# melt the dataframe & assign the items column
s = df[[*i, *c]].melt(i, var_name='columns')\
.assign(items=m.values.T.reshape((-1, 1)))
# drop the nan values and sort the dataframe
s = s.dropna(subset=['items']).sort_values(i, ignore_index=True)
Details:
mask the values in columns ending with suffix _1 where the values in the corresponding flag columns equals 0:
a_1 b_1 c_1
0 NaN 0.28 0.18
1 0.2 0.47 0.40
2 NaN NaN NaN
3 0.4 0.05 NaN
melt the dataframe containing the columns a, b, c, then reshape the masked values and assign a new column items in melted dataframe:
time level columns value items
0 01-01-2020 q a 1 NaN
1 02-01-2020 q a 2 0.20
2 01-01-2020 r a 3 NaN
3 02-01-2020 r a 4 0.40
4 01-01-2020 q b 12 0.28
5 02-01-2020 q b 34 0.47
6 01-01-2020 r b 54 NaN
7 02-01-2020 r b 67 0.05
8 01-01-2020 q c 18 0.18
9 02-01-2020 q c 29 0.40
10 01-01-2020 r c 39 NaN
11 02-01-2020 r c 47 NaN
Lastly drop the NaN values in items and sort the values on time and level to get the final result:
time level columns value items
0 01-01-2020 q b 12 0.28
1 01-01-2020 q c 18 0.18
2 02-01-2020 q a 2 0.20
3 02-01-2020 q b 34 0.47
4 02-01-2020 q c 29 0.40
5 02-01-2020 r a 4 0.40
6 02-01-2020 r b 67 0.05
There may be a more elegant way, but this works. Extract data for each column name (a, b, c), select those that have flags set to 1 and concatenate the results.
df.set_index(['time', 'level'], inplace=True)
parts = []
for name in 'a','b','c':
d = df[[name, f'{name}_1', f'{name}_2']]\
.rename(columns={name: 'values', f'{name}_1': 'items', f'{name}_2': 'flag'})
d['column'] = name
parts.append(d[d.flag == 1])
pd.concat(parts)[['column','values','items']].reset_index()
Step 1: reorder the columns such that the numbers come before the letters:
res = df.copy()
res.columns = ["_".join(entry.split("_")[::-1]) for entry in res]
Step2 : reorder the columns (again) such that "num" is prefixed if the column is in ("a","b","c")
res.columns = [f"num_{letter}" if letter in ("a", "b", "c")
else letter
for letter in res]
res
time level num_a num_b num_c 1_a 2_a 1_b 2_b 1_c 2_c
0 01-01-2020 q 1 12 18 0.1 0 0.28 1 0.18 1
1 02-01-2020 q 2 34 29 0.2 1 0.47 1 0.40 1
2 01-01-2020 r 3 54 39 0.3 0 0.02 0 0.12 0
3 02-01-2020 r 4 67 47 0.4 1 0.05 1 0.01 0
Step 3: Use pandas wide to long to reshape the data, filter for rows equal to 1, rename the columns and finally reset the index:
(
pd.wide_to_long(
res,
stubnames=["num", "1", "2"],
i=["time", "level"],
j="column",
sep="_",
suffix=".",
)
# this is where the filter for rows equal to 1 occur
.query("`2`==1")
.drop(columns="2")
.set_axis(["values", "items"], axis="columns")
.reset_index()
)
time level column values items
0 01-01-2020 q b 12 0.28
1 01-01-2020 q c 18 0.18
2 02-01-2020 q a 2 0.20
3 02-01-2020 q b 34 0.47
4 02-01-2020 q c 29 0.40
5 02-01-2020 r a 4 0.40
6 02-01-2020 r b 67 0.05
This is another way, but same idea of renaming the columns - makes it easy to reshape with wide to long:
result = df.rename(
columns=lambda x: f"values_{x}"
if x in ("a", "b", "c")
else f"items_{x[0]}"
if re.search(".1$", x)
else f"equals1_{x[0]}"
if re.search(".2$", x)
else x
)
(
pd.wide_to_long(
result,
stubnames=["values", "items", "equals1"],
i=["time", "level"],
j="column",
sep="_",
suffix=".",
)
.query("equals1==1")
.iloc[:, :-1]
.reset_index()
)
Another option is the pivot_longer function from pyjanitor, using the .value placeholder:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(index = ['time', 'level'],
names_to = ["column", ".value"],
names_pattern = r"(.)_?(.?)",
sort_by_appearance = True)
.query('`2` == 1')
.drop(columns = '2')
.rename(columns={'':'values', '1':'items'})
)
time level column values items
1 01-01-2020 q b 12 0.28
2 01-01-2020 q c 18 0.18
3 02-01-2020 q a 2 0.20
4 02-01-2020 q b 34 0.47
5 02-01-2020 q c 29 0.40
9 02-01-2020 r a 4 0.40
10 02-01-2020 r b 67 0.05
I have two dataframes df1 and df2:
data1 = {'A':[1,3,2,1.4,2,1,2,4], 'B':[10,30,20,1.4,2,78,2,78],'C':[200,340,20,180,2,201,2,100]}
df1 = pd.DataFrame(data1)
print(df1)
A B C
0 1.0 10.0 200
1 3.0 30.0 340
2 2.0 20.0 20
3 1.4 1.4 180
4 2.0 2.0 2
5 1.0 78.0 201
6 2.0 2.0 2
7 4.0 78.0 100
data2 = {'D':['a1','a2','a3','a4',2,1,'a3',4], 'E':['b1','b2',20,1.4,2,78,2,78],'F':[200,340,'c1',180,2,'c2',2,100]}
df2 = pd.DataFrame(data2)
print(df2)
D E F
0 a1 b1 200
1 a2 b2 340
2 a3 20 c1
3 a4 1.4 180
4 2 2 2
5 1 78 c2
6 a3 2 2
7 4 78 100
I want to insert df2 in df1 by replacing column B in df1. How can a dataframe be inserted by reaplcing column in another dataframe.
desired result:
A D E F C
0 1.0 a1 b1 200 200
1 3.0 a2 b2 340 340
2 2.0 a3 20 c1 20
3 1.4 a4 1.4 180 180
4 2.0 2 2 2 2
5 1.0 1 78 c2 201
6 2.0 a3 2 2 2
7 4.0 4 78 100 100
Idea is use concat with select columns by positions by DataFrame.iloc and Index.get_loc, last remove original column by DataFrame.drop:
c = 'B'
pos = df1.columns.get_loc(c)
df = pd.concat([df1.iloc[:, :pos], df2, df1.iloc[:, pos:]], axis=1).drop(c, axis=1)
print (df)
A D E F C
0 1.0 a1 b1 200 200
1 3.0 a2 b2 340 340
2 2.0 a3 20 c1 20
3 1.4 a4 1.4 180 180
4 2.0 2 2 2 2
5 1.0 1 78 c2 201
6 2.0 a3 2 2 2
7 4.0 4 78 100 100
Tested another columns:
c = 'A'
pos = df1.columns.get_loc(c)
df = pd.concat([df1.iloc[:, :pos], df2, df1.iloc[:, pos:]], axis=1).drop(c, axis=1)
print (df)
D E F B C
0 a1 b1 200 10.0 200
1 a2 b2 340 30.0 340
2 a3 20 c1 20.0 20
3 a4 1.4 180 1.4 180
4 2 2 2 2.0 2
5 1 78 c2 78.0 201
6 a3 2 2 2.0 2
7 4 78 100 78.0 100
c = 'C'
pos = df1.columns.get_loc(c)
df = pd.concat([df1.iloc[:, :pos], df2, df1.iloc[:, pos:]], axis=1).drop(c, axis=1)
print (df)
A B D E F
0 1.0 10.0 a1 b1 200
1 3.0 30.0 a2 b2 340
2 2.0 20.0 a3 20 c1
3 1.4 1.4 a4 1.4 180
4 2.0 2.0 2 2 2
5 1.0 78.0 1 78 c2
6 2.0 2.0 a3 2 2
7 4.0 78.0 4 78 100
IIUC we use itertools with reindex
import itertools
l=list(itertools.chain.from_iterable(list(df2) if item == 'B' else [item] for item in list(df1)))
pd.concat([df1,df2], axis=1).reindex(columns=l)
A D E F C
0 1.0 a1 b1 200 200
1 3.0 a2 b2 340 340
2 2.0 a3 20 c1 20
3 1.4 a4 1.4 180 180
4 2.0 2 2 2 2
5 1.0 1 78 c2 201
6 2.0 a3 2 2 2
7 4.0 4 78 100 100
I have three columns in dataframe , X1 X2 X3 , i want to count rows when value change from value greater than 1 to 0 . if before 0 value less than 1 dont need to count.
input df:
df1=pd.DataFrame({'x1':[3,4,7,0,0,0,0,20,15,16,0,0,70],
'X2':[3,4,7,0,0,0,0,20,15,16,0,0,70],
'X3':[6,3,0.5,0,0,0,0,20,15,16,0,0,70]})
print(df1)
x1 X2 X3
0 3 3 6.0
1 4 4 3.0
2 7 7 0.5
3 0 0 0.0
4 0 0 0.0
5 0 0 0.0
6 0 0 0.0
7 20 20 20.0
8 15 15 15.0
9 16 16 16.0
10 0 0 0.0
11 0 0 0.0
12 70 70 70.0
desired_output
x1_count X2_count X3_count
0 6 6 2
Idea is replace 0 to missing values, forward filling them, convert all another values to NaNs, compare greater like 1 and count Trues by sum to Series converted to one row DataFrame with transpose:
m = df1.eq(0)
df2 = (df1.mask(m)
.ffill()
.where(m)
.gt(1)
.sum()
.add_suffix('_count')
.to_frame()
.T
)
print (df2)
x1_count X2_count X3_count
0 6 6 2
I have the following data frame
item_id group price
0 1 A 10
1 3 A 30
2 4 A 40
3 6 A 60
4 2 B 20
5 5 B 50
I am looking to add a quantile column based on the price for each group like below:
item_id group price quantile
01 A 10 0.25
03 A 30 0.5
04 A 40 0.75
06 A 60 1.0
02 B 20 0.5
05 B 50 1.0
I could loop over entire data frame and perform computation for each group. However, I am wondering is there a more elegant way to resolve this? Thanks!
You need df.rank() with pct=True:
pct : bool, default False
Whether or not to display the returned rankings in percentile form.
df['quantile']=df.groupby('group')['price'].rank(pct=True)
print(df)
item_id group price quantile
0 1 A 10 0.25
1 3 A 30 0.50
2 4 A 40 0.75
3 6 A 60 1.00
4 2 B 20 0.50
5 5 B 50 1.00
Although the df.rank method above is probably the way to go for this problem. Here's another solution using pd.qcut with GroupBy:
df['quantile'] = (
df.groupby('group')['price']
.apply(lambda x: pd.qcut(x, q=len(x), labels=False)
.add(1)
.div(len(x))
)
)
item_id group price quantile
0 1 A 10 0.25
1 3 A 30 0.50
2 4 A 40 0.75
3 6 A 60 1.00
4 2 B 20 0.50
5 5 B 50 1.00
I have a pandas data frame dat as below:
0 1 0 1 0 1
0 A 23 0.1 122 56 9
1 B 24 0.45 564 36 3
2 C 25 0.2 235 87 5
3 D 13 0.8 567 42 6
4 E 5 0.9 356 12 2
As you can see from above, the columns' index are 0,1,0,1,0,1 etc. I want to rename back to original index starting from 0,1,2,3,4 ... and I did the following:
dat = dat.reset_index(drop=True)
The index was not changed. How do I get the index renamed in this case? Thanks in advance.
dat.columns = range(dat.shape[1])
There are quite a few ways:
dat = dat.rename(columns = lambda x: dat.columns.get_loc(x))
Or
dat = dat.rename(columns = dict(zip(dat.columns, range(dat.shape[1]))))
Or
dat = dat.set_axis(pd.RangeIndex(dat.shape[1]), axis=1, inplace=False)
Out[677]:
0 1 2 3 4 5
0 A 23 0.10 122 56 9
1 B 24 0.45 564 36 3
2 C 25 0.20 235 87 5
3 D 13 0.80 567 42 6
4 E 5 0.90 356 12 2