I have a dataset which look like this
time channel min sd mag. frequency
12:00 X 12.0 2.3 x11 fx11
12:00 X 12.0 2.3 x12 fx12
12:00 X 12.0 2.3 x13 fx13
12:00 X 12.0 2.3 x14 fx14
12:00 X 12.0 2.3 x15 fx15
12:00 Y 17.0 2.7 y11 fy11
12:00 Y 17.0 2.7 y12 fy12
12:00 Y 17.0 2.7 y13 fy13
12:00 Y 17.0 2.7 y14 fy14
12:00 Y 17.0 2.7 y15 fy15
12:00 Z 15.0 4.3 z11 fz11
12:00 Z 15.0 4.3 z12 fz12
12:00 Z 15.0 4.3 z13 fz13
12:00 Z 15.0 4.3 z14 fz14
12:00 Z 15.0 4.3 z15 fz15
12:01 X 13.0 4.9 x21 fx21
.... ... ... ... ... .....
.... ..... .... ... .... ..... ....
As you could see that for channel X, Y, Z there are entries like 'time', 'min' and 'sd' repeating 5 times, however 'mag.' and 'frequency' are changing each time. The shape of this dataset is (740231, 6), where this 15 rows for channel X,Y,Z keep repeating as I described above.
I would like to get rid of this repetition and would like to transform this dataset like this:
time channel min sd m1 f1 m2 f2 m3 f3 m4 f4 m5 f5
12:00 X 12.0 2.3 x11 fx11 x12 fx12 x13 fx13 x14 fx14 x15 fx15
12:00 Y 17.0 2.7 y11 fy11 y12 fy12 y13 fy13 y14 fy14 y15 fy15
12:00 Y 15.0 4.3 z11 fz11 z12 fz12 z13 fz13 z14 fz14 z15 fz15
12:01 X 13.0 4.9 x21 fx21 x22 fx22 x23 fx23 x24 fx24 x25 fx25
.... ... ..... ... .... ..... .... ..... .... .... ....
.... ..... .... .... .... ... .... ..... .... .... ... ... ...
which means that 15 rows x 6 columns values are now transformed in 3 rows x 14 columns.
Any suggestions is appreciated. Many thanks for your time.
Best Regards,
pooja
If ordering of output column should be swapped - first f and then m columns:
cols = ['time','channel','min', 'sd']
d = {'frequency':'f','mag.':'m'}
g = df.groupby(cols).cumcount().add(1).astype(str)
df = df.rename(columns=d).set_index(cols + [g]).unstack().sort_index(axis=1, level=1)
df.columns = df.columns.map(''.join)
df = df.reset_index()
print (df)
time channel min sd f1 m1 f2 m2 f3 m3 f4 m4 f5 \
0 12:00 X 12.0 2.3 fx11 x11 fx12 x12 fx13 x13 fx14 x14 fx15
1 12:00 Y 17.0 2.7 fy11 y11 fy12 y12 fy13 y13 fy14 y14 fy15
2 12:00 Z 15.0 4.3 fz11 z11 fz12 z12 fz13 z13 fz14 z14 fz15
3 12:01 X 13.0 4.9 fx21 x21 NaN NaN NaN NaN NaN NaN NaN
m5
0 x15
1 y15
2 z15
3 NaN
Explanation:
First rename columns by dictionary
Then set_index by counter Series created by cumcount with added 1 and converted to strings
Reshape by unstack
Soer second level of MultiIndex by sort_index
Flatten MultiIndex columns by map and join
Last reset_index for column from index
If ordering of output columns is important is possible use double rename of columns:
cols = ['time','channel','min', 'sd']
d = {'frequency':2,'mag.':1}
g = df.groupby(cols).cumcount().add(1).astype(str)
df = (df.rename(columns=d)
.set_index(cols + [g])
.unstack()
.sort_index(axis=1, level=1)
.rename(columns={2:'f', 1:'m'}))
df.columns = df.columns.map(''.join)
df = df.reset_index()
print (df)
time channel min sd m1 f1 m2 f2 m3 f3 m4 f4 m5 \
0 12:00 X 12.0 2.3 x11 fx11 x12 fx12 x13 fx13 x14 fx14 x15
1 12:00 Y 17.0 2.7 y11 fy11 y12 fy12 y13 fy13 y14 fy14 y15
2 12:00 Z 15.0 4.3 z11 fz11 z12 fz12 z13 fz13 z14 fz14 z15
3 12:01 X 13.0 4.9 x21 fx21 NaN NaN NaN NaN NaN NaN NaN
f5
0 fx15
1 fy15
2 fz15
3 NaN
Related
Return and assign the value of column c based on the maximum value of column b in a grouped data frame, grouped on column a.
I'm trying to return and assign (ideally in a different column) the value of column c, determined by the row with the maximum in column b (which are dates). The dataframe is grouped by column a, but I keep getting errors. The data frame is relatively large, with over 16,000 rows.
df = pd.DataFrame({'a': ['a','a','a','b','b','b','c','c','c'],
'b': ['2008-11-01', '2022-07-01', '2017-02-01', '2017-02-01', '2018-02-01', '2008-11-01', '2014-11-01', '2008-11-01', '2022-07-01'],
'c': [numpy.NaN,6,8,2,1,numpy.NaN,6,numpy.NaN,7]})
df['b'] = pd.to_datetime(df['b'])
df
a b c
0 a 2008-11-01 NaN
1 a 2022-07-01 6.0
2 a 2017-02-01 8.0
3 b 2017-02-01 2.0
4 b 2018-02-01 1.0
5 b 2008-11-01 NaN
6 c 2014-11-01 6.0
7 c 2008-11-01 NaN
8 c 2022-07-01 7.0
I want the following result:
a b c d
0 a 2008-11-01 NaN 8.0
1 a 2022-07-01 6.0 8.0
2 a 2017-02-01 8.0 8.0
3 b 2017-02-01 2.0 1.0
4 b 2018-02-01 1.0 1.0
5 b 2008-11-01 NaN 1.0
6 c 2014-11-01 6.0 7.0
7 c 2008-11-01 NaN 7.0
8 c 2022-07-01 7.0 7.0
I've tried the following, but grouped series ('SeriesGroupBy') cannot be cast to another dtype:
df['d'] = df.groupby('a').b.astype(np.int64).max()[df.c].reset_index()
I've set column b to int before running the above code, but get an error saying none of the values in c are in the index.
None of [Index([....values in column c...\n dtype = 'object', name = 'a', length = 16280)] are in the [index]
I've tried expanding out the command but errors (which I just realized relate to no alternative values being given in the pd.where function - but I don't know how to give the correct value):
df= df.groupby('a')
df['d'] = df.['c'].where(grouped['b'] == grouped['b'].max())
I've also tried using solutions provided here, here, and here.
Any help or direction would be appreciated.
I'm assuming the first three values should be 6.0 (because maximum date for group a is 2022-07-01 with value 6):
df["d"] = df["a"].map(
df.groupby("a").apply(lambda x: x.loc[x["b"].idxmax(), "c"])
)
print(df)
Prints:
a b c d
0 a 2008-11-01 NaN 6.0
1 a 2022-07-01 6.0 6.0
2 a 2017-02-01 8.0 6.0
3 b 2017-02-01 2.0 1.0
4 b 2018-02-01 1.0 1.0
5 b 2008-11-01 NaN 1.0
6 c 2014-11-01 6.0 7.0
7 c 2008-11-01 NaN 7.0
8 c 2022-07-01 7.0 7.0
Hi i have a dataframe that looks like that :
Unnamed: 0
X1
Unnamed: 1
X2
Unnamed: 1
X3
Unnamed: 2
X4
1970-01-31
5.0
1970-01-31
1.0
1970-01-31
1.0
1980-01-30
1.0
1970-02-26
6.0
1970-02-26
3.0
1970-02-26
3.0
1980-02-26
3.0
I have many columns (631) that looks like that.
I would like to have :
date
X1
X2
X3
X4
1970-01-31
5.0
1.0
1.0
na
1970-02-26
6.0
3.0
3.0
na
1980-01-30
na
na
na
1.0
1980-02-26
na
na
na
3.0
I tried :
res_df = pd.concat(
df2[[date, X]].rename(columns={date: "date"}) for date, X in zip(df2.columns[::2],
df2.columns[1::2])
).pivot_table(index="date")
It works for small data but do not work for mine. Maybe because I have the same columns name 'Unnamed: 1' in my df.
I have a message error:
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
Crete index by date varible and use axis=1 in concat:
res_df = (pd.concat((df2[[date, X]].set_index(date)
for date, X in zip(df2.columns[::2], df2.columns[1::2])), axis=1)
.rename_axis('date')
.reset_index())
print (res_df)
date X1 X2 X3 X4
0 1970-01-31 5.0 1.0 1.0 NaN
1 1970-02-26 6.0 3.0 3.0 NaN
2 1980-01-30 NaN NaN NaN 1.0
3 1980-02-26 NaN NaN NaN 3.0
EDIT: Error seems like duplicated columns names in your DataFrame, possible solution is deduplicated before apply solution above:
df = pd.DataFrame(columns=['a','a','b'], index=[0])
#you can test if duplicated columns names
print (df.columns[df.columns.duplicated(keep=False)])
Index(['a', 'a'], dtype='object')
#https://stackoverflow.com/a/43792894/2901002
df.columns = pd.io.parsers.ParserBase({'names':df.columns})._maybe_dedup_names(df.columns)
print (df.columns)
Index(['a', 'a.1', 'b'], dtype='object')
I have a problem displaying what I want with pd.crosstab
I tried those lines:
pd.crosstab(df_temp['date'].apply(lambda x: pd.to_datetime(x).year), df_temp['state'][df_temp['state'] >= 20], margins=True])
pd.crosstab(df_temp['date'].apply(lambda x: pd.to_datetime(x).year), df_temp['state'][df_temp['state'] >= 20], margins=True, aggfunc = lambda x: x.count(), values = df_temp['state'][df_temp['state'] >= 20])
And they both display this:
state 20.0 30.0 32.0 50.0 All
date
2017 303.0 327.0 6.0 118.0 754.0
2018 328.0 167.0 3.0 58.0 556.0
All 631.0 494.0 9.0 176.0 1310.0`
But what I want is not for each state to count the number of values being the state. For example for the state 20 for each year I want the value to be the count of all values greater or equal than 20. Thus it should be 754. For 30 it sould be 754 - 303 = 451. And so on for the other states.
I also tried this line of command but it doesn't work either:
pd.crosstab(df_temp['date'].apply(lambda x: pd.to_datetime(x).year), df_temp['state'][(df_temp['state'] >= 20) | (df_temp['state'] == 30)], margins=True, aggfunc = lambda x: x.count(), values = df_temp['state'][(df_temp['state'] == 20) | (df_temp['state'] == 30)])
It displays the following table:
state 20.0 30.0 32.0 50.0 All
date
2017 303.0 327.0 0.0 0.0 630.0
2018 328.0 167.0 0.0 0.0 495.0
All 631.0 494.0 NaN NaN 1125.0
Given the following data frame:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Site':['a','a','a','b','b','b'],
'x':[1,1,0,1,0,0],
'y':[1,np.nan,0,1,1,0]
})
df
Site y x
0 a 1.0 1
1 a NaN 1
2 a 0.0 0
3 b 1.0 1
4 b 1.0 0
5 b 0.0 0
I am looking for the most efficient way, for each numerical column (y and x), to produce a percent per group, label the column name, and stack them in one column.
Here's how I accomplish this for 'y':
df=df.loc[~np.isnan(df['y'])] #do not count non-numbers
t=pd.pivot_table(df,index='Site',values='y',aggfunc=[np.sum,len])
t['Item']='y'
t['Perc']=round(t['sum']/t['len']*100,1)
t
sum len Item Perc
Site
a 1.0 2.0 y 50.0
b 2.0 3.0 y 66.7
Now all I need is a way to add 2 more rows to this; the results for 'x' if I had pivoted with its values above, like this:
sum len Item Perc
Site
a 1.0 2.0 y 50.0
b 2.0 3.0 y 66.7
a 1 2 x 50.0
b 1 3 x 33.3
In reality, I have 48 such numerical data columns that need to be stacked as such.
Thanks in advance!
First you can use notnull. Then omit in pivot_table parameter value, stack and sort_values by new column Item. Last you can use pandas function round:
df=df.loc[df['y'].notnull()]
t=pd.pivot_table(df,index='Site', aggfunc=[sum,len])
.stack()
.reset_index(level=1)
.rename(columns={'level_1':'Item'})
.sort_values('Item', ascending=False)
t['Perc']= (t['sum']/t['len']*100).round(1)
#reorder columns
t = t[['sum','len','Item','Perc']]
print t
sum len Item Perc
Site
a 1.0 2.0 y 50.0
b 2.0 3.0 y 66.7
a 1.0 2.0 x 50.0
b 1.0 3.0 x 33.3
Another solution if is neccessary define values columns in pivot_table:
df=df.loc[df['y'].notnull()]
t=pd.pivot_table(df,index='Site',values=['y', 'x'], aggfunc=[sum,len])
.stack()
.reset_index(level=1)
.rename(columns={'level_1':'Item'})
.sort_values('Item', ascending=False)
t['Perc']= (t['sum']/t['len']*100).round(1)
#reorder columns
t = t[['sum','len','Item','Perc']]
print t
sum len Item Perc
Site
a 1.0 2.0 y 50.0
b 2.0 3.0 y 66.7
a 1.0 2.0 x 50.0
b 1.0 3.0 x 33.3
I'm practicing with Pandas and i want to get the ranges of a column from a dataframe by the values of another column.
An example dataset:
Points Grade
1 7.5 C
2 9.3 A
3 NaN A
4 1.3 F
5 8.7 B
6 9.5 A
7 7.9 C
8 4.5 F
9 8.0 B
10 6.8 D
11 5.0 D
I want group ranges of points for each grade so i can induce missing values.
For that goal i need gets something like this:
Grade Points
A [9.5, 9.3]
B [8.7, 8.0]
C [7.5, 7.0]
D [6.8, 5.0]
F [1.3, 4.5]
I can get it with for and that kinds of stuffs but is it possible with pandas in some easy way?
I tried all groupby combinations i know and nothing. Some suggestion?
You can first filter df with notnull and then groupby and tolist with reset_index:
print df
Points Grade
0 7.5 C
1 9.3 A
2 NaN A
3 1.3 F
4 8.7 B
5 9.5 A
6 7.9 C
7 4.5 F
8 8.0 B
9 6.8 D
10 5.0 D
print df['Points'].notnull()
0 True
1 True
2 False
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 True
Name: Points, dtype: bool
print df.loc[df['Points'].notnull()]
Points Grade
0 7.5 C
1 9.3 A
3 1.3 F
4 8.7 B
5 9.5 A
6 7.9 C
7 4.5 F
8 8.0 B
9 6.8 D
10 5.0 D
print df.loc[df['Points'].notnull()].groupby('Grade')['Points']
.apply(lambda x: x.tolist()).reset_index()
Grade Points
0 A [9.3, 9.5]
1 B [8.7, 8.0]
2 C [7.5, 7.9]
3 D [6.8, 5.0]
4 F [1.3, 4.5]