Expand Pandas Group By to Many Columns - python-3.x

I have a dataframe like this,
create a data frame data
d = {'Machine ID': [100, 100, 101, 101], 'Machine': ["ABC", "ABC", "CDQ", "CDQ"],
"June": [10,0,12,15], "July": [12,15,0,32], "August": [0,15,20,11]}
data = pd.DataFrame(data=d)
Data Frame:
Machine ID Machine June July August
0 100 ABC 10 12 0
1 100 ABC 0 15 15
2 101 CDQ 12 0 20
3 101 CDQ 15 32 11
Now, I want to groupby and aggregate by month, so, I did this,
machine_group = data.groupby(['Machine ID','Machine'])['June'].sum().reset_index(name = 'June Sum')
I get the following,
Machine ID Machine June Sum
0 100 ABC 10
1 101 CDQ 27
However, I need an ouput something like this
Machine ID Machine June Sum July Sum August Sum
0 100 ABC 10 27 15
1 101 CDQ 27 32 31
How can I expand my group by code to get them lines up. Otherwise, for now, my option is to group by each month and append grouped column into a new data frame.
Any suggestion will be appreciated.

Try:
machine_group = data.groupby(['Machine ID','Machine']).sum()\
.add_suffix(' Sum').reset_index()
Output:
Machine ID Machine June Sum July Sum August Sum
0 100 ABC 10 27 15
1 101 CDQ 27 32 31

Related

How to replace a column in dataframe for the result of a function

currently I have a dataframe with a column named age, which has the age of the person in days. I would like to convert this value to year, how could I achieve that?
at this moment, if one runs this command
df['age']
the result would be something like
0 18393
1 20228
2 18857
3 17623
4 17474
5 21914
6 22113
7 22584
8 17668
9 19834
10 22530
11 18815
12 14791
13 19809
I would like to change the value from each row to the current value/ 365 (which would convert days to year)
As suggested:
>>> df['age'] / 365
age
0 50.391781
1 55.419178
2 51.663014
3 48.282192
4 47.873973
Or if you need a real year:
>>> df['age'] // 365
age
0 50
1 55
2 51
3 48
4 47

Grouping data based on month-year in pandas and then dropping all entries except the latest one- Python

Below is my example dataframe
Date Indicator Value
0 2000-01-30 A 30
1 2000-01-31 A 40
2 2000-03-30 C 50
3 2000-02-27 B 60
4 2000-02-28 B 70
5 2000-03-31 C 90
6 2000-03-28 C 100
7 2001-01-30 A 30
8 2001-01-31 A 40
9 2001-03-30 C 50
10 2001-02-27 B 60
11 2001-02-28 B 70
12 2001-03-31 C 90
13 2001-03-28 C 100
Desired Output
Date Indicator Value
2000-01-31 A 40
2000-02-28 B 70
2000-03-31 C 90
2001-01-31 A 40
2001-02-28 B 70
2001-03-31 C 90
I want to write a code that groups data by particular month-year and then keep the entry of latest date in that particular month-year and drop the rest. The data is till year 2020
I was only able to fetch the count by month-year. I am not able to drop create a proper code that helps to group data as per month-year and indicator and get the correct results
Use Series.dt.to_period for months periods, aggregate index of maximal date per groups by DataFrameGroupBy.idxmax and then pass to DataFrame.loc:
df['Date'] = pd.to_datetime(df['Date'])
print (df['Date'].dt.to_period('m'))
0 2000-01
1 2000-01
2 2000-03
3 2000-02
4 2000-02
5 2000-03
6 2000-03
7 2001-01
8 2001-01
9 2001-03
10 2001-02
11 2001-02
12 2001-03
13 2001-03
Name: Date, dtype: period[M]
df = df.loc[df.groupby(df['Date'].dt.to_period('m'))['Date'].idxmax()]
print (df)
Date Indicator Value
1 2000-01-31 A 40
4 2000-02-28 B 70
5 2000-03-31 C 90
8 2001-01-31 A 40
11 2001-02-28 B 70
12 2001-03-31 C 90

Pandas unpivot dataframe using datetime elements from column names

Say I have a pandas dataframe as follows:
Here Store serves as id, Jan 18 - Mar 18 columns represent sales of said stores in respective years and months and Trading Area is an example of time-invariant feature of a store.
For simplicity assume sales column names are already converted to proper datetime format.
Expected result:
I was thinking about using pandas.melt, however I'm not sure how to properly use datetime information contained within column names to construct columns for year and month (obviously this can be done manually in a loop but I need to apply this to arbitrarily large dataframes and this is where it gets tedious, surely a more elegant solution exists).
Any help is appreciated.
Edit: data = pd.DataFrame({'Store':['A', 'B', 'C'], 'Jan 18':[100, 50, 60], 'Feb 18':[120, 70, 80], 'Mar 18':[140, 90, 100], 'Trading Area':[500, 800, 700]})
You could use melt in the following way:
# melt
melted = data.melt(id_vars=['Store', 'Trading Area'], var_name='Month', value_name='Sales')
# extract month and year
melted[['Month', 'Year']] = melted.Month.str.split(expand=True)
# format year
melted['Year'] = pd.to_datetime(melted.Year, yearfirst=True, format='%y').dt.year
print(melted.sort_values('Store'))
Output
Store Trading Area Month Sales Year
0 A 500 Jan 100 2018
3 A 500 Feb 120 2018
6 A 500 Mar 140 2018
1 B 800 Jan 50 2018
4 B 800 Feb 70 2018
7 B 800 Mar 90 2018
2 C 700 Jan 60 2018
5 C 700 Feb 80 2018
8 C 700 Mar 100 2018
You can do a wide_to_long followed by a stack:
(pd.wide_to_long(df=data,
stubnames=['Jan','Feb', 'Mar'],
i=['Store','Trading Area'],
j='Year',
sep=' '
)
.stack()
.reset_index(name='Sales')
.rename(columns={'level_3':'Month'})
)
Output:
Store Trading Area Year Month Sales
0 A 500 18 Jan 100
1 A 500 18 Feb 120
2 A 500 18 Mar 140
3 B 800 18 Jan 50
4 B 800 18 Feb 70
5 B 800 18 Mar 90
6 C 700 18 Jan 60
7 C 700 18 Feb 80
8 C 700 18 Mar 100

How can I delete duplicates group 3 columns using two criteria (first two columns)?

That is my data set enter code here
Year created Week created SUM_New SUM_Closed SUM_Open
0 2018 1 17 0 82
1 2018 6 62 47 18
2 2018 6 62 47 18
3 2018 6 62 47 18
4 2018 6 62 47 18
In last three columns there is already the sum for the year and week. I need to get rid of duplicates so that the table contains unique values (for the example above):
Year created Week created SUM_New SUM_Closed SUM_Open
0 2018 1 17 0 82
4 2018 6 62 47 18
I tried to group data but it somehow works wrong and does what I need but just for one column.
df.groupby(['Year created', 'Week created']).size()
And output:
Year created Week created
2017 48 2
49 25
50 54
51 36
52 1
2018 1 17
2 50
3 37
But it is just one column and I don't know which one because even if I separate the data on three parts and do the same procedure for each part I get the same result (as above) for all.
I believe need drop_duplicates:
df = df.drop_duplicates(['Year created', 'Week created'])
print (df)
Year created Week created SUM_New SUM_Closed SUM_Open
0 2018 1 17 0 82
1 2018 6 62 47 18
df2 = df.drop_duplicates(['Year created', 'Week created', 'SUM_New', 'SUM_Closed'])
print(df2)
hope this helps.

row substraction in lambda pandas dataframe

I have a dataframe with multiple columns. One of the column is the cumulative revenue column. If the year is not ended then the revenue will be constant for the rest of the period because the coming daily revenue is 0.
The dataframe looks like this
Now I want to create a new column where the row is substracted by the last row and if the result is 0 then print 0 for that row in the new column. If not zero then use the row value. The new dataframe should look like this:
My idea was to do this with the apply lambda method. So this is the thinking:
{df['2017new'] = df['2017'].apply(lambda x: 0 if row - lastrow == 0 else x)}
But i do not know how to write the row - lastrow part of the code. How to do this? Thanks in advance!
By using np.where
df2['New']=np.where(df2['2017'].diff().eq(0),0,df2['2017'])
df2
Out[190]:
2016 2017 New
0 10 21 21
1 15 34 34
2 70 40 40
3 90 53 53
4 93 53 0
5 99 53 0
We can shift the data and fill the values based on condition using np.where i.e
df['new'] = np.where(df['2017']-df['2017'].shift(1)==0,0,df['2017'])
or with df.where i.e
df['new'] = df['2017'].where(df['2017']-df['2017'].shift(1)!=0,0)
2016 2017 new
0 10 21 21
1 15 34 34
2 70 40 40
3 90 53 53
4 93 53 0
5 99 53 0

Resources