For the following data df1 with missing January data, cumul_val1 and cumul_val2 are the monthly cumulative values of value1 and value2 respectively.
date cumul_val1 cumul_val2
0 2020-05-31 48702.97 45919.59
1 2020-06-30 69403.68 62780.21
2 2020-07-31 83631.36 75324.61
3 2020-08-31 98485.95 88454.14
4 2020-09-30 117072.67 103484.20
5 2020-10-31 133293.80 116555.76
6 2020-11-30 150834.45 129492.36
7 2020-12-31 176086.22 141442.95
8 2021-02-28 17363.14 13985.87
9 2021-03-31 36007.05 27575.82
10 2021-04-30 50305.00 40239.76
11 2021-05-31 66383.32 54318.08
12 2021-06-30 88635.35 72179.07
13 2021-07-31 101648.18 84895.41
14 2021-08-31 114192.81 98059.73
15 2021-09-30 130331.78 112568.07
16 2021-10-31 143040.71 124933.62
17 2021-11-30 158130.73 137313.96
18 2021-12-31 179433.41 147602.08
19 2022-02-28 15702.61 14499.38
20 2022-03-31 31045.96 27764.95
21 2022-04-30 39768.15 39154.31
22 2022-05-31 50738.38 52133.62
I now want to convert them into monthly values. For example, the value of value1 on 2021-04-30 is calculated by 50305.00-36007.05. It can be seen that the value in January is missing, so the current month value in February is the accumulated value itself, and the current month value in March will be the accumulated value in March minus the accumulated value in February.
May I ask how to achieve it?
The expected result:
date cumul_val1 cumul_val2 month_val1 month_val2
0 2020-05-31 48702.97 45919.59 NaN NaN
1 2020-06-30 69403.68 62780.21 20700.71 16860.62
2 2020-07-31 83631.36 75324.61 14227.68 12544.40
3 2020-08-31 98485.95 88454.14 14854.59 13129.53
4 2020-09-30 117072.67 103484.20 18586.72 15030.06
5 2020-10-31 133293.80 116555.76 16221.13 13071.56
6 2020-11-30 150834.45 129492.36 17540.65 12936.60
7 2020-12-31 176086.22 141442.95 25251.77 11950.59
8 2021-02-28 17363.14 13985.87 17363.14 13985.87
9 2021-03-31 36007.05 27575.82 18643.91 13589.95
10 2021-04-30 50305.00 40239.76 14297.96 12663.94
11 2021-05-31 66383.32 54318.08 16078.32 14078.32
12 2021-06-30 88635.35 72179.07 22252.03 17860.99
13 2021-07-31 101648.18 84895.41 13012.83 12716.34
14 2021-08-31 114192.81 98059.73 12544.63 13164.32
15 2021-09-30 130331.78 112568.07 16138.97 14508.34
16 2021-10-31 143040.71 124933.62 12708.94 12365.55
17 2021-11-30 158130.73 137313.96 15090.02 12380.34
18 2021-12-31 179433.41 147602.08 21302.68 10288.12
19 2022-02-28 15702.61 14499.38 15702.61 14499.38
20 2022-03-31 31045.96 27764.95 15343.35 13265.57
21 2022-04-30 39768.15 39154.31 8722.19 11389.36
22 2022-05-31 50738.38 52133.62 10970.22 12979.31
Notes: in order to simplify the question, I added a new alternative sample data df2 without missing months:
date cumul_val monthly_val
0 2020-09-30 32144142.46 NaN
1 2020-10-31 36061223.45 3917080.99
2 2020-11-30 40354684.50 4293461.05
3 2020-12-31 44360036.58 4005352.08
4 2021-01-31 4130729.28 4130729.28
5 2021-02-28 7985781.64 3855052.36
6 2021-03-31 12306556.74 4320775.10
7 2021-04-30 16873032.10 4566475.36
8 2021-05-31 21730065.01 4857032.91
9 2021-06-30 26816787.85 5086722.84
10 2021-07-31 31785276.80 4968488.95
11 2021-08-31 37030178.38 5244901.58
12 2021-09-30 42879767.13 5849588.75
13 2021-10-31 48392250.79 5512483.66
14 2021-11-30 53655448.65 5263197.86
15 2021-12-31 59965790.04 6310341.39
16 2022-01-31 5226910.15 5226910.15
17 2022-02-28 9481147.06 4254236.91
18 2022-03-31 14205738.71 4724591.65
19 2022-04-30 19096746.32 4891007.61
20 2022-05-31 24033460.77 4936714.45
21 2022-06-30 28913566.31 4880105.54
22 2022-07-31 34099663.15 5186096.84
23 2022-08-31 39082926.81 4983263.66
24 2022-09-30 44406354.61 5323427.80
25 2022-10-31 48889431.89 4483077.28
26 2022-11-30 52956747.09 4067315.20
27 2022-12-31 57184652.60 4227905.51
Had there been no gap in the data, the problem would have been an easy .diff(). However, since there are gaps, we need to fill those gap with 0, calculate the diff, then keep only the original months.
idx = pd.to_datetime(df["date"])
month_val = (
df[["cumul_val1", "cumul_val2"]]
# Fill the gap months with 0
.set_index(idx)
.reindex(pd.date_range(idx.min(), idx.max(), freq="M"), fill_value=0)
# Take the diff
.diff()
# Keep only the original months
.loc[idx]
# Beat into shape for the subsequent concat
.set_axis(["month_val1", "month_val2"], axis=1)
.set_index(df.index)
)
result = pd.concat([df, month_val], axis=1)
Edit: the OP clarified that for the first entry of the year, be it Jan or Feb, the monthly value is the same as a cumulative value. In that case, use this:
cumul_cols = ["cumul_val1", "cumul_val2"]
monthly_cols = [f"month_val{i+1}" for i in range(len(cumul_cols))]
# Make sure `date` is of type Timestamp and the dataframe is sorted. You data
# may have satisfied both conditions already.`
df["date"] = pd.to_datetime(df["date"])
df = df.sort_values("date")
# Return True if current row is in the same year as the previous row.
# Repeat the result for each cumul_val column.
is_same_year = np.tile(
df["date"].dt.year.diff().eq(0).to_numpy()[:, None],
(1, len(cumul_cols)),
)
month_val = np.where(
is_same_year,
df[cumul_cols].diff(),
df[cumul_cols],
)
month_val[0, :] = np.nan
df[monthly_cols] = month_val
It would be much easier to use your date as a PeriodIndex with monthly frequencies:
# set up the date as a monthly period Index
df2 = df.assign(date=pd.to_datetime(df['date']).dt.to_period('M')).set_index('date')
# subtract the previous month
df2.sub(df2.shift(freq='1M'), fill_value=0).reindex_like(df2)
Output:
cumul_val1 cumul_val2
date
2020-05 48702.97 45919.59
2020-06 20700.71 16860.62
2020-07 14227.68 12544.40
2020-08 14854.59 13129.53
2020-09 18586.72 15030.06
2020-10 16221.13 13071.56
2020-11 17540.65 12936.60
2020-12 25251.77 11950.59
2021-02 17363.14 13985.87
2021-03 18643.91 13589.95
2021-04 14297.95 12663.94
2021-05 16078.32 14078.32
2021-06 22252.03 17860.99
2021-07 13012.83 12716.34
2021-08 12544.63 13164.32
2021-09 16138.97 14508.34
2021-10 12708.93 12365.55
2021-11 15090.02 12380.34
2021-12 21302.68 10288.12
2022-02 15702.61 14499.38
2022-03 15343.35 13265.57
2022-04 8722.19 11389.36
2022-05 10970.23 12979.31
If you want to assign back to the original DataFrame:
df[['month_val1', 'month_val2']] = df2.sub(df2.shift(freq='1M'), fill_value=0).reindex_like(df2).to_numpy()
Updated df:
date cumul_val1 cumul_val2 month_val1 month_val2
0 2020-05-31 48702.97 45919.59 48702.97 45919.59
1 2020-06-30 69403.68 62780.21 20700.71 16860.62
2 2020-07-31 83631.36 75324.61 14227.68 12544.40
3 2020-08-31 98485.95 88454.14 14854.59 13129.53
4 2020-09-30 117072.67 103484.20 18586.72 15030.06
5 2020-10-31 133293.80 116555.76 16221.13 13071.56
6 2020-11-30 150834.45 129492.36 17540.65 12936.60
7 2020-12-31 176086.22 141442.95 25251.77 11950.59
8 2021-02-28 17363.14 13985.87 17363.14 13985.87
9 2021-03-31 36007.05 27575.82 18643.91 13589.95
10 2021-04-30 50305.00 40239.76 14297.95 12663.94
11 2021-05-31 66383.32 54318.08 16078.32 14078.32
12 2021-06-30 88635.35 72179.07 22252.03 17860.99
13 2021-07-31 101648.18 84895.41 13012.83 12716.34
14 2021-08-31 114192.81 98059.73 12544.63 13164.32
15 2021-09-30 130331.78 112568.07 16138.97 14508.34
16 2021-10-31 143040.71 124933.62 12708.93 12365.55
17 2021-11-30 158130.73 137313.96 15090.02 12380.34
18 2021-12-31 179433.41 147602.08 21302.68 10288.12
19 2022-02-28 15702.61 14499.38 15702.61 14499.38
20 2022-03-31 31045.96 27764.95 15343.35 13265.57
21 2022-04-30 39768.15 39154.31 8722.19 11389.36
22 2022-05-31 50738.38 52133.62 10970.23 12979.31
Fortunately pandas offers a diff function for this:
df = pd.DataFrame([['2020-05-31',48702.97,45919.59], ['2020-06-30',69403.68,62780.21], ['2020-07-31',83631.36,75324.61]], columns=['date','cumul_val1','cumul_val2'])
df['val1'] = df['cumul_val1'].diff()
df['val2'] = df['cumul_val2'].diff()
print(df)
Below is my example dataframe
Date Indicator Value
0 2000-01-30 A 30
1 2000-01-31 A 40
2 2000-03-30 C 50
3 2000-02-27 B 60
4 2000-02-28 B 70
5 2000-03-31 C 90
6 2000-03-28 C 100
7 2001-01-30 A 30
8 2001-01-31 A 40
9 2001-03-30 C 50
10 2001-02-27 B 60
11 2001-02-28 B 70
12 2001-03-31 C 90
13 2001-03-28 C 100
Desired Output
Date Indicator Value
2000-01-31 A 40
2000-02-28 B 70
2000-03-31 C 90
2001-01-31 A 40
2001-02-28 B 70
2001-03-31 C 90
I want to write a code that groups data by particular month-year and then keep the entry of latest date in that particular month-year and drop the rest. The data is till year 2020
I was only able to fetch the count by month-year. I am not able to drop create a proper code that helps to group data as per month-year and indicator and get the correct results
Use Series.dt.to_period for months periods, aggregate index of maximal date per groups by DataFrameGroupBy.idxmax and then pass to DataFrame.loc:
df['Date'] = pd.to_datetime(df['Date'])
print (df['Date'].dt.to_period('m'))
0 2000-01
1 2000-01
2 2000-03
3 2000-02
4 2000-02
5 2000-03
6 2000-03
7 2001-01
8 2001-01
9 2001-03
10 2001-02
11 2001-02
12 2001-03
13 2001-03
Name: Date, dtype: period[M]
df = df.loc[df.groupby(df['Date'].dt.to_period('m'))['Date'].idxmax()]
print (df)
Date Indicator Value
1 2000-01-31 A 40
4 2000-02-28 B 70
5 2000-03-31 C 90
8 2001-01-31 A 40
11 2001-02-28 B 70
12 2001-03-31 C 90
I have this dataframe (sample of it atleast)
DATETIME_FROM DATETIME_TO MEAS ROW VEHICLE SPEED
1 2020-02-27 05:19:42.750 2020-02-27 05:20:42.750 2.2844 1 26 85
2 2020-02-27 05:30:06.050 2020-02-27 05:31:06.050 2.5256 1 31 69
3 2020-02-27 05:36:02.370 2020-02-27 05:37:02.370 4.8933 1 37 86
4 2020-02-27 05:41:12.005 2020-02-27 05:42:12.005 2.6998 1 27 86
5 2020-02-27 05:46:30.773 2020-02-27 05:47:30.773 2.2720 1 26 86
6 2020-02-27 05:50:53.862 2020-02-27 05:51:53.862 4.6953 1 3 82
7 2020-02-27 05:59:45.381 2020-02-27 06:00:45.381 2.5942 1 31 86
8 2020-02-27 06:04:12.657 2020-02-27 06:05:12.657 4.9136 1 37 86
The results should be a table, where I get mean average of every vehicle, each day.
but I would also like to have a total mean of MEAS per day, and per vehicle
I am using this:
pd.crosstab([valid1low.DATE,valid1low.ROW], [valid1low.VEHICLE], values=valid1low.MEAS, aggfunc=[np.mean], margins=True)
And the total looks like an average, but if I use Excel to make the average, I don't get the same result.
Could this be because Excel is not using the same precision of MEAS values?
and how would I get the same result?
The end user of this table will be using excel, so if the total average differs from excel, I would get questions :)
What I think you are looking for if I understand correctly is groupby. I have tried to recreate a similar dataframe with the code below to explain.
import pandas as pd
from datetime import datetime
df = pd.DataFrame()
df['DATETIME_FROM'] = pd.to_datetime(pd.DataFrame({'year': [2020,2020,2020,2020,2020,2020,2020,2020],
'month': [2, 2, 2, 2,2,2,2,2],
'day': [27, 27, 27, 27,28,28,28,28],
'hour':[24,26,28,30,32,34,36,38],
'minute':[2,4,6,8,10,12,14,16],
'second':[1,3,5,7,8,10,12,13] }))
df['DATETIME_TO'] = pd.to_datetime(pd.DataFrame({'year': [2020, 2020, 2020, 2020,2020,2020,2020,2020],
'month': [2, 2, 2, 2,2,2,2,2],
'day': [27, 27, 27, 27,28,28,28,28],
'hour':[25,27,29,31,33,35,37,39],
'minute':[3,5,7,9,11,13,15,17],
'second':[2,4,6,8,10,12,14,16]
}))
df['MEAS'] = [ 2.2844,2.5256,4.8933,2.6998,1,2,3,4]
df['ROW'] = [1,1,1,1,2,2,2,2]
df['VEHICLE'] = [26,31,37,27,65,46,45,49]
df['VEHICLE_SPEED'] =[85,69,86,86,90,91,92,93]
The dataframe that this code creates looks like the following.
DATETIME_FROM DATETIME_TO MEAS ROW VEHICLE VEHICLE_SPEED
0 2020-02-28 00:02:01 2020-02-28 01:03:02 2.2844 1 26 85
1 2020-02-28 02:04:03 2020-02-28 03:05:04 2.5256 1 31 69
2 2020-02-28 04:06:05 2020-02-28 05:07:06 4.8933 1 37 86
3 2020-02-28 06:08:07 2020-02-28 07:09:08 2.6998 1 27 86
4 2020-02-29 08:10:08 2020-02-29 09:11:10 1.0000 2 65 90
5 2020-02-29 10:12:10 2020-02-29 11:13:12 2.0000 2 46 91
6 2020-02-29 12:14:12 2020-02-29 13:15:14 3.0000 2 45 92
7 2020-02-29 14:16:13 2020-02-29 15:17:16 4.0000 2 49 93
You said that you need to get the mean of each vehicle per day and the mean of the MEAS per day. So I grouped by the day using the groupby function along with the Grouper to specify the day as the target to group by within the DATETIME_FROM column. Then I just got the mean of all the rows for a given column using the mean function. This function sums up the values in a given column and divides it by the number of rows.
means = df.set_index(["DATETIME_FROM"]).groupby(pd.Grouper(freq='D')).mean()
The dataframe means now contains the following. The DATEIME_FROM is now the index as we have grouped by this column.
MEAS ROW VEHICLE VEHICLE_SPEED
DATETIME_FROM
2020-02-27 3.100775 1.0 30.25 81.5
2020-02-28 2.500000 2.0 51.25 91.5
When you say you want the total means of MEAS and vehicle I am assuming that you want the mean of the values for the columns in the mean dataframe. This can be done by just getting the means of these columns, I then just created a new dataframe called totals and added these entrys.
mean_meas =means['MEAS'].mean()
mean_vechicles = means['VEHICLE'].mean()
total = pd.DataFrame({'MEAN MEAS':[mean_meas],'MEAN VECHICLE':[mean_vechicles]})
The totals dataframe then will include the following:
MEAN MEAS MEAN VECHICLE
0 2.800388 40.75
I hope this helps, if you have a question let me know!
That is my data set enter code here
Year created Week created SUM_New SUM_Closed SUM_Open
0 2018 1 17 0 82
1 2018 6 62 47 18
2 2018 6 62 47 18
3 2018 6 62 47 18
4 2018 6 62 47 18
In last three columns there is already the sum for the year and week. I need to get rid of duplicates so that the table contains unique values (for the example above):
Year created Week created SUM_New SUM_Closed SUM_Open
0 2018 1 17 0 82
4 2018 6 62 47 18
I tried to group data but it somehow works wrong and does what I need but just for one column.
df.groupby(['Year created', 'Week created']).size()
And output:
Year created Week created
2017 48 2
49 25
50 54
51 36
52 1
2018 1 17
2 50
3 37
But it is just one column and I don't know which one because even if I separate the data on three parts and do the same procedure for each part I get the same result (as above) for all.
I believe need drop_duplicates:
df = df.drop_duplicates(['Year created', 'Week created'])
print (df)
Year created Week created SUM_New SUM_Closed SUM_Open
0 2018 1 17 0 82
1 2018 6 62 47 18
df2 = df.drop_duplicates(['Year created', 'Week created', 'SUM_New', 'SUM_Closed'])
print(df2)
hope this helps.
The question is when merge two dfs, and they all have a column called A, then the result will be a df having A_x and A_y, I am wondering how to keep A from one df and discard another one, so that I don't have to rename A_x to A later on after the merge.
Just filter your dataframe columns before merging.
df1 = pd.DataFrame({'Key':np.arange(12),'A':np.random.randint(0,100,12),'C':list('ABCD')*3})
df2 = pd.DataFrame({'Key':np.arange(12),'A':np.random.randint(100,1000,12),'C':list('ABCD')*3})
df1.merge(df2[['Key','A']], on='Key')
Output: (Note: C is not duplicated)
A_x C Key A_y
0 60 A 0 440
1 65 B 1 731
2 76 C 2 596
3 67 D 3 580
4 44 A 4 477
5 51 B 5 524
6 7 C 6 572
7 88 D 7 984
8 70 A 8 862
9 13 B 9 158
10 28 C 10 593
11 63 D 11 177
It depends if need append columns with duplicated columns names to final merged DataFrame:
...then add suffixes parameter to merge:
print (df1.merge(df2, on='Key', suffixes=('', '_')))
--
... if not use #Scott Boston solution.