Sort through microsecond data and save day - python-3.x
I would like some help figuring out how many days in a month something happened.
This dataset is 30 years of microsecond occurrences (if an event is triggered, the data is recorded)
What Im trying to do is see how often an event happens in X days per month.
DataFrame
datetime year month day hour lon lat diy
0 1989-01-07 02:21:55 1989 1 7 2 -122.201 47.577 1989-01-07
1 1989-01-07 02:24:30 1989 1 7 2 -122.190 47.555 1989-01-07
2 1989-01-07 02:24:32 1989 1 7 2 -122.437 47.585 1989-01-07
3 1989-02-17 21:53:13 1989 2 17 21 -120.844 47.438 1989-02-17
4 1989-02-17 21:53:33 1989 2 17 21 -120.844 47.438 1989-02-17
... ... ... ... ... ... ... ... ...
212978 2019-12-14 00:10:41 2019 12 14 0 -124.880 48.605 2019-12-14
212979 2019-12-19 14:38:32 2019 12 19 14 -125.244 48.493 2019-12-19
212980 2019-12-19 14:49:23 2019 12 19 14 -125.200 48.543 2019-12-19
212981 2019-12-19 14:52:09 2019 12 19 14 -125.203 48.551 2019-12-19
212982 2019-12-31 21:00:11 2019 12 31 21 -124.155 47.684 2019-12-31
So we can see on Jan 7,1989 3 events happened. Would like to separate to just days
In plain English, in the month of January there was 1 day that data was recorded.I'm not worried about how many events happened in that one day. Just want to focus on how many days in that month had an event.
# What I Want: NewDataFrame
datetime days_with_occurrences
0 1989-01 1
1 1989-02 1
2 1989-03 3
I've tried using df.grouby
x = df.groupby('datetime').size().reset_index().rename(columns={0: 'days_with_occurrences'})
but this is giving me the total counts...
datetime days_with_occurrences
0 1989-01-07 3
1 1989-02-17 2
2 1989-03-07 3
3 1989-03-10 1
4 1989-03-13 2
Related
Convert monthly cumulative values to current month values in Pandas
For the following data df1 with missing January data, cumul_val1 and cumul_val2 are the monthly cumulative values of value1 and value2 respectively. date cumul_val1 cumul_val2 0 2020-05-31 48702.97 45919.59 1 2020-06-30 69403.68 62780.21 2 2020-07-31 83631.36 75324.61 3 2020-08-31 98485.95 88454.14 4 2020-09-30 117072.67 103484.20 5 2020-10-31 133293.80 116555.76 6 2020-11-30 150834.45 129492.36 7 2020-12-31 176086.22 141442.95 8 2021-02-28 17363.14 13985.87 9 2021-03-31 36007.05 27575.82 10 2021-04-30 50305.00 40239.76 11 2021-05-31 66383.32 54318.08 12 2021-06-30 88635.35 72179.07 13 2021-07-31 101648.18 84895.41 14 2021-08-31 114192.81 98059.73 15 2021-09-30 130331.78 112568.07 16 2021-10-31 143040.71 124933.62 17 2021-11-30 158130.73 137313.96 18 2021-12-31 179433.41 147602.08 19 2022-02-28 15702.61 14499.38 20 2022-03-31 31045.96 27764.95 21 2022-04-30 39768.15 39154.31 22 2022-05-31 50738.38 52133.62 I now want to convert them into monthly values. For example, the value of value1 on 2021-04-30 is calculated by 50305.00-36007.05. It can be seen that the value in January is missing, so the current month value in February is the accumulated value itself, and the current month value in March will be the accumulated value in March minus the accumulated value in February. May I ask how to achieve it? The expected result: date cumul_val1 cumul_val2 month_val1 month_val2 0 2020-05-31 48702.97 45919.59 NaN NaN 1 2020-06-30 69403.68 62780.21 20700.71 16860.62 2 2020-07-31 83631.36 75324.61 14227.68 12544.40 3 2020-08-31 98485.95 88454.14 14854.59 13129.53 4 2020-09-30 117072.67 103484.20 18586.72 15030.06 5 2020-10-31 133293.80 116555.76 16221.13 13071.56 6 2020-11-30 150834.45 129492.36 17540.65 12936.60 7 2020-12-31 176086.22 141442.95 25251.77 11950.59 8 2021-02-28 17363.14 13985.87 17363.14 13985.87 9 2021-03-31 36007.05 27575.82 18643.91 13589.95 10 2021-04-30 50305.00 40239.76 14297.96 12663.94 11 2021-05-31 66383.32 54318.08 16078.32 14078.32 12 2021-06-30 88635.35 72179.07 22252.03 17860.99 13 2021-07-31 101648.18 84895.41 13012.83 12716.34 14 2021-08-31 114192.81 98059.73 12544.63 13164.32 15 2021-09-30 130331.78 112568.07 16138.97 14508.34 16 2021-10-31 143040.71 124933.62 12708.94 12365.55 17 2021-11-30 158130.73 137313.96 15090.02 12380.34 18 2021-12-31 179433.41 147602.08 21302.68 10288.12 19 2022-02-28 15702.61 14499.38 15702.61 14499.38 20 2022-03-31 31045.96 27764.95 15343.35 13265.57 21 2022-04-30 39768.15 39154.31 8722.19 11389.36 22 2022-05-31 50738.38 52133.62 10970.22 12979.31 Notes: in order to simplify the question, I added a new alternative sample data df2 without missing months: date cumul_val monthly_val 0 2020-09-30 32144142.46 NaN 1 2020-10-31 36061223.45 3917080.99 2 2020-11-30 40354684.50 4293461.05 3 2020-12-31 44360036.58 4005352.08 4 2021-01-31 4130729.28 4130729.28 5 2021-02-28 7985781.64 3855052.36 6 2021-03-31 12306556.74 4320775.10 7 2021-04-30 16873032.10 4566475.36 8 2021-05-31 21730065.01 4857032.91 9 2021-06-30 26816787.85 5086722.84 10 2021-07-31 31785276.80 4968488.95 11 2021-08-31 37030178.38 5244901.58 12 2021-09-30 42879767.13 5849588.75 13 2021-10-31 48392250.79 5512483.66 14 2021-11-30 53655448.65 5263197.86 15 2021-12-31 59965790.04 6310341.39 16 2022-01-31 5226910.15 5226910.15 17 2022-02-28 9481147.06 4254236.91 18 2022-03-31 14205738.71 4724591.65 19 2022-04-30 19096746.32 4891007.61 20 2022-05-31 24033460.77 4936714.45 21 2022-06-30 28913566.31 4880105.54 22 2022-07-31 34099663.15 5186096.84 23 2022-08-31 39082926.81 4983263.66 24 2022-09-30 44406354.61 5323427.80 25 2022-10-31 48889431.89 4483077.28 26 2022-11-30 52956747.09 4067315.20 27 2022-12-31 57184652.60 4227905.51
Had there been no gap in the data, the problem would have been an easy .diff(). However, since there are gaps, we need to fill those gap with 0, calculate the diff, then keep only the original months. idx = pd.to_datetime(df["date"]) month_val = ( df[["cumul_val1", "cumul_val2"]] # Fill the gap months with 0 .set_index(idx) .reindex(pd.date_range(idx.min(), idx.max(), freq="M"), fill_value=0) # Take the diff .diff() # Keep only the original months .loc[idx] # Beat into shape for the subsequent concat .set_axis(["month_val1", "month_val2"], axis=1) .set_index(df.index) ) result = pd.concat([df, month_val], axis=1) Edit: the OP clarified that for the first entry of the year, be it Jan or Feb, the monthly value is the same as a cumulative value. In that case, use this: cumul_cols = ["cumul_val1", "cumul_val2"] monthly_cols = [f"month_val{i+1}" for i in range(len(cumul_cols))] # Make sure `date` is of type Timestamp and the dataframe is sorted. You data # may have satisfied both conditions already.` df["date"] = pd.to_datetime(df["date"]) df = df.sort_values("date") # Return True if current row is in the same year as the previous row. # Repeat the result for each cumul_val column. is_same_year = np.tile( df["date"].dt.year.diff().eq(0).to_numpy()[:, None], (1, len(cumul_cols)), ) month_val = np.where( is_same_year, df[cumul_cols].diff(), df[cumul_cols], ) month_val[0, :] = np.nan df[monthly_cols] = month_val
It would be much easier to use your date as a PeriodIndex with monthly frequencies: # set up the date as a monthly period Index df2 = df.assign(date=pd.to_datetime(df['date']).dt.to_period('M')).set_index('date') # subtract the previous month df2.sub(df2.shift(freq='1M'), fill_value=0).reindex_like(df2) Output: cumul_val1 cumul_val2 date 2020-05 48702.97 45919.59 2020-06 20700.71 16860.62 2020-07 14227.68 12544.40 2020-08 14854.59 13129.53 2020-09 18586.72 15030.06 2020-10 16221.13 13071.56 2020-11 17540.65 12936.60 2020-12 25251.77 11950.59 2021-02 17363.14 13985.87 2021-03 18643.91 13589.95 2021-04 14297.95 12663.94 2021-05 16078.32 14078.32 2021-06 22252.03 17860.99 2021-07 13012.83 12716.34 2021-08 12544.63 13164.32 2021-09 16138.97 14508.34 2021-10 12708.93 12365.55 2021-11 15090.02 12380.34 2021-12 21302.68 10288.12 2022-02 15702.61 14499.38 2022-03 15343.35 13265.57 2022-04 8722.19 11389.36 2022-05 10970.23 12979.31 If you want to assign back to the original DataFrame: df[['month_val1', 'month_val2']] = df2.sub(df2.shift(freq='1M'), fill_value=0).reindex_like(df2).to_numpy() Updated df: date cumul_val1 cumul_val2 month_val1 month_val2 0 2020-05-31 48702.97 45919.59 48702.97 45919.59 1 2020-06-30 69403.68 62780.21 20700.71 16860.62 2 2020-07-31 83631.36 75324.61 14227.68 12544.40 3 2020-08-31 98485.95 88454.14 14854.59 13129.53 4 2020-09-30 117072.67 103484.20 18586.72 15030.06 5 2020-10-31 133293.80 116555.76 16221.13 13071.56 6 2020-11-30 150834.45 129492.36 17540.65 12936.60 7 2020-12-31 176086.22 141442.95 25251.77 11950.59 8 2021-02-28 17363.14 13985.87 17363.14 13985.87 9 2021-03-31 36007.05 27575.82 18643.91 13589.95 10 2021-04-30 50305.00 40239.76 14297.95 12663.94 11 2021-05-31 66383.32 54318.08 16078.32 14078.32 12 2021-06-30 88635.35 72179.07 22252.03 17860.99 13 2021-07-31 101648.18 84895.41 13012.83 12716.34 14 2021-08-31 114192.81 98059.73 12544.63 13164.32 15 2021-09-30 130331.78 112568.07 16138.97 14508.34 16 2021-10-31 143040.71 124933.62 12708.93 12365.55 17 2021-11-30 158130.73 137313.96 15090.02 12380.34 18 2021-12-31 179433.41 147602.08 21302.68 10288.12 19 2022-02-28 15702.61 14499.38 15702.61 14499.38 20 2022-03-31 31045.96 27764.95 15343.35 13265.57 21 2022-04-30 39768.15 39154.31 8722.19 11389.36 22 2022-05-31 50738.38 52133.62 10970.23 12979.31
Fortunately pandas offers a diff function for this: df = pd.DataFrame([['2020-05-31',48702.97,45919.59], ['2020-06-30',69403.68,62780.21], ['2020-07-31',83631.36,75324.61]], columns=['date','cumul_val1','cumul_val2']) df['val1'] = df['cumul_val1'].diff() df['val2'] = df['cumul_val2'].diff() print(df)
"Max value day" of the week and tallying up each day that was highest Python
I was able to get the highest value of the week. Now, I need to figure out which day of the week it was so I can tally up how many times a certain day of the week is the highest. For example, Day of the week that has highest value of that week Mon:5 Tue:2 Wed:3 Thur:2 Fri:1 This is what my dataframe looked like before I parsed the information that I needed. Date Weekdays Week Open Close 0 2019-06-26 Wednesday 26 208.279999 208.509995 1 2019-06-27 Thursday 26 208.970001 212.020004 2 2019-06-28 Friday 26 213.000000 213.169998 3 2019-07-01 Monday 27 214.250000 214.619995 4 2019-07-02 Tuesday 27 214.380005 214.539993 .. ... ... ... ... ... 500 2021-06-21 Monday 25 275.619995 277.100006 501 2021-06-22 Tuesday 25 277.570007 276.920013 502 2021-06-23 Wednesday 25 276.890015 274.660004 503 2021-06-24 Thursday 25 275.000000 275.489990 504 2021-06-25 Friday 25 276.369995 278.380005 [505 rows x 5 columns] Now I was able to get the highest value of the week, but I want to get the day and tally the which days were the highest. #Tally up the highest days of the week at OPEN new_data.groupby(pd.Grouper('Week')).Open.max() The result was Week 26 213.000000 27 215.130005 28 215.210007 29 214.440002 30 208.369995 31 210.000000 32 204.199997 33 214.740005 34 210.050003 35 217.509995 36 222.000000 37 220.539993 38 220.279999 39 214.000000 40 214.300003 41 215.880005 42 216.740005 43 212.429993 44 213.550003 45 222.809998 46 228.500000 47 233.570007 48 233.919998 49 231.190002 50 231.259995 51 227.679993 52 226.860001 1 233.539993 2 234.789993 3 235.220001 4 233.000000 5 236.979996 6 241.429993 7 244.729996 8 248.070007 9 251.080002 10 264.220001 11 260.309998 12 252.750000 13 259.940002 14 264.220001 15 270.470001 16 272.299988 17 276.290009 18 289.970001 19 292.350006 20 290.200012 21 290.190002 22 292.910004 23 292.559998 24 286.660004 25 277.570007 53 230.500000 Name: Open, dtype: float64
I got you. We wrap the groupby in df.loc, then select the indexes for the max values of Open in each group. Finally just take the value_counts of the Weekdays. df.loc[df.groupby(["Week"]).Open.idxmax()].Weekdays.value_counts()
How to find again the index after pivoting dataframe?
I created a dataframe form a csv file containing data on number of deaths by year (running from 1946 to 2021) and month (within year): dataD = pd.read_csv('MY_FILE.csv', sep=',') First rows (out of 902...) of output are : dataD Year Month Deaths 0 2021 2 55500 1 2021 1 65400 2 2020 12 62800 3 2020 11 64700 4 2020 10 56900 As expected, the dataframe contains an index numbered 0,1,2, ... and so on. Now, I pivot this dataframe in order to have only 1 row by year and months in column, using the following code: dataDW = dataD.pivot(index='Year', columns='Month', values='Deaths') The first rows of the result are now: Month 1 2 3 4 5 6 7 8 9 10 11 12 Year 1946 70900.0 53958.0 57287.0 45376.0 42591.0 37721.0 37587.0 34880.0 35188.0 37842.0 42954.0 49596.0 1947 60453.0 56891.0 56442.0 45121.0 42605.0 37894.0 38364.0 36763.0 35768.0 40488.0 41361.0 46007.0 1948 46161.0 45412.0 51983.0 43829.0 42003.0 37084.0 39069.0 35272.0 35314.0 39588.0 43596.0 53899.0 1949 87861.0 58592.0 52772.0 44154.0 41896.0 39141.0 40042.0 37372.0 36267.0 40534.0 47049.0 47918.0 1950 51927.0 47749.0 50439.0 47248.0 45515.0 40095.0 39798.0 38124.0 37075.0 42232.0 44418.0 49860.0 My question is: What do I have to change in the previous pivoting code in order to find again the index 0,1,2,..etc. when I output the pivoted file? I think I need to specify index=*** in order to make the pivot instruction run. But afterwards, I would like to recover an index "as usual" (if I can say), exactly like in my first file dataD. Any possibility?
You can reset_index() after pivoting: dataDW = dataD.pivot(index='Year', columns='Month', values='Deaths').reset_index() This would give you the following: Month Year 1 2 3 4 5 6 7 8 9 10 11 12 0 1946 70900.0 53958.0 57287.0 45376.0 42591.0 37721.0 37587.0 34880.0 35188.0 37842.0 42954.0 49596.0 1 1947 60453.0 56891.0 56442.0 45121.0 42605.0 37894.0 38364.0 36763.0 35768.0 40488.0 41361.0 46007.0 2 1948 46161.0 45412.0 51983.0 43829.0 42003.0 37084.0 39069.0 35272.0 35314.0 39588.0 43596.0 53899.0 3 1949 87861.0 58592.0 52772.0 44154.0 41896.0 39141.0 40042.0 37372.0 36267.0 40534.0 47049.0 47918.0 4 1950 51927.0 47749.0 50439.0 47248.0 45515.0 40095.0 39798.0 38124.0 37075.0 42232.0 44418.0 49860.0 Note that the "Month" here might look like the index name but is actually df.columns.name. You can unset it if preferred: df.columns.name = None Which then gives you: Year 1 2 3 4 5 6 7 8 9 10 11 12 0 1946 70900.0 53958.0 57287.0 45376.0 42591.0 37721.0 37587.0 34880.0 35188.0 37842.0 42954.0 49596.0 1 1947 60453.0 56891.0 56442.0 45121.0 42605.0 37894.0 38364.0 36763.0 35768.0 40488.0 41361.0 46007.0 2 1948 46161.0 45412.0 51983.0 43829.0 42003.0 37084.0 39069.0 35272.0 35314.0 39588.0 43596.0 53899.0 3 1949 87861.0 58592.0 52772.0 44154.0 41896.0 39141.0 40042.0 37372.0 36267.0 40534.0 47049.0 47918.0 4 1950 51927.0 47749.0 50439.0 47248.0 45515.0 40095.0 39798.0 38124.0 37075.0 42232.0 44418.0 49860.0
Find earliest date within daterange
I have the following market data: data = pd.DataFrame({'year': [2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020], 'month': [10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11], 'day': [1,2,5,6,7,8,9,12,13,14,15,16,19,20,21,22,23,26,27,28,29,30,2,3,5,6,9,10,11,12,13,16,17,18,19,20,23,24,25,26,27,30]}) data['date'] = pd.to_datetime(data) data['spot'] = [77.3438,78.192,78.1044,78.4357,78.0285,77.3507,76.78,77.13,77.0417,77.6525,78.0906,77.91,77.6602,77.3568,76.7243,76.5872,76.1374,76.4435,77.2906,79.2239,78.8993,79.5305,80.5313,79.3615,77.0156,77.4226,76.288,76.5648,77.1171,77.3568,77.374,76.1758,76.2325,76.0401,76.0529,76.1992,76.1648,75.474,75.551,75.7018,75.8639,76.3944] data = data.set_index('date') I'm trying to find the spot value for the first day of the month in the date column. I can find the first business day with below: def get_month_beg(d): month_beg = (d.index + pd.offsets.BMonthEnd(0) - pd.offsets.MonthBegin(normalize=True)) return month_beg data['month_beg'] = get_month_beg(data) However, due to data issues, sometimes the earliest date from my data does not match up with the first business day of the month. We'll call the earliest spot value of each month the "strike", which is what I'm trying to find. So for October, the spot value would be 77.3438 (10/1/21) and in Nov it would be 80.5313 (which is on 11/2/21 NOT 11/1/21). I tried below, which only works if my data's earliest date matches up with the first business date of the month (eg it works in Oct, but not in Nov) data['strike'] = data.month_beg.map(data.spot) As you can see, I get NaN in Nov because the first business day in my data is 11/2 (spot rate 80.5313) not 11/1. Does anyone know how to find the earliest date within a date range (in this case the earliest date of each month)? I was hoping the final df would like like below: data = pd.DataFrame({'year': [2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020], 'month': [10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11], 'day': [1,2,5,6,7,8,9,12,13,14,15,16,19,20,21,22,23,26,27,28,29,30,2,3,5,6,9,10,11,12,13,16,17,18,19,20,23,24,25,26,27,30]}) data['date'] = pd.to_datetime(data) data['spot'] = [77.3438,78.192,78.1044,78.4357,78.0285,77.3507,76.78,77.13,77.0417,77.6525,78.0906,77.91,77.6602,77.3568,76.7243,76.5872,76.1374,76.4435,77.2906,79.2239,78.8993,79.5305,80.5313,79.3615,77.0156,77.4226,76.288,76.5648,77.1171,77.3568,77.374,76.1758,76.2325,76.0401,76.0529,76.1992,76.1648,75.474,75.551,75.7018,75.8639,76.3944] data['strike'] = [77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313] data = data.set_index('date')
I Believe, We can get the first() for every year and month combination and later on join that with main data. data2=data.groupby(['year','month']).first().reset_index() #join data 2 with data based on month and year later on year month day spot 0 2020 10 1 77.3438 1 2020 11 2 80.5313 Based on the question, What i have understood is that we need to take every month's first day and respective 'SPOT' column value. Correct me if i have understood it wrong.
Strike = Spot value from first day of each month To do this, we need to do the following: Step 1. Get the Year/Month value from the Date column. Alternate, we can use Year and Month columns you already have in the DataFrame. Step 2: We need to groupby Year and Month. That will give all the records by Year+Month. From this, we need to get the first record (which will be the earliest date of the month). The earliest date can either be 1st or 2nd or 3rd of the month depending on the data in the column. Step 3: By using transform in Groupby, pandas will send back the results to match the dataframe length. So for each record, it will send the same result. In this example, we have only 2 months (Oct & Nov). However, we have 42 rows. Transform will send us back 42 rows. The code: groupby('[year','month'])['date'].transform('first') will give first day of month. Use This: data['dy'] = data.groupby(['year','month'])['date'].transform('first') or: data['dx'] = data.date.dt.to_period('M') #to get yyyy-mm value Step 4: Using transform, we can also get the Spot value. This can be assigned to Strike giving us the desired result. Instead of getting first day of the month, we can change it to return Spot value. The code will be: groupby('date')['spot'].transform('first') Use this: data['strike'] = data.groupby(['year','month'])['spot'].transform('first') or data['strike'] = data.groupby('dx')['spot'].transform('first') Putting all this together The full code to get Strike Price using Spot Price from first day of month import pandas as pd import numpy as np data = pd.DataFrame({'year': [2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020], 'month': [10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11], 'day': [1,2,5,6,7,8,9,12,13,14,15,16,19,20,21,22,23,26,27,28,29,30,2,3,5,6,9,10,11,12,13,16,17,18,19,20,23,24,25,26,27,30]}) data['date'] = pd.to_datetime(data) data['spot'] = [77.3438,78.192,78.1044,78.4357,78.0285,77.3507,76.78,77.13,77.0417,77.6525,78.0906,77.91,77.6602,77.3568,76.7243,76.5872,76.1374,76.4435,77.2906,79.2239,78.8993,79.5305,80.5313,79.3615,77.0156,77.4226,76.288,76.5648,77.1171,77.3568,77.374,76.1758,76.2325,76.0401,76.0529,76.1992,76.1648,75.474,75.551,75.7018,75.8639,76.3944] #Pick the first day of month Spot price as the Strike price data['strike'] = data.groupby(['year','month'])['spot'].transform('first') #This will give you the first row of each month print (data) The output of this will be: year month day date spot strike 0 2020 10 1 2020-10-01 77.3438 77.3438 1 2020 10 2 2020-10-02 78.1920 77.3438 2 2020 10 5 2020-10-05 78.1044 77.3438 3 2020 10 6 2020-10-06 78.4357 77.3438 4 2020 10 7 2020-10-07 78.0285 77.3438 5 2020 10 8 2020-10-08 77.3507 77.3438 6 2020 10 9 2020-10-09 76.7800 77.3438 7 2020 10 12 2020-10-12 77.1300 77.3438 8 2020 10 13 2020-10-13 77.0417 77.3438 9 2020 10 14 2020-10-14 77.6525 77.3438 10 2020 10 15 2020-10-15 78.0906 77.3438 11 2020 10 16 2020-10-16 77.9100 77.3438 12 2020 10 19 2020-10-19 77.6602 77.3438 13 2020 10 20 2020-10-20 77.3568 77.3438 14 2020 10 21 2020-10-21 76.7243 77.3438 15 2020 10 22 2020-10-22 76.5872 77.3438 16 2020 10 23 2020-10-23 76.1374 77.3438 17 2020 10 26 2020-10-26 76.4435 77.3438 18 2020 10 27 2020-10-27 77.2906 77.3438 19 2020 10 28 2020-10-28 79.2239 77.3438 20 2020 10 29 2020-10-29 78.8993 77.3438 21 2020 10 30 2020-10-30 79.5305 77.3438 22 2020 11 2 2020-11-02 80.5313 80.5313 23 2020 11 3 2020-11-03 79.3615 80.5313 24 2020 11 5 2020-11-05 77.0156 80.5313 25 2020 11 6 2020-11-06 77.4226 80.5313 26 2020 11 9 2020-11-09 76.2880 80.5313 27 2020 11 10 2020-11-10 76.5648 80.5313 28 2020 11 11 2020-11-11 77.1171 80.5313 29 2020 11 12 2020-11-12 77.3568 80.5313 30 2020 11 13 2020-11-13 77.3740 80.5313 31 2020 11 16 2020-11-16 76.1758 80.5313 32 2020 11 17 2020-11-17 76.2325 80.5313 33 2020 11 18 2020-11-18 76.0401 80.5313 34 2020 11 19 2020-11-19 76.0529 80.5313 35 2020 11 20 2020-11-20 76.1992 80.5313 36 2020 11 23 2020-11-23 76.1648 80.5313 37 2020 11 24 2020-11-24 75.4740 80.5313 38 2020 11 25 2020-11-25 75.5510 80.5313 39 2020 11 26 2020-11-26 75.7018 80.5313 40 2020 11 27 2020-11-27 75.8639 80.5313 41 2020 11 30 2020-11-30 76.3944 80.5313 Previous Answer to get the first day of each month (within the column data) One way to do it is to create a dummy column to store the first day of each month. Then use drop_duplicates() and retain only the first row. Key assumption: The assumption with this logic is that we have at least 2 rows for each month. If there is only one row for a month, then it will not be part of the duplicates and you will NOT get that month's data. That will give you the first day of each month. import pandas as pd import numpy as np data = pd.DataFrame({'year': [2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020], 'month': [10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11], 'day': [1,2,5,6,7,8,9,12,13,14,15,16,19,20,21,22,23,26,27,28,29,30,2,3,5,6,9,10,11,12,13,16,17,18,19,20,23,24,25,26,27,30]}) data['date'] = pd.to_datetime(data) data['spot'] = [77.3438,78.192,78.1044,78.4357,78.0285,77.3507,76.78,77.13,77.0417,77.6525,78.0906,77.91,77.6602,77.3568,76.7243,76.5872,76.1374,76.4435,77.2906,79.2239,78.8993,79.5305,80.5313,79.3615,77.0156,77.4226,76.288,76.5648,77.1171,77.3568,77.374,76.1758,76.2325,76.0401,76.0529,76.1992,76.1648,75.474,75.551,75.7018,75.8639,76.3944] #create a dummy column to store the first day of the month data['dx'] = data.date.dt.to_period('M') #drop duplicates while retaining only the first row of each month dx = data.drop_duplicates('dx',keep='first') #This will give you the first row of each month print (dx) The output of this will be: year month day date spot dx 0 2020 10 1 2020-10-01 77.3438 2020-10 22 2020 11 2 2020-11-02 80.5313 2020-11 If there is only one row for a given month, then you can use groupby the month and take the first record. data.groupby(['dx']).first() This will give you: year month day date spot dx 2020-10 2020 10 1 2020-10-01 77.3438 2020-11 2020 11 2 2020-11-02 80.5313
data['strike']=data.groupby(['year','month'])['spot'].transform('first') I guess this can be achieved by this without creating any other dataframe.
pandas summation per year
python3, pandas version 0.23.4 Let's say we have a pandas DataFrame as follows np.random.seed(45) df = pd.DataFrame({'A': np.random.randint(0, 10, 20)}, index = pd.to_datetime(dd).sort_values(ascending=False)) Now, I would like to total the data in column A with respect to each year. I could do: gf_perYear = gf.groupby(by= gf.index.year) gf_perYear.sum() A 2012 11 2013 8 2014 15 2015 44 2016 13 2017 11 However, I am wondering if there would be a way that would allow me to get the result posted in a new column right by the last day if each year, as shown below: A sum_per_year 2017-12-15 3 11 2017-11-27 0 2017-07-24 5 2017-06-28 3 2016-11-07 4 13 2016-06-03 9 2015-12-18 8 44 2015-10-16 1 2015-09-18 5 2015-07-15 9 2015-04-09 6 2015-03-18 8 2015-02-18 7 2014-10-21 8 15 2014-09-16 5 2014-01-29 2 2013-01-04 8 8 2012-12-28 1 11 2012-08-21 6 2012-03-02 4
You can using transform gf_perYear = gf.groupby(by= gf.index.year) gf['new'] = gf_perYear.transform('sum')