Pandas - Sum of variant number of products - python-3.x
I have two dataframes:
First one contains the information about number of daily arrivals of people in reference to some particular day of interest e.g. '-2' column header refers to number of people arriving on a day that is 2 days prior to my day of interest etc. These arrivals are split between two groups ('Rest of US' and 'International') in terms of a multiindex. Each row of a dataframe showcases different expected number of arrivals based on the number of tickets owned. To sum up, a single value of that dataframe showcases the number of people arriving on a certain day, given particular group of interest and the number of ticket holdings we consider.
import pandas as pd
import numpy as np
arrivals_US=np.array(
[[0.00000000e+00, 1.32853314e+03, 7.92624282e+03, 2.36446211e+04,
4.70225402e+04, 7.01360136e+04, 8.36885523e+04, 8.32165654e+04,
7.09262060e+04, 5.28946471e+04, 3.50642212e+04, 2.09198795e+04,
1.13464885e+04, 5.64124825e+03, 2.58896895e+03, 1.10330043e+03,
4.38831214e+02, 1.63633610e+02, 5.74273229e+01, 1.90344814e+01,
5.97698863e+00, 1.78298388e+00],
[0.00000000e+00, 5.95406456e+03, 2.65918120e+04, 5.93816593e+04,
8.84026873e+04, 9.87051626e+04, 8.81666331e+04, 6.56277324e+04,
4.18720044e+04, 2.33758901e+04, 1.16000548e+04, 5.18077143e+03,
2.10346912e+03, 7.82869343e+02, 2.68955436e+02, 8.57998806e+01,
2.55464064e+01, 7.13089804e+00, 1.87339647e+00, 4.64827256e-01,
1.09262813e-01, 2.43992667e-02]])
arrivals_Int = np.array(
[[1.80595142e+02, 1.01558052e+03, 2.85556902e+03, 5.35278380e+03,
7.52537259e+03, 8.46381767e+03, 7.93274228e+03, 6.37284860e+03,
4.47973074e+03, 2.79909538e+03, 1.57407707e+03, 8.04714088e+02,
3.77110526e+02, 1.63129911e+02, 6.55260304e+01, 2.45657991e+01,
8.63414246e+00, 2.85613404e+00, 8.92307156e-01, 2.64100407e-01,
7.42587051e-02, 1.98854935e-02],
[3.33606865e+03, 1.20883137e+04, 2.19011273e+04, 2.64530626e+04,
2.39633048e+04, 1.73663061e+04, 1.04878615e+04, 5.42899758e+03,
2.45901062e+03, 9.90030648e+02, 3.58739652e+02, 1.18172777e+02,
3.56834584e+01, 9.94613456e+00, 2.57428744e+00, 6.21865585e-01,
1.40833925e-01, 3.00185083e-02, 6.04292696e-03, 1.15245636e-03,
2.08797472e-04, 3.60277122e-05]])
arrivals = [arrivals_US, arrivals_Int]
days=list(range(-5,17))
tickets=[1,2]
arrivals_df = pd.concat([pd.DataFrame(dict(zip(days,arrival_type.T)), index=tickets) for arrival_type in arrivals], axis=1, keys=["Rest of US", "International"])
Second one has similar structure as the first one but instead of the number of arrivals, it represents a distribution of number of days stayed given the group type and number of tickets owned e.g. on average 24% of people who are being part of group 'Rest of US' and having a single ticket will stay for 3 days etc.
stays_dist_US=np.array(
[[4.43244820e-03, 5.39982734e-02, 2.42003472e-01, 3.98996272e-01,
2.42003472e-01, 5.39982734e-02, 4.43244820e-03, 1.33848338e-04,
1.48692072e-06, 6.07670514e-09, 9.13595666e-12, 5.05295484e-15,
1.02811648e-18, 7.69563998e-23, 2.11910601e-27, 2.14667422e-32,
7.99991029e-38, 1.09675497e-43, 5.53145805e-50, 1.02630195e-56,
7.00513005e-64, 1.75898756e-71],
[1.33830425e-04, 4.43185500e-03, 5.39910468e-02, 2.41971084e-01,
3.98942874e-01, 2.41971084e-01, 5.39910468e-02, 4.43185500e-03,
1.33830425e-04, 1.48672173e-06, 6.07589189e-09, 9.13473400e-12,
5.05227860e-15, 1.02797889e-18, 7.69461007e-23, 2.11882241e-27,
2.14638693e-32, 7.99883965e-38, 1.09660819e-43, 5.53071778e-50,
1.02616460e-56, 7.00419255e-64]])
stays_dist_Int = np.array(
[[5.05227106e-15, 9.13472036e-12, 6.07588282e-09, 1.48671951e-06,
1.33830225e-04, 4.43184839e-03, 5.39909662e-02, 2.41970723e-01,
3.98942278e-01, 2.41970723e-01, 5.39909662e-02, 4.43184839e-03,
1.33830225e-04, 1.48671951e-06, 6.07588282e-09, 9.13472036e-12,
5.05227106e-15, 1.02797735e-18, 7.69459859e-23, 2.11881924e-27,
2.14638372e-32, 7.99882771e-38],
[1.02797735e-18, 5.05227106e-15, 9.13472036e-12, 6.07588282e-09,
1.48671951e-06, 1.33830225e-04, 4.43184839e-03, 5.39909662e-02,
2.41970723e-01, 3.98942278e-01, 2.41970723e-01, 5.39909662e-02,
4.43184839e-03, 1.33830225e-04, 1.48671951e-06, 6.07588282e-09,
9.13472036e-12, 5.05227106e-15, 1.02797735e-18, 7.69459859e-23,
2.11881924e-27, 2.14638372e-32]])
stays_dist = [stays_dist_US, stays_dist_Int]
len_of_stay = list(range(1,23))
tickets=[1,2]
stays_df = pd.concat([pd.DataFrame(dict(zip(len_of_stay,stays_dist_type.T)), index=tickets) for stays_dist_type in stays_dist], axis=1, keys=["Rest of US", "International"])
I would like to create a 3rd dataframe which would depict number of departures on each day (having exactly the same structure of columns and index as the 1st one), by doing some sort of 'variant sum of products' (not sure how to name it) for example:
number of 'Rest of US' people leaving at day '-4', given that they have k number of tickets should be represented as a product of number of these type of people arriving at day '-5' (value from the 1st dataframe) and percentage of these type of people leaving after single day (value from my length of stay distribution - 2nd dataframe),
number of 'Rest of US' people leaving at day '-3', given that they have k number of tickets should be represented as sum of two products:
Number of people arriving at day '-5' and percentage of people staying for 2 days,
Number of people arriving at day '-4' and percentage of people staying for 1 day,
number of 'Rest of US' people leaving at day '-2', given that they have k number of tickets should be represented as sum of three products:
Number of people arriving at day '-5' and percentage of people staying for 3 days,
Number of people arriving at day '-4' and percentage of people staying for 2 days,
Number of people arriving at day '-3' and percentage of people staying for 1 day.
and so it goes all the way till all days are considered for given group and then the same is repeated for the next one.
So far I've done it in Excel by manually expanding the formula with each new day I was considering:
example of the calculation in excel for day '-2'
I tried to replicate it in Python with some for loops but it's messy and probably not efficient.
Is there any clever way one can do this with some pandas/numpy methods?
Thank you for any help and suggestions!
Related
How to build a simple moving average measure
I want to build a measure to get the simple moving average for each day. I have a cube with a single table containing stock market information. Schema The expected result is that for each date, this measure shows the closing price average of the X previous days to that date. For example, for the date 2013-02-28 and for X = 5 days, this measure would show the average closing price for the days 2013-02-28, 2013-02-27, 2013-02-26, 2013-02-25, 2013-02-22. The closing values of those days are summed, and then divided by 5. The same would be done for each of the rows. Example dashboard Maybe it could be achieved just with the function tt..agg.mean() but indicating those X previous days in the scope parameter. The problem is that I am not sure how to obtain the X previous days for each of the dates dynamically so that I can use them in a measure.
You can compute a sliding average you can use the cumulative scope as referenced in the atoti documentation https://docs.atoti.io/latest/lib/atoti.scope.html#atoti.scope.cumulative. By passing a tuple containing the date range, in your case ("-5D", None) you will be able to calculate a sliding average over the past 5 days for each date in your data. The resulting Python code would be: import atoti as tt // session setup ... m, l = cube.measures, cube.levels // measure setup ... tt.agg.mean(m["ClosingPrice"], scope=tt.scope.cumulative(l["date"], window=("-5D", None)))
Finding average age of incidents in a datetime series
I'm new to Stackoverflow and fairly fresh with Python (some 5 months give or take), so apologies if I'm not explaining this too clearly! I want to build up a historic trend of the average age of outstanding incidents on a daily basis. I have two dataframes. df1 contains incident data going back 8 years, with the two most relevant columns being "opened_at" and "resolved_at" which contains datetime values. df2 contains a column called date with the full date range from 2012-06-13 to now. The goal is to have df2 contain the number of outstanding incidents on each date (as of 00:00:00) and the average age of all those deemed outstanding. I know it's possible to get all rows that exist between two dates, but I believe I want the opposite and find where each date row in df2 exists between dates in opened_at and resolved_at in df1
(It would be helpful to have some example code containing an anonymized/randomized short extract of your data to try solutions on) This is unlikely to be the most efficient solution, but I believe you could do: df2["incs_open"] = 0 # Ensure the column exists for row_num in range(df2.shape[0]): df2.at[row_num, "incs_open"] = sum( (df1["opened_at"] < df2.at[row_num, "date"]) & (df2.at[row_num, "date"] < df1["opened_at"]) ) (This assumes you haven't set an index on the data frame other than the default one) For the second part of your question, the average age, I believe you can calculate that in the body of the loop like this: open_incs = (df1["opened_at"] < df2.at[row_num, "date"]) & \ (df2.at[row_num, "date"] < df1["opened_at"]) ages_of_open_incs = df2.at[row_num, "date"] - df1.loc[open_incs, "opened_at"] avg_age = ages_of_open_incs.mean() You'll hit some weirdnesses about rounding and things. If an incident was opened last night at 3am, what is its age "today" -- 1 day, 0 days, 9 hours (if we take noon as the point to count from), etc. -- but I assume once you've got code that works you can adjust that to taste.
Print the first value of a dataframe based on condition, then iterate to the next sequence
I'm looking to perform data analysis on 100-years of climatological data for select U.S. locations (8 in particular), for each day spanning the 100-years. I have a pandas dataFrame set up with columns for Max temperature, Min temperature, Avg temperature, Snowfall, Precip Total, and then Day, Year, and Month values (then, I have an index also based on a date-time value). Right now, I want to set up a for loop to print the first Maximum temperature of 90 degrees F or greater from each year, but ONLY the first. Eventually, I want to narrow this down to each of my 8 locations, but first I just want to get the for loop to work. Experimented with various iterations of a for loop. for year in range(len(climate['Year'])): if (climate['Max'][year] >=90).all(): print (climate.index[year]) break Unsurprisingly, the output of the loop I provided prints the first 90 degree day period (from the year 1919, the beginning of my data frame) and breaks. for year in range(len(climate['Year'])): if (climate['Max'][year] >=90).all(): print (climate.index[year]) break 1919-06-12 00:00:00 That's fine. If I take out the break statement, all of the 90 degree days print, including multiple in the same year. I just want the first value from each year to print. Do I need to set up a second for loop to increment through the year? If I explicitly state the year, ala below, while trying to loop through a counter, the loop still begins in 1919 and eventually reaches an out of bounds index. I know this logic is incorrect. count = 1919 while count < 2019: for year in range(len(climate['Year'])): if (climate[climate['Year']==count]['Max'][year] >=90).all(): print (climate.index[year]) count = count+1 Any input is sincerely appreciated.
You can achieve this without having a second for loop. Assuming the climate dataframe is ordered chronologically, this should do what you want: current_year = None for i in range(climate.shape[0]): if climate['Max'][i] >= 90 and climate['Year'][i] != current_year: print(climate.index[i]) current_year = climate['Year'][i] Notice that we're using the current_year variable to keep track of the latest year that we've already printed the result for. Then, in the if check, we're checking if we've already printed a result for the year of the current row in the loop. That's one way to do it, but I would suggest taking a look at pandas.DataFrame.groupby because I think it fits your use case well. You could get a dataframe that contains the first >=90 max days per year with the following (again assuming climate is ordered chronologically): climate[climate.Max >= 90].groupby('Year').first() This just filters the dataframe to only contain the >=90 max days, groups rows from the same year together, and retains only the first row from each group. If you had an additional column Location, you could extend this to get the same except per location per year: climate[climate.Max >= 90].groupby(['Location', 'Year']).first()
How to Calculate Loan Balance at Any Given Point In Time Without Use of a Table in Excel
I'm trying to calculate the remaining balance of a home loan at any point in time for multiple home loans. Its looks like it is not possible to find the home loan balance w/ out creating one of those long tables (example). Finding the future balance for multiple home loans would require setting up a table for ea. home (in this case, 25). With a table, when you want to look at the balance after a certain amount of payments have been made for the home loan, you would just visually scan the table for that period... But is there any single formula which shows the remaining loan balance by just changing the "time" variable? (# of years/mths in the future)... An example of the information I'm trying to find is "what would be the remaining balance on a home loan with the following criteria after 10 years": original loan amt: $100K term: 30-yr rate: 5% mthly pmts: $536.82 pmts per yr: 12 I'd hate to have to create 25 different amortization schedules - a lot of copy-paste-dragging... Thanks in advance!
You're looking for =FV(), or "future value). The function needs 5 inputs, as follows: =FV(rate, nper, pmt, pv, type) Where: rate = interest rate for the period of interest. In this case, you are making payments and compounding interest monthly, so your interest rate would be 0.05/12 = 0.00417 nper = the number of periods elapsed. This is your 'time' variable, in this case, number of months elapsed. pmt = the payment in each period. in your case $536.82. pv = the 'present value', in this case the principle of the loan at the start, or -100,000. Note that for a debt example, you can use a negative value here. type = Whether payments are made at the beginning (1) or end (0) of the period. In your example, to calculate the principle after 10 years, you could use: =FV(0.05/12,10*12,536.82,-100000,0) Which produces: =81,342.32 For a loan this size, you would have $81,342.32 left to pay off after 10 years.
I don't like to post answer when there already exist a brilliant answer, but I want to give some views. Understanding why the formula works and why you should use FV as P.J correctly states! They use PV in the example and you can always double-check Present Value (PV) vs Future Value (FV), why? Because they are linked to each other. FV is the compounded value of PV. PV is the discounted value at interest rate of FV. Which can be illustrated in this graph, source link: In the example below, where I replicated the way the example calculate PV (Column E the example from excel-easy, Loan Amortization Schedule) and in Column F we use Excel's build in function PV. You want to know the other way... therefore FV Column J. Since they are linked they need to give the same Cash Flows over time (bit more tricky if the period/interest rate is not constant over time)!! And they indeed do: Payment number is the number of periods you want to look at (10 year * 12 payments per year = 120, yellow cells). PV function is composed by: rate: discount rate per period nper: total amount of periods left. (total periods - current period), (12*30-120) pmt: the fixed amount paid every month FV: is the value of the loan in the future at end after 360 periods (after 30 year * 12 payments per year). A future value of a loan at the end is always 0. Type: when payments occur in the year, usually calculated at the end. PV: 0.05/12, (12*30)-120, 536.82 ,0 , 0 = 81 342.06 = FV: 0.05/12, 120, 536.82 , 100 000.00 , 0 = -81 342.06
Find a growth rate that creates values adding to a determined total
I am trying to create a forecast tool that shows a smooth growth rate over a determined number of steps while adding up to a determined value. We have variables tied to certain sales values and want to illustrate different growth patterns. I am looking for a formula that would help us to determine the values of each individual step. as an example: say we wanted to illustrate 100 units sold, starting with sales of 19 units, over 4 months with an even growth rate we would need to have individual month sales of 19, 23, 27 and 31. We can find these values with a lot of trial and error, but I am hoping that there is a formula that I could use to automatically calculate the values. We will have a starting value (current or last month sales), a total amount of sales that we want to illustrate, and a period of time that we want to evaluate -- so all I am missing is a way to determine the change needed between individual values.
This basically is a problem in sequences and series. If the starting sales number is a, the difference in sales numbers between consecutive months is d, and the number of months is n, then the total sales is S = n/2 * [2*a + (n-1) * d] In your example, a=19, n=4, and S=100, with d unknown. That equation is easy to solve for d, and we get d = 2 * (S - a * n) / (n * (n - 1)) There are other ways to write that, of course. If you substitute your example values into that expression, you get d=4, so the sales values increase by 4 each month.
For excel you can use this formula: =IF(D1<>"",(D1-1)*($B$1-$B$2*$B$3)/SUMPRODUCT(ROW($A$1:INDEX(A:A,$B$3-1)))+$B$2,"")
I would recommend using Excel. This is simply a Y=mX+b equation. Assuming you want a steady growth rate over a time with x periods you can use this formula to determine the slope of your line (growth rate - designated as 'm'). As long as you have your two data points (starting sales value & ending sales value) you can find 'm' using m = (y2-y1) / (x2-x1) That will calculate the slope. Y2 represents your final sales goal. Y1 represents your current sales level. X2 is your number of periods in the period of performance (so how many months are you giving to achieve the goal). X1 = 0 since it represents today which is time period 0. Once you solve for 'm' this will plug into the formula y=mX+b. Your 'b' in this scenario will always be equal to your current sales level (this represents the y intercept). Then all you have to do to calculate the new 'Y' which represents the sales level at any period by plugging in any X value you choose. So if you are in the first month, then x=1. If you are in the second month X=2. The 'm' & 'b' stay the same. See the Excel template below which serves as a rudimentary model. The yellow boxes can be filled in by the user and the white boxes should be left as formulas.