I have two dataframes:
First one contains the information about number of daily arrivals of people in reference to some particular day of interest e.g. '-2' column header refers to number of people arriving on a day that is 2 days prior to my day of interest etc. These arrivals are split between two groups ('Rest of US' and 'International') in terms of a multiindex. Each row of a dataframe showcases different expected number of arrivals based on the number of tickets owned. To sum up, a single value of that dataframe showcases the number of people arriving on a certain day, given particular group of interest and the number of ticket holdings we consider.
import pandas as pd
import numpy as np
arrivals_US=np.array(
[[0.00000000e+00, 1.32853314e+03, 7.92624282e+03, 2.36446211e+04,
4.70225402e+04, 7.01360136e+04, 8.36885523e+04, 8.32165654e+04,
7.09262060e+04, 5.28946471e+04, 3.50642212e+04, 2.09198795e+04,
1.13464885e+04, 5.64124825e+03, 2.58896895e+03, 1.10330043e+03,
4.38831214e+02, 1.63633610e+02, 5.74273229e+01, 1.90344814e+01,
5.97698863e+00, 1.78298388e+00],
[0.00000000e+00, 5.95406456e+03, 2.65918120e+04, 5.93816593e+04,
8.84026873e+04, 9.87051626e+04, 8.81666331e+04, 6.56277324e+04,
4.18720044e+04, 2.33758901e+04, 1.16000548e+04, 5.18077143e+03,
2.10346912e+03, 7.82869343e+02, 2.68955436e+02, 8.57998806e+01,
2.55464064e+01, 7.13089804e+00, 1.87339647e+00, 4.64827256e-01,
1.09262813e-01, 2.43992667e-02]])
arrivals_Int = np.array(
[[1.80595142e+02, 1.01558052e+03, 2.85556902e+03, 5.35278380e+03,
7.52537259e+03, 8.46381767e+03, 7.93274228e+03, 6.37284860e+03,
4.47973074e+03, 2.79909538e+03, 1.57407707e+03, 8.04714088e+02,
3.77110526e+02, 1.63129911e+02, 6.55260304e+01, 2.45657991e+01,
8.63414246e+00, 2.85613404e+00, 8.92307156e-01, 2.64100407e-01,
7.42587051e-02, 1.98854935e-02],
[3.33606865e+03, 1.20883137e+04, 2.19011273e+04, 2.64530626e+04,
2.39633048e+04, 1.73663061e+04, 1.04878615e+04, 5.42899758e+03,
2.45901062e+03, 9.90030648e+02, 3.58739652e+02, 1.18172777e+02,
3.56834584e+01, 9.94613456e+00, 2.57428744e+00, 6.21865585e-01,
1.40833925e-01, 3.00185083e-02, 6.04292696e-03, 1.15245636e-03,
2.08797472e-04, 3.60277122e-05]])
arrivals = [arrivals_US, arrivals_Int]
days=list(range(-5,17))
tickets=[1,2]
arrivals_df = pd.concat([pd.DataFrame(dict(zip(days,arrival_type.T)), index=tickets) for arrival_type in arrivals], axis=1, keys=["Rest of US", "International"])
Second one has similar structure as the first one but instead of the number of arrivals, it represents a distribution of number of days stayed given the group type and number of tickets owned e.g. on average 24% of people who are being part of group 'Rest of US' and having a single ticket will stay for 3 days etc.
stays_dist_US=np.array(
[[4.43244820e-03, 5.39982734e-02, 2.42003472e-01, 3.98996272e-01,
2.42003472e-01, 5.39982734e-02, 4.43244820e-03, 1.33848338e-04,
1.48692072e-06, 6.07670514e-09, 9.13595666e-12, 5.05295484e-15,
1.02811648e-18, 7.69563998e-23, 2.11910601e-27, 2.14667422e-32,
7.99991029e-38, 1.09675497e-43, 5.53145805e-50, 1.02630195e-56,
7.00513005e-64, 1.75898756e-71],
[1.33830425e-04, 4.43185500e-03, 5.39910468e-02, 2.41971084e-01,
3.98942874e-01, 2.41971084e-01, 5.39910468e-02, 4.43185500e-03,
1.33830425e-04, 1.48672173e-06, 6.07589189e-09, 9.13473400e-12,
5.05227860e-15, 1.02797889e-18, 7.69461007e-23, 2.11882241e-27,
2.14638693e-32, 7.99883965e-38, 1.09660819e-43, 5.53071778e-50,
1.02616460e-56, 7.00419255e-64]])
stays_dist_Int = np.array(
[[5.05227106e-15, 9.13472036e-12, 6.07588282e-09, 1.48671951e-06,
1.33830225e-04, 4.43184839e-03, 5.39909662e-02, 2.41970723e-01,
3.98942278e-01, 2.41970723e-01, 5.39909662e-02, 4.43184839e-03,
1.33830225e-04, 1.48671951e-06, 6.07588282e-09, 9.13472036e-12,
5.05227106e-15, 1.02797735e-18, 7.69459859e-23, 2.11881924e-27,
2.14638372e-32, 7.99882771e-38],
[1.02797735e-18, 5.05227106e-15, 9.13472036e-12, 6.07588282e-09,
1.48671951e-06, 1.33830225e-04, 4.43184839e-03, 5.39909662e-02,
2.41970723e-01, 3.98942278e-01, 2.41970723e-01, 5.39909662e-02,
4.43184839e-03, 1.33830225e-04, 1.48671951e-06, 6.07588282e-09,
9.13472036e-12, 5.05227106e-15, 1.02797735e-18, 7.69459859e-23,
2.11881924e-27, 2.14638372e-32]])
stays_dist = [stays_dist_US, stays_dist_Int]
len_of_stay = list(range(1,23))
tickets=[1,2]
stays_df = pd.concat([pd.DataFrame(dict(zip(len_of_stay,stays_dist_type.T)), index=tickets) for stays_dist_type in stays_dist], axis=1, keys=["Rest of US", "International"])
I would like to create a 3rd dataframe which would depict number of departures on each day (having exactly the same structure of columns and index as the 1st one), by doing some sort of 'variant sum of products' (not sure how to name it) for example:
number of 'Rest of US' people leaving at day '-4', given that they have k number of tickets should be represented as a product of number of these type of people arriving at day '-5' (value from the 1st dataframe) and percentage of these type of people leaving after single day (value from my length of stay distribution - 2nd dataframe),
number of 'Rest of US' people leaving at day '-3', given that they have k number of tickets should be represented as sum of two products:
Number of people arriving at day '-5' and percentage of people staying for 2 days,
Number of people arriving at day '-4' and percentage of people staying for 1 day,
number of 'Rest of US' people leaving at day '-2', given that they have k number of tickets should be represented as sum of three products:
Number of people arriving at day '-5' and percentage of people staying for 3 days,
Number of people arriving at day '-4' and percentage of people staying for 2 days,
Number of people arriving at day '-3' and percentage of people staying for 1 day.
and so it goes all the way till all days are considered for given group and then the same is repeated for the next one.
So far I've done it in Excel by manually expanding the formula with each new day I was considering:
example of the calculation in excel for day '-2'
I tried to replicate it in Python with some for loops but it's messy and probably not efficient.
Is there any clever way one can do this with some pandas/numpy methods?
Thank you for any help and suggestions!
I'm looking to perform data analysis on 100-years of climatological data for select U.S. locations (8 in particular), for each day spanning the 100-years. I have a pandas dataFrame set up with columns for Max temperature, Min temperature, Avg temperature, Snowfall, Precip Total, and then Day, Year, and Month values (then, I have an index also based on a date-time value). Right now, I want to set up a for loop to print the first Maximum temperature of 90 degrees F or greater from each year, but ONLY the first. Eventually, I want to narrow this down to each of my 8 locations, but first I just want to get the for loop to work.
Experimented with various iterations of a for loop.
for year in range(len(climate['Year'])):
if (climate['Max'][year] >=90).all():
print (climate.index[year])
break
Unsurprisingly, the output of the loop I provided prints the first 90 degree day period (from the year 1919, the beginning of my data frame) and breaks.
for year in range(len(climate['Year'])):
if (climate['Max'][year] >=90).all():
print (climate.index[year])
break
1919-06-12 00:00:00
That's fine. If I take out the break statement, all of the 90 degree days print, including multiple in the same year. I just want the first value from each year to print. Do I need to set up a second for loop to increment through the year? If I explicitly state the year, ala below, while trying to loop through a counter, the loop still begins in 1919 and eventually reaches an out of bounds index. I know this logic is incorrect.
count = 1919
while count < 2019:
for year in range(len(climate['Year'])):
if (climate[climate['Year']==count]['Max'][year] >=90).all():
print (climate.index[year])
count = count+1
Any input is sincerely appreciated.
You can achieve this without having a second for loop. Assuming the climate dataframe is ordered chronologically, this should do what you want:
current_year = None
for i in range(climate.shape[0]):
if climate['Max'][i] >= 90 and climate['Year'][i] != current_year:
print(climate.index[i])
current_year = climate['Year'][i]
Notice that we're using the current_year variable to keep track of the latest year that we've already printed the result for. Then, in the if check, we're checking if we've already printed a result for the year of the current row in the loop.
That's one way to do it, but I would suggest taking a look at pandas.DataFrame.groupby because I think it fits your use case well. You could get a dataframe that contains the first >=90 max days per year with the following (again assuming climate is ordered chronologically):
climate[climate.Max >= 90].groupby('Year').first()
This just filters the dataframe to only contain the >=90 max days, groups rows from the same year together, and retains only the first row from each group. If you had an additional column Location, you could extend this to get the same except per location per year:
climate[climate.Max >= 90].groupby(['Location', 'Year']).first()
I have a table within PowerPivot currently that tracks a count of customers through our sales pipeline. From (by sales location) first interaction to charged sale. So far, I’ve creates a moving 5-day average that averages each task. Below is the DAX formula I’ve created thus far and an example table.
=
CALCULATE (
SUM ( [Daily Count] ),
DATESINPERIOD ( Table1[Date], LASTDATE ( Table1[Date] ), -7, DAY ),
ALLEXCEPT ( Table1, Table1[Sales Location], Table1[Group] )
)
/ 5
Where I’m struggling is being able to come up with a way to exclude weekends and company observed holidays. Additionally, if a holiday falls on a weekday I would like to remove that from the average and go back an additional day (to smooth the trend).
For example, on 11/26/18 (the Monday after Thanksgiving and Black Friday) I would like to average the five business days previous (11/26/18, 11/21-11/19, and 11/16). In the example above, the moving total and average for the previous 5 days should be Intake = 41 (total) 8.2 (average), Appointment = 30 (total) 6 (average), and Sale = 13 (total) and 2.6 (average).
Based on the formula currently, each of these numbers is inaccurate. Is there an easy way to exclude these days?
Side note: I’ve created an ancillary table with all holidays that is related to the sales data that I have.
Thank you for the help!
For this, I'd recommend using a calendar table related to Table1 on the Date column that also has a column IsWorkday with 1 if that day is a workday and 0 otherwise.
Once you have that set up, you can write a measure like this:
Moving Avg =
VAR Last5Workdays =
SELECTCOLUMNS (
TOPN (
5,
FILTER (
DateTable,
DateTable[Date] <= EARLIER ( Table1[Date] )
&& DateTable[IsWorkday] = 1
),
DateTable[Date], DESC
),
"Workday", DateTable[Date]
)
RETURN
CALCULATE (
SUM ( Table1[Daily Count] ),
Table1[Date] IN Last5Workdays
ALLEXCEPT ( Table1, Table1[Sales Location], Table1[Group] ),
)
/ 5
The TOPN function here returns the top 5 rows of the DateTable where each row must be a workday that is less than or equal to the date in your current Table1 row (the EARLIER function refers to the earlier row context that defines the current row).
I then use SELECTCOLUMNS to turn this table into a list by selecting a single column (which I've named Workday). From there, it's basically your measure with the date filter changed a bit.
#alexisolson Thank you for the response here. I was actually able to figure this out over the weekend but forgot to close out the thread (sorry about that! Appreciate your help either way). But I did something fairly similar to what you mentioned above.
I created a date table (CorpCalendar) that was only inclusive of working days. Then I created an index column within the CorpCalendar table to give each row a unique number in ascending order. From there, I linked the CorpCalendar table to my SalesData table by related dates and used the LOOKUPVALUE function to bring the index value over from the CorpCalendar table to the SalesData table. In a separate column I subtracted 4 from the date index value to get an index adjustment column (for a range of five days from the actual date index and the adjustment...if that makes sense). I then added an additional LOOKUPVALUE helper column to match the adjusted date index column to the appropriate working day.Lastly, I then used the following function to get the 5 day rolling average.
=CALCULATE(sum(Combined[Daily Count]),DATESBETWEEN(Combined[Date - Adjusted],Combined[Date - Adjusted (-5)],Combined[Date - Adjusted]),ALLEXCEPT(Combined,Combined[Group]))/5
This is probably more convoluted than necessary, however, it got me to the answer I was looking for. Let me know if this makes sense and if you have any suggestions for future scenarios like this.
Thanks again!
Ive spent the last 2 days trying to get this, and I really just need a few pointers. Im using Excel 2010 w/ Power Pivot and calculating inventories. I am trying to get the amount sold between 2 dates. I recorded the quantity on hand if the item was in stock.
Item # Day Date Qty
Black Thursday 11/6/2014 2
Blue Thursday 11/6/2014 3
Green Thursday 11/6/2014 3
Black Friday 11/7/2014 2
Green Friday 11/7/2014 2
Black Monday 11/10/2014 3
Blue Monday 11/10/2014 4
Green Monday 11/10/2014 3
Is there a way to do this in dax? I may have to go back and calculate the differences for each record in excel, but Id like to avoid that if possible.
Somethings that have made this hard for me.
1) I only record the inventory Mon-Fri. I am not sure this will always be the case so i'd like to avoid a dependency on this being only weekdays.
2) When there is none in stock, I dont have a record for that day
Ive tried, CALCULATE with dateadd and it gave me results nearly right, but it ended up filtering out some of the results. Really was odd, but almost right.
Any Help is appreciated.
Bryan, this may not totally answer your question as there are a couple of things that aren't totally clear to me but it should give you a start and I'm happy to expand my answer if you provide further info.
One 'pattern' you can use involves the TOPN function which when used with the parameter n=1 can return the earliest or latest value from a table that it sorts by dates and can be filtered to be earlier/later than dates specified.
For this example I am using a 'disconnected' date table from which the user would select the two dates required in a slicer or report filter:
=
CALCULATE (
SUM ( inventory[Qty] ),
TOPN (
1,
FILTER ( inventory, inventory[Date] <= MAX ( dates[Date] ) ),
inventory[Date],
0
)
)
In this case the TOPN returns a single row table of the latest date earlier than or equal to the latest date provided. The 1st argument in the TOPN specifies the number of rows, the second the table to use, the 3rd the column to sort on and the 4th says to sort descending.
From here it is straightforward to adapt this for a second measure that finds the value for the latest date before or equal to the earliest date selected (i.e. swap MIN for MAX in MAX(dates[Date])).
Hope this helps.
Jacob
*prettified using daxformatter.com