Select amount of past data when calculating features - featuretools

I'm wondering if there is a way to automatically select the amount of past data when calculating features.
For example, I might want to predict when a customer is going to make their next purchase, so it would be good to know a count of purchases or average purchase price by different date cutoffs. e.g. Purchases in the last 12 months, last 3 months, 7 days etc.
What is the best way to approach this with featuretools?

You can create a feature matrix thats uses only a certain amount of historical data using the training window parameter in featuretools.dfs. When training window is set, Featuretools will use the historical data between the cutoff time and cutoff_time - training_window. Here's the example from the documentation:
window_fm, window_features = ft.dfs(entityset=es,
target_entity="customers",
cutoff_time=cutoff_times,
cutoff_time_in_index=True,
training_window="1 hour")
When determining which data is valid for use, the training window will check if the time in the time_index column is within the training window.

Related

Binomial Options Pricing Calculation in PowerQuery

Im trying to build an excel sheet that calculates synthetic options prices and greeks for time series data to model intraday options pricing, input is simply intraday price data, say Tick level to 5 minute interval. I found this https://www.thebiccountant.com/2021/12/28/black-scholes-option-pricing-with-power-query-in-power-bi/ which provides for powerBI and Black Scholes but possibly not very accurately. I prefer the Binomial method (I have used this excellent tutuorial to build a manual version for a large number of strikes but it takes a long time to calculate and is very very complex and also inaccurate due to not being able to calculate many steps before topping excel out: https://www.macroption.com/binomial-option-pricing-excel/).
Does anyone have any idea if this is possible to create an entire column in Power Query that will calculate bionomially derived options pricing using >100 even up to 1000 steps? The reason is intraday pricing using high resolution data 5min, 1min, Seconds and Tick I think needs a large number of steps to properly converge. This is just about doing a good enough model that can be used for visualising the progress of a trade on a given day.
Any pointers on how this could be done and calculated using M Language would be much appreciated and useful!

Dynamic Percentile Analysis Across Multiple Categories - PowerPivot / DAX

I've spent a a lot of time trying to find a solution to the following issue but I haven't been able. There are similar threads to this issue both here and on other forums but they don't seem to be applicable. Please let me know any best demonstrated practices regarding posting on this forum that I may be going against.
I would like to be able to dynamically (and hopefully in as simple way as possible) create measures (ideally NOT via calculated columns) in power pivot to be able to carry out percentile analysis (e.g., value associated with top quartile, top quintile, third decile, etc etc) on different subsets of my data (in a pivot table). For example, I might want to create the percentile based on the yearly sales associated with a shop (although the records I have are based on monthly, or another time period).
Here is what this data could look like as an example, as well as what the results would be on this data (I did this jammily using excel). I know that there is a way to do this using calculated columns but I want to try and do it using measures (e.g., maybe using a combination of sumx, percentiles, top n??).
In case you're not able to view the picture of my data, my data is structured as such:
===============================================================================================
**Shop ID** ## **Value** ## **Metric**## **Period** (e.g., mm / yy) ## **Franchised or Co Owned** ## **Year** ## **Quarter**
===============================================================================================
1 50 etc etc please see screenshot! thank you
2 70
3 90
Additional explanation on data
Shop ID could have many entries
Value is the value for each metric - the record is based around having a value for each metric for each shop id for each month (or other time period)
Metric could be things like sales, ebitda, car count, etc etc
Period is typically month
Shop status could be "Co - Owned" or "Franchised"
Year and Quarter are based off the period
I want to be able to get percentile values for sales in a given period (e.g., total yearly sales for a given year, total quarterly sales etc) for whatever slicer i have going on for the current pivot table.
Super grateful for any help!
Thanks,
Louis
OK, I think I found an answer. Something like this formula might work:
PERCENTILEX.INC(ALLSELECTED(Facts[ID]),SUMX(ALLSELECTED(Facts[Period]),[Sum Values]),[Percentile Definition])

Using QDigest over a date range

I need to keep a 28 day history for some dashboard data. Essentially I have an event/action that is recorded through our BI system. I want to count the number of events and the distinct users who do that event for the past 1 day, 7 days and 28 days. I also use grouping sets (cube) to get the fully segmented data by country/browser/platform etc.
The old way was to do this keeping a 28 day history per user, for all segments. So if a user accessed the site from mobile and desktop every day for all 28 days they would have 54 rows in the DB. This ends up being a large table and is time consuming even to calculate approx_distinct and not distinct. But the issue is that I also wish to calculate approx_percentiles.
So I started investigating the user of HyperLogLog https://prestodb.io/docs/current/functions/hyperloglog.html
This works great, its much more efficient storing the sketches daily rather than the entire list of unique users per day. As I am using approx_distinct the values are close enough and it works.
I then noticed a similar function for medians. Qdigest.
https://prestodb.io/docs/current/functions/qdigest.html
Unfortunately the documentation is not nearly as good on this page as it is on previous pages, so it took me a while to figure it out. This works great for calculating daily medians. But it does not work if I want to calculate the median actions per user over the longer time period. The examples in HyperLogLog demonstrate how to calculate approx_distinct users over a time period but the Qdigest docs do not give such an example.
The results that I get when I try something to the HLL example for date ranges with Qdigest I get results similar to 1 day results.
Because you're in need of medians that are aggregated (summed) across multiple days on a per user basis, you'll need to perform that aggregation prior to insertion into the qdigest in order for this to work for 7- and 28-day per-user counts. In other words, the units of the data need to be consistent, and if daily values are being inserted into qdigest, you can't use that qdigest for 7- or 28-day per-user counts of the events.

Passing parameter to the drill thru report in cognos

How to Pass a Calculated Member/measure to a Drill-thru Target Report
In order to avoid using Calculated Members--because from googling some people were saying you could not pass them via Drill-thru--I went back to my FM model and created 3 new measures (High Risk, Low Risk and Medium Risk). Now these will show up in the drill-thru definitions parameter list . . . my only problem is that how can I do a check to see which of the three measures has been selected by a user?
Remember, I basically have a line chart with 3 lines, one for each measure above (High, Medium or Low Risk) by time frame. A user will select a data point, High risk for March or Medium Risk for Semester 2, for example. I then need to pass the value for that datapoint to my Target (2nd) report. How can I check for which of the three measure values they passed through?!?

Statistical method for time-course data comparison

I have a question for statistical method which i cant find in my textbook. I want to compare data of two groups. For example, both group have data of day 0, but one group have data of day 2, and another day 6. How can I analyse the outcome with the data and the date? i.e. I want to show that the if data taken on day XX are YY, it has an impact on the outcome.
Thanks in advance.
I'd use a repeated measures ANOVA in this case. However, since you don't have a complete dataset, day X and Y would be just operationalized as the endpoint of your dependent variable. If you'd have measures of all days I'd include.all of them in the analysis in order to fully compare the two timelines. You could then also compare the days of interest directly by using post-hoc tests (e.g. Bonferroni)

Resources