featuretools timeseries data group by month year - featuretools

I have timeseries data with application number, loan amount. How do I get group by count of applications and average loan amount using featuretools package without adding a relationship of month year back to main entity?
I have already accomplished this using pandas but I am trying to explore featuretools package and was wondering if it has group by like functionality.
Below the example of pandas version: I want to replicate it using featuretools.
#Creating a copy of the existing data frame
new_df=df[:]
#Creating values
new_df['year'] = new_df['DATE'].dt.year
new_df['month'] = new_df['DATE'].dt.month
#Sorting Values
new_df=new_df.drop_duplicates().sort_values(by=['var_1','var2','year','month'])
#Counting Distinct variable across 4 variables then taking cummulative sum across 2 variables and storing it in a new data frame
new_df_count_cummulative=new_df.groupby(['var_1','var_2','year','month']).var_3.nunique().groupby(['var_1','var_2']).cumsum()

Related

PowerBI DAX: logic to use aggregated table as parameter in functions or another workaround to calculate dataset KPI filtered by any field?

In PowerBI, I need to create a Performance Indicator (KPI) measure which evaluates dataset values in a scale from 0 to 1, with target (1) being the MAX value in a 20 years history. It's a national airport trip records open database. The formula is basically [value]/[max value].
My dataset has a lot of fields and I wish I could filter it by any of these fields, with a line chart showing the 0-1 indicator for each month based on the filters.
This is my workaround test solution:
Table 1 - Original dataset: if I filter something here, below tables also update (there are more fields to the left, including YEAR and MONTH
Table 2 - Reference to original dataset, aggregating YEAR-MONTH by the sum of "take-offs" (decolagens)
Table 3 - Reference to above (sum) table, aggregating MONTH by the max of "take-offs" (decolagens)
Table 4 - 'Sum table' merged to 'Max table' by MONTH as new table: then do [Value]/[Max] and we've got the indicator
So if i filter the original dataset by any fields, all other tables update accordingly and the indicators always stays between 0-1, works like a charm.
TL;DR
The problem is: I need to create a dashboard of this on Power Bi. So I need this calculation to be in a measure or another workaround.
My possible solution: by pure DAX code in the measure field, to produce Tables 2 and 3 so I'll divide the month sum values by their month max value (which will both be produced according to PowerBi dashboard slicers) and get the indicator dinamically produced.
I'm stuck at: I don't understand how can I reference a sum/max aggregate table in dax code. Something like = SUM (dataset[take-offs]) / MAX (SUM (dataset[take-offs])). Of course these functions do not work like that, but I hope I made my point clear: how can I produce this four table effect with a single measure?
Other solutions are welcome.
Link to the original dataset: https://www.anac.gov.br/assuntos/dados-e-estatisticas/dados-estatisticos/arquivos/DadosEstatsticos.csv
It's an open dataset, so I guess there's no problem sharing it. Please help! :)
EDIT: please download the dataset and try to solve this. Personally I think it's a quality statistics doubt that will eventually help others. The calculation works, it only needs a Power Bi Measure port.
Add the ALL formula:
Measure = SUMX(ALL('Table'),[Valor])/SUM('Table'[Max])
Example

Select amount of past data when calculating features

I'm wondering if there is a way to automatically select the amount of past data when calculating features.
For example, I might want to predict when a customer is going to make their next purchase, so it would be good to know a count of purchases or average purchase price by different date cutoffs. e.g. Purchases in the last 12 months, last 3 months, 7 days etc.
What is the best way to approach this with featuretools?
You can create a feature matrix thats uses only a certain amount of historical data using the training window parameter in featuretools.dfs. When training window is set, Featuretools will use the historical data between the cutoff time and cutoff_time - training_window. Here's the example from the documentation:
window_fm, window_features = ft.dfs(entityset=es,
target_entity="customers",
cutoff_time=cutoff_times,
cutoff_time_in_index=True,
training_window="1 hour")
When determining which data is valid for use, the training window will check if the time in the time_index column is within the training window.

Converting Hourly Data to Daily Data for many different Excel files

I have been downloading and organizing hourly water quality data into Excel for many different states, and have organized them by year. I have done data prep for them to make sure there are no zeros/every day of the year (DOY) has 24 values, but the time series plots were too noisy so want me to get the daily average values instead.
All of the sites annual data is different in terms of how many days are available, and sometimes they are missing whole months due to no recordings.
So my question is, how can I develop a code to give me the average daily value linked to a specific DOY that I can apply to many different Excel sheets. The data appears like this:
And the files are saved like like CA1_2012 (California Site 1 hourly data from 2012)
I know there is a lot on this topic but I have been trying everything and I can't get a code that works!
You can get the summation of the second column based on values in the first column in matlab using accumarray;
[m,~,n] = unique(data(:,1));
sumdata = [m, accumarray(n,data(:,2))];
for mean I would suggest grpstats:
avgdata = grpstats(data, DOY, {'mean'});
or as #gnovice suggested:
avgdata = accumarray(DOY, data, [], #mean);
You can also get what you want by using Pivot Table in excel and group data by DOY and get the mean value for them in the table. (No coding required).

Cognos - Showing every month on x-axis when some months don't have values

Let me first say I am very new to Cognos and have mainly learned by just manipulating items within active reports. I am having an issue with creating a graph that acts like a time series. I want it to display every month (with multiple values in some months and none in others). I want to visually see gaps between data points (ex: we order products every 3 months starting in January, so we should see gaps in the months we do not order products - like February and March).
I have tried changing the label control to manual and setting display frequency to 1. However, I think my issue is that there is not any data within certain months.
You are correct in that your problem is lack of data. A standard inner join will drop rows where there is not a corresponding row in both tables, resulting in gaps.
There are two solutions available:
Use a union to create "dummy" records for each date
Manually specify an outer join between the date table and the table containing the rest of information
Since the first technique is the most common, I'll outline the basic steps for it here.
Create a new query
Add your month data item to the query
Create a 'dummy' data item for your measure. Use 0 for its expression.
If there is a date range filter in the main query apply it here
Create a union
Drag over your new query into the union
Drag over your original query into the union
Pull in the date and measure data items into the union query
Set the Aggregate Function property of the measure to Total
Use the union query as the source for your chart
For every month with measure data you will have two rows, one with the measure amount and one with 0. The two rows will be combined by the auto-group and summarize function. The measures will be added together. Anything added to 0 will end up as the original amount.
For months with no measure data, there will only be the 'dummy' row with 0 for the measure and it will be represented in your chart.

How automatically to identify series when creating graph

Right now I have three columns of data that I would need on a graph. It's about the score of different countries on non-related evaluations. So there's a column for the year of the evaluation, the name of the country and the score it got.
Since there are hundreds of them, it would take a lot of time to add data series individually, so I was wondering if isn't there a way to just select the columns and Excel could identify each series automatically.
Illustrating:
Supposing I had this table:
And wanted to create a graph like this:
Is there a way to do this easily?
Plot a PivotChart Line type: Years for ROWS, Country for COLUMNS and Sum of Score for VALUES.

Resources