Are there built-in primitives for interactions in Feature tools? - featuretools

are there built-in primitives performing absolute and relative differences between two numeric columns? Two date columns?

This can currently be done for numeric columns, but not datetimes.
With interaction terms, we typically recommend you manually define the specific features you want. For example, here is how to define difference and absolute difference between to numeric features
import featuretools as ft
es = ft.demo.load_retail(nrows=1000)
total = ft.Feature(es["order_products"]["total"])
unit_price = ft.Feature(es["order_products"]["unit_price"])
difference = unit_price - total
absolute_diff = abs(difference)
fm = ft.calculate_feature_matrix(features=[difference, absolute_diff], entityset=es)
fm.head()
this returns
unit_price - total ABSOLUTE(unit_price - total)
order_product_id
0 -21.0375 21.0375
1 -27.9675 27.9675
2 -31.7625 31.7625
3 -27.9675 27.9675
4 -27.9675 27.9675
We could also pass those values those values to ft.dfs as seed features if we wanted other primitives to stack on top of them.

Related

Annual count index from GAM looking at long-term trends by site

I'm interested in estimating a shared, global trend over time for counts monitored at several different sites using generalized additive models (gams). I've read this great introduction to hierarchical gams (hgams) by Pederson et al. (2019), and I believe I can setup the model as follows (the Pederson et al. (2019) GS model),
fit_model = gam(count ~ s(year, m = 2) + s(year, site, bs = 'fs', m = 2),
data = count_df,
family = nb(link = 'log'),
method = 'REML')
I can plot the partial effect smooths, look at the fit diagnostics, and everything looks reasonable. My question is how to extract a non-centered annual relative count index? My first thought would be to add the estimated intercept (the average count across sites at the beginning of the time series) to the s(year) smooth (the shared global smooth). But I'm not sure if the uncertainty around that smooth already incorporates uncertainty in the estimated intercept? Or if I need to add that in? All of this was possible thanks to the amazing R libraries mgcv, gratia, and dplyr.
Your way doesn't include the uncertainty in the constant term, it just shifts everything around.
If you want to do this it would be easier to use the constant argument to gratia:::draw.gam():
draw(fit_model, select = "s(year)", constant = coef(fit_model)[1L])
which does what your code does, without as much effort (on your part).
An better way — with {gratia}, seeing as you are using it already — would be to create a data frame containing a sequence of values over the range of year and then use gratia::fitted_values() to generate estimates from the model for those values of year. To get what you want (which seems to be to exclude the random smooth component of the fit, such that you are setting the random component to equal 0 on the link scale) you need to pass that smooth to the exclude argument:
## data to predict at
new_year <- with(count_df,
tibble(year = gratia::seq_min_max(year, n = 100),
site = factor(levels(site)[1], levels = levels(site)))
## predict
fv <- fitted_values(fit_model, data = new_year, exclude = "s(year,site)")
If you want to read about exclude, see ?predict.gam

R package rstatix - ANOVA: What is my error?

I wanted to perform ANOVA on my dataset using rstatix package.
This is the command I used
anova_test(data = light3, dv = gene_copies, wid = ID, within = treatment)
And this is the error it gives me:
Fehler in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
0 (non-NA) cases
My data consists of 4 different groups (treatment, factor class). Per group there are 3x3 values (gene_copies, numeric class). Each value has an individual ID and assigned timepoint (3 values per timepoint, timepoint factor class) in a separate column. There are no NAs in the table and every group+timepoint has 3 values so that everything is balanced out.
I adapted the command from this script:
https://www.datanovia.com/en/lessons/repeated-measures-anova-in-r/
My dataset has the exact same structure.
Please help

Get feature names for dataframe.corr

I am using the cancer data set from sklearn and I need to find the correlations between features. I am able to find the correlated columns, but I am not able to present them in a "nice" way, so that they will be an input for Dataframe.drop.
Here is my code:
cancer_data = load_breast_cancer()
df=pd.DataFrame(cancer_data.data, columns=cancer_data.feature_names)
corr = df.corr()
#filter to find correlations above 0.6
corr_triu = corr.where(~pd.np.tril(pd.np.ones(corr.shape)).astype(pd.np.bool))
corr_triu = corr_triu.stack()
corr_result = corr_triu[corr_triu > 0.6]
print(corr_result)
df.drop(columns=[?])
IIUC, you want the columns that correlate with some other column in the dataset, ie drop columns that don't appear in corr_result. So you'll want to get the unique variables from the index of corr_result, from each level. There may be repeats so take care of that as well, such as with sets:
corr_result.index = corr_result.index.remove_unused_levels()
corr_vars = set()
corr_vars.update(corr_result.index.unique(level=0))
corr_vars.update(corr_result.index.unique(level=1))
all_vars = set(df.columns)
df.drop(columns=all_vars - corr_vars)

Finding the top three relevant category and its corresponding probabilities

From the below script, I find the highest probability and its corresponding category in a multi class text classification problem. How do I find the highest top 3 predicted probability and its corresponding category in a best efficient way without using loops.
probabilities = classifier.predict_proba(X_test)
max_probabilities = probabilities.max(axis=1)
order=np.argsort(probabilities, axis=1)
classification=(classifier.classes_[order[:, -1:]])
print(accuracy_score(classification,y_test))
Thanks in advance.
( I have around 50 categories, I want to extract the top 3 best relevant category among 50 categories for each of my narrations and display them in a dataframe)
You've done most of the hard work here, just missing a bit of numpy foo to finish it off. Your line
order = np.argsort(probabilities, axis=1)
Contains the indices of the sorted probabilities, so [[lowest_prob_class_1, ..., highest_prob_class_1]...] for each of your samples. Which you have used to give your classification with order[:, -1:], i.e. the index of the highest probability class. So to get the top three classes we can just make a simple change
top_3_classes = classifier.classes_[order[:, -3:]]
Then to get the corresponding probabilities we can use
top_3_probabilities = probabilities[np.repeat(np.arange(order.shape[0]), 3),
order[:, -3:].flatten()].reshape(order.shape[0], 3)

Spark - Optimize calculation time over a data frame, by using groupBy() instead of filter()

I have a data frame which contains different columns ('features').
My goal is to calculate column X statistical measures:
Mean, Standart-Deviation, Variance
But, to calculate all of those, with dependency on column Y.
e.g. Get all rows which Y = 1, and for them calculate mean,stddev, var,
then do the same for all rows which Y = 2 for them.
My current implementation is:
print "For CONGESTION_FLAG = 0:"
log_df.filter(log_df[flag_col] == 0).select([mean(size_col), stddev(size_col),
pow(stddev(size_col), 2)]).show(20, False)
print "For CONGESTION_FLAG = 1:"
log_df.filter(log_df[flag_col] == 1).select([mean(size_col), stddev(size_col),
pow(stddev(size_col), 2)]).show(20, False)
print "For CONGESTION_FLAG = 2:"
log_df.filter(log_df[flag_col] == 2).select([mean(size_col), stddev(size_col),
pow(stddev(size_col), 2)]).show(20, False)
I was told the filter() way is wasteful in terms of computation times, and received an advice that for making those calculation run faster (i'm using this on 1GB data file), it would be better use groupBy() method.
Can someone please help me transform those lines to do the same calculations by using groupBy instead?
I got mixed up with the syntax and didn't manage to do so correctly.
Thanks.
Filter by itself is not wasteful. The problem is that you are calling it multiple times (once for each value) meaning you are scanning the data 3 times. The operation you are describing is best achieved by groupby which basically aggregates data per value of the grouped column.
You could do something like this:
agg_df = log_df.groupBy(flag_col).agg(mean(size_col).alias("mean"), stddev(size_col).alias("stddev"), pow(stddev(size_col),2).alias("pow"))
You might also get better performance by calculating stddev^2 after the aggregation (you should try it on your data):
agg_df = log_df.groupBy(flag_col).agg(mean(size_col).alias("mean"), stddev(size_col).alias("stddev"))
agg_df2 = agg_df.withColumn("pow", agg_df["stddev"] * agg_df["stddev"])
You can:
log_df.groupBy(log_df[flag_col]).agg(
mean(size_col), stddev(size_col), pow(stddev(size_col), 2)
)

Resources