Using featuretool, how to use dfs primitives individually? - featuretools

Could you help me? when I used feturestools, I use iris dataset, it has 4 features as follows: f1, f2, f3, f4, when I use ft.dfsI have 3 tow questions.
1. I found that feature_matrix has too much features. the 'divide_by_feature' and 'modulo_numeric' didn't act on original features individually. It firstly act divide_by_feature' then got 4 features newly, and then act 'modulo_numeric' on original features and new features.
I hope the two primitives can act on original features individually. How should I do?
2. I use transform primitives like trans_primitives = ['subtract_numeric_scalar', 'modulo_numeric']. I found that subtract_numeric_scalar can pass an value, however, I don't know how to pass?
3. I wonder how to use all transform primitives? default, trans_primitives=None, by now, I can solve it like this: trans_primitives = ['is_null','diff',...], however, I think that it's trouble.
could you give me some advice? Thank you!
enter image description here

You can use max_depth to control the complexity of the features. When max_depth=1, the primitives will use only original features.
features = ft.dfs(
entityset=es,
target_entity='data',
trans_primitives=['divide_by_feature', 'modulo_numeric'],
features_only=True,
max_depth=1,
)
[<Feature: f1>,
<Feature: f2>,
<Feature: f3>,
<Feature: f4>,
<Feature: 1 / f3>,
<Feature: 1 / f1>,
<Feature: 1 / f2>,
<Feature: 1 / f4>,
<Feature: f1 % f2>,
<Feature: f4 % f3>,
<Feature: f4 % f2>,
<Feature: f1 % f3>,
<Feature: f2 % f4>,
<Feature: f4 % f1>,
<Feature: f3 % f2>,
<Feature: f3 % f1>,
<Feature: f2 % f1>,
<Feature: f3 % f4>,
<Feature: f2 % f3>,
<Feature: f1 % f4>]
You can create an instance of a primitive with the parameters. This is how you can pass a value to subtract_numeric_scalar.
from featuretools.primitives import SubtractNumericScalar
ft.dfs(
...
trans_primitives=[SubtractNumericScalar(value=2)]
)
You can use all transform primitives by extracting the names from the primitive list.
primitives = ft.list_primitives()
primitives = primitives.groupby('type')
transforms = primitives.get_group('transform')
transforms = transforms.name.values.tolist()
['less_than_scalar',
'divide_numeric',
'latitude',
'add_numeric',
'week',
'greater_than_equal_to_scalar',
'and',
'multiply_numeric_scalar',
'not',
'second',
'greater_than_scalar',
'modulo_numeric_scalar',
'scalar_subtract_numeric_feature',
'diff',
'day',
'cum_min',
'divide_by_feature',
'less_than_equal_to',
'time_since',
'time_since_previous',
'cum_count',
'year',
'is_null',
'num_characters',
'equal_scalar',
'is_weekend',
'less_than_equal_to_scalar',
'longitude',
'add_numeric_scalar',
'month',
'less_than',
'or',
'multiply_boolean',
'percentile',
'minute',
'not_equal_scalar',
'greater_than_equal_to',
'modulo_by_feature',
'multiply_numeric',
'negate',
'hour',
'cum_max',
'greater_than',
'modulo_numeric',
'subtract_numeric_scalar',
'isin',
'cum_mean',
'divide_numeric_scalar',
'num_words',
'absolute',
'cum_sum',
'not_equal',
'weekday',
'equal',
'haversine',
'subtract_numeric']
Let me know if this helps.

Related

In ANCOVA, the squre root of MSE does not match the pooled adjusted SD across groups. Why?

In ANOVA, the square root of MSE is equal to the pooled SD across groups (or, equally, SE of marginal means * sqrt(n)).
But, in ANCOVA, the pooled adjusted SD (SE of the adjusted marginal mean * sqrt(n)) is not equal to sqrt(MSE), although it is very close. What is the difference?
The adjustment is, I assume, equally applied to both statistics in the same way, so why is there a difference?
(The issue became a practical one we calculate SMD from ANCOVA.)
Here is a reproducible example:
library("effects") #for calculating the adjusted statistics
#data
dat <- data.frame(
gp = factor(c(1,1,1,1,2,2,2,2)),
pre = c(6,3,1,3,3,7,2,3),
post = c(7,6,2,4,6,10,5,4)
)
#ANOVA
reg.res.no.cov <- lm(post ~ gp, dat)
anova.res <- anova(reg.res.no.cov)
adj.res.anova <- effect("gp", reg.res.no.cov, se = T)
sqrt(anova.res$`Mean Sq`[2]) #pooled SD from MSE
[1] 2.43242
adj.res.anova$se * sqrt(4) #pooled SD from SE * sqrt(n)
[1] 2.43242 2.43242
#ANCOVA
ancova.res <- anova(reg.res.with.cov)
reg.res.with.cov <- lm(post ~ pre + gp, dat)
adj.res.ancova <- effect("gp", reg.res.with.cov, se = T)
sqrt(ancova.res$`Mean Sq`[3]) #pooled SD from MSE
1.092121
adj.res.ancova$se * sqrt(4) #pooled SD from SE * sqrt(n)
[1] 1.097073 1.097073

Conditional Probability for fake reviews

I am working on a conditional probability question.
A = probability of being legit review
B = probability of guessing correctly
P(A) = 0.98 → P(A’) = 0.02
P(B|A’) = 0.95
P(B|A) = 0.90
The question should be this: P(A’|B) =?
P(A’|B) = P(B|A’).P(A’) / P(B)
P(B) = P(B and A’) + P(B and A)
= P(B|A’). P(A’) + P(B|A). P(A)
= 0.901
P(A’|B) = P(B|A’).P(A’) / P(B)
= 0.95 x 0.02 / 0.901
= 0.021
However, my result is not listed on the choices of questions. Can you please tell me if I am missing anything? Or my logic is incorrect?
Example with numbers
This example with numbers is meant as an intuitive way to understand how Bayes' formula works:
Let's assume we have 10.000 typical reviews. We calculate what we would expect to happen with these 10.000 reviews:
9.800 are real
200 fake
To predict how many review are classified as fake:
Of the 9800 real ones, 10% are classified as fake → 9800 * 0.10 = 980
Of the 200 fake ones, 95% are classified as fake → 200 * 0.95 = 190
980 + 190 = 1.170 are classified a fake.
Now we have all the pieces we need to calculate the probability that a reviews is fake, given that it is classified as such:
All reviews that are classified as fake → 1.170
Of those, are actually fake → 190
190 / 1170 = 0.1623 or 16.23%
Using general Bayes' theorem
Let's set up the events. Note that my version of event B is slightly different from yours.
P(A): Real review
P(A'): Fake review
P(B): Predicted real
P(B'): Predicted fake
P(A'|B'): Probability that a review is actually fake, when it is predicted to be real
Now that we have our events defined, we can go ahead with Bayes:
P(A'|B') = P(A' and B') / P(B') # Bayes' formula
= P(A' and B') / (P(A and B') + P(A' and B')) # Law of total probability
We also know the following, by an adapted version of Bayes' rule:
P(A and B') = P(A) * P(B'|A )
= 0.98 * 0.10
= 0.098
P(A' and B') = P(A') * P(B'|A')
= 0.02 * 0.95
= 0.019
Putting the pieces together yields:
P(A'|B') = 0.019 / (0.098 + 0.019) = 0.1623

Calculating p-values with pnorm ( ). What makes p-values differ if data is transformed?

I am comparing two alternatives for calculating p-values with R's pnorm() function.
xbar <- 2.1
mu <- 2
sigma <- 0.25
n = 35
# z-transformation
z <- (xbar - mu) / (sigma / sqrt(n))
# Alternative I using transformed values
pval1 <- pnorm(q = z)
# Alternative II using untransformed values
pval2 <- pnorm(q = xbar, mean = mu, sd = sigma)
How come the two calculated p-values are not the same? Should not they?
They are different because you use two different estimates of the standard deviation.
In the z-transformation calculation you use sigma / sqrt(n) as the standard deviation, but in the untransformed calculation you use sd = sigma, ignoring n.

Spectrogram of two audio files (Added together)

Assume for a moment I have two input signals f1 and f2. I could add these signals to produce a third signal f3 = f1 + f2. I would then compute the spectrogram of f3 as log(|stft(f3)|^2).
Unfortunately I don't have the original signals f1 and f2. I have, however, their spectrograms A = log(|stft(f1)|^2) and B = log(|stft(f2)|^2). What I'm looking for is a way to approximate log(|stft(f3)|^2) as closely as possible using A and B. If we do some math we can derive:
log(|stft(f1 + f2)|^2) = log(|stft(f1) + stft(f2)|^2)
express stft(f1) = x1 + i * y1 & stft(f2) = x2 + i * y2 to write
... = log(|x1 + i * y1 + x2 + i * y2|^2)
... = log((x1 + x2)^2 + (y1 + y2)^2)
... = log(x1^2 + x2^2 + y1^2 + y2^2 + 2 * (x1 * x2 + y1 * y2))
... = log(|stft(f1)|^2 + |stft(f2)|^2 + 2 * (x1 * x2 + y1 * y2))
So at this point I could use the approximation:
log(|stft(f3)|^2) ~ log(exp(A) + exp(B))
but I would ignore the last part 2 * (x1 * x2 + y1 * y2). So my question is: Is there a better approximation for this?
Any ideas? Thanks.
I'm not 100% understanding your notation but I'll give it a shot. Addition in the time domain corresponds to addition in the frequency domain. Adding two time domain signals x1 and x2 produces a 3rd time domain signal x3. x1, x2 and x3 all have a frequency domain spectrum, F(x1), F(x2) and F(x3). F(x3) is also equal to F(x1) + F(x2) where the addition is performed by adding the real parts of F(x1) to the real parts of F(x2) and adding the imaginary parts of F(x1) to the imaginary parts of F(x2). So if x1[0] is 1+0j and x2[0] is 0.5+0.5j then the sum is 1.5+0.5j. Judging from your notation you are trying to add the magnitudes, which with this example would be |1+0j| + |0.5+0.5j| = sqrt(1*1) + sqrt(0.5*0.5+0.5*0.5) = sqrt(2) + sqrt(0.5). Obviously not the same thing. I think you want something like this:
log((|stft(a) + stft(b)|)^2) = log(|stft(a)|^2) + log(|stft(b)|^2)
Take the exp() of the 2 log magnitudes, add them, then take the log of the sum.
Stepping back from the math for a minute, we can see that at a fundamental level, this just isn't possible.
Consider a 1st signal f1 that is a pure tone at frequency F and amplitude A.
Consider a 2nd signal f2 that is a pure tone at frequency F and amplitude A, but perfectly out of phase with f1.
In this case, the spectrograms of f1 & f2 are identical.
Now consider two possible combined signals.
f1 added to itself is a pure tone at frequency F and amplitude 2A.
f1 added to f2 is complete silence.
From the spectrograms of f1 and f2 alone (which are identical), you've no way to know which of these very different situations you're in. And this doesn't just hold for pure tones. Any signal and its reflection about the axis suffer the same problem. Generalizing even further, there's just no way to know how much your underlying signals cancel and how much they reinforce each other. That said, there are limits. If, for a particular frequency, your underlying signals had amplitudes A1 and A2, the biggest possible amplitude is A1+A2 and the smallest possible is abs(A1-A2).

Why is music21 using pitch attributes in an unexpected way?

Consider the following testing code.
from music21 import pitch
C0 = 16.35
for f in [261, 130, 653, 64, 865]:
p = pitch.Pitch()
p.frequency = f
# Compare manual frequency with music21 frequency
f1 = p.frequency
f2 = C0 * pow(2, p.octave) * pow(2, p.pitchClass / 12) * pow(2, p.microtone.cents / 1200)
print(f, f1, f2)
# Compare manual pitchspace with music21 pitchspace
ps1 = p.ps
ps2 = 12 * (p.octave + 1) + p.pitchClass + p.microtone.cents / 100
print(ps1, ps2)
print()
The output of this is
261 260.99999402174154 521.9489797003519
59.958555 71.95855499999999
130 129.99999854289362 259.974590631057
47.892097 59.892097
653 653.0000144741496 652.9362051837928
75.834954 75.834954
64 63.999998381902046 65.86890433005668
35.623683 36.123683
865 864.9999846113213 890.2594167561009
80.702359 81.202359
There is often a difference between my manual computation of the frequency resp. the pitch space and the music21 value.
Note that sometimes this difference can be about an octave (like the first two C note frequencies), but mostly it is about one tone. Another weird thing is that for the third testing frequency the pitchspace values are the same while the frequencies are not.
What could be wrong about my manual formulas?
So it appears that while the deviation of an octave was a bug, the other deviations are intended behaviour. See https://github.com/cuthbertLab/music21/issues/96 for detailed explanation.

Resources