I would like to investigate the effects of two independent variables on a dependent variable. Suppose we have X1, X2 independent variables, and Y dependent variable.
I use two different approaches. In the first approach, to eliminate the effect of X1 on Y, I generate the conditional distribution of Y|X1 and perform regression using the second variable X2. When I check the correlations between X2 and Y|X1, I obtain relatively high correlations (R2>0.50). However, when I perform multiple regression over a wide range of data (X1 and X2), the effect of X2 on Y is decreased and becomes insignificant. How do these approaches give conflicting results? What is the most appropriate approach to determine the effect of X2 on Y for a given X1 value? Thanks.
It could be good to see the code or the above in mathematical notation.
For instance: did you include the constant terms?
What do you see when:
Y = B0 + B1X1 + B2X2
That will be the easiest to check, and B2 will give you probably what you want.
That model is still simple, you could explore something like:
Y = B0 + B1X1 + B2X2 + B3X1X2
or
Y = B0 + B1X1 + B2X2 + B3X1X2 + B4X1^2 + B5X2^2
And see if there are changes in the coefficients and if there are new significant coefficients.
You could go further and explore Structural Equation Models
Related
In various papers I seen regressions of the sort of Y = f(x1, x2), where f() is usually a simple OLS and, importantly, Y = x1 + x2 + x3. In other words, regressors are exactly a part of Y.
These papers used regressors as a way to describe data rather than isolating a causal inference between X and Y. I was wondering what are the implication of the above strategy. To begin with, do numbers / significance test make any sense at all? thanks.
I understand that the approach mechanically fails if regressors included in the analyisis completely describe Y for obvious reasons (perfect collinearity). However, I would like to understand better the implication of only including some of the x in it.
I have the following two mixed models
Full model (with the three-way interaction)
m_Kunkle_pacc3_n <- lmer(pacc3_old ~ PRS_Kunkle*AgeAtVisit +
PRS_Kunkle*I(AgeAtVisit^2) +
APOE_score*AgeAtVisit + APOE_score*I(AgeAtVisit^2) + PRS_Kunkle*APOE_score + famhist +
+ gender + EdYears_Coded_Max20 + VisNo + X1 + X2 + X3 + X4 + X5 +
(1 |family/DBID),
data = WRAP_all, REML = F)
Nested model (exclude the three-way interaction (two variables are excluded: three-way interaction with linear age and quadratic age))
m_Kunkle_pacc3 <- lmer(pacc3_old ~ PRS_Kunkle*AgeAtVisit*APOE_score +
PRS_Kunkle*I(AgeAtVisit^2)*APOE_score +
+ gender + EdYears_Coded_Max20 + VisNo + famhist + X1 + X2 + X3 + X4 + X5 +
(1 |family/DBID),
data = WRAP_all, REML = F)
I used the likelihood ratio test to test the difference between full model and nested model, am I correct in testing the significance of this three-way interaction?
pacc3_LRT_Kunkle <- anova(m_Kunkle_pacc3, m_Kunkle_pacc3_n, test = "chisq")
Many thanks
If you are interested in testing the significance of the three-way interaction, I think in general you should do that within the context of a single model. You first select a model based on theoretical considerations and sometimes certain indices, and then look at the parameters of the model you pick. For example, the BIC is related to a model's negative log-likelihood penalized by its complexity (it also depends on your sample size), and you can use the BIC to select a model among competing choices. Once you pick a model that has a certain interaction term within it, you evaluate its coefficient. I should warn you that interpreting three-way interactions can be very challenging, so you should consider that in the context of your problem as well.
TLDR; comparing a model that has a term with one that does not have it (whether you look at their R^2, compare likelihoods or penalized likelihoods, etc.) will tell you something about the whole model, not the parameter itself.
How can we calculate the correlation and covariance between two variables without using cov and corr in Python3?
At the end, I want to write a function that returns three values:
a boolean that is true if two variables are independent
covariance of two variables
correlation of two variables.
You can find the definition of correlation and covariance here:
https://medium.com/analytics-vidhya/covariance-and-correlation-math-and-python-code-7cbef556baed
I wrote this part for covariance:
'''
ans=[]
mean_x , mean_y = x.mean() , y.mean()
n = len(x)
Cov = sum((x - mean_x) * (y - mean_y)) / n
sum_x = float(sum(x))
sum_y = float(sum(y))
sum_x_sq = sum(xi*xi for xi in x)
sum_y_sq = sum(yi*yi for yi in y)
psum = sum(xi*yi for xi, yi in zip(x, y))
num = psum - (sum_x * sum_y/n)
den = pow((sum_x_sq - pow(sum_x, 2) / n) * (sum_y_sq - pow(sum_y, 2) / n), 0.5)
if den == 0: return 0
return num / den
'''
For the covariance, just subtract the respective means and multiply the vectors together (using the dot product). (Of course, make sure whether you're using the sample covariance or population covariance estimate -- if you have "enough" data the difference will be tiny, but you should still account for it if necessary.)
For the correlation, divide the covariance by the standard deviations of both.
As for whether or not two columns are independent, that's not quite as easy. For two random variables, we just have that $\mathbb{E}\left[(X - \mu_X)(Y - \mu_Y)\right] = 0$, where $\mu_X, \mu_Y$ are the means of the two variables. But, when you have a data set, you are not dealing with the actual probability distributions; you are dealing with a sample. That means that the correlation will very likely not be exactly $0$, but rather a value close to $0$. Whether or not this is "close enough" will depend on your sample size and what other assumptions you're willing to make.
Assume for a moment I have two input signals f1 and f2. I could add these signals to produce a third signal f3 = f1 + f2. I would then compute the spectrogram of f3 as log(|stft(f3)|^2).
Unfortunately I don't have the original signals f1 and f2. I have, however, their spectrograms A = log(|stft(f1)|^2) and B = log(|stft(f2)|^2). What I'm looking for is a way to approximate log(|stft(f3)|^2) as closely as possible using A and B. If we do some math we can derive:
log(|stft(f1 + f2)|^2) = log(|stft(f1) + stft(f2)|^2)
express stft(f1) = x1 + i * y1 & stft(f2) = x2 + i * y2 to write
... = log(|x1 + i * y1 + x2 + i * y2|^2)
... = log((x1 + x2)^2 + (y1 + y2)^2)
... = log(x1^2 + x2^2 + y1^2 + y2^2 + 2 * (x1 * x2 + y1 * y2))
... = log(|stft(f1)|^2 + |stft(f2)|^2 + 2 * (x1 * x2 + y1 * y2))
So at this point I could use the approximation:
log(|stft(f3)|^2) ~ log(exp(A) + exp(B))
but I would ignore the last part 2 * (x1 * x2 + y1 * y2). So my question is: Is there a better approximation for this?
Any ideas? Thanks.
I'm not 100% understanding your notation but I'll give it a shot. Addition in the time domain corresponds to addition in the frequency domain. Adding two time domain signals x1 and x2 produces a 3rd time domain signal x3. x1, x2 and x3 all have a frequency domain spectrum, F(x1), F(x2) and F(x3). F(x3) is also equal to F(x1) + F(x2) where the addition is performed by adding the real parts of F(x1) to the real parts of F(x2) and adding the imaginary parts of F(x1) to the imaginary parts of F(x2). So if x1[0] is 1+0j and x2[0] is 0.5+0.5j then the sum is 1.5+0.5j. Judging from your notation you are trying to add the magnitudes, which with this example would be |1+0j| + |0.5+0.5j| = sqrt(1*1) + sqrt(0.5*0.5+0.5*0.5) = sqrt(2) + sqrt(0.5). Obviously not the same thing. I think you want something like this:
log((|stft(a) + stft(b)|)^2) = log(|stft(a)|^2) + log(|stft(b)|^2)
Take the exp() of the 2 log magnitudes, add them, then take the log of the sum.
Stepping back from the math for a minute, we can see that at a fundamental level, this just isn't possible.
Consider a 1st signal f1 that is a pure tone at frequency F and amplitude A.
Consider a 2nd signal f2 that is a pure tone at frequency F and amplitude A, but perfectly out of phase with f1.
In this case, the spectrograms of f1 & f2 are identical.
Now consider two possible combined signals.
f1 added to itself is a pure tone at frequency F and amplitude 2A.
f1 added to f2 is complete silence.
From the spectrograms of f1 and f2 alone (which are identical), you've no way to know which of these very different situations you're in. And this doesn't just hold for pure tones. Any signal and its reflection about the axis suffer the same problem. Generalizing even further, there's just no way to know how much your underlying signals cancel and how much they reinforce each other. That said, there are limits. If, for a particular frequency, your underlying signals had amplitudes A1 and A2, the biggest possible amplitude is A1+A2 and the smallest possible is abs(A1-A2).
I have a class implementing an audio stream that can be read at varying speed (including reverse and fast varying / "scratching")... I use linear interpolation for the read part and everything works quite decently..
But now I want to implement writing to the stream at varying speed as well and that requires me to implement a kind of "reverse interpolation" i.e. Deduce the input sample vector Z that, interpolated with vector Y will produce the output X (which I'm trying to write)..
I've managed to do it for constant speeds, but generalising for varying speeds (e.g accelerating or decelerating) is proving more complicated..
I imagine this problem has been solved repeatedly, but I can't seem to find many clues online, so my specific question is if anyone has heard of this problem and can point me in the right direction (or, even better, show me a solution :)
Thanks!
I would not call it "reverse interpolation" as that does not exists (my first thought was you were talking about extrapolation!). What you are doing is still simply interpolation, just at an uneven rate.
Interpolation: finding a value between known values
Extrapolation: finding a value beyond known values
Interpolating to/from constant rates is indeed much much simpler than the generic quest of "finding a value between known values". I propose 2 solutions.
1) Interpolate to a significantly higher rate, and then just sub-sample to the nearest one (try adding dithering)
2) Solve the generic problem: for each point you need to use the neighboring N points and fit a order N-1 polynomial to them.
N=2 would be linear and would add overtones (C0 continuity)
N=3 could leave you with step changes at the halfway point between your source samples (perhaps worse overtones than N=2!)
N=4 will get you C1 continuity (slope will match as you change to the next sample), surely enough for your application.
Let me explain that last one.
For each output sample use the 2 previous and 2 following input samples. Call them S0 to S3 on a unit time scale (multiply by your sample period later), and you are interpolating from time 0 to 1. Y is your output and Y' is the slope.
Y will be calculated from this polynomial and its differential (slope)
Y(t) = At^3 + Bt^2 + Ct + D
Y'(t) = 3At^2 + 2Bt + C
The constraints (the values and slope at the endpoints on either side)
Y(0) = S1
Y'(0) = (S2-S0)/2
Y(1) = S2
Y'(1) = (S3-S1)/2
Expanding the polynomial
Y(0) = D
Y'(0) = C
Y(1) = A+B+C+D
Y'(1) = 3A+2B+C
Plugging in the Samples
D = S1
C = (S2-S0)/2
A + B = S2 - C - D
3A+2B = (S3-S1)/2 - C
The last 2 are a system of equations that are easily solvable. Subtract 2x the first from the second.
3A+2B - 2(A+B)= (S3-S1)/2 - C - 2(S2 - C - D)
A = (S3-S1)/2 + C - 2(S2 - D)
Then B is
B = S2 - A - C - D
Once you have A, B, C and D you can put in an time 't' in the polynomial to find a sample value between your known samples.
Repeat for every output sample, reuse A,B,C&D if the next output sample is still between the same 2 input samples. Calculating t each time is similar to Bresenham's line algorithm, you're just advancing by a different amount each time.