Calculating p-values with pnorm ( ). What makes p-values differ if data is transformed? - transform

I am comparing two alternatives for calculating p-values with R's pnorm() function.
xbar <- 2.1
mu <- 2
sigma <- 0.25
n = 35
# z-transformation
z <- (xbar - mu) / (sigma / sqrt(n))
# Alternative I using transformed values
pval1 <- pnorm(q = z)
# Alternative II using untransformed values
pval2 <- pnorm(q = xbar, mean = mu, sd = sigma)
How come the two calculated p-values are not the same? Should not they?

They are different because you use two different estimates of the standard deviation.
In the z-transformation calculation you use sigma / sqrt(n) as the standard deviation, but in the untransformed calculation you use sd = sigma, ignoring n.

Related

How to calculate Covariance and Correlation in Python without using cov and corr?

How can we calculate the correlation and covariance between two variables without using cov and corr in Python3?
At the end, I want to write a function that returns three values:
a boolean that is true if two variables are independent
covariance of two variables
correlation of two variables.
You can find the definition of correlation and covariance here:
https://medium.com/analytics-vidhya/covariance-and-correlation-math-and-python-code-7cbef556baed
I wrote this part for covariance:
'''
ans=[]
mean_x , mean_y = x.mean() , y.mean()
n = len(x)
Cov = sum((x - mean_x) * (y - mean_y)) / n
sum_x = float(sum(x))
sum_y = float(sum(y))
sum_x_sq = sum(xi*xi for xi in x)
sum_y_sq = sum(yi*yi for yi in y)
psum = sum(xi*yi for xi, yi in zip(x, y))
num = psum - (sum_x * sum_y/n)
den = pow((sum_x_sq - pow(sum_x, 2) / n) * (sum_y_sq - pow(sum_y, 2) / n), 0.5)
if den == 0: return 0
return num / den
'''
For the covariance, just subtract the respective means and multiply the vectors together (using the dot product). (Of course, make sure whether you're using the sample covariance or population covariance estimate -- if you have "enough" data the difference will be tiny, but you should still account for it if necessary.)
For the correlation, divide the covariance by the standard deviations of both.
As for whether or not two columns are independent, that's not quite as easy. For two random variables, we just have that $\mathbb{E}\left[(X - \mu_X)(Y - \mu_Y)\right] = 0$, where $\mu_X, \mu_Y$ are the means of the two variables. But, when you have a data set, you are not dealing with the actual probability distributions; you are dealing with a sample. That means that the correlation will very likely not be exactly $0$, but rather a value close to $0$. Whether or not this is "close enough" will depend on your sample size and what other assumptions you're willing to make.

Prior conditional on other prior

I am trying to sample two parameters (prior) from a categorical distribution ranging from 1 to 5000, theta[1] and theta[2] with the requirement that theta1 < theta2.
I have tried (among other things):
theta[1] ~ dcat(p1[])
p1[1:n] <- 1/n
theta[2] ~ dcat(p2[])
pi2[1:theta[1]] <- 0
pi2[sum(theta[1],1):n] <- 1/sum(n, -pi1)
with n = 5000
so that theta2 is sampled from the categorical distribution ranging from theta1 to n.
The error is: unknown variable theta[1].
Any help would be appreciated.
If in this categorical variable with n=5000, the only requirement is that theta1<theta2, you could use the order() function:
theta.star[1] ~ dcat(p1[])
theta.star[2] ~ dcat(p1[])
theta <- order(theta.star)
The order() function is a way to impose order constraints in JAGS.

Solving vector second order differential equation while indexing into an array

I'm attempting to solve the differential equation:
m(t) = M(x)x'' + C(x, x') + B x'
where x and x' are vectors with 2 entries representing the angles and angular velocity in a dynamical system. M(x) is a 2x2 matrix that is a function of the components of theta, C is a 2x1 vector that is a function of theta and theta' and B is a 2x2 matrix of constants. m(t) is a 2*1001 array containing the torques applied to each of the two joints at the 1001 time steps and I would like to calculate the evolution of the angles as a function of those 1001 time steps.
I've transformed it to standard form such that :
x'' = M(x)^-1 (m(t) - C(x, x') - B x')
Then substituting y_1 = x and y_2 = x' gives the first order linear system of equations:
y_2 = y_1'
y_2' = M(y_1)^-1 (m(t) - C(y_1, y_2) - B y_2)
(I've used theta and phi in my code for x and y)
def joint_angles(theta_array, t, torques, B):
phi_1 = np.array([theta_array[0], theta_array[1]])
phi_2 = np.array([theta_array[2], theta_array[3]])
def M_func(phi):
M = np.array([[a_1+2.*a_2*np.cos(phi[1]), a_3+a_2*np.cos(phi[1])],[a_3+a_2*np.cos(phi[1]), a_3]])
return np.linalg.inv(M)
def C_func(phi, phi_dot):
return a_2 * np.sin(phi[1]) * np.array([-phi_dot[1] * (2. * phi_dot[0] + phi_dot[1]), phi_dot[0]**2])
dphi_2dt = M_func(phi_1) # (torques[:, t] - C_func(phi_1, phi_2) - B # phi_2)
return dphi_2dt, phi_2
t = np.linspace(0,1,1001)
initial = theta_init[0], theta_init[1], dtheta_init[0], dtheta_init[1]
x = odeint(joint_angles, initial, t, args = (torque_array, B))
I get the error that I cannot index into torques using the t array, which makes perfect sense, however I am not sure how to have it use the current value of the torques at each time step.
I also tried putting odeint command in a for loop and only evaluating it at one time step at a time, using the solution of the function as the initial conditions for the next loop, however the function simply returned the initial conditions, meaning every loop was identical. This leads me to suspect I've made a mistake in my implementation of the standard form but I can't work out what it is. It would be preferable however to not have to call the odeint solver in a for loop every time, and rather do it all as one.
If helpful, my initial conditions and constant values are:
theta_init = np.array([10*np.pi/180, 143.54*np.pi/180])
dtheta_init = np.array([0, 0])
L_1 = 0.3
L_2 = 0.33
I_1 = 0.025
I_2 = 0.045
M_1 = 1.4
M_2 = 1.0
D_2 = 0.16
a_1 = I_1+I_2+M_2*(L_1**2)
a_2 = M_2*L_1*D_2
a_3 = I_2
Thanks for helping!
The solver uses an internal stepping that is problem adapted. The given time list is a list of points where the internal solution gets interpolated for output samples. The internal and external time lists are in no way related, the internal list only depends on the given tolerances.
There is no actual natural relation between array indices and sample times.
The translation of a given time into an index and construction of a sample value from the surrounding table entries is called interpolation (by a piecewise polynomial function).
Torque as a physical phenomenon is at least continuous, a piecewise linear interpolation is the easiest way to transform the given function value table into an actual continuous function. Of course one also needs the time array.
So use numpy.interp1d or the more advanced routines of scipy.interpolate to define the torque function that can be evaluated at arbitrary times as demanded by the solver and its integration method.

sklearn customized standarization of data

Suppose I have a 2D numpy array:
X = np.array[
[..., ...],
[..., ...]]
And I want to standardize the data either with:
X = StandardScaler().fit_transform(X)
or:
X = (X - X.mean())/X.std()
The results are different. Why are they different?
Assuming X is a feature matrix of shape (n x m) (n instances and m features). We want to scale each feature so its instances are distributed with a mean of zero and with unit variance.
To do this you need to calculate the mean and standard deviation of each feature for the provided instances (column of X) and then calculate the scaled feature vectors. Currently you are calculating the mean and standard deviation of the whole dataset and scaling the data using these values: this will give you meaningless results in all but a few special cases (i.e., X = np.ones((100,2)) is such a special case).
Practically, to calculate these statistics for each feature you will need to set the axis parameter of the .mean() or .std() methods to 0. This will perform the calculations along the columns and return a (1 x m) shaped array (actually a (m,) array, but thats another story), where each value is the mean or standard deviation for the given column. You can then use numpy broadcasting to correctly scale the feature vectors.
The below example shows how you can correctly implement it manually. x1 and x2 are 2 features with 100 training instances. We store them in a feature matrix X.
x1 = np.linspace(0, 100, 100)
x2 = 10 * np.random.normal(size=100)
X = np.c_[x1, x2]
# scale the data using the sklearn implementation
X_scaled = StandardScaler().fit_transform(X)
# scale the data taking mean and std along columns
X_scaled_manual = (X - X.mean(axis=0)) / X.std(axis=0)
If you print the two you will see they match exactly, explicitly:
print(np.sum(X_scaled-X_scaled_manual))
returns 0.0.

How to find the fundamental frequency of a guitar string sound?

I want to build a guitar tuner app for Iphone. My goal is to find the fundamental frequency of sound generated by a guitar string. I have used bits of code from aurioTouch sample provided by Apple to calculate frequency spectrum and I find the frequency with the highest amplitude . It works fine for pure sounds (the ones that have only one frequency) but for sounds from a guitar string it produces wrong results. I have read that this is because of the overtones generate by the guitar string that might have higher amplitudes than the fundamental one. How can I find the fundamental frequency so it works for guitar strings? Is there an open-source library in C/C++/Obj-C for sound analyzing (or signal processing)?
You can use the signal's autocorrelation, which is the inverse transform of the magnitude squared of the DFT. If you're sampling at 44100 samples/s, then a 82.4 Hz fundamental is about 535 samples, whereas 1479.98 Hz is about 30 samples. Look for the peak positive lag in that range (e.g. from 28 to 560). Make sure your window is at least two periods of the longest fundamental, which would be 1070 samples here. To the next power of two that's a 2048-sample buffer. For better frequency resolution and a less biased estimate, use a longer buffer, but not so long that the signal is no longer approximately stationary. Here's an example in Python:
from pylab import *
import wave
fs = 44100.0 # sample rate
K = 3 # number of windows
L = 8192 # 1st pass window overlap, 50%
M = 16384 # 1st pass window length
N = 32768 # 1st pass DFT lenth: acyclic correlation
# load a sample of guitar playing an open string 6
# with a fundamental frequency of 82.4 Hz (in theory),
# but this sample is actually at about 81.97 Hz
g = fromstring(wave.open('dist_gtr_6.wav').readframes(-1),
dtype='int16')
g = g / float64(max(abs(g))) # normalize to +/- 1.0
mi = len(g) / 4 # start index
def welch(x, w, L, N):
# Welch's method
M = len(w)
K = (len(x) - L) / (M - L)
Xsq = zeros(N/2+1) # len(N-point rfft) = N/2+1
for k in range(K):
m = k * ( M - L)
xt = w * x[m:m+M]
# use rfft for efficiency (assumes x is real-valued)
Xsq = Xsq + abs(rfft(xt, N)) ** 2
Xsq = Xsq / K
Wsq = abs(rfft(w, N)) ** 2
bias = irfft(Wsq) # for unbiasing Rxx and Sxx
p = dot(x,x) / len(x) # avg power, used as a check
return Xsq, bias, p
# first pass: acyclic autocorrelation
x = g[mi:mi + K*M - (K-1)*L] # len(x) = 32768
w = hamming(M) # hamming[m] = 0.54 - 0.46*cos(2*pi*m/M)
# reduces the side lobes in DFT
Xsq, bias, p = welch(x, w, L, N)
Rxx = irfft(Xsq) # acyclic autocorrelation
Rxx = Rxx / bias # unbias (bias is tapered)
mp = argmax(Rxx[28:561]) + 28 # index of 1st peak in 28 to 560
# 2nd pass: cyclic autocorrelation
N = M = L - (L % mp) # window an integer number of periods
# shortened to ~8192 for stationarity
x = g[mi:mi+K*M] # data for K windows
w = ones(M); L = 0 # rectangular, non-overlaping
Xsq, bias, p = welch(x, w, L, N)
Rxx = irfft(Xsq) # cyclic autocorrelation
Rxx = Rxx / bias # unbias (bias is constant)
mp = argmax(Rxx[28:561]) + 28 # index of 1st peak in 28 to 560
Sxx = Xsq / bias[0]
Sxx[1:-1] = 2 * Sxx[1:-1] # fold the freq axis
Sxx = Sxx / N # normalize S for avg power
n0 = N / mp
np = argmax(Sxx[n0-2:n0+3]) + n0-2 # bin of the nearest peak power
# check
print "\nAverage Power"
print " p:", p
print "Rxx:", Rxx[0] # should equal dot product, p
print "Sxx:", sum(Sxx), '\n' # should equal Rxx[0]
figure().subplots_adjust(hspace=0.5)
subplot2grid((2,1), (0,0))
title('Autocorrelation, R$_{xx}$'); xlabel('Lags')
mr = r_[:3 * mp]
plot(Rxx[mr]); plot(mp, Rxx[mp], 'ro')
xticks(mp/2 * r_[1:6])
grid(); axis('tight'); ylim(1.25*min(Rxx), 1.25*max(Rxx))
subplot2grid((2,1), (1,0))
title('Power Spectral Density, S$_{xx}$'); xlabel('Frequency (Hz)')
fr = r_[:5 * np]; f = fs * fr / N;
vlines(f, 0, Sxx[fr], colors='b', linewidth=2)
xticks((fs * np/N * r_[1:5]).round(3))
grid(); axis('tight'); ylim(0,1.25*max(Sxx[fr]))
show()
Output:
Average Power
p: 0.0410611012542
Rxx: 0.0410611012542
Sxx: 0.0410611012542
The peak lag is 538, which is 44100/538 = 81.97 Hz. The first-pass acyclic DFT shows the fundamental at bin 61, which is 82.10 +/- 0.67 Hz. The 2nd pass uses a window length of 538*15 = 8070, so the DFT frequencies include the fundamental period and harmonics of the string. This enables an ubiased cyclic autocorrelation for an improved PSD estimate with less harmonic spreading (i.e. the correlation can wrap around the window periodically).
Edit: Updated to use Welch's method to estimate the autocorrelation. Overlapping the windows compensates for the Hamming window. I also calculate the tapered bias of the hamming window to unbias the autocorrelation.
Edit: Added a 2nd pass with cyclic correlation to clean up the power spectral density. This pass uses 3 non-overlapping, rectangular windows length 538*15 = 8070 (short enough to be nearly stationary). The bias for cyclic correlation is a constant, instead of the Hamming window's tapered bias.
Finding the musical pitches in a chord is far more difficult than estimating the pitch of one single string or note played at a time. The overtones for the multiple notes in a chord might all be overlapping and interleaving. And all the notes in common chords may themselves be at overtone frequencies for one or more non-existent lower pitched notes.
For single notes, autocorrelation is a common technique used by some guitar tuners. But with autocorrelation, you have to be aware of some potential octave uncertainty, as guitars may produce inharmonic and decaying overtones which thus don't exactly match from pitch period to pitch period. Cepstrum and Harmonic Product Spectrum are two other pitch estimation methods which may or may not have different problems, depending on the guitar and the note.
RAPT appears to be one published algorithm for more robust pitch estimation. YIN is another.
Also Objective C is a superset of ANSI C. So you can use any C DSP routines you find for pitch estimation within an Objective C app.
Use libaubio (link) and be happy . It was one the biggest time lose for me to try to implement a fundemental frequency estimator. If you want to do it yourself I advise you follow to YINFFT method (link)

Resources