Causal Inference where the treatment assignment is randomised - statistics

I have mostly worked with Observational data where the treatment assignment was not randomized. In the past, I have used PSM, IPTW to balance and then calculate ATE.
My problem is:
Now I am working on a problem where the treatment assignment is randomized meaning there won't be a confounding effect. But treatment and control groups have different sizes. There's a bucket imbalance.
Now should I just analyze the data as it is and run statistical significance and Statistical power test?
Or shall I balance the imbalance of sizes between the treatment and control using let's say covariate matching and then run significance tests?

In general, you don't need equal group sizes to estimate treatment effects.
Unequal groups will not bias the estimate, it will just affect its variance - namely, reducing the precision (recall the statistical power is determined by the smallest group, so unequal groups is less sample-efficient, but not categorically wrong).
you can further convince yourself with a simple simulation (code below). Showing that for repeated draws, the estimation is not biased (both distributions perfectly overlay), but having equal groups have improved precision (smaller standard error).
import statsmodels.api as sm
import numpy as np
import pandas as pd
import seaborn as sns
n_trials = 100
balanced = {
True: (100, 100),
False: (190, 10),
}
effect = 2.0
res = []
for i in range(n_trials):
np.random.seed(i)
noise = np.random.normal(size=sum(balanced))
for is_balanced, ratio in balanced.items():
t = np.array([0]*ratio[0] + [1]*ratio[1])
y = effect * t + noise
m = sm.OLS(y, t).fit()
res.append((is_balanced, m.params[0], m.bse[0]))
res = pd.DataFrame(res, columns=["is_balanced", "beta", "se"])
g = sns.jointplot(
x="se", y="beta",
hue="is_balanced",
data=res
)
# Annotate the true effect:
g.fig.axes[0].axhline(y=effect, color='grey', linestyle='--')
g.fig.axes[0].text(y=effect, x=res["se"].max(), s="True effect")

Related

Detecting duplicate audio files

I have snippets of audio that are almost the same that I want to group together (samples 5 and 3 below). There are other portions that are similar, but differ (3 and 4, there is a double drum hit at the end for 3) and completely different ones (sample 8).
How can I group together samples that are almost the same? I tried taking the difference (attempting to minimize it), but that does not work since they are not aligned. I also tried to take audio features like pitch distribution, but since the sounds are similar in pitch those don't get separated well.
The files are available here: https://drive.google.com/drive/folders/14UQQDfIBUNRO_1Pv8bkPf9noi86M7lKd
Here's something that appears to work for the data you are using but may (likely does) have weaknesses when it comes to other data or other sorts of data. But maybe it will be helpful nonetheless.
The basic idea of this solution is to compute the MFCCs of each of the samples to get feature vectors and then find a distance (here just using basic Euclidean distance) between those feature sets with the assumption (which seems to be true for your data) that the least similar samples will have a large distance and the closest will have the least. Here's the code:
import librosa
import scipy
import matplotlib.pyplot as plt
sample3, rate = librosa.load('sample3.wav', sr=None)
sample4, rate = librosa.load('sample4.wav', sr=None)
sample5, rate = librosa.load('sample5.wav', sr=None)
sample8, rate = librosa.load('sample8.wav', sr=None)
# cut the longer sounds to same length as the shortest
len5 = len(sample5)
sample3 = sample3[:len5]
sample4 = sample4[:len5]
sample8 = sample8[:len5]
mf3 = librosa.feature.mfcc(sample3, sr=rate)
mf4 = librosa.feature.mfcc(sample4, sr=rate)
mf5 = librosa.feature.mfcc(sample5, sr=rate)
mf8 = librosa.feature.mfcc(sample8, sr=rate)
# average across the frames. dubious?
amf3 = mf3.mean(axis=0)
amf4 = mf4.mean(axis=0)
amf5 = mf5.mean(axis=0)
amf8 = mf8.mean(axis=0)
f_list = [amf3, amf4, amf5, amf8]
results = []
for i, features_a in enumerate(f_list):
results.append([])
for features_b in f_list:
result = scipy.spatial.distance.euclidean(features_a,
features_b)
results[i].append(result)
plt.ion()
fig, ax = plt.subplots()
ax.imshow(results, cmap='gray_r', interpolation='nearest')
spots = [0, 1, 2, 3]
labels = ['s3', 's4', 's5', 's8']
ax.set_xticks(spots)
ax.set_xticklabels(labels)
ax.set_yticks(spots)
ax.set_yticklabels(labels)
The code plots a heatmap of the distances across all the samples. The code is lazy so it both re-computes the elements that are symmetric across the diagonal, which are the same, and the diagonal itself (which should be zero distance) but those are sort of sanity checks as it is nice to see white down the diagonal and that the matrix is symmetric.
The real information is that clip 8 is black against all the other clips (i.e. furthest from them) and clip 3 and clip 5 are the least distant from one another.
This basic idea could be done with a feature vector generated in a different sort of way (e.g. instead of MFCCs, you could use the embeddings from something like YAMNet) or with a different way of finding a distance between the feature vectors.
For the grouping part of what you want to do, you could experimentally work out a threshold on the distance metric below which you would consider a clip to be in the same group as another. With more clips, you could compute all these distances and then hand that distance matrix over to a clustering algorithm (like HDBSCAN) to cluster the clips.

How can I compute (for later uses) a wave wtih a very high frequency?

I'm running a physics simulation related to visible light, and the resulting wave function has a very, very high frequency -- cyclic frequency is on the order of 1.0e15, and the spatial frequency k is on the order of 1.0e7. Thankfully, I only use the spatial frequency, but when I calculate it for later usage (using either math or numpy), I get something that resembles a beat wave, unless I use N ~= k sample points, because I have to calculate it over a much greater range (on the order of 1.0e-3 - 1.0e-1). It produces a beat wave so consistently I spent a few hours to make sure I'm not actually calculating one. I'll also have to use fft() on the resulting wave and I'm afraid it won't work properly with a misrepresented wave.
I've tried using various amounts of sample points, but unless it's extraordinarily high (takes a good minute or two to calculate), only the prominence of beating changes. Just in case I'm misusing numpy, I tried the same thing with appending wave.value calculated by math.sin to a float array, but it had the same result.
import numpy as np
import matplotlib.pyplot as plt
mmScale = 1.0e-3
nmScale = 1.0e-9
c = 3.0e8
N = 1000
class Wave:
def __init__(self, amplitude, wavelength):
self.wavelength = wavelength*nmScale
self.amplitude = amplitude
self.omega = 2*pi*c/self.wavelength
self.k = 2*pi/self.wavelength
def value(self, time, travel):
return self.amplitude*np.sin(self.omega*time - self.k*travel)
x = np.linspace(50, 250, N)*mmScale
wave = Wave(1, 400)
y = wave.value(0.1, x)
plt.plot(x,y)
plt.show()
The code above produces a graph of the function, and you can put in different values for N to see how it gives different waveforms.
Your sampling spatial frequency is:
1/Ts = 1 / ((250-50)*mmScale) / N) = 5000 [samples/meter]
Your wave's spatial frequency is:
1/Tw = 1 / wavelength = 1 / (400e-9) = 2500000 [wavelengths/meter]
You fail to satisfy Nyquist criterion by a factor of (2*2500000 ) / 5000 = 1000.
Thus you must expect serious aliasing effects. See https://en.wikipedia.org/wiki/Aliasing.
Not much can be done to battle it. But there are some tricks that may help you depending on application. One is to represent a wave as a complex envelop around carier frequency, which is 400e-9. Please provide more detail on what you do with the wave.

Choice of IMODE in gekko optimisation problems

I'm seeing here that imode=3 is equivalent to the steady-state simulation (which I guess imode=2) except that additional degrees of freedom are allowed.
How do I decide to use imode=3 instead of imode=2?
I'm doing optimization using imode=2 where I'm defining variables calculated by solver to meet constraint using m.Var & other using m.Param. What changes I need to do in variables to use imode=3 ?
Niladri,
IMODE 2 is for steady state problems with multiple data points.
Here is an example:
from gekko import GEKKO
import numpy as np
xm = np.array([0,1,2,3,4,5])
ym = np.array([0.1,0.2,0.3,0.5,1.0,0.9])
m = GEKKO()
m.x = m.Param(value=np.linspace(-1,6))
m.y = m.Var()
m.options.IMODE=2
m.cspline(m.x,m.y,xm,ym)
m.solve(disp=False)
This is a Cubic Spline approximation with multiple data points. When you switch to IMODE 3, it is very similar but it only considers one instance of your model. All of the value properties should only have 1 value such as when you optimize the Cubic spline to find the maximum value.
p = GEKKO()
p.x = p.Var(value=1,lb=0,ub=5)
p.y = p.Var()
p.cspline(p.x,p.y,xm,ym)
p.Obj(-p.y)
p.solve(disp=False)
Here is additional information on IMODE:
https://apmonitor.com/wiki/index.php/Main/OptionApmImode
https://apmonitor.com/wiki/index.php/Main/Modes
https://gekko.readthedocs.io/en/latest/imode.html
Best regards,
John Hedengren

Bayesian Linear Regression with PyMC3 and a large dataset - bracket nesting level exceeded maximum and slow performance

I would like to use a Bayesian multivariate linear regression to estimate the strength of players in team sports (e.g. ice hockey, basketball or soccer). For that purpose, I create a matrix, X, containing the players as columns and the matches as rows. For each match the player entry is either 1 (player plays in the home team), -1 (player plays in the away team) or 0 (player does not take part in this game). The dependent variable Y is defined as the scoring differences for both teams in each match (Score_home_team - Score_away_team).
Thus, the number of parameters will be quite large for one season (e.g. X is defined by 300 rows x 450 columns; i.e. 450 player coefficients + y-intercept). When running the fit I came across a compilation error:
('Compilation failed (return status=1): /Users/me/.theano/compiledir_Darwin-17.7.0-x86_64-i386-64bit-i386-3.6.5-64/tmpdxxc2379/mod.cpp:27598:32: fatal error: bracket nesting level exceeded maximum of 256.
I tried to handle this error by setting:
theano.config.gcc.cxxflags = "-fbracket-depth=1024"
Now, the sampling is running. However, it is so slow that even if I take only 35 of 300 rows the sampling is not completed within 20 minutes.
This is my basic code:
import pymc3 as pm
basic_model = pm.Model()
with basic_model:
# Priors for beta coefficients - these are the coefficients of the players
dict_betas = {}
for col in X.columns:
dict_betas[col] = pm.Normal(col, mu=0, sd=10)
# Priors for unknown model parameters
alpha = pm.Normal('alpha', mu=0, sd=10) # alpha is the y-intercept
sigma = pm.HalfNormal('sigma', sd=1) # standard deviation of the observations
# Expected value of outcome
mu = alpha
for col in X.columns:
mu = mu + dict_betas[col] * X[col] # mu = alpha + beta_1 * Player_1 + beta_2 * Player_2 + ...
# Likelihood (sampling distribution) of observations
Y_obs = pm.Normal('Y_obs', mu=mu, sd=sigma, observed=Y)
The instantiation of the model runs within one minute for the large dataset. I do the sampling using:
with basic_model:
# draw 500 posterior samples
trace = pm.sample(500)
The sampling is completed for small sample sizes (e.g. 9 rows, 80 columns) within 7 minutes. However, the time is increasing substantially with increasing sample size.
Any suggestions how I can get this Bayesian linear regression to run in a feasible amount of time? Are these kind of problems doable using PyMC3 (remember I came across a bracket nesting error)? I saw in a recent publication that this kind of analysis is doable in R (https://arxiv.org/pdf/1810.08032.pdf). Therefore, I guess it should also somehow work with Python 3.
Any help is appreciated!
Eliminating the for loops should improve performance and might also take care of the nesting issue you are reporting. Theano TensorVariables and the PyMC3 random variables that derive from them are already multidimensional and support linear algebra operations. Try changing your code to something along the lines of
beta = pm.Normal('beta', mu=0, sd=10, shape=X.shape[1])
...
mu = alpha + pm.math.dot(X, beta)
...
If you need specify different prior values for mu and/or sd, those arguments accept anything that theano.tensor.as_tensor_variable() accepts, so you can pass a list or numpy array.
I highly recommend getting familiar with the theano.tensor and pymc3.math operations since sometimes you must use these to properly manipulate random variables, and in general it should lead to more efficient code.

How to compare predictive power of PCA and NMF

I would like to compare the output of an algorithm with different preprocessed data: NMF and PCA.
In order to get somehow a comparable result, instead of choosing just the same number of components for each PCA and NMF, I would like to pick the amount that explains e.g 95% of retained variance.
I was wondering if its possible to identify the variance retained in each component of NMF.
For instance using PCA this would be given by:
retainedVariance(i) = eigenvalue(i) / sum(eigenvalue)
Any ideas?
TL;DR
You should loop over different n_components and estimate explained_variance_score of the decoded X at each iteration. This will show you how many components do you need to explain 95% of variance.
Now I will explain why.
Relationship between PCA and NMF
NMF and PCA, as many other unsupervised learning algorithms, are aimed to do two things:
encode input X into a compressed representation H;
decode H back to X', which should be as close to X as possible.
They do it in a somehow similar way:
Decoding is similar in PCA and NMF: they output X' = dot(H, W), where W is a learned matrix parameter.
Encoding is different. In PCA, it is also linear: H = dot(X, V), where V is also a learned parameter. In NMF, H = argmin(loss(X, H, W)) (with respect to H only), where loss is mean squared error between X and dot(H, W), plus some additional penalties. Minimization is performed by coordinate descent, and result may be nonlinear in X.
Training is also different. PCA learns sequentially: the first component minimizes MSE without constraints, each next kth component minimizes residual MSE subject to being orthogonal with the previous components. NMF minimizes the same loss(X, H, W) as when encoding, but now with respect to both H and W.
How to measure performance of dimensionality reduction
If you want to measure performance of an encoding/decoding algorithm, you can follow the usual steps:
Train your encoder+decoder on X_train
To measure in-sample performance, compare X_train'=decode(encode(X_train)) with X_train using your preferred metric (e.g. MAE, RMSE, or explained variance)
To measure out-of-sample performance (generalizing ability) of your algorithm, do step 2 with the unseen X_test.
Let's try it with PCA and NMF!
from sklearn import decomposition, datasets, model_selection, preprocessing, metrics
# use the well-known Iris dataset
X, _ = datasets.load_iris(return_X_y=True)
# split the dataset, to measure overfitting
X_train, X_test = model_selection.train_test_split(X, test_size=0.5, random_state=1)
# I scale the data in order to give equal importance to all its dimensions
# NMF does not allow negative input, so I don't center the data
scaler = preprocessing.StandardScaler(with_mean=False).fit(X_train)
X_train_sc = scaler.transform(X_train)
X_test_sc = scaler.transform(X_test)
# train the both decomposers
pca = decomposition.PCA(n_components=2).fit(X_train_sc)
nmf = decomposition.NMF(n_components=2).fit(X_train_sc)
print(sum(pca.explained_variance_ratio_))
It will print you explained variance ratio of 0.9536930834362043 - the default metric of PCA, estimated using its eigenvalues. We can measure it in a more direct way - by applying a metric to actual and "predicted" values:
def get_score(model, data, scorer=metrics.explained_variance_score):
""" Estimate performance of the model on the data """
prediction = model.inverse_transform(model.transform(data))
return scorer(data, prediction)
print('train set performance')
print(get_score(pca, X_train_sc))
print(get_score(nmf, X_train_sc))
print('test set performance')
print(get_score(pca, X_test_sc))
print(get_score(nmf, X_test_sc))
which gives
train set performance
0.9536930834362043 # same as before!
0.937291711378812
test set performance
0.9597828443047842
0.9590555069007827
You can see that on the training set PCA performs better than NMF, but on the test set their performance is almost identical. This happens, because NMF applies lots of regularization:
H and W (the learned parameter) must be non-negative
H should be as small as possible (L1 and L2 penalties)
W should be as small as possible (L1 and L2 penalties)
These regularizations make NMF fit worse than possible to the training data, but they might improve its generalizing ability, which happened in our case.
How to choose the number of components
In PCA, it is simple, because its components h_1, h_2, ... h_k are learned sequentially. If you add the new component h_(k+1), the first k will not change. Thus, you can estimate performance of each component, and these estimates will not depent on the number of components. This makes it possible for PCA to output the explained_variance_ratio_ array after only a single fit to data.
NMF is more complex, because all its components are trained at the same time, and each one depends on all the rest. Thus, if you add the k+1th component, the first k components will change, and you cannot match each particular component with its explained variance (or any other metric).
But what you can to is to fit a new instance of NMF for each number of components, and compare the total explained variance:
ks = [1,2,3,4]
perfs_train = []
perfs_test = []
for k in ks:
nmf = decomposition.NMF(n_components=k).fit(X_train_sc)
perfs_train.append(get_score(nmf, X_train_sc))
perfs_test.append(get_score(nmf, X_test_sc))
print(perfs_train)
print(perfs_test)
which would give
[0.3236945680665101, 0.937291711378812, 0.995459457205891, 0.9974027602663655]
[0.26186701106012833, 0.9590555069007827, 0.9941424954209546, 0.9968456603914185]
Thus, three components (judging by the train set performance) or two components (by the test set) are required to explain at least 95% of variance. Please notice that this case is unusual and caused by a small size of training and test data: usually performance degrades a little bit on the test set, but in my case it actually improved a little.

Resources