Why is not variance of normalized data by sklearn equal 1? - python-3.x

I'm using preprocessing from package sklearn to normalize data as follows:
import pandas as pd
import urllib3
from sklearn import preprocessing
decathlon = pd.read_csv("https://raw.githubusercontent.com/leanhdung1994/Deep-Learning/main/decathlon.txt", sep='\t')
decathlon.describe()
nor_df = decathlon.copy()
nor_df.iloc[:, 0:10] = preprocessing.scale(decathlon.iloc[:, 0:10])
nor_df.describe()
The result is
The mean is -1.516402e-16, which is almost 0. On the contrary, the variance is 1.012423e+00, which is 1.012423. For me, 1.012423 is not considered as near 1.
Could you please elaborate on this phenomenon?

In this instance sklearn and pandas calculate std differently.
sklearn.preprocessing.scale:
We use a biased estimator for the standard deviation, equivalent to
numpy.std(x, ddof=0). Note that the choice of ddof is unlikely to
affect model performance.
pandas.Dataframe.describe uses pandas.core.series.Series.std where:
Normalized by N-1 by default. This can be changed using the ddof argument
...
ddof : int, default 1
Delta Degrees of Freedom. The divisor used in calculations is N - ddof,
where N represents the number of elements.
It should be noted that, in 2020-10-28, pandas.Dataframe.describe does not have a ddof parameter so the default of ddof=1 is always used for Series.

Related

Causal Inference where the treatment assignment is randomised

I have mostly worked with Observational data where the treatment assignment was not randomized. In the past, I have used PSM, IPTW to balance and then calculate ATE.
My problem is:
Now I am working on a problem where the treatment assignment is randomized meaning there won't be a confounding effect. But treatment and control groups have different sizes. There's a bucket imbalance.
Now should I just analyze the data as it is and run statistical significance and Statistical power test?
Or shall I balance the imbalance of sizes between the treatment and control using let's say covariate matching and then run significance tests?
In general, you don't need equal group sizes to estimate treatment effects.
Unequal groups will not bias the estimate, it will just affect its variance - namely, reducing the precision (recall the statistical power is determined by the smallest group, so unequal groups is less sample-efficient, but not categorically wrong).
you can further convince yourself with a simple simulation (code below). Showing that for repeated draws, the estimation is not biased (both distributions perfectly overlay), but having equal groups have improved precision (smaller standard error).
import statsmodels.api as sm
import numpy as np
import pandas as pd
import seaborn as sns
n_trials = 100
balanced = {
True: (100, 100),
False: (190, 10),
}
effect = 2.0
res = []
for i in range(n_trials):
np.random.seed(i)
noise = np.random.normal(size=sum(balanced))
for is_balanced, ratio in balanced.items():
t = np.array([0]*ratio[0] + [1]*ratio[1])
y = effect * t + noise
m = sm.OLS(y, t).fit()
res.append((is_balanced, m.params[0], m.bse[0]))
res = pd.DataFrame(res, columns=["is_balanced", "beta", "se"])
g = sns.jointplot(
x="se", y="beta",
hue="is_balanced",
data=res
)
# Annotate the true effect:
g.fig.axes[0].axhline(y=effect, color='grey', linestyle='--')
g.fig.axes[0].text(y=effect, x=res["se"].max(), s="True effect")

Is it possible to describe with 1 parameter when a wave is sinusoidal or square in Python?

I am using scipy and I managed to filter the data with the fft package cutting the high frequencies, but that is only useful to transform the data, instead of that I want to get just 1 parameter after the analysis.
Let's have a look at some simple code to explain what I mean:
from scipy import fftpack
import numpy as np
import pandas as pd
from scipy import signal
t = np.linspace(0, 2*np.pi, 100, endpoint=True)
sq1 = signal.square(np.pi*t)
sin1 = np.sin(np.pi*t)
fft_sq1 = fftpack.dct(sq1,norm="ortho")
fft_sin1 = fftpack.dct(sin1, norm="ortho")
After applying the fast fourier transform (direct cosine) I get fft_sq1 and fft_sin1, which are arrays 100 elements long. Manipulating those coefficientes I can use later the fftpack.idct() and obtain a curve that does not contain noise.
The problem with this is that I get too much frequencies, I get 100 parameters I have to filter and after that I get again the curve.
Instead of that I am interested in a filter that returns me just 1 value:
0 if the curve is completely square
1 if the curve is exactly like a sinusoid
Does something comes to your mind?
Obviously there are infinite curves in between, if the periodic signal is more flat the number will be closer to 0 and if the curve is more round the number will be closer to 1.
Thanks!!

What is the meaning of "value" in a node in sklearn decisiontree plot_tree

I plotted my sklearn decision tree using the plot_tree function. The nodes have the following structure:
But I don't understand what does the value = [2417, 1059] mean. In other nodes there are other values. Thanks for explaining.
DecisionTreeClassifier:
value in a DecisionTreeClassifier is the class split in each node's samples.
Keep in mind it might also be weighted if you weighted your classes on the call to fit().
For example:
cw={0: 0.6495288248337029, 1: 2.1719184430027805}
Taking the true node, your true class split is calculated as:
>>> [3819.229 / cw[0], 1216.274 / cw[1]]
[5880, 560]
And if it's not clear, your criterion is calculated on the weighted split:
>>> a, b = 3819.229, 1216.274
>>> ab = a + b
>>> (-(a / ab)*math.log2(a / ab)) - ((b / ab)*math.log2(b / ab))
0.7975914228753467
DecisionTreeRegressor:
value in a DecisionTreeRegressor is the value that the tree would predict for a new example falling in that node. If your criterion is MSE, you'll find that value is an average measure of the samples in that node.
For example:
*(Data: Seaborn's "dots" example set.)
A depth-1 regressor tree fitted on coherence to predict firing_rate. It's not a very useful tree, but it illustrates the idea.
Taking the true node, value is calculated as:
>>> value = data[data.coherence <= 19.2].firing_rate.mean()
>>> value
40.48326118418657
squared_error for that node is:
>>> ((data[data.coherence <= 19.2].firing_rate - value)**2).mean()
134.6504380931471
They are indicating you the number of sample by class that you have in the step.
For example, your picture show that before splitting for "hops<=5" you have 2417 samples of class 0 and 1059 samples of the class 1.
Realize that if you sum this two values, you will obtain the same number (3476) as the parameter "samples".
If the tree works, you will observe how the data is splitting better in every step. For final leaf you will see that you have clear values like [300, 2]. Then you can say that all this sample are class 0.

unbiased variance in Theano

In numpy we can set ddof=1 to get the ubiased variance, how is it implemented in theano?
I've looked at this page it seems the theano.tensor.var function does not support such options.
theano.tensor.var returns the biased sample variance. I'm not aware of a builtin function that returns the unbiased sample variance, but you can obtain it as follows:
Given a vector x, use Theano's builtin var(), but change the 1/n divisor to 1/(n-1):
v = x.var() * x.size / (x.size - 1)

theano: gradient where cost is imag(x)

If I have a cost that is the imaginary part of a complex number, trying to obtain the gradient with theano I get the following error:
TypeError: Elemwise{imag,no_inplace}.grad illegally returned an integer-valued variable. (Input index 0, dtype complex128)
Is it not possible to use the imaginary part as cost despite it being a real-valued cost?
Edit. Minimal working example
import theano.tensor as T
from theano import function
a = T.zscalar('a')
f = function([a], T.grad(T.imag(a),a))
I would expect this to work as T.imag(a) is a real scalar cost..

Resources