the relationship between the sample size and the dimension of the parameter in SCAD (model selection) - statistics

I was wondering, if there is a explicit relationship between the sample size n and the dimension of the paramter "d" in the SCAD by Fan & Li?
an explicit relationship is perfect.

Related

WeightedRandomSampler with multi-dimensional batch

I'm working on a classification problem (100 classes) and my dataset has a huge class imbalance. To tackle this, I'm considering using torch's WeightedRandomSampler to oversample the minority class. I took help from this post which seemed pretty straightforward. Only concern with this is the nature of my dataset.
In my case, each sample (1 point in a batch) contains 8 points. Each of these 8 points have one true class out of 100 classes. So my output shape is like this: (bs x 8). Hence, the final weight variable has total_dataset_length*8 length.
Here's my implementation:
y_org = np.load('target.npy') # 5000 x 8
samples_per_class = np.unique(y_org.ravel(), return_counts=True)[1]
class_weights = class_weight.compute_class_weight(class_weight='balanced', \
classes=np.unique(y_org.ravel()), \
y=y_org.ravel())
weights = class_weights[y_org.ravel()]
sampler = WeightedRandomSampler(weights, len(y_org.ravel()), replacement=True)
To count the number of occurrences of class index, I have to unroll (ravel) the ground truth array along the first dimension. Since the final weight variable has total_dataset_length*8, it causes indexing errors during loading
IndexError: list index out of range
How can I use WeightedRandomSampler in such cases?

Discretizing PDE in space for use with modelica

I am currently doing a course called "Modeling of dynamic systems" and have been given the task of modeling a warm water tank in modelica with a distributed temperature description.
Most of the tasks have gone well, and my group is left with the task of introducing the heat flux due to buoyancy effects into the model. Here is where we get stuck.
the equation given is this:
Given PDE
But how do we discretize this into something we can use in modelica?
The discretized version we ended up with was this:
(Qd_pp_b[k+1] - Qd_pp_b[k]) / h_dz = -K_b *(T[k+1] - 2 * T[k] + T[k-1]) / h_dz^2
where Qd_pp_b is the left-hand side variable, ie the heat flux, k is the current slice of the tank and T is the temperature in the slices.
Are we on the right path? or completely wrong?
This doesn't seem to be a differential equation (as is) so this does not make sense without surrounding problem. For the second derivative you should always create auxiliary variables and for each partial derivative a separate equation. I added dummy values for parameters and dummy equations for T[k]. This can be simulated, is this about what you expected?
model test
constant Integer n = 10;
Real[n] Qd_pp_b;
Real[n] dT;
Real[n] T;
parameter Real K_b = 1;
equation
for k in 1:n loop
der(Qd_pp_b[k]) = -K_b *der(dT[k]);
der(T[k]) = dT[k];
T[k] = sin(time+k);
end for;
end test;

SARIMAX - Summary table coefficient signs are reversed when calling them

I've fit a SARIMAX model using statsmodels as follows
mod = sm.tsa.statespace.SARIMAX(ratingCountsRSint,order=(2,0,0),seasonal_order=(1,0,0,52),enforce_stationarity=False,enforce_invertibility=False, freq='W')
results = mod.fit()
print(results.summary().tables[1])
In the results table I have a coefficient ar.S.L52 that shows as 0.0163. When I try to retrieve the coefficient using
seasonalAR=results.polynomial_seasonal_ar[52]
I get -0.0163. I'm wondering why the sign has turned around. The same thing happens with polynomial_ar. In the documentation it says that polynomial_seasonal_ar gives the "array containing seasonal autoregressive lag polynomial coefficients". I would have guessed that I should get exactly the same as in the summary table. Could someone clarify how that comes about and whether the actual coefficient of the lag is positive or negative?
I'll use an AR(1) model as an example, but the same principle applies to a seasonal model.
We typically write the AR(1) model as:
y_t = \phi_1 y_{t-1} + \varespilon_t
The parameter estimated by Statsmodels is \phi_1, and that is what is presented in the summary table.
When writing the AR(1) model in lag-polynomial form, we usually write it like:
\phi(L) y_t = \varepsilon_t
where \phi(L) = 1 - \phi L, and L is the lag operator. The coefficients of this lag polynomial are (1, -\phi). These coefficients are what are presented in the polynomial attributes in the results object.

What are the parameters to the kernel function in an SVM?

I'm trying to understand kernel functions, particularly the gaussian/RBF function K(a,b) = exp(-gamma||a-b||**2).
As I understand, this is computing a similarity measure for vectors a and b in part using euclidean distance. My question isn't about the specifics of this kernel, though.
What I don't understand: what are vectors a and b when you use this kernel in an SVM?
SVM is a supervised learning algorithm, so there will be a training phase and a testing phase in which you use a sample of collected data.
A sample of data used for training is usually indicated with {x_i, y_i}, where x are real-valued attributes for each datum and y are the corresponding labels (See wikipedia SVM page, at section "Linear SVM" for example).
For each kernel K(a, b). the value "a" and "b" are the x_i and x_j of the data you have.
In the testing phase you will have only the set {x_i} and you want to estimate the corresponding y. Also in this case the "a" and "b" are the x_i and x_j of the data you have.
EDIT
K(a, b) is calculated for every pair (a, b) = (x_i, x_j), varying i and j. The kernel represents a dot product (Kernel trick), defined on the feature space by the so called function phi.
The SVM needs all the dot products of all the pairs because the hinge-loss comprehends a sum over i and j of all the dot products (that means of all the K(x_i, x_j)).
For example, if you have the set {x_i} = {x_1, x_2} you need
K(x_1, x_1), K(x_1, x_2), K(x_2, x_1), K(x_2, x_2)
(For each kernel K(a,b) = K(b,a), being a dot product, then symmetric. In the end you don't need K(x_2, x_1))

Principle Component Analysis (PCA) Explained Variance remains the same after changing dataframe position

I have a dataframe where A and B is used to predict C
df = df[['A','B','C']]
array = df.values
X = array[:,0:-1]
Y = array[:,-1]
# Feature Importance
model = GradientBoostingClassifier()
model.fit(X, Y)
print ("Importance:")
print((model.feature_importances_)*100)
#PCA
pca = PCA(n_components=len(df.columns)-1)
fit = pca.fit(X)
print("Explained Variance")
print(fit.explained_variance_ratio_)
This prints
Importance:
[ 53.37975706 46.62024294]
Explained Variance
[ 0.98358394 0.01641606]
However when I changed the dataframe position swapping A and B, only the importance changed, but the Explain variance remains, why did the explained variance not change according to [0.01641606 0.98358394]?
df = df[['B','A','C']]
Importance:
[ 46.40771024 53.59228976]
Explained Variance
[ 0.98358394 0.01641606]
Explained variance does not refer to A or B or any columns of your dataframe. It refers to the principal components identified by the PCA, which are some linear combinations of the columns. These components are sorted in the order of decreasing variance as the documentation says:
components_ : array, shape (n_components, n_features)
Principal axes in feature space, representing the directions of maximum variance in the data. The components are sorted by explained_variance_.
explained_variance_ : array, shape (n_components,)
The amount of variance explained by each of the selected components.
Equal to n_components largest eigenvalues of the covariance matrix of X.
explained_variance_ratio_ : array, shape (n_components,)
Percentage of variance explained by each of the selected components.
So, the order of features does not affect the order of components returned. It does affect the array components_ which is a matrix that can be used to map principal components to the feature space.

Resources