PCA with sklearn discrepancies - scikit-learn

I am trying to apply a PCA in a very specific context and ran into a behavior that I can not explain.
As a test I am running the following code with the file data that you can retrieve here: https://www.dropbox.com/s/vdnvxhmvbnssr34/test.npy?dl=0 (numpy array format).
from sklearn.decomposition import PCA
import numpy as np
test = np.load('test.npy')
pca = PCA()
X_proj = pca.fit_transform(test) ### Project in the basis of eigenvectors
proj = pca.inverse_transform(X_proj) ### Reconstruct vector
My issue is the following: Because I do not specify any number of components, I should here be reconstructing with all the computed components. I therefore expect my ouput proj to be the same as my input test. But a quick plot proves this not to be the case:
plt.figure()
plt.plot(test[0]-proj[0])
plt.show()
The plot here will show some large discrepancies between projection and the input matrix.
Does anyone have an idea or explanation to help me understand why proj is different from test in my case?

I checked the your test data and found the following:
mean = test.mean() # 1.9545972004854737e+24
std = test.std() # 9.610595443778275e+26
I interpret the standard deviation to represent, in some sense, the least count or the uncertainty in the values that are reported. By that I mean that if a numerical algorithm reports the answer to be a, then the real answer should be in the interval [a - std, a + std]. This is because numerical algorithms are imprecise by their very nature. They depend on floating point operations which obviously can't represent real numbers in all there glory.
So if I plot:
plt.plot((test[0]-proj[0])/std)
plt.show()
I get the following plot which seems more reasonable.
You may be interested in plotting relative errors as well. Alternately, you can normalize your data to have 0 mean and unit variance and then the PCA results should be more accurate.

Related

Create random Numpy array following a given distribution and trend

I want to create data that follows the same distribution and trend of the sample data taken using numpy.
For example say I have an array x whose trend is increasing and the distribution is suppose log normal. Can I create another random array which follows same distribution and trend using numpy ?
Well, numpy doesn't have the capability to fit distributions to your data. You can either do it manually using the method you like (MLE or MM) or you can use scipy that can fit distributions over your data like shown below:
import scipy.stats as st
# Inferred parameters of the distribution
s, loc, scale = st.lognorm.fit(x)
# Distribution object
dist = st.lognorm(s, loc, scale)
# generate 1000 random samples
samples = dist.rvs(size=1000)
Scipy used MLE by default.
You will have to explore your data and look into the distributions that fit the best. Numpy or scipy can't do that for you.
Documentation of fit method: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rv_continuous.fit.html

Evaluate an idea of generating noise based on standard deviation

I generate synthetic dataset using this method:
import numpy as np
import random
def generate_dataset(size, dim):
dataset = [random.randint(0, 2 ** dim) for _ in range(size)]
# Removes duplicates
dataset = list(set(dataset))
return dataset
As you can see, the data points are generated randomly from [0 - 2^dim]. For any dataset generated by this method, I want to add noise to it. Now, I am thinking of a simple way to do so but I am not sure if it is logically correct, so here it is:
Find the standard deviation of data points from the generated dataset.
Generate new data points that are NOT within this standard deviation.
Add them to your original dataset, and shuffle.
Is this way of generating noise sound?
Thank you.
it seems like you are creating outliers. noise to me is more like adding a small number(+/- number) to the data points. for example, how many steps did you walk today? it could be 100, but some tracing device might read 95 or 110. that difference is noise.
not sure if this helps.

Neural network regression with skewed data

I have been trying to build a machine learning model using Keras which predicts the radiation dose based on pre-treatment parameters. My dataset has approximately 2200 samples of which 20% goes into validation and testing.
The problem with the target variable is that it is very skewed since large radiation doses are much more rare than the small ones. Hence, I suspect that my regression model fails to predict the large values at all, and predicts everything around the mean, which is apparent from the figure. I have tried to log-normalise the target variable to make it more normally distributed, but it has had no effect.
Any suggestion how to fix this?
Target variable
Regression predictions
Computing individual sample weights based on 10 histogram bins helped in my case. See the code below:
import pandas as pd
import numpy as np
from sklearn.utils.class_weight import compute_sample_weight
hist, bin_edges = np.histogram(training_targets, bins = 10)
classes = training_targets.apply(lambda x: pd.cut(x, bin_edges, labels = False,
include_lowest = True)).values
sample_weights = compute_sample_weight('balanced', classes)

Binary classifier too confident to plot ROC curve with sklearn?

I have a created a binary classifier in Tensorflow that will output a generator object containing predictions. I extract the predictions (e.g [0.98, 0.02]) from the object into a list, later converting this into a numpy array. I have the corresponding array of labels for these predictions. Using these two arrays I believe I should be able to plot a roc curve via:
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve
fpr, tpr, thr = roc_curve(labels, predictions[:,1])
plt.plot(fpr, tpr)
plt.show()
print(fpr)
print(tpr)
print(thr)
Where predictions[:,1] gives the positive prediction score. However, running this code leads to only a flat line and only three values for each fpr, tpr and thr:
Flat line roc plot and limited function outputs.
The only theory I have as to why this is happening is because my classifier is too sure of it's predictions. Many, if not all, of the positive prediction scores are 1.0, or incredibly close to zero:
[[9.9999976e-01 2.8635742e-07]
[3.3693312e-11 1.0000000e+00]
[1.0000000e+00 9.8642090e-09]
...
[1.0106111e-15 1.0000000e+00]
[1.0000000e+00 1.0030269e-09]
[8.6156778e-15 1.0000000e+00]]
According to a few sources including this stackoverflow thread and this stackoverflow thread, the very polar values of my predictions could be creating an issue for roc_curve().
Is my intuition correct? If so is there anything I can do about it to plot my roc_curve?
I've tried to include all the information I think would be relevant to this issue but if you would like any more information about my program please ask.
ROC is generated by changing the threshold on your predictions and finding the sensitivity and specificity for each threshold. This generally means that as you increase the threshold, your sensitivity decreases but your specificity increases and it draws a picture of the overall quality of your predicted probabilities. In your case, since everything is either 0 or 1 (or very close to it) there are no meaningful thresholds to use. That's why the thr value is basically [ 1, 1, 1 ].
You can try to arbitrarily pull the values closer to 0.5 or alternatively implement your own ROC curve calculation with more tolerance for small differences.
On the other hand you might want to review your network because such result values often mean there is a problem there, maybe the labels leaked into the network somehow and therefore it produces perfect results.

Different Linear Regression Coefficients with statsmodels and sklearn

I was planning to use sklearn linear_model to plot a graph of linear regression result, and statsmodels.api to get a detail summary of the learning result. However, the two packages produce very different results on the same input.
For example, the constant term from sklearn is 7.8e-14, but the constant term from statsmodels is 48.6. (I added a column of 1's in x for constant term when using both methods) My code for both methods are succint:
# Use statsmodels linear regression to get a result (summary) for the model.
def reg_statsmodels(y, x):
results = sm.OLS(y, x).fit()
return results
# Use sklearn linear regression to compute the coefficients for the prediction.
def reg_sklearn(y, x):
lr = linear_model.LinearRegression()
lr.fit(x, y)
return lr.coef_
The input is too complicated to post here. Is it possible that a singular input x caused this problem?
By making a 3-d plot using PCA, it seems that the sklearn result is not a good approximation. What are some explanations? I still want to make a visualization, so it will be very helpful to fix the issues in the sklearn linear regression implementation.
You say that
I added a column of 1's in x for constant term when using both methods
But the documentation of LinearRegression says that
LinearRegression(fit_intercept=True, [...])
it fits an intercept by default. This could explain why you have the differences in the constant term.
Now for the other coefficients, differences can occur when two of the variables are highly correlated. Let's consider the most extreme case where two of your columns are identical. Then reducing the coefficient in front of any of the two can be compensated by increasing the other. This is the first thing I'd check.

Resources