Please may someone review this code, I cannot seem to use this as the fit
function does not exist?
import torch
import torch.distributions as dist
print(torch.cuda.is_available())
# Sample data
data = torch.randn(1000)
# Fit the data to a log-normal distribution
mu, std = dist.Normal(0, 1).fit(torch.log(data))
# Create a log-normal distribution with the fitted parameters
lognormal = dist.LogNormal(mu, std)
# Get the probability density function (PDF) of the log-normal distribution
pdf = lognormal.log_prob(torch.log(data)).exp()
# Plot the data and the PDF
import matplotlib.pyplot as plt
plt.hist(data, bins=30, density=True)
plt.plot(data, pdf, 'r')
plt.show()
I receive the following error:
AttributeError: 'Normal' object has no attribute 'fit'
Neither the PyTorch Distribution base class nor the Normal distribution offer a fit() method (thus the error).
If you want to fit a normal distribution to given data, you could use the scipy.stats package.
import numpy as np
from scipy.stats import norm
import torch
data = torch.randn(1000)
# ... do other thinngs with the data
data = data.numpy() # torch tensor must be converted to numpy array first
mu, std = norm.fit(data)
# ...
Please also note: By using torch.randn() you generate random numbers from a normal distribution with mean 0 and variance 1.
This means approximately half the numbers are negative. This will cause problems if you take the logarithm of these values in the next step, since logarithms of negative numbers are not defined.
data = torch.randn(1000)
data = torch.log(data) # log(data) produces nan any time a value is less than zero
# --> tensor([ 6.4755e-01, 7.7064e-01, -3.5208e-01, 1.2873e-01, nan,
# nan, -6.5469e-01, 5.8180e-02, nan, nan, ...])
data = data.numpy()
mu, std = norm.fit(data) # this won't work now as the data contains non-finite values.
Related
A bit confused with the probability between categorical target and one-hot encoding target from OneVsRestClassifier of sklean. Using iris data with simple logistic regression as an example. When I use original iris class[0,1,2], the calculated OneVsRestClassifier() probability for each observation will always add up to 1. However, if I converted the target to dummies, this is not the case. I understand that OneVsRestClassifier() compares one vs rest (class 0 vs non class 0, class 1 vs non class 1, etc). It makes more sense that the sum of these probabilities has no relation with 1. Then why I see the difference and how so?
import numpy as np
import pandas as pd
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
np.set_printoptions(suppress=True)
iris = datasets.load_iris()
rng = np.random.RandomState(0)
perm = rng.permutation(iris.target.size)
X = iris.data[perm]
y = iris.target[perm]
# categorical target with no conversion
X_train, y_train1 = X[:80], y[:80]
X_test, y_test1 = X[80:], y[80:]
m3 = LogisticRegression(random_state=0)
clf1 = OneVsRestClassifier(m3).fit(X_train, y_train1)
y_pred1 = clf1.predict(X_test)
print(np.sum(y_pred1 == y_test))
y_prob1 = clf1.predict_proba(X_test)
y_prob1[:5]
#output
array([[0.00014508, 0.17238549, 0.82746943],
[0.03850173, 0.79646817, 0.1650301 ],
[0.73981106, 0.26018067, 0.00000827],
[0.00016332, 0.32231163, 0.67752505],
[0.00029197, 0.2495404 , 0.75016763]])
# one hot encoding for categorical target
y2 = pd.get_dummies(y)
y_train2 = y2[:80]
y_test2 = y2[80:]
clf2 = OneVsRestClassifier(m3).fit(X_train, y_train2)
y_pred2 = clf2.predict(X_test)
y_prob2 = clf2.predict_proba(X_test)
y_prob2[:5]
#output
array([[0.00017194, 0.20430011, 0.98066319],
[0.02152246, 0.44522562, 0.09225181],
[0.96277892, 0.3385952 , 0.00001076],
[0.00023024, 0.45436925, 0.95512082],
[0.00036849, 0.31493725, 0.94676348]])
When you encode the targets, sklearn interprets your problem as a multilabel one instead of just multiclass; that is, that it is possible for a point to have more than one true label. And in that case, it is perfectly acceptable for the total sum of probabilities to be greater (or less) than 1. That's generally true for sklearn, but OneVsRestClassifier calls it out specifically in the docstring:
OneVsRestClassifier can also be used for multilabel classification. To use this feature, provide an indicator matrix for the target y when calling .fit.
As for the first approach, there are indeed three independent models, but the predictions are normalized; see the source code. Indeed, that's the only difference:
(y_prob2 / y_prob2.sum(axis=1)[:, None] == y_prob1).all()
# output
True
It's probably worth pointing out that LogisticRegression also natively supports multiclass. In that case, the weights for each class are independent, so it's similar to three separate models, but the resulting probabilities are the result of a softmax application, and the loss function minimizes the loss for each class simultaneously, so that the resulting coefficients and hence predictions can be different from those obtained from OneVsRestClassifier:
m3.fit(X_train, y_train1)
y_prob0 = m3.predict_proba(X_test)
y_prob0[:5]
# output:
array([[0.00000494, 0.01381671, 0.98617835],
[0.02569699, 0.88835451, 0.0859485 ],
[0.95239985, 0.04759984, 0.00000031],
[0.00001338, 0.04195642, 0.9580302 ],
[0.00002815, 0.04230022, 0.95767163]])
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD
digits = datasets.load_digits()
X = digits.data
X = X - X.mean() # centering the data
#### svd
svd = TruncatedSVD(n_components=5)
svd.fit(X)
print(svd.explained_variance_ration)
#### PCA
pca = PCA(n_components=5)
pca.fit(X)
print(pca.explained_variance_ratio_)
svd output is:
array([0.02049911, 0.1489056 , 0.13534811, 0.11738598, 0.08382797])
pca output is:
array([0.14890594, 0.13618771, 0.11794594, 0.08409979, 0.05782415])
is there a bug in the TruncatedSVD implementation? or why is the first explained variance (0.02...) behaving like this? or what is the meaning
Summary:
That is because TruncatedSVD and PCA use different SVD functions!.
Note: Your case is due to Reason 2 below, yet I included another reason for future readers.
Details:
Reason 1: The solver set by user in each algorithm, is different:
PCA internally uses scipy.linalg.svd which sorts singular values, hence the explained_variance_ratio_ is sorted.
Part of Scikit Implementation of PCA:
# Center data
U, S, Vt = linalg.svd(X, full_matrices=False)
# flip eigenvectors' sign to enforce deterministic output
U, Vt = svd_flip(U, Vt)
components_ = Vt
# Get variance explained by singular values
explained_variance_ = (S ** 2) / (n_samples - 1)
total_var = explained_variance_.sum()
explained_variance_ratio_ = explained_variance_ / total_var
Screenshot from the above-mentioned scipy.linalg.svd link:
On the other hand, TruncatedSVD uses scipy.sparse.linalg.svds which relies on the ARPACK solver for decomposition.
Screenshot from the above-mentioned scipy.sparse.linalg.svds link:
Reason 2: The TruncatedSVD operates differently compared to PCA:
In your case you chose randomized as a solver (which is set by default) in both algorithms, yet you obtained different results with regards to the order of the variance.
That is because in PCA, the variance is obtained from the actual singular values (called Sigma or S in Scikit-Learn implementation), which are already sorted:
On the other hand, the variance in TruncatedSVD is obtained from X_transformed which results from multiplying the data matrix by the components. The latter does not necessarily preserve order because data are not centered, nor is it the purpose of TruncatedSVD which it is used in first place for sparse matrices:
Now if you center your data, you will get them sorted (note that you did not center data properly, because centering requires dividing by standard deviation):
from sklearn import datasets
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
digits = datasets.load_digits()
X = digits.data
sc = StandardScaler()
X = sc.fit_transform(X)
### SVD
svd = TruncatedSVD(n_components=5, algorithm='randomized', random_state=2021)
svd.fit(X)
print(svd.explained_variance_ratio_)
Output
[0.12033916 0.09561054 0.08444415 0.06498406 0.04860093]
Important: Further read.
I can obtain the probability for x in a gamma distribution via the following in python
import numpy as np
from scipy.stats import gamma
# Gamma distribution
print(gamma.pdf(x=2, a=2, scale=1, loc=0)) # 0.27067
However, this requires creating a new gamma distribution every time I want to get the probably at a specific x.
I like to create a designated gamma distribution my_gamma_dist, then draw from this time and time again for a specific x value.
import numpy as np
from scipy.stats import gamma
my_gamma_dist = gamma.pdf(x, a=2, scale=1, loc=0)
my_gamma_dist.x = 2 # AttributeError: 'numpy.ndarray' object has no attribute 'x'
print(my_gamma_dist(x=2)) # TypeError: 'numpy.ndarray' object is not callable
I've checked the documentation but I'm still unclear.
You can define the distribution and later call the .pdf method on the values you are looking for, namely:
from scipy.stats import gamma
g = gamma(a=2, scale=1, loc=0)
x=2
g.pdf(x)
0.2706705664732254
x=3
g.pdf(x)
0.14936120510359185
# Or pass a list of values to evaluate
g.pdf([2,3])
array([0.27067057, 0.14936121])
For data with the shape (num_samples,features), MinMaxScaler from sklearn.preprocessing can be used to normalize it easily.
However, when using the same method for time series data with the shape (num_samples, time_steps,features), sklearn will give an error.
from sklearn.preprocessing import MinMaxScaler
import numpy as np
#Making artifical time data
x1 = np.linspace(0,3,4).reshape(-1,1)
x2 = np.linspace(10,13,4).reshape(-1,1)
X1 = np.concatenate((x1*0.1,x2*0.1),axis=1)
X2 = np.concatenate((x1,x2),axis=1)
X = np.stack((X1,X2))
#Trying to normalize
scaler = MinMaxScaler()
X_norm = scaler.fit_transform(X) <--- error here
ValueError: Found array with dim 3. MinMaxScaler expected <= 2.
This post suggests something like
(timeseries-timeseries.min())/(timeseries.max()-timeseries.min())
Yet, it only works for data with only 1 feature. Since my data has more than 1 feature, this method doesn't work.
How to normalize time series data with multiple features?
To normalize a 3D tensor of shape (n_samples, timesteps, n_features) use the following:
(timeseries-timeseries.min(axis=2))/(timeseries.max(axis=2)-timeseries.min(axis=2))
Using the argument axis=2 will return the result of the tensor operation performed along the 3rd dimension i.e., the feature axis. Thus each feature will be normalized independently.
I'm using statsmodels.api to compute the statistical parameters for an OLS fit between two variables:
def computeStats(x, y, yName):
'''
Takes as an argument an array, and a string for the array name.
Uses Ordinary Least Squares to compute the statistical parameters for the
array against log(z), and determines the equation for the line of best fit.
Returns the results summary, residuals, statistical parameters in a list, and the
best fit equation.
'''
# Mask NaN values in both axes
mask = ~np.isnan(y) & ~np.isnan(x)
# Compute model parameters
model = sm.OLS(y, sm.add_constant(x), missing= 'drop')
results = model.fit()
residuals = results.resid
# Compute fit parameters
params = stats.linregress(x[mask], y[mask])
fit = params[0]*x + params[1]
fitEquation = '$(%s)=(%.4g \pm %.4g) \\times redshift+%.4g$'%(yName,
params[0], # slope
params[4], # stderr in slope
params[1]) # y-intercept
return results, residuals, params, fit, fitEquation
The second part of the function (using stats.linregress) plays nicely with the masked values, but statsmodels does not. When I try to plot the residuals against the x values with plt.scatter(x, resids), the dimensions do not match:
ValueError: x and y must be the same size
because there are 29007 x-values, and 11763 residuals (that's how many y-values made it through the masking process). I tried changing the model variable to
model = sm.OLS(y[mask], sm.add_constant(x[mask]), missing= 'drop')
but this had no effect.
How can I scatter-plot the residuals against the x-values they match with?
Hi #jim421616 Since statsmodels dropped few missing values, you should use the model's exog variable to plot the scatter as shown.
plt.scatter(model.model.exog[:,1], model.resid)
For reference a complete dummy example
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
#generate data
x = np.random.rand(1000)
y =np.sin( x*25)+0.1*np.random.rand(1000)
# Make some as NAN
y[np.random.choice(np.arange(1000), size=100)]= np.nan
x[np.random.choice(np.arange(1000), size=80)]= np.nan
# fit model
model = sm.OLS(y, sm.add_constant(x) ,missing='drop').fit()
print model.summary()
# plot
plt.scatter(model.model.exog[:,1], model.resid)
plt.show()