n-components doesn't seem to truncate the number of components calculated

n-components doesn't seem to truncate the number of components calculated - scikit-learn

I'm trying to perform Kernal Principal Component Analysis (KPCA) on a large data set that I will want to find the pre-image of after removal of the low energy/high entropy components.
I would had assumed that specifying the n_components parameter would prevent the nxn calculation (and storage thereof), but that doesn't seem to be the case; at least kpca.alphas_ and .lambdas_ still have nxn components stored and calculated.
Is there something I'm doing wrong, or can this function not operate similarly to truncated_svd?
I've read up on streaming KPCA approaches that would assuage the memory and processing time issue, but then I would need to auger a way to form the pre-image which I don't feel well equipped to do.
from sklearn.decomposition import KernelPCA as KPCA
from sklearn.datasets import make_blobs as mb
import numpy as np
X,y=mb(n_samples=400,cluster_std=[1,2,.25,.5,0.1],centers=5,n_features=2)
kpca=KPCA(kernel='rbf',fit_inverse_transform=True,gamma=10,n_components=50)
Xk=kpca.fit_transform(X)
print np.shape(kpca.lambdas_)

It occurred to me that telling sklearn to fit the inverse might also require all eignvalues/vectors to be calculated.
Without this field, it performs the same way as truncated_svd.
Suppose I'll need to make/discover a pre-image approximation scheme after all.
If you know of any, feel free to post in the comments.

Related

Using a LinearOperator as Jacobian in scipy.root

I want to solve a system of nonlinear equations using scipy.root. For performance reason, I want to provide the jacobian of the system using a LinearOperator. However, I cannot get it to work. Here is a minimal example using the gradient of the Rosenbrock function, where I first define the Jacobian (i.e. the Hessian of the Rosenbrock function) as a LinearOperator.
import numpy as np
import scipy.optimize as opt
import scipy.sparse as sp
ndim = 10
def rosen_hess_LO(x):
return sp.linalg.LinearOperator((ndim,ndim) ,matvec = (lambda dx,xl=x : opt.rosen_hess_prod(xl,dx)))
opt_result = opt.root(fun=opt.rosen_der,x0=np.zeros((ndim),float),jac=rosen_hess_LO)
Upon execution, I get the following error :
TypeError: fsolve: there is a mismatch between the input and output shape of the 'fprime' argument 'rosen_hess_LO'.Shape should be (10, 10) but it is (1,).
What am I missing here ?

Partial answer :
I was able to input my "exact" jacobian into scipy.optimize.nonlin.nonlin_solve . This really felt hacky.
Long story short, I defined a class inheriting from scipy.optimize.nonlin.Jacobian, where I defined "update" and "solve" method so that my exact jacobian would be used by the solver.
I expect performance results to greatly vary from problem to problem. Let me detail my experience for a ~10k dimensional critial point solve of an "almost" coercive function (i.e. the problem would be coercive if I had taken the time to remove a 4-dimensional symmetry generator), with many many local minima (and thus presumably many many critical points).
Long story short, this gave terrible results far from the optimum, but local convergence was achieved in fewer optimization cycles. The cost of each of those optimization cycle was (for my personal problem at hand) far greater than the "standard" krylov lgmres, so in the end even close to the optimum, I cannot really say it was worth the trouble.
To be honest, I am very impressed with the Jacobian finite difference approximation of the 'krylov' method of scipy.optimize.root.

Does fitting Weibull distribution to data using scipy.stats perform poor?

I am working on fitting Weibull distribution on some integer data and estimating relevant shape, scale, location parameters. However, I noticed poor performance of scipy.stats library while doing so.
So, I took a different direction and checked the fit performance by using the code below. I first create 100 numbers using Weibull distribution with parameters shape=3, scale=200, location=1. Subsequently, I estimate the best distribution fit using fitter library.
from fitter import Fitter
import numpy as np
from scipy.stats import weibull_min
# generate numbers
x = weibull_min.rvs(3, scale=200, loc=1, size=100)
# make them integers
data = np.asarray(x, dtype=int)
# fit one of the four distributions
f = Fitter(data, distributions=["gamma", "rayleigh", "uniform", "weibull_min"])
f.fit()
f.summary()
I expect the best fit to be Weibull distribution. I have tried re-running this test. Sometimes Weibull fit is a good estimate. However, most of the time Weibull fit is reported as the worst result. In this case, the estimated parameters are = (0.13836651040093312, 66.99999999999999, 1.3200752378443505). I assume these parameters correspond to shape, scale, location in order. Below is the summary of the fit procedure.
$ f.summary()
sumsquare_error aic bic kl_div
gamma 0.001601 1182.739756 -1090.410631 inf
rayleigh 0.001819 1154.204133 -1082.276256 inf
uniform 0.002241 1113.815217 -1061.400668 inf
weibull_min 0.004992 1558.203041 -976.698452 inf
Additionally, the following plot is produced.
Also, Rayleigh distribution is a special case of Weibull with shape parameter = 2. So, I expect the resulting Weibull fit to be at least as good as Rayleigh.
Update
I ran the tests above on Linux/Ubuntu 20.04 machine with numpy version 1.19.2 and scipy version 1.5.2. The code above seems to run as expected and return proper results for Weibull distribution on a Mac machine.
I have also tested fitting a Weibull distribution on data x generated above on the Linux machine by using an R library fitdistrplus as:
fit.weib <- fitdist(x, "weibull")
and observed that the estimated shape and scale values are found to be very close to the initially given values. The best guess so far is that the problem is due to some Python-Ubuntu bug/incompatibility.
I can be considered as a newbie in this area. So, I am wondering, am I doing something wrong here? Or is this result somehow expected? Any help is greatly appreciated.
Thank you.

Library fitter doesn't allow to specify parameters for distributions such as a, loc, etc. And strangely, Mac produces better fit while Linux heavily pains the results for best fit, for the same version of Numpy and Scipy. Underlying reasons may include different BLAS-LAPACK algorithms designed for Linux and Mac, https://stackoverflow.com/a/49274049/6806531, or weibull_min may not initialize parameter a = 1 which is discussed online, or default floating-point accuracy. However, one can solve the error inside fitter library. Knowing the fact that weib_min is expon_weib with parameter a is fixed as 1, changing the run function inside of _timed_run function in fitter.py as
def run(self):
try:
if distribution == "exponweib":
self.result = func(args,floc=0,fa = 1, **kwargs)
else:
self.result = func(args, floc=0, **kwargs)
except Exception as err:
self.exc_info = sys.exc_info()
and using exponweib as weib_min gives nearly same results as R fitdist.

I am not familiar with the Fitter library, but in order to draw some conclusions I would suggest:
Retry your code, but by taking size=10,000. In this case, there are sufficient datapoints for the fitting methods to utilize. Theoretically, you would then expect the Weibull to deliver the best fit.
I noticed that the location parameter can sometimes be a pain. You could try to run your fits by fixing the location parameter with floc=1 (i.e. equal to your sampling parameter for location). What do you get? Aditionally, FYI, with MLE, it suffices to take loc=min(x), where x is your dataset. For the exponential distribution, this in fact the MLE of the location parameter. For other distributions I am not sure, but I wouldn't be surprised if this holds for other distributions as well. This would reduce the fitting procedure with 1 parameter.
Lastly, I noticed that if you take small values for location/scale/shape for some distributions, the functions logpdf and logcdf of scipy.stats distributions result in np.inf values. In this scenario, you could perhaps use the Powell optimization algorithm and set bounds on the values of your parameters.

Point Cloud triangulation using marching-cubes in Python 3

I'm working on a 3D reconstruction system and want to generate a triangular mesh from the registered point cloud data using Python 3. My objects are not convex, so the marching cubes algorithm seems to be the solution.
I prefer to use an existing implementation of such method, so I tried scikit-image and Open3d but both the APIs do not accept raw point clouds as input (note that I'm not expert of those libraries). My attempts to convert my data failed and I'm running out of ideas since the documentation does not clarify the input format of the functions.
These are my desired snippets where pcd_to_volume is what I need.
scikit-image
import numpy as np
from skimage.measure import marching_cubes_lewiner
N = 10000
pcd = np.random.rand(N,3)
def pcd_to_volume(pcd, voxel_size):
#TODO
volume = pcd_to_volume(pcd, voxel_size=0.05)
verts, faces, normals, values = marching_cubes_lewiner(volume, 0)
open3d
import numpy as np
import open3d
N = 10000
pcd = np.random.rand(N,3)
def pcd_to_volume(pcd, voxel_size):
#TODO
volume = pcd_to_volume(pcd, voxel_size=0.05)
mesh = volume.extract_triangle_mesh()
I'm not able to find a way to properly write the pcd_to_volume function. I do not prefer a library over the other, so both the solutions are fine to me.
Do you have any suggestions to properly convert my data? A point cloud is a Nx3 matrix where dtype=float.
Do you know another implementation [of the marching cube algorithm] that works on raw point cloud data? I would prefer libraries like scikit and open3d, but I will also take into account github projects.

Do you know another implementation [of the marching cube algorithm] that works on raw point cloud data?
Hoppe's paper Surface reconstruction from unorganized points might contain the information you needed and it's open sourced.
And latest Open3D seems to be containing surface reconstruction algorithms like alphaShape, ballPivoting and PoissonReconstruction.
From what I know, marching cubes is usually used for extracting a polygonal mesh of an isosurface from a three-dimensional discrete scalar field (that's what you mean by volume). The algorithm does not work on raw point cloud data.
Hoppe's algorithm works by first generating a signed distance function field (a SDF volume), and then passing it to marching cubes. This can be seen as an implementation to you pcd_to_volume and it's not the only way!
If the raw point cloud is all you have, then the situation is a little bit constrained. As you might see, the Poisson reconstruction and Screened Poisson reconstruction algorithm both implement pcd_to_volume in their own way (they are highly related). However, they needs additional point normal information, and the normals have to be consistently oriented. (For consistent orientation you can read this question).
While some Delaunay based algorithm (they do not use marching cubes) like alphaShape and this may not need point normals as input, for surfaces with complex topology, it's hard to get a satisfactory result due to orientation problem. And the graph cuts method can use visibility information to solve that.
Having said that, if your data comes from depth images, you will usually have visibility information. And you can use TSDF to build a good surface mesh. Open3D have already implemented that.

How can I get the initial values from my dataset for a combined lorentzian and gaussian fit?

I try to fit data using standard defined functions (Lorentzian & Gaussian) from lmfit package. The program works quite well for some data set but for another one its not able to fit because the initial values doesnt seem right. Is there any algorithm which can extract the initial values from the data set and do some iterations in order to find the best fit?
I tried some common methods like bruethe-force algorithm but the results are not satisfactory and it cost a lot of time.

It is always recommended to provide a small, complete example script that shows the problem you are having. How could we know why it works in some cases and not in others?
lmfit.GaussianModel and lmfit.LorentzianModel both have guess methods. This should work reasonably well data with an isolated peak, working like
import lmfit
model = lmfit.models.GaussianModel()
params = model.guess(ydata, x=xdata)
for p in params.values():
print(p)
result = model.fit(ydata, params, x=xdata)
print(result.fit_report())
If the data doesn't have a clear isolated peak, that might not work so well.
If finding the peak(s) is the actual problem, try scipy.signal.find_peaks
(https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.find_peaks.html)or peakutils (https://peakutils.readthedocs.io/en/latest/). Either of these should give you a good estimate of the center parameter, which is probably the most likely to cause bad fits if a poor initial value is give.

Rolling mean with irregular boundaries

I am using rolling mean on my data to smoothen it. My data can be found here.
An illustration of my original data is;
Currently, I am using
import pandas as pd
import numpy as np
data = pd.read_excel('data.xlsx')
data = np.array(data, dtype=np.float)
window_length = 9
res = pd.rolling_mean(np.array(data[:, 2]), window_length, min_periods=1, center=True)
This is what I get after applying rolling mean with a window_length of 9;
And when i increase the window_length to 20, I get a smoother image but at boundaries, the data seems to be erroneous.
The problem is, as seen in the figures above, the rolling mean introduces some sort of errors at the boundaries of my data which do not exist in the original data.
Is there any way to correct this?
My guess is, at the boundary, since part of the window_length is found outside my data, it exaggerates the mean.
Is there a way to correct this error using pandas rolling mean or is there a better pythonic way in doing this? Thanks.
Ps. I am aware the panda function of rolling mean i am using is deprecated in the new versión.

You can try a native 2D convolution method such as scipy.ndimage.filters.convolve with weights so just make the kernel an average (mean) function.
The weights would be:
n = 3. # size of kernel over which to calculate mean
weights = np.ones(n,n)/n**2
If the white area of your data are represented by nans, this would reduce the footprint of the result by n since any kernel stamp with a nan included will return a nan. If this is really an issue try look at astropy.convolution, which has better nan handling.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string