How to create random distributions? - python-3.x

I see nearly similar posts, though not quite and I can't seem to adapt any of them to Python. This post suggests an R solution using method="SANN" in optim(): though I can't seem to duplicate it in Python. I want to create a model/plot using (n,min,max,mean,sd,25%,50%,75%,skew, and kurtosis) as 10 known parameters WITHOUT using a distribution family. I can do it fine with a dist family, for example:
x = np.linspace(39,401,100)
plt.plot(x, scipystats.???.pdf(x,p[0],p[1],p[2],p[3],p[4],p[5],p[6],p[7],p[8],p[9])
Then use the model for simulation using a random generator. For example:
ModDist = ???.rvs(p0=n,p1=min,p2=max,p3=m,p4=sd,p5=25,p6=50,p7=75,p8=s,p9=k,rs=3)
It does not have to be only scipy if pandas, numpy, etc.., or combined can work.
Any ideas please?

Related

Does fitting Weibull distribution to data using scipy.stats perform poor?

I am working on fitting Weibull distribution on some integer data and estimating relevant shape, scale, location parameters. However, I noticed poor performance of scipy.stats library while doing so.
So, I took a different direction and checked the fit performance by using the code below. I first create 100 numbers using Weibull distribution with parameters shape=3, scale=200, location=1. Subsequently, I estimate the best distribution fit using fitter library.
from fitter import Fitter
import numpy as np
from scipy.stats import weibull_min
# generate numbers
x = weibull_min.rvs(3, scale=200, loc=1, size=100)
# make them integers
data = np.asarray(x, dtype=int)
# fit one of the four distributions
f = Fitter(data, distributions=["gamma", "rayleigh", "uniform", "weibull_min"])
f.fit()
f.summary()
I expect the best fit to be Weibull distribution. I have tried re-running this test. Sometimes Weibull fit is a good estimate. However, most of the time Weibull fit is reported as the worst result. In this case, the estimated parameters are = (0.13836651040093312, 66.99999999999999, 1.3200752378443505). I assume these parameters correspond to shape, scale, location in order. Below is the summary of the fit procedure.
$ f.summary()
sumsquare_error aic bic kl_div
gamma 0.001601 1182.739756 -1090.410631 inf
rayleigh 0.001819 1154.204133 -1082.276256 inf
uniform 0.002241 1113.815217 -1061.400668 inf
weibull_min 0.004992 1558.203041 -976.698452 inf
Additionally, the following plot is produced.
Also, Rayleigh distribution is a special case of Weibull with shape parameter = 2. So, I expect the resulting Weibull fit to be at least as good as Rayleigh.
Update
I ran the tests above on Linux/Ubuntu 20.04 machine with numpy version 1.19.2 and scipy version 1.5.2. The code above seems to run as expected and return proper results for Weibull distribution on a Mac machine.
I have also tested fitting a Weibull distribution on data x generated above on the Linux machine by using an R library fitdistrplus as:
fit.weib <- fitdist(x, "weibull")
and observed that the estimated shape and scale values are found to be very close to the initially given values. The best guess so far is that the problem is due to some Python-Ubuntu bug/incompatibility.
I can be considered as a newbie in this area. So, I am wondering, am I doing something wrong here? Or is this result somehow expected? Any help is greatly appreciated.
Thank you.
Library fitter doesn't allow to specify parameters for distributions such as a, loc, etc. And strangely, Mac produces better fit while Linux heavily pains the results for best fit, for the same version of Numpy and Scipy. Underlying reasons may include different BLAS-LAPACK algorithms designed for Linux and Mac, https://stackoverflow.com/a/49274049/6806531, or weibull_min may not initialize parameter a = 1 which is discussed online, or default floating-point accuracy. However, one can solve the error inside fitter library. Knowing the fact that weib_min is expon_weib with parameter a is fixed as 1, changing the run function inside of _timed_run function in fitter.py as
def run(self):
try:
if distribution == "exponweib":
self.result = func(args,floc=0,fa = 1, **kwargs)
else:
self.result = func(args, floc=0, **kwargs)
except Exception as err:
self.exc_info = sys.exc_info()
and using exponweib as weib_min gives nearly same results as R fitdist.
I am not familiar with the Fitter library, but in order to draw some conclusions I would suggest:
Retry your code, but by taking size=10,000. In this case, there are sufficient datapoints for the fitting methods to utilize. Theoretically, you would then expect the Weibull to deliver the best fit.
I noticed that the location parameter can sometimes be a pain. You could try to run your fits by fixing the location parameter with floc=1 (i.e. equal to your sampling parameter for location). What do you get? Aditionally, FYI, with MLE, it suffices to take loc=min(x), where x is your dataset. For the exponential distribution, this in fact the MLE of the location parameter. For other distributions I am not sure, but I wouldn't be surprised if this holds for other distributions as well. This would reduce the fitting procedure with 1 parameter.
Lastly, I noticed that if you take small values for location/scale/shape for some distributions, the functions logpdf and logcdf of scipy.stats distributions result in np.inf values. In this scenario, you could perhaps use the Powell optimization algorithm and set bounds on the values of your parameters.

Pre-aligning molecules in Rdkit before computing shape similarity with ShapeTanimotoDist() possible?

I am building a script to compare shapes of Rdkit generated conformers for a query ligand to a reference ligand extracted from a template protein-ligand complex. For this I want to use the shape similarity Tanimoto metric ShapeTanimotoDist() provided by Rdkit. It seems however that this function does not pre-align the molecules when computing shape similarity. When I did some searches I stumbled upon this discussion 10 years ago wherein someone attempted something similar: https://sourceforge.net/p/rdkit/mailman/message/21906484/.
Quoting Greg Landrum:
There is no alignment step. If you want reasonable shape comparisons,
you first need a reasonable alignment of the molecules. The RDKit
doesn't currently provide a practical method of doing this alignment.
So I am wondering if since then this issue has been resolved and that it therefore would be reasonable to just use this function in a standalone fashion to compare shapes of molecules? In the documentation it states under ShapeTanimotoDist() that it uses a "predefined alignment", which is not elaborated further. I have looked into documentation for the 2 molecule aligning functions Rdkit provides: AlignMol and Open3DAlign (O3A) https://www.rdkit.org/docs/source/rdkit.Chem.rdMolAlign.html. For some reason AlignMol does not work for me (Runtime error), albeit O3A which is supported in Rdkit since 2014 did allow me to compare the conformers with ref ligands. However, when creating an O3A object, is there a way to somehow retrieve the coordinates of the conformer and ref molecule alignment to feed into ShapeTanimotoDist()? And also perhaps visualize this using PyMol?
Cheers
Also perhaps useful to consult: 3D functionality in RDkit section https://www.rdkit.org/docs/Cookbook.html

Medical image processing using DICOM images

Im new to medical image processing. how can i convert 3D DICOM medical images to numerical matrix format using either python or c++?
Another option, if you really want "3D" dicom image support (ie CT/MR/NM/PET 3d series - as opposed to purely 2D image handling) and you want do anything really 3d related and/or more complex, you might want to check out simple ITK.
That gives you very powerful true 3d handling and is fast (it's wrapped around complied C). It includes, for example, full 3D image registration and various filters/tools etc.
It can read an entire series at once and automatically create a fully spatially aware 3D numpy array for you (ie it takes care of processing all the dicom 3D spatial orientation/spacing etc tags for you)
However, because it's a lot more powerful than pydicom, it also has a much steeper learning curve - but does have many examples and online jupyter notebook tutorials.
...so, depending on your needs it might be good for you. However, if you only really want basic 2d image-at-a-time type processing, pydicom is the way to go.
You can use pydicom package in python. You can install it in python by:
pip install pydicom
Here is a simple example of reading DICOM images and converting to numpy array:
import os
import pydicom
import numpy as np
dicom_dir = your_dicom_folder_of_slices
file_names = os.listdir(dicom_dir)
file_names.sort()
dicom_data = []
for name in file_names:
path = os.path.join(dicom_dir, name)
dicom_data.append(pydicom.read_file(path))
array = [data.pixel_array for data in dicom_data]
array = np.stack(array, axis=-1) # or 0 if 'channel_first'
Here is a detailed example.
I prefer using SimpleElastix for medical image processing. it has many methods for segmentations and many other helpful methods. it is available in both python and C++. In my experience SimpleElastix handled DICOMS and niftis better than other Packages.

How can I get the initial values from my dataset for a combined lorentzian and gaussian fit?

I try to fit data using standard defined functions (Lorentzian & Gaussian) from lmfit package. The program works quite well for some data set but for another one its not able to fit because the initial values doesnt seem right. Is there any algorithm which can extract the initial values from the data set and do some iterations in order to find the best fit?
I tried some common methods like bruethe-force algorithm but the results are not satisfactory and it cost a lot of time.
It is always recommended to provide a small, complete example script that shows the problem you are having. How could we know why it works in some cases and not in others?
lmfit.GaussianModel and lmfit.LorentzianModel both have guess methods. This should work reasonably well data with an isolated peak, working like
import lmfit
model = lmfit.models.GaussianModel()
params = model.guess(ydata, x=xdata)
for p in params.values():
print(p)
result = model.fit(ydata, params, x=xdata)
print(result.fit_report())
If the data doesn't have a clear isolated peak, that might not work so well.
If finding the peak(s) is the actual problem, try scipy.signal.find_peaks
(https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.find_peaks.html)or peakutils (https://peakutils.readthedocs.io/en/latest/). Either of these should give you a good estimate of the center parameter, which is probably the most likely to cause bad fits if a poor initial value is give.

Invalid Graphics State - ongoing issue for beginner RStudio/R user

I am working on an assignment for a course. The code creates variables for use in a data.dist function from rms. We then create a simple linear regression model using ols(). Printing/creating the first plot before the data.dist() and ols() functions is simple. We use:
plot(x,y,pch='o')
lines(x,yTrue,lty=2,lwd=.5,col='red')
Then we create the data.dist() and the ols(), here named fit0.
mydat=data.frame(x=x,y=y)
dd=datadist(mydat)
options(datadist='dd')
fit0=ols(y~x,data=mydat)
fit0
anova(fit0)
This all works smoothly, printing out the results of the linear regression and the anova table. Then, we want to predict based on the model, and plot these predictions. The plot prints out nicely, however the lines and points won't show up here. The code:
ff=Predict(fit0)
plot(ff)
lines(x,yTrue,lwd=2,lty=1,col='red')
points(x,y,pch='.')
Note - this works fine in R. I much prefer to use RStudio, though can switch to R if there's no clear solution this issue. I've tried dev.off() several times (i.e. repeat until get, I've tried closing RStudio and re-opening, I've uninstalled and reinstalled R and RStudio, rms package (which includes ggplot2), updated the packages, made my RStudio graphics window larger. Any solution I've seen, doesn't work. Help!

Resources