measure distance between two atoms, in each state, in pymol - pymol

I'm using Pymol to measure the distance between two atoms of a nucleic acid. I have a pdb. file from molecular dynamics calculation. When I open that file in pymol I have multiple states of the molecule. I want to measure the distance between two specific atoms and for each state, and export the distances to a file. Then I can visualise the distribution in distance. Can anyone please suggest a method to do this using pymol.

Related

Deviation analysis color to vertex color using Meshlab

I would to know if is possible to do deviation analysis with Meshlab and transfer the result to vertex color in a mesh. So expand those 2 ideas...
1st. Is it possible to do deviation analysis with MeshLab? I have a scanned mesh and I will compare with a "ideal model". The difference between these 2 will generate a (grey or color) scale information that represent the distance I have from the points of the scanned model to the "ideal" one.
2nd. I want to get this information (color/grey grading that shows how distant the points are) and transfer to a vertex color information.
I don't know it was clear, but if you know what deviation analysis means I think you got it. The difference is that I would like the generate a 3d mesh with the vertex color provided by this deviation analysis.
Seems that mesh lab can compare two models and can deal with vertex colorizing, but I don't Know if is possible to deal with real measurements, transfer this information to vertex color and export a mesh that show it.
If its possible and If you know how just point me some direction. I'm not familiar with Meshlab and click here and there trying a impossible task can be very frustrating, so it will be good if someone can give me some tips.
Thanks.
Yes, MeshLab can compute deviation analysis between two similar surfaces (and the required alignment preprocessing too).
Estimating the deviation between two meshes means computing the hausdorff distance.
There is a small tutorial on how to compute and visualize it in MeshLab here:
http://meshlabstuff.blogspot.com/2010/01/measuring-difference-between-two-meshes.html

How to select most important features? Feature Engineering

I used the function for gower distance from this link: https://sourceforge.net/projects/gower-distance-4python/files/. My data (df) is such that each row is a trade, and each of the columns are features. Since it contains a lot of categorical data, I then converted the data using gower distance to measure "similarity"... I hope this is correct (as below..):
D = gower_distances(df)
distArray = ssd.squareform(D)
hierarchal_cluster=scipy.cluster.hierarchy.linkage(distArray, method='ward', metric='euclidean', optimal_ordering=False)
I then plot the hierarchical_cluster from above into a dendogram:
plt.title('Hierarchical Clustering Dendrogram (truncated)')
plt.xlabel('sample index or (cluster size)')
plt.ylabel('distance')
dendrogram(
hierarchal_cluster,
truncate_mode='lastp', # show only the last p merged clusters
p=15, # show only the last p merged clusters
leaf_rotation=90.,
leaf_font_size=12.,
show_contracted=True # to get a distribution impression in truncated branches
)
I cannot show it, since I do not have enough privilege points, but on the dendogram I can see separate colors.
What is the main discriminator separating them?
How can I find this out?
How can I use PCA to extract useful features?
Do I pass my 'hierarchal_cluster' into a PCA function?
Something like the below..?
pca = PCA().fit(hierarchal_cluster.T)
plt.plot(np.arange(1,len(pca.explained_variance_ratio_)+1,1),pca.explained_variance_ratio_.cumsum())
I hope you do know that PCA works only for continuous data? Since you mentioned, there are many categorical features. From what you have written, it occurs that you got mixed data.
A common practice when dealing with mixed data is to separate the continuous and categorical features/variables. Then find the Euclidean distance between data points for continuous (or numerical) features and Hamming distance for the categorical features [1].
This will enable you to find similarity between continuous and categorical feature separately. Now, while you are at this, apply PCA on the continuous variables to extract important features. And apply Multiple Correspondence Analysis MCA on the categorical features. Thereafter, you can combine the obtained relevant features together, and apply any clustering algorithm.
So essentially, I'm suggesting feature selection/feature extraction before clustering.
[1] Huang, Z., 1998. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data mining and knowledge discovery, 2(3), pp.283-304.
Quoting the documentation of scipy on the matter of Ward linkage:
Methods ‘centroid’, ‘median’ and ‘ward’ are correctly defined only if Euclidean pairwise metric is used. If y is passed as precomputed pairwise distances, then it is a user responsibility to assure that these distances are in fact Euclidean, otherwise the produced result will be incorrect.
So you can't use Ward linkage with Gower!

Agglomerative Clustering with custom distance metric (alternative to the input correlation metric)

I'm looking to implement a hierarchical clustering model for my set of training variables with respect to their correlation matrix (it's a 100x100 matrix, and I want the largest cluster whose elements are the most uncorrelated). I've been able to use the scipy family of functions to do this, however, for visualization and presentations sake, I'd like an alternative correlation distance defined for my data.
The inbuilt distance 'correlation' metric is defined to be 1-r, where r is the pearson score between two variables. I'd like to change it to 1-absvalue(r)-as my most interesting variables are the most uncorrelated ones (so say the variables who find themselves 1-.8 distance apart). Thanks!

Check if numbers form bell curve (gauss distribution) Python 3

I've got files with irradiance data measured every minute 24 hours a day.
So if there is a day without any clouds on the sky the data shows a nice continuous bell curves.
When looking for a day without any clouds in the data I always plotted month after month with gnuplot and checked for nice bell curves.
I was wondering If there's a python way to check, if the Irradiance measurements form a continuos bell curve.
Don't know if the question is too vague but I'm simply looking for some ideas on that quest :-)
For a normal distribution, there are normality tests.
In short, we abuse some knowledge we have of what normal distributions look like to identify them.
The kurtosis of any normal distribution is 3. Compute the kurtosis of your data and it should be close to 3.
The skewness of a normal distribution is zero, so your data should have a skewness close to zero
More generally, you could compute a reference distribution and use a Bregman Divergence, to assess the difference (divergence) between the distributions. bin your data, create a histogram, and start with Jensen-Shannon divergence.
With the divergence approach, you can compare to an arbitrary distribution. You might record a thousand sunny days and check if the divergence between the sunny day and your measured day is below some threshold.
Just to complement the given answer with a code example: one can use a Kolmogorov-Smirnov test to obtain a measure for the "distance" between two distributions. SciPy offers a neat interface for this, called kstest:
from scipy import stats
import numpy as np
data = np.random.normal(size=100) # Our (synthetic) dataset
D, p = stats.kstest(data, "norm") # Perform a one-sided Kolmogorov-Smirnov test
In the above example, D denotes the distance between our data and a Gaussian normal (norm) distribution (smaller is better), and p denotes the corresponding p-value. Other distributions can be similarly tested by substituting norm with those implemented in scipy.stats.

What descriptive statistics are commonly used for time-series data?

I have a time-series of weekly usage data and I'm going to attempt to use some statistics to segment the population. Skewness and Kurtosis to may allow me to describe the time-series and group the people in different ways. But I also notice some appear to have saw-tooth patterns, or bimodal patterns, then I don't think these two aforementioned statistics will describe them well. Distance from the mean would tell me who has continual steady usage vs. unpredictable usage.
What descriptive statistics are commonly used for time-series data?
Thanks,
The periodogram and the autocorrelation function are two common sources of information
used to analyse and model time series. You can use this information to compare the series.
In the periodogram you can detect the frequencies at which the estimated spectral density is the highest. This will tell you which series are dominated by cycles of the same frequency.
The autocorrelation function (the time domain counterpart of the periodogram) and the partial autocorrelation function can similarly be used to compare and group the series. Those series with significant autocorrelations at the same lag orders could be grouped together.
You may need to transform the series in order to discern some of this information, for example taking differences to render the data stationary. Alternatively you can select an ARIMA model for each series and compare the characteristics of each model (those characteristics will be pretty much the same as those observed in the autocorrelation functions).

Resources