I am attempting to find the most accurate function to give me the quantile of a given value within a data set. The data set will (likely) always be an exponential distribution.
The methodology I am using is as follows (and I apologize if the coding is poor, as I'm really an infrastructure guy, not a stats dude, nor a daily dev):
import sys, scipy, numpy
from matplotlib import pyplot
from scipy.stats.mstats import mquantiles
def FindQuantile(data,findme):
print 'entered FindQuantile'
probset=[]
#cheap hack to make a quick list to get quantiles for each permille value]
for i in numpy.linspace(0,1,10000):
probset.append(i)
#http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mstats.mquantiles.html
quantile_results = mquantiles(data,prob=probset)
quantiles = []
i = 0
for value in quantile_results:
print str(i) + ' permille ' + str(value)
quantiles.append(value)
i = i+1
#goal is to figure out which quantile findme falls in:
i = 0
for quantile in quantiles:
if (findme > quantile):
print str(quantile) + ' is too small for ' + str(findme)
else:
print str(quantile) + ' is the quantile value for the ' + str(i) + '-' + str(i + 1) + ' permille quantile range. ' + str(findme) + ' falls within this range.'
break
i = i + 1
During my research, I noticed there are several more advanced functions to use, such as scipy.stats.[distribution type].ppf().
What is the advantage of using these over mquantiles()?
Is there a method available to efficiently determine the distribution of the data in a data set (this is my concern with scipy.stats.[distribution type]())?
Thanks,
Matt
[update]
After discussing with a "stats dude," I believe that this method (what he referred to as the "empirical method") is just as valid if you do not know the distribution. To find the distribution, you could use the Kolmogorov–Smirnov test, which is revealed via scipy.stats.ksone and scipy.stats.kstwobign to determine the distribution, then utilize one of the scipy.stats.[distribution type].ppf() functions. He also said that it doesn't matter at all, that the method above is just as good as doing all this work, with little reward. Although he cautioned that the strength of the above method would rise with the amount of data available in data (meaning the inverse is also true), that no one has solved the problem of applying laws against small data sets.
What I will do, is to consider the strength of the data set, and put a weight on my resultant, and consider it to be much more fuzzy/have less weight when the data set is "small." What is "small?" I'm not sure yet.
I would still like to find other peoples input as to effective use of ppf() versus mquantile().
ppf gives you the quantiles for a particular distribution given the parameters of the distribution. For example, you could fit your data to and exponential distribution, and then you can use the ppf with the estimated parameters to get the quantiles.
When you use mquantiles, then you do not assume that you have a specific distribution.
Estimating the parameters of a given distribution and using ppf will give you better results, with lower variance, than mquantiles, if your data really comes from that distribution or the distribution is at least a very good approximation.
Related
I am interested in which processes/activities contribute most to the Life Cycle Impact Assessment (LCIA) that I am conducting. For this, I run a contribution analysis (see code below). To crosscheck the results of my contribution analysis and to ensure that I get everything right, I wanted to compare the returned contributions with the impact assessment result (lca.score).
The documentation of ca.annotated_top_processes(lca) says: "Returns a list of tuples: (lca score, supply, activity)."
In my understanding, lca.score should be the same value as the sum of all the first values in the tuples that are returned by ca.annotated_top_processes(lca) (the printed values). However, this is not the case. What am I missing? Is there some sort of cut-off applied or did I misunderstand something?
import bw2analyzer as bwa
random_act = db_ei381.random()
lca = bw2data.LCA(
{random_act: 1},
('ReCiPe Midpoint (H) V1.13', 'water depletion', 'WDP')
)
lca.lci()
lca.lcia()
print(lca.score)
# %% Contribution analysis
ca = bwa.ContributionAnalysis()
contributions = ca.annotated_top_processes(lca)
print(sum([i[0] for i in contributions]))
It is not well documented, but you can introduce an argument limit that specifies the number of activities that are considered in the contribution analysis. The default value I think is 25. It is sorted so the most important activities come first. If you write something like this you should see how the result converges to the total score as the number of activities increase:
import matplotlib.pyplot as plt
cutoff = [25,50,100,500,1000,1200]
scores = []
for n in cutoff:
contributions = ca.annotated_top_processes(lca,limit=n)
contr_sum = sum([i[0] for i in contributions])
scores.append(contr_sum)
plt.plot(cutoff,scores)
plt.axhline(lca.score,ls='--',color='r');
I'm interested in estimating a shared, global trend over time for counts monitored at several different sites using generalized additive models (gams). I've read this great introduction to hierarchical gams (hgams) by Pederson et al. (2019), and I believe I can setup the model as follows (the Pederson et al. (2019) GS model),
fit_model = gam(count ~ s(year, m = 2) + s(year, site, bs = 'fs', m = 2),
data = count_df,
family = nb(link = 'log'),
method = 'REML')
I can plot the partial effect smooths, look at the fit diagnostics, and everything looks reasonable. My question is how to extract a non-centered annual relative count index? My first thought would be to add the estimated intercept (the average count across sites at the beginning of the time series) to the s(year) smooth (the shared global smooth). But I'm not sure if the uncertainty around that smooth already incorporates uncertainty in the estimated intercept? Or if I need to add that in? All of this was possible thanks to the amazing R libraries mgcv, gratia, and dplyr.
Your way doesn't include the uncertainty in the constant term, it just shifts everything around.
If you want to do this it would be easier to use the constant argument to gratia:::draw.gam():
draw(fit_model, select = "s(year)", constant = coef(fit_model)[1L])
which does what your code does, without as much effort (on your part).
An better way — with {gratia}, seeing as you are using it already — would be to create a data frame containing a sequence of values over the range of year and then use gratia::fitted_values() to generate estimates from the model for those values of year. To get what you want (which seems to be to exclude the random smooth component of the fit, such that you are setting the random component to equal 0 on the link scale) you need to pass that smooth to the exclude argument:
## data to predict at
new_year <- with(count_df,
tibble(year = gratia::seq_min_max(year, n = 100),
site = factor(levels(site)[1], levels = levels(site)))
## predict
fv <- fitted_values(fit_model, data = new_year, exclude = "s(year,site)")
If you want to read about exclude, see ?predict.gam
So I have a system where I need to be able to determine the exact position of my ions and run an equation on the average position of that ion. I found my ion positions were inconsistent due to some ions wrapping across the periodic boundary and severely changing the position for that one window. Leading me to have an average of say +20 when the ion just shuffled between +40 and -40.
I was wanting to correct that by implementing a way to unwrap my wrapped coordinates for ions on the edge of my box.
Essentially I was thinking that for each frame in my trajectory, MDAnalysis would check the position of ION 1 in frame 1. Then in frame 2 it would check the same ion once more and compare it to the previous position. If it for example goes from + coordinates to - coordinates then I would have a count that adds +1 meaning that it wrapped once. If it goes from - to + I would have it subtract 1. Then by the end of all of the frames I would have a number that could help me identify how I could perform my analysis.
However my coding skills are less than lackluster and I wanted to know how I would go about implementing this? I have essentially gotten the count down, but the comparison between frames is where I am confused. How would I do this comparison?
Thanks in advance
There are a few ways to answer this question. Firstly,
Essentially I was thinking that for each frame in my trajectory, MDAnalysis would check the position of ION 1 in frame 1. Then in frame 2 it would check the same ion once more and compare it to the previous position. If it for example goes from + coordinates to - coordinates then I would have a count that adds +1 meaning that it wrapped once. If it goes from - to + I would have it subtract 1. Then by the end of all of the frames I would have a number that could help me identify how I could perform my analysis.
You could write your own analysis class.
One untested way to do it is prototyped below -- the tutorial goes more into what each method (_prepare, _conclude, etc) does.
from MDAnalysis.analysis.base import AnalysisBase
import numpy as np
class CountWrappings(AnalysisBase):
def __init__(self, universe, select="name NA"):
super().__init__(universe.universe.trajectory)
# these are your selected ions
self.atomgroup = universe.select_atoms(select)
self.n_atoms = len(self.atomgroup)
def _prepare(self):
# self.results is a dictionary of results
self.results.wrapping_per_frame = np.zeros((self.n_frames, self.n_atoms), dtype=bool)
self._last_positions = self.atomgroup.positions
def _single_frame(self):
# does sign change for any element in 2D array?
compare_signs = np.sign(self.atomgroup.positions) == np.sign(self._last_positions)
sign_changes_any_axis = np.any(compare_signs, axis=1)
# _frame_index is the relative index of the frame being currently analyzed
self.results.wrapping_per_frame[self._frame_index] = sign_changes_any_axis
self._last_positions = self.atomgroup.positions
def _conclude(self):
self.results.n_wraps = self.results.wrapping_per_frame.sum(axis=0)
n_wraps = CountWrappings(my_universe, select="name NA CL MG")
n_wraps.run()
print(n_wraps.results.wrapping_per_frame)
print(n_wraps.results.n_wraps)
However, I'm not sure that addresses your actual aim:
I was wanting to correct that by implementing a way to unwrap my wrapped coordinates for ions on the edge of my box.
Are you computing the ion positions relative to anything? Potentially you could add bonds between each ion and the center so that you can use the AtomGroup.unwrap() function. Alternatively, is your data compatible with GROMACS? GROMACS has an unwrapping utility called "nojump" that unwraps atoms jumping across box edges, e.g.
gmx trjconv -f my_trajectory.xtc -s my_topology.gro -pbc nojump -o my_unwrapped_trajectory.xtc
As Lily mentioned, you could write your own analysis to do this or use GROMACS. However, both Lily's example and the GROMACS implementation of 'nojump' fail to account for box size fluctuations under the NPT ensemble (assuming you've used NPT). von Bulow et al. wrote about this widespread problem a couple of years ago. As far as I'm aware, the only implementation of nojump unwrapping that accounts for box size fluctuations is in LiPyphilic (disclaimer: I am the author of LiPyphilic).
Using LiPyphilic, you can unwrap your trajectory like so:
import MDAnalysis as mda
from lipyphilic.transformations import nojump
u = mda.Universe(pdb, xtc)
ions = u.select_atoms('name NA CLA')
u.trajectory.add_transformations(
nojump(
ag=ions,
nojump_x=True,
nojump_y=True,
nojump_z=True)
)
Then, when you do further analysis with your MDAnalysis Universe, the atoms will automatically be unwrapped at each frame.
I have two points to ask about:
1)
I would like to understand what is precisely returned from the np.random.randn from NumPy and torch.randn from PyTorch. They both return a tensor with random numbers from a normal distribution with mean 0 and std 1, hence, a standard normal distribution. However, it is not the same thing as puting x values in the standard normal distribution function here and getting its respective image values y. The values returned by PyTorch and NumPy does not seem like this.
For me, it seems that both np.random.randn and torch.randn from these libraries returns the x values from the functions, not the image y as I calculated below. Is that correct?
normal = np.array([(1/np.sqrt(2*np.pi))*np.exp(-(1/2)*(i**2)) for i in range(-38,39)])
Printing the normal variable shows me something like this.
array([1.10e-314, 2.12e-298, 1.51e-282, 3.94e-267, 3.79e-252, 1.34e-237,
1.75e-223, 8.36e-210, 1.47e-196, 9.55e-184, 2.28e-171, 2.00e-159,
6.45e-148, 7.65e-137, 3.34e-126, 5.37e-116, 3.17e-106, 6.90e-097,
5.52e-088, 1.62e-079, 1.76e-071, 7.00e-064, 1.03e-056, 5.53e-050,
1.10e-043, 8.00e-038, 2.15e-032, 2.12e-027, 7.69e-023, 1.03e-018,
5.05e-015, 9.13e-012, 6.08e-009, 1.49e-006, 1.34e-004, 4.43e-003,
5.40e-002, 2.42e-001, 3.99e-001, 2.42e-001, 5.40e-002, 4.43e-003,
1.34e-004, 1.49e-006, 6.08e-009, 9.13e-012, 5.05e-015, 1.03e-018,
7.69e-023, 2.12e-027, 2.15e-032, 8.00e-038, 1.10e-043, 5.53e-050,
1.03e-056, 7.00e-064, 1.76e-071, 1.62e-079, 5.52e-088, 6.90e-097,
3.17e-106, 5.37e-116, 3.34e-126, 7.65e-137, 6.45e-148, 2.00e-159,
2.28e-171, 9.55e-184, 1.47e-196, 8.36e-210, 1.75e-223, 1.34e-237,
3.79e-252, 3.94e-267, 1.51e-282, 2.12e-298, 1.10e-314])
2) Also, if we ask these libraries that I want a matrix of values from a standard normal distribution, it means that all rows and columns are draw from the same standard distribution? If I want i.i.d distributions in every row, I would need to call np.random.randn over a for loop for each row and then vstack them?
1) Yes, they give you x and not phi(x) since the formula for phi(x) gives the probability density of sampling a value x. If you want to know the probability of getting values in an interval [a,b] you need to integrate phi(x) between a and b. Intuitively, if you look at the function phi(x) you'll see that you're more likely to get values near zero than, say, values near 1.
An easy way to see it, is look at the histogram of the sampled values.
import numpy as np
import matplotlib.pyplot as plt
samples = np.random.normal(size=[1000])
plt.hist(samples)
2) they're iid. Just use a 2d size like so:
samples = np.random.normal(size=[10, 10])
After fitting GARCH model in R and obtain the output, how do I know whether there is any evidence of ARCH effect?
I am not toosure whether I have to check in optimal parameters, Information criteria, Q-statistics on standardized residuals, ARCM LM Tests, Nyblom stability test, Sign Bias Test or Adjusted Pearson Goodness-of-fit test?
I assume I have to check under ARCH LM Tests, and if the p-value is rather high, there is an ARCH effect, am I right?
Thank you
You need to start by looking for second order persistence in the return series itself before going on to fit a GARCH model. Lets work through a quick example of how this will work
Start by getting the return series. Here I will use the quantmod library to load in the data for SPDR S&P 500 ETF or SPY
library(quantmod)
library(PerformanceAnalytics)
rtn<-getSymbols(c('SPY'),return.class='ts')
Next, calculate the return series either yourself or using the Return.calculate function as provided by the PerformanceAnalytics library
Rtn <- diff(log(SPY[,"SPY.Close"])) * 100
#OR
Rtn <- Return.calculate(SPY[,"SPY.Close"], method = c("compound","simple")[2]) * 100
Now, lets have a look at the persistence of the first and second order moments of the series. For second order moments, lets use the squared return series as a proxy.
Plotdata<-cbind(Rtn, Rtn^2)
plot.zoo(Plotdata)
There remains strong first persistence in returns and there is clearly periods of strong second order persistence as seen in the squared returns.
We can now formally start testing for ARCH-effects. A formal test for ARCH effects is LBQ stats on squared returns:
Box.test(coredata(Rtn^2), type = "Ljung-Box", lag = 12)
Box-Ljung test
data: coredata(Rtn^2)
X-squared = 2001.2, df = 12, p-value < 2.2e-16
We can clearly reject the null hypothesis of independence in a given time series. (ARCH-effects)
Fin.Ts also provides the ARCH-LM test for conditional heteroskedasticity in the returns:
library(FinTS)
ArchTest(Rtn)
ARCH LM-test; Null hypothesis: no ARCH effects
data: Rtn
Chi-squared = 722.19, df = 12, p-value < 2.2e-16
This supports the conclusion of the LBQ test that ARCH-effects are present.