documentation for spatstat default.ngrid function missing - spatstat

I'm using the ppm function in spatstat and looking for documentation on the default dummy points used in the quadrature scheme. The default.dummy help page says "If nd is missing, a default value (depending on the data pattern X) is computed by default.ngrid." I am now looking for information on default.ngrid but cannot find anything. For context, I am comparing null random models to models with covariates, where the points are coordinates of class ppp and the covariates are pixel images in a window. To generate the models, I'm using the ppm function with default parameters, as in:
ppm0 <- ppm(myPoints_ppp ~ 1)
ppm1 <- ppm(myPoints_ppp ~ covariate_A)
I'm trying to understand how the dummy points are being generated since I did not specify the nd argument. Thanks!

That reference is out of date. The default number of dummy points is determined by default.n.tiling. This is an internal, undocumented function.
The source code for default.n.tiling can be consulted to find out the exact rules, but here is a sketch:
Currently the default minimum number of grid points in each dimension is the greater of
10 * ceiling(2 * sqrt(npoints(X))/10)
and
spatstat.options('ndummy.min')
This determines a minimum acceptable number of dummy points. With the default value of spatstat.options('ndummy.min') = 32 this means that any point pattern with no more than 225 points will be given a minimum 32 x 32 grid of dummy points. A pattern with 226 to 400 points will have 40 x 40 dummy points. A pattern with 401 to 625 points will have 50 x 50 dummy points, and so on.
The final number of dummy points is determined by applying some other constraints (which depend on the context and user-specified parameters) and may be greater than the minimum number specified above.
This default can always be overruled by simply specifying a non-NULL value for the parameter nd.
I should also point out that the default settings are chosen to produce an acceptable result quickly, rather than to produce a highly accurate result. This is necessary because of CRAN's limits on the total time taken to check the package. If you're doing any serious methodological research, you should consider increasing spatstat.options('ndummy.min') or creating your own quadrature schemes. This will also improve the reproducibility of your results (since the default in spatstat could change.)

Related

Converting intensities to probabilities in ppp

Apologies for the overlap with existing questions; mine is at a more basic skill level. I am working with very sparse occurrences spanning very large areas, so I would like to calculate probability at pixels using the density.ppp function (as opposed to relrisk.ppp, where specifying presences+absences would be computationally intractable). Is there a straightforward way to convert density (intensity) to probabilities at each point?
Maxdist=50
dtruncauchy=function(x,L=60) L/(diff(atan(c(-1,1)*Maxdist/L)) * (L^2 + x^2))
dispersfun=function(x,y) dtruncauchy(sqrt(x^2+y^2))
n=1e3; PPP=ppp(1:n,1:n, c(1,n),c(1,n), marks=rep(1,n));
density.ppp(PPP,cutoff=Maxdist,kernel=dispersfun,at="points",leaveoneout=FALSE) #convert to probabilies?
Thank you!!
I think there is a misunderstanding about fundamentals. The spatstat package is designed mainly for analysing "mapped point patterns", datasets which record the locations where events occurred or things were located. It is designed for "presence-only" data, not "presence/absence" data (with some exceptions).
The relrisk function expects input data about the presence of two different types of events, such as the mapped locations of trees belonging to two different species, and then estimates the spatially-varying probability that a tree will belong to each species.
If you have 'presence-only' data stored in a point pattern object X of class "ppp", then density(X, ....) will produce a pixel image of the spatially-varying intensity (expected number of points per unit area). For example if the spatial coordinates were expressed in metres, then the intensity values are "points per square metre". If you want to calculate the probability of presence in each pixel (i.e. for each pixel, the probability that there is at least one presence point in the pixel), you just need to multiply the intensity value by the area of one pixel, which gives the expected number of points in the pixel. If pixels are small (the usual case) then the presence probability is just equal to this value. For physically larger pixels the probability is 1 - exp(-m) where m is the expected number of points.
Example:
X <- redwood
D <- density(X, 0.2)
pixarea <- with(D, xstep * ystep)
M <- pixarea * D
p <- 1 - exp(-M)
then M and p are images which should be almost equal, and can both be interpreted as probability of presence.
For more information see Chapter 6 of the spatstat book.
If, instead, you had a pixel image of presence/absence data, with pixel values equal to 1 or 0 for presence or absence respectively, then you can just use the function blur in the spatstat package to perform kernel smoothing of the image, and the resulting pixel values are presence probabilities.

Quasi-Monte-Carlo vs. variable dimensionality?

I've been looking through the Matlab documention on using quasi-random sampling of N-dimensional unit cubes. This represents a problem with N stochastic parameters. Based on the fact that it is a unit cube, I presume that I need to use the inverse CDF of each parameter to map from the [0,1] domain to the value range of each parameter.
I would like to try this on a problem for which I now use Monte Carlo. Unfortunately, the problem I'm analyzing does not have a fixed number of dimensions. For each instantiation of the problem, I generate a variable number of widgets (say) using a Poisson distribution. Only after that do I randomly generate the parameters for each widget. That whole process yields one instance of the problem to be analyzed, so the number of parameters varies from one instance to the next.
Is this kind of problem still amenable to Quasi-Monte-Carlo?
What I used once was to get highest possible dimension of the problem d, generate Sobol sequence in d and use whatever number of points necessary for a particular sampling. I would say it helped somewhat...
From talking to a much smarter colleague, we need to consider the various combinations of widget counts for each widget type. For example, if we have 2 of widget type#1, 4 of widget type #2, 1 of widget type #3, etc., that constitutes one combination. QMC can be applied to that one combination. We are assuming that number of widget#i is independent of the number of widget#j for i<>j, so the probability of each combination is just the product of p(2 widgets of type#1), p(4 widgets of type#2), p(1 widget of type#3), etc. The individual probabilities are easy to get from their Poisson distributions (or their flat distributions, or whatever distribution is being used). If there are N widget types, this is just a joint PMF in N-space. This probability is then used to weight the QMC result for that particular combination. Note that even when the exactly combination is nailed down, QMC is still needed because there each widget is associated with 3 stochastic parameters.

approximate histogram for streaming string values (card catalog algorithm?)

I have a large list (or stream) of UTF-8 strings sorted lexicographically. I would like to create a histogram with approximately equal values for the counts, varying the bin width as necessary to keep the counts even. In the literature, these are sometimes called equi-height, or equi-depth histograms.
I'm not looking to do the usual word-count bar chart, I'm looking for something more like an old fashioned library card catalog where you have a set of drawers (bins), and one might hold SAM - SOLD,and the next bin SOLE-STE, while all of Y-ZZZ fits in a single bin. I want to calculate where to put the cutoffs for each bin.
Is there (A) a known algorithm for this, similar to approximate histograms for numeric values? or (B) suggestions on how to encode the strings in a way that a standard numeric histogram algorithm would work. The algorithm should not require prior knowledge of string population.
The best way I can think to do it so far is to simply wait until I have some reasonable amount of data, then form logical bins by:
number_of_strings / bin_count = number_of_strings_in_each_bin
Then, starting at 0, step forward by number_of_strings_in_each_bin to get the bin endpoints.
This has two weaknesses for my use-case. First, it requires two iterations over a potentially very large number of strings, one for the count, one to find the endpoints. More importantly, a good histogram implementation can give an estimate of where in a bin a value falls, and this would be really useful.
Thanks.
If we can't make any assumptions about the data, you are going to have to make a pass to determine bin size.
This means that you have to either start with a bin size rather than bin number or live with a two-pass model. I'd just use linear interpolation to estimate positions between bins, then do a binary search from there.
Of course, if you can make some assumptions about the data, here are some that might help:
For example, you might not know the exact size, but you might know that the value will fall in some interval [a, b]. If you want at most n bins, make the bin size == a/n.
Alternatively, if you're not particular about exactly equal-sized bins, you could do it in one pass by sampling every m elements on your pass and dump it into an array, where m is something reasonable based on context.
Then, to find the bin endpoints, you'd find the element at size/n/m in your array.
The solution I came up with addresses the lack of up-front information about the population by using reservoir sampling. Reservoir sampling lets you efficiently take a random sample of a given size, from a population of an unknown size. See Wikipedia for more details. Reservoir sampling provides a random sample regardless of whether the stream is ordered or not.
We make one pass through the data, gathering a sample. For the sample we have explicit information about the number of elements as well as their distribution.
For the histogram, I used a Guava RangeMap. I picked the endpoints of the ranges to provide an even number of results in each range (sample_size / number_of_bins). The Integer in the map merely stores the order of the ranges, from 1 to n. This allows me to estimate the proportion of records that fall within two values: If there are 100 equal sized bins, and the values fall in bin 25 and bin 75, then I can estimate that approximately 50% of the population falls between those values.
This approach has the advantage of working for any Comparable data type.

How to properly clamp beckmann distribution

I am trying to implement a Microfacet BRDF shading model (similar to the Cook-Torrance model) and I am having some trouble with the Beckmann Distribution defined in this paper: https://www.cs.cornell.edu/~srm/publications/EGSR07-btdf.pdf
Where M is a microfacet normal, N is the macrofacet normal and ab is a "hardness" parameter between [0, 1].
My issue is that this distribution often returns obscenely large values, especially when ab is very small.
For instance, the Beckmann distribution is used to calculate the probability of generating a microfacet normal M per this equation :
A probability has to be between the range [0,1], so how is it possible to get a value within this range using the function above if the Beckmann distribution gives me values that are 1000000000+ in size?
So there a proper way to clamp the distribution? Or am I misunderstanding it or the probability function? I had tried simply clamping it to 1 if the value exceeded 1 but this didn't really give me the results I was looking for.
I was having the same question you did.
If you read
http://blog.selfshadow.com/publications/s2012-shading-course/hoffman/s2012_pbs_physics_math_notes.pdf
and
http://blog.selfshadow.com/publications/s2012-shading-course/hoffman/s2012_pbs_physics_math_notebook.pdf
You'll notice it's perfectly normal. To quote from the links:
"The Beckmann Αb parameter is equal to the RMS (root mean square) microfacet slope. Therefore its valid range is from 0 (non-inclusive –0 corresponds to a perfect mirror or Dirac delta and causes divide by 0 errors in the Beckmann formulation) and up to arbitrarily high values. There is no special significance to a value of 1 –this just means that the RMS slope is 1/1 or 45°.(...)"
Also another quote:
"The statistical distribution of microfacet orientations is defined via the microfacet normal distribution function D(m). Unlike F (), the value of D() is not restricted to lie between 0 and 1—although values must be non-negative, they can be arbitrarily large (indicating a very high concentration of microfacets with normals pointing in a particular direction). (...)"
You should google for Self Shadow's Physically Based Shading courses which is full of useful material (there is one blog post for each year: 2010, 2011, 2012 & 2013)

skew normal distribution

we have skew normal distribution with location=0, scale =1 and shape =0 then it is same as standard normal distribution with mean 0 and variance 1.but if we change the shape parameter say shape=5 then mean and variance also changes.how can we fix mean and variance with different values of shape parameter
Just look after how the mean and variance of a skew normal distribution can be computed and you got the answer! Knowing that the mean looks like:
and
You can see, that with a xi=0 (location), omega=1 (scale) and alpha=0 (shape) you really get a standard normal distribution (with mean=0, standard deviation=1):
If you only change the alpha (shape) to 5, you can except the mean will differ a lot, and will be positive. If you want to hold the mean around zero with a higher alpha (shape), you will have to decrease other parameters, e.g.: the omega (scale). The most obvious solution could be to set it to zero instead of 1. See:
Mean is set, we have to get a variance equal to zero with a omega set to zero and shape set to 5. The formula is known:
With our known parameters:
Which is insane :) That cannot be done this way. You may also go back and alter the value of xi instead of omega to get a mean equal to zero. But that way you might first compute the only possible value of omega with the formula of variance given.
Then the omega should be around 1.605681 (negative or positive).
Getting back to mean:
So, with the following parameters you should get a distribution you was intended to:
location = 1.256269 (negative or positive), scale = 1.605681 (negative or positive) and shape = 5.
Please, someone test it, as I might miscalculated somewhere with the given example.

Resources