Spatstat: Cluster processes for interactions in marked point patterns? - spatstat

I am trying to identify clustering patterns between different marks using point process models.
Multitype Strauss is the only model I have found thus far which can compare interactions between different marks. Is there a clustering model that can be used on mullti-type point patterns

No, unfortunately. The spatstat model-fitting function kppm does not currently support fitting multitype cluster processes to multitype point pattern data. This is on the "to do" list.
Other packages such as binspp and lgcp may do what you want.

Related

grade separation and shortest path on networks in spatstat

I have a question not on spatstat but on use and limitation of spatsat.
During the calculation of metrics like pcf and k function equivalents on linear networks, a shortest path distance is used instead of euclidean distance. I have the spatsat book from 2015 and I remember reading somewhere in the text that the shortest path calculation on networks is not sensitive to grade separations like flyover, bridges, underpass and therefore caution should be exercised in selecting the study area or be aware of this limitation while interpreting results.
Is there any publication that discusses this limitation of grade separation in detail and may be suggesting some workarounds? Or limitations of network equivalents in general?
Thank you
The code for linear networks in spatstat can handle networks which contain flyovers, bridges, underpasses and so on.
Indeed the dataset dendrite, supplied with spatstat, includes some of these features.
The shortest-path calculation takes account of these features correctly.
The only challenge is that you can't build the network structure using the data conversion function as.linnet.psp, because it takes a list of line segments and tries to guess which segments are connected at a vertex. In this context it will guess wrongly.
The connectivity information has to be specified somehow! You can use the constructor function linnet to build the network object when you have this information. The connectivity can be edited interactively using clickjoin.
This is explained briefly on page 713 of the book (which also mentions dendrite).
The networks that can be handled by spatstat are slightly more general than the simple model described on page 711. Lines can cross over without intersecting.
I'm sorry the documentation is terse, but much of this information has been kept confidential until recently (while our PhD students were finishing).

what check is to be used to confirm that point pattern is random

I have a very basic question. I am not a student of spatial statistics. But for an application, I feel that point pattern on a network is a good approximation for my case. I like the spatstat approach and to limit myself to this package, I would like to ask:
Based on some observations, I have the rate (λ = points per km) of occurrence of a point event on a network. Which check(function/test) in spatstat should I perform to verify that my point pattern generated by rpoislpp is indeed random in nature.
I would be happy if someone could help me in this or direct me to some relevant literature for a beginner level.
Thank you
A standard procedure would be to calculate the (network version of the) K-function of the point pattern dataset, and compare this with the envelopes of K-functions for simulated patterns which are completely random.
If X is your point pattern on a linear network (class lpp) then
plot(envelope(X, nsim=19))
will give a simple instance of these envelopes.
For more information see chapter 17 of the book cited.
Since you are using spatstat I would once again recommend our book Spatial Point Patterns: Methodology and Applications with R. Chapter 17 is about point patterns on a linear network. I can assure you that rpoislpp indeed generates points that are random in nature. You can just generate a bunch of samples and look at a plot of the patterns to see that they appear very random.

Test multiple algorithms in one experiment

Is there any way to test multiple algorithms rather than doing it once for each and every algorithm; then checking the result? There are a lot of times where I don’t really know which one to use, so I would like to test multiple and get the result (error rate) fairly quick in Azure Machine Learning Studio.
You could connect the scores of multiple algorithms with an 'Evaluate Model' button to evaluate algorithms against each other.
Hope this helps.
The module you are looking for, is the one called “Cross-Validate Model”. It basically splits whatever comes in from the input-port (dataset) into 10 pieces, then reserves the last piece as the “answer”; and trains the nine other subset models and returns a set of accuracy statistics measured towards the last subset. What you would look at is the column called “Mean absolute error” which is the average error for the trained models. You can connect whatever algorithm you want to one of the ports, and subsequently you will receive the result for that algorithm in particular after you “right-click” the port which gives the score.
After that you can assess which algorithm did the best. And as a pro-tip; you could use the Filter-based-feature selection to actually see which column had a significant impact on the result.
You can check section 6.2.4 of hands-on-lab at GitHub https://github.com/Azure-Readiness/hol-azure-machine-learning/blob/master/006-lab-model-evaluation.md which focuses on the evaluation of multiple algorithms etc.

How do I classify this value using a decision tree

Basically my decision tree can't classify a value using the normal algorithm.
I get to a node, and there are two options (say, sunny and windy), but at this node my value is different (for example, rainy).
Are there any methods to deal with this, e.g. change the tree or just estimate based on other data?
I was thinking of assigning the most common value at that node but this is just a guess.
Have you considered fuzzy logic for the rich/poor continuum? As for things that can't be expressed as a continuum, I can't think of a way it can be done. Rainy weather, for example, is so fundamentally different from sunny and windy weather in how we experience and react to it, I'm not sure how you expect a computer (or whatever it is you're writing your decision tree for) to figure out what to do. (Aside from simply having an "I don't know what to do" output state, but I'm assuming you wanted something more meaningful than that.)
The whole point in decision trees is that the options are complete and (hopefully) mutual exclusive.
If it is not you'll get into trouble. Redefine poor and rich to cover everything. (all incomes, all states of mind...)
But honestly, interpret such weather examples as what they are: just examples for a concept, not the holy grail of meteorology.
The issue here is that you've learned a decision from different data as you are using to classify it. More specific, your decision tree knows only two values (i.e., sunny and windy) for the attribute Weather. But your data for classification also allows the value rainy.
Since your decision tree has no observation when the weather was rainy, this value turns useless. In other words, you have to eliminate this value from your classification.
The only solution is to do data cleaning before using the decision tree as classifier.
You have two options:
1. Remove all observations/instances with Weather="rainy" from your data set because you can't classify them. The disadvantage is that all instances with Weather="rainy" are not classified.
2. For all observations/instances with Weather="rainy", remove the value or rather set it to unknown/null. In case that your decision tree can handle null values, it can classify all of your data set. If not, you still have a problem. In that case you should go for option 3.
3. Relearn your decision tree with Weather={sunny, windy, rainy}
(4). In your case the following is not an option. Replace "rainy" with either "sunny" or "rainy. There are different heuristics for that.
You are talking about the "normal algorithm", which is a quite blurry statement. I assume you are using a strictly-binary rooted decision tree, where the each internal node makes a binary split of the data. Thus, the condition evaluation at each internal node outputs a Boolean variable, which splits the data into the left node (true) and right node (false). In your case, you can have a categorical variable weather with two possible values in the training data, which makes only two possible node: weather==sunny or weather==windy. Hence, the rainy samples will be always on the right node, as it is not sunny and not windy.
In the following picture, the rainy samples will be classified as not sunny, not windy.

Understanding Lucene Queries

I am interested in knowing a little more specifically about how Lucene queries are scored. In their documentation, they mention the VSM. I am familiar with VSM, but it seems inconsistent with the types of queries they allow.
I tried stepping through the source code for BooleanScorer2 and BooleanWeight, to no real avail.
My question is, can somebody step through the execution of a BooleanScorer to explain how it combines queries.
Also, is there a way to simple send out several terms and just get the raw tf.idf score for those terms, the way it is described in the documentation?
The place to start is http://lucene.apache.org/java/3_3_0/api/core/org/apache/lucene/search/Similarity.html
I think it clears up your inconsistency? Lucene combines Boolean model (BM) of Information Retrieval with Vector Space Model (VSM) of Information Retrieval - documents "approved" by BM are scored by VSM.
The next thing to look at is Searcher.explain, which can give you a string explaining how the score for a (query, document) pair is calculated.
Tracing thru the execution of BooleanScorer can be challenging I think, its probably easiest to understand BooleanScorer2 first, which uses subscorers like ConjunctionScorer/DisjunctionSumScorer, and to think of BooleanScorer as an optimization.
If this is confusing, then start even simpler at TermScorer. Personally I look at it "bottoms-up" anyway:
A Query creates a Weight valid across the whole index: this incorporates boost, idf, queryNorm, and even confusingly, boosts of any 'outer'/'parent' queries like booleanquery that are holding the term. this weight is computed a single time.
A Weight creates a Scorer (e.g. TermScorer) for each index segment, for a single term this scorer has everything it needs in the formula except for what is document-dependent: the within-document term-frequency (TF), which it must read from the postings, and the document's length normalization value (norm). So this is why termscorer scores a document as weight * sqrt(tf) * norm. in practice this is cached for tf values < 32 so that scoring most documents is a single multiply.
BooleanQuery really doesnt do "much" except its scorers are responsible for nextDoc()'ing and advance()'ing subscorers, and when the Boolean model is satisfied, then it combines the scores of the subscorers, applying the coordination factory (coord()) based on how many subscorers matched.
in general, its definitely difficult to trace through how lucene scores documents because in all released forms, the Scorers are responsible for 2 things: matching and calculating scores. In Lucene's trunk (http://svn.apache.org/repos/asf/lucene/dev/trunk/) these are now separated, in such a way that a Similarity is basically responsible for all aspects of scoring, and this is separate from matching. So the API there might be easier to understand, maybe harder, but at least you can refer to implementations of many other scoring models (BM25, language models, divergence from randomness, information-based models) if you get confused: http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/lucene/src/java/org/apache/lucene/search/similarities/

Resources