How to find the distribution probability in user's chech-ins? - foursquare

I read a paper that mentioned the user's check-in behavior follow a power-law distribution. I want to know how to I can calculate a user's check-in behavior?
This is the figure of probability and they said:
To obtain this measurement, we calculate the distances between all pairs of POIs that a user has checked in and plot
a histogram (actually probability density function) over the
distance of POIs checked in by the same user. As shown in
Figure 2, a significant percentage of POIs pairs checked in
by the same user appears to be within short distance, indicating a geographical clustering phenomenon in user check-in
activities.7

Related

Envelope of Lines

Its been a while that I'm stuck with an apparently "simple" problem. My goal is to build the envelope of a set of lines that are "attached" to a curve. Let's say a curve like this:
For the above example I would expect the envelope of lines (whose directions are depicted by arrows and are orthogonal to the edges of the red curve) to be an arc of a circle.
I thought of doing this in two computationally separate ways:
Intersection of consecutive lines: In an ideal smooth world, the envelop of lines attached is a curve where the red lines are all tangent to. Now, coming back to the discrete world I try to obtain the envelope curve by intersecting consecutive lines (for example the first line with the second line would give the first vertex of the envelope).
Evolute of the red curve: Again in an ideal smooth world, one can think of such an envelop as the evolute of the red curve (see Evolute - wikipedia). Therefore, all I had to do in addition to current info was to compute the curvature and then build the evolute (naturally I had to use a discrete version of curvature which you can find its definition here: Discrete Curvature - wikipedia).
Doing any of the above approaches I would get the following result:
However, finding the "correct arc" is heavily dependent on the accuracy of the initial data which is the red curve. As soon as the red curve has some "noises" in the vertices the envelope is heavily distorted. Here I add a picture (where the red curve is visually intact (but not actually) yet the envelope is distorted):
My Question: How can I rectify this? I believe there should be a numerical approach to solve this issue as I badly need this envelope to be correctly built. I'm a mathematician and am not fully aware of the numerical tricks that might exist in dealing with cases like this.
However, I believe that this should be a standard question in computer graphics community though I could not find anything properly relevant after searching for months.
It would be great if the solutions are in MATLAB language. Please let me know if you want me to be more accurate regarding the passage.
For the line intersecton method, yes, because the lines are relatively parallel, any small error in the defining data for a line will produce a dramatic error in their intersection points.
I suggest the following:
Calculate all lines.
Calculate all intersection points of the adjacent lines.
Calculate the distances between all adjacent intersection points.
Sequence plot the distances, and identify all distances which are more than,
perhaps, 2 standard deviations from the trend line of the distances.
If the data is not "too bad" then I think the identified distances
will mostly come in pairs, ie, there is one "bad" intersection line
causing two "bad" distances.
Exclude the "bad" lines and reprocess the remaining intersection points.
The above assumes the granularity of the data is greater where the base curve is curvier.
If the intersection point distances seem to form two trend lines, especially if they look like to two diverging, or two converging, trend lines, then group the intersection lines accordingly, plot two envelopes, and take the average of the two envelopes as "the envelope". (Or perahps even more trend lines, if there is a regular error in the data.)
But, if there are signs of regular data errors, then a contextual assessment and analysis of the data source and how it was generated/gathered/measured might be required to correctly determine which data should be excluded.

what type of model should I use?

I am trying to assess the infuence of sex (nominal), altitude (nominal) and latitude (nominal) on corrected wing size (continuous; residual of wing size by body mass) of an animal species. I considered altitude as a nominal factor given the fact that this particular species is mainly distributed at the extremes (low and high) of steep elevational gradients in my study area. I also considered latitude as a nominal fixed factor given the fact that I have sampled individuals only at three main latitudinal levels (north, center and south). 
I have been suggested to use Linear Mixed Model for this analysis. Specifically, considering sex, altitude, latitude, sex:latitude, sex:altitude, and altitude:latitude as fixed factors, and collection site (nominal) as the random effect. The latter given the clustered distribution of the collection sites.
However, I noticed that despite the corrected wing size follow a normal distribution, it violates the assumption of homoscedasticity among some altitudinal/latitudinal groups. I tried to use a non-parametric equivalent of factorial ANOVA (ARTool) but I cannot make it run because it does not allow cases of missing data and it requires to asses all possible fixed factor and their interactions. I will appreciate any advice on what type of model I can use given the design of my data and what software/package can I use to perform the analysis.
Thanks in advance for your kind attention.
Regards,

Google Elevation API resolution range and data sources

What is the minimum and maximum resolution of the elevation data from the Google Elevation API?
"A resolution value, indicating the maximum distance between data points from which the elevation was interpolated, in meters. This property will be missing if the resolution is not known. Note that elevation data becomes more coarse (larger resolution values) when multiple points are passed. To obtain the most accurate elevation value for a point, it should be queried independently." - https://developers.google.com/maps/documentation/elevation/intro
What data sources is it built from? SRTM1(30m US)? SRTM3(90m global) LiDAR? National Elevation Dataset (NED)?

Closest Landmark to a Coordinate

I have an app that displays maps. I would like to add a feature where the user can tap a location and have it it display information about the nearest landmark.
By landmark, I mean I have a set of predefined objects related to the app with latitude/longitude coordinates.
The app is able to convert X/Y screen coordinates to latitude/longitude.
The app already calculates the distance between two latitude/longitude coordinates.
Therefore, I could, through brute force, run through the list of landmarks and find the closest.
However, knowing this is a problem many applications have to face, I ask if there is a better technique to find the closest "landmark" to a latitude/longitude than brute force? Some kind of transform?

'Probability' of a K-nearest neighbor like classification

I've a small set of data points (around 10) in a 2D space, and each of them have a category label. I wish to classify a new data point based on the existing data point labels and also associate a 'probability' for belonging to any particular label class.
Is it appropriate to label the new point based on the label to its nearest neighbor( like a K-nearest neighbor, K=1)? For getting the probability I wish to permute all the labels and calculate all the minimum distance of the unknown point and the rest and finding the fraction of cases where the minimum distance is lesser or equal to the distance that was used to label it.
Thanks
The Nearest Neighbour method is already using the Bayes theorem to estimate the probability using the points in a ball containing your chosen K points. There is no need to transform, as the number of points in the ball of K points belonging to each label divided by the total number of points in that ball already is an approximation of the posterior probability of that label. In other words:
P(label|z) = P(z|label)P(label) / P(z) = K(label)/K
This is obtained using the Bayes rule of probability on an estimated probability estimated using a subset of the data. In particular, using:
VP(x) = K/N (this gives you the probability of a point in a ball of volume V)
P(x) = K/NV (from above)
P(x=label) = K(label)/N(label)V (where K(label) and N(label) are the number of points in the ball of that given class and the number of points in the total samples of that class)
and
P(label) = N(label)/N.
Therefore, just pick a K, calculate the distances, count the points and by checking their labels and recounting you will have your probability.
Roweis uses a probabilistic framework with KNN in his publication Neighbourhood Component Analysis. The idea is to use a "soft" nearest neighbour classification, where the probability that a point i uses another point j as its neighbour is defined by
,
where d_ij is the euclidean distance between point i and j.
The are no probabilities for such K-nearest classification method because it is discriminative classification as well as SVM. There are should be used postporcess for learning probabilities on unseen data with generative model like logistic regression.
1. learn K nearest classifier
2. Train logistic regression on distance and average distance to K nearest for validation data.
Check for details LibSVM article.
Sort the distances to the 10 centres; they could be
1 5 6 ... — one near, others far
1 1 1 5 6 ... — 3 near, others far
... lots of possibilities.
You could combine the 10 distances to a single number, e.g. 1 - (nearest / average) ** p,
but that's throwing away information.
(Different powers p makes the hills around the centres steeper or flatter.)
If your centres are really Gaussian hills though, take a look at
Multivariate kernel density estimation.
Added:
There are zillions of functions that go smoothly between 0 and 1,
but that doesn't make them probabilities of something.
"Probability" means either that chance, likelihood, is involved,
as in probability of rain;
or that you're trying to impress somebody.
Added again: scholar.google.com "(single|1) nearest neighbor classifier" gets > 300 hits;
"k nearest neighbor classifier" gets almost 3000.
It seems to me (non-expert) that, out of 10 different ways of mapping k-NN distances to labels,
each one might be better than the 9 others — for some data, with some error measure.
Anyway, you could try asking stats.stackexchange.com ,
The answer is : it depends.
Imagine your labels are the surname of a person, and the X,Y coordinates represent some essential characteristics of the person's DNA sequence. Clearly a more close DNA description enhance the probability of having the same surnames.
Now suppose the X,Y is the lat/long of the work office for that person. Working closer isn't related to label (surname) sharing.
So, it depends on the semantic of your tags and axes.
HTH!

Resources