kd-tree for clustered data - search

I am looking for a kd-tree for clustered data.
I have a large data set and in some areas the data is highly dense.
So I need some "balanced" search.
When I do a search for n-neighbors with a point next to a dense area I want
a result which is spatially "balanced". The search result must contain
also data points from the sparse area.
Is there any implementation which can do this job?
Thank you
Jan

A weighted voronoi diagram is spatialy balanced.

Related

Is there a metric that can determine spatial and temporal proximity together?

Given a dataset which consists of geographic coordinates and the corresponding timestamps for each record, I want to know if there's any suitable measure that can determine the closeness between two points by taking the spatial and temporal distance into consideration.
The approaches I've tried so far includes implementing a distance measure between the two coordinate values and calculating the time difference separately. But in this case, I'd require two threshold values for both the spatial and temporal distances to determine their overall proximity.
I wanted to know there's any single function that can take in these values as an input together and give a single measure of their correlation. Ultimately, I want to be able to use this measure to cluster similar records together.

How to use excel data to find period

I have three Excel columns of data from an experiment with a pendulum: time, angle displacement, and angular velocity. I was wondering if there is a way in Excel to calculate and then graph the period (and, if possible, display the function for the graph)... I realize it's kinda a dumb question. I'm still new at Excel.
Thanks for any pointers u can give!
In case the Analysis ToolPak is installed, one can use Tools->Data Analysis->Fourier Analysis. If the data is a superposition of harmonic functions (sin,cos), the corresponding frequencies (or inverse periods) will appear as peaks in the Fourier analysis.

Efficient algorithm for nearest point search in a grid

I am looking for an algorithm that can do efficient search in a grid.
I have a large array which includes all the centroid points (x,y,z)
Now for a given location (xp,yp,zp) I want to find the closest centroid to that p location.
Currently I am doing a brute force search which basically for each point p I go through all points, calculate the distance to location p and by this find out which centroid that is.
I know that octree search and kd-tree might help but not too sure how to tackle it or which one would be better.
I would you a spatial index, such as the kd-tree or quadtree/octree (which you suggested) or maybe an R-Tree based solution.
Put all your centroids into the index. Usually you can associated any point in the index with some additional data, so if you need that, you could provide a back-treference into the grids, for example grid coordinates).
Finding the nearest point in the index should be very fast. The returned data then allows you to go back into the grid.
In a way, a quadtree/octree is in itself nothing but a discretizing grid that get finer if the point density increases. The difference to a grid is that it is hierarchical and that empty areas are not stored at all.

Monte Carlo Simulation in Excel for Non-normal Distributions

I would like to simulate the performance a baseball player. I know his expected performance for every future year and the standard deviations of those performances (based on regression analysis). At first, I was thinking of using the NORMINV(RAND(),REF,REF) function in excel, but the underlying distribution of baseball players' performances is dramatically right skewed. Is there a way that I can perform this sort of analysis in Excel or some other free or low-cost software? The end-goal here is for the simulation to use the right skewed distribution. Thanks very much.
R has lots of tools to do this sort of analysis, though you'd have to look through the docs to figure out how to use it. R is free, at least for non-commercial use.
If you have a cumulative distribution table (that is evenly spaced and sufficiently detailed) then you can easily generate random values from this distribution in Excel by looking up a uniform random number generated by RAND() in your distribution table and take the corresponding "x-axis" value.
=OFFSET($A$1,MATCH(RAND(),$B$2:$B$102),0)
A1 is the cell just above the table of "x-axis" values.
B2:B102 is the cumulative distribution table.
This is a simplified example. Some small modifications may be needed to handle edge-cases and adjust for biases.
If you have enough empirical data you should be able to create the cumulative distribution table.

K-means text documents clustering. How calculate intra and inner similarity?

I classify thousands of documents where the vector components are calculated according to the tf-idf. I use the cosine similarity. I did a frequency analysis of words in clusters to check the difference in top words. But I'm not sure how to calculate the similarity numerically in this sort of documents.
I count internal similarity of a cluster as the average of the similarity of each document to the centroid of the cluster. If I counted the average couple is based on small number.
External similarity calculated as the average similarity of all pairs cluster centroid
I count right? It is based on my inner similarity values average ​​from 0.2 (5 clusters and 2000 documents)to 0.35 (20 clusters and 2000 documents). Which is probably caused by a widely-oriented documents in computer science. Intra from 0.3-0.7. The result may be like that? On the Internet I found various ways of measuring, do not know which one to use than the one that was my idea. I am quite desperate.
Thank you so much for your advice!
Using k-means with anything but squared euclidean is risky. It may stop converging, as the convergence proof relies on both the mean and the distance assignment optimizing the same criterion. K-means minimizes squared deviations, not distances!
For a k-means variant that can handle arbitrary distance functions (and have guaranteed convergence), you will need to look at k-medoids.

Resources