Way to reduce geopoints? - geospatial

Does anyone have any handy algorithms that could be used to reduce the number of geo-points ?
I am using a list of 2,000,000 postcodes which come with their own geo-point. I am using them to collect data from an API to be used offline. The program is written in C++.
I have to go through each postcode, calculate a bounding box based on the postcodes location, and then send it to the API which gives me some data near to that postcode.
However 2,000,000 is a lot to process and some of the postcodes are next to each other or close enough to each other that they would share some of the same data.
So far I've came up with two ways I could reduce them but I am not sure if they would work:
1 - Program uses data structure to record which postcode overlaps which and then run a routine a few time to removes the ones that have overlaps one by one until we are left without ones without overlapping postcodes.
Start at the top left geo point of the UK and slowly increment it the rough size of a postcode area until we have covered the entire UK.
Is there a easy way to reduce these number of postcodes so that I have few of them overlapping as possible ? whilst still making sure I get data covering as much of the UK as possible ? I was thinking there may be an algorithm handy for this, that people use else where.

You can use a quadtree especially a quadkey. A quadkey plot the points along a curve. It's similar to sort the points into a grid. Then you can traverse the grid to search deeper in the tree. You can also search around a center point. You can also use a database with a spatial index. It depends how much the data overlap but with a quadtree you can choose the size of the grid.

Related

Custom markers in Excel depending on data point value

I am kind of a newbie with Excel, so asking this question here. I have already checked some videos but the examples do not fit my need, so I am wondering if this is even possible.
I have a line chart with 10 data points, and for each data point I am currently using a built-in dot marker. The data values can go from 1 to 5, which corresponds to a net promoting score where 1 is "extremely satisfied" and 5 is "extremely satisfied": I would like to have an emoticon to represent each value, e.g. sad face for 1 and happy face for 5, and have this in a systematic way, so that every chart I generate will have these custom markers displayed depending on the value.
I would go with:
having a separate sheet with a table containing the values and the related images (not sure if this is even possible)
creating some sort of magic formula to link the data value to the image
apply the magic formula to the charts
Is this approach correct? Is this even possible?
Thank you.

How to use excel data to find period

I have three Excel columns of data from an experiment with a pendulum: time, angle displacement, and angular velocity. I was wondering if there is a way in Excel to calculate and then graph the period (and, if possible, display the function for the graph)... I realize it's kinda a dumb question. I'm still new at Excel.
Thanks for any pointers u can give!
In case the Analysis ToolPak is installed, one can use Tools->Data Analysis->Fourier Analysis. If the data is a superposition of harmonic functions (sin,cos), the corresponding frequencies (or inverse periods) will appear as peaks in the Fourier analysis.

Excel, Determine where data takes a dive

I'm trying to determine where, in a set of measurement data, the data takes a dive...
... so I can plot a vertical line and
... plot a horizontal line in the graph.
I have no problem doing the 2nd and 3rd bullet points above on my own, so that's taken care of.
The problem I need help with is the first bullet point - determining WHERE the data takes a dive - WHERE the data crosses a threshold that basically says, "Whatever-it-is you're measuring, is no longer performing as it is expected to.".
Here's what I'm doing:
I am taking measurements using a measuring device and that device is logging the measurements in its internal memory and allowing me to download that measurement data to my computer into a csv when the test session is complete.
I pull that csv into an xls and plot the data on a graph. (see attached image)
Here's what I want to do:
If you look at the attached image I would like to find the value where the data DEFINITELY crosses BELOW the horizontal line so I can say, "Here is where the device being tested 'gave up the ghost' and was no longer able to perform as desired."
What the data roughly looks like:
Each measurement set will have the rough look and feel of the attached image but slightly different each time. (because each object I am testing will have roughly the same performance characteristics but they all have their own manufacturing defects and variations.)
The data set for the attached image is a data set of 7000 measurements.
I never really know where the horizontal line will be.
Examples of the data sets I have gotten in the past several tests look like this:
(394 to 0)
(390000 to 0)
(3.88 to 0)
(375000 to 0)
(39.55 to 0)
(59200 to 0)
and each data set will have about 1,000 to 7,000 measurements each.
Here's how I was trying to solve this issue:
I was using SLOPE() and trying to latch onto where the slop of the line took a dive / started to work its way to a zero slope (which is a vertical line) so when it starts approaching a really small slope then it MUST be taking a dive. That didn't really work.
I was looking at using STDEV.P() in Excel and feeding it the entire data set. Then I was looking at doing the same thing but feeding it only the first 10, 30, 60 measurements but then I thought - we never really know just how many measurements will come through. Then I thought I would use the first 10% of the measurements and feed that to STDEV.P().
Please let me know what you think of this and please let me know of any ideas you may have.
Thanks.
H
Something like this should work to flag when the decay rate increases.
To find what 'direction' your data is going in you need the derivative.
Excel doesn't have a derivative formula but you can set it up pretty easily by using the (change in y)/(change in x) as demonstrated here:
http://faculty.educ.ubc.ca/sanderson/lab/CLFbiom/demo/diff.htm
I would then check a formula which counts how many datarows you have (=COUNTA(A:A) or similar)
Then uses that to get a step of 10% of your data
Then check the value of the derivative in a cell against a cell 10% further down. If it's still a negative (to account for the slight downhill at first) then you'll know
The right way to go about this is to model the data with an unknown discontinuity, something like "if time < break_time then (some constant plus noise) else (decaying exponential)". A maximum likelihood estimation for that model might require iteration or other operations which are clumsy in Excel -- maybe you should consider VB or Python or some other programming language. I.e. choose the tool to fit the problem and not the other way around.
See Seber and Wild, "Nonlinear Regression", for an extensive discussion of models with discontinuities.
If your data can be generally characterized as having:
(A) a more or less flat plateau region, followed by
(B) a downward trending region
then a basic strategy could be to start at then end of the data and march towards the beginning one point at a time, checking to see that the values are increasing. Once they stop increasing, you've found the break point.
The strategy assumes (unwisely?) that the downward trending region is smooth/noiseless. To make the solution more robust to noise, you could compare values that are 5 apart, or 10 apart, or whatever interval works to filter out the noise. Or you could use a moving average.
This strategy could potentially be made more efficient by starting the search somewhere in the middle of the data but still in downward trending portion. If you know (based on experience) that any value that is (say) 0.5X the maximum is in the downward trending portion, you could start the search there.
Hope that helps.
It appears as though you want to detect when the slope changes from something near zero to something negative. One way to detect this is to calculate the 2nd derivative of the values (calculate the slope of the slope). The 2nd derivative should be near zero in the flat portion of the data AND in the downward trending portion of the data. It should go negative at the break point. So finding the minimum (most negative) value of the 2nd should locate the break point.
To implement this, you probably will need to filter noise. So calculate the first derivative (slope) over some suitable window of data:
=SLOPE(moving window of say 25 raw values)
Then calculate the second derivative (slope of slope):
=SLOPE(moving window of say 25 slope values)
Then look for the minimum.
Hope that helps.

tableau waterfall chart with mixed colors

I started working with Tableau and found out how to do waterfall charts (Gantt-charts, rolling sum for y-axis and negative value for length of bar. See here: tableau tutorial waterfall charts)
Now my question would be if there is a possibility to split the color by some category? To show what I'm talking about I set up some waterfalls with excel powerpoint. The first one without a split by a category and the second one with a split by category.
I would appreciate any help
It is as simple as dragging the right field to Color, and adjusting the table calculation.
If it's a simple running sum (for the positioning of the bar), you need to go edit the table calculation, select Compute using Advanced..., then drag all fields to addressing. Then you need to sort your data by the master field (in your case the one that has Coffee, Coke,...) maximum (ascending or descending, doesn't matter).
This way you guarantee that the running sum is being applied to one category at a time (at not one color at a time or something like that).
It's really important to understand the concept of table calculations so you can understand what's going on, how Tableau is calculating stuff. Read this http://onlinehelp.tableausoftware.com/current/pro/online/en-us/help.htm#calculations_tablecalculations_understanding_addressing.html
And if you actually understand what you're doing, it's easier to find a solution. For instance, this hack of Gantt Chart to make a waterfall. What goes on rows the chart will understand at starting point of the data, and what goes on size is, well, the size of the chart. You put negative values cause you want the bar to go down.
That being said, dragging a field to color won't mess up with the size, but it will mess up with the running sum used to determine the starting points. How to solve it? make it calculate all the colors together. How to make it? Understand table calculations and reach my solution.
This is the simple approach, if your database or fields have some peculiarity, this might not work perfectly, and you'll need to explain so I can try to understand how to solve

Optimizing search on map

for one of our clients we are providing a system for retrieving the closest N landmarks from the users zipcode location.
We have a database of all the available zipcodes (650,000+) with the coresponding coordinates (latitude and longitude) and also all of 400+ landmarks in the country.
For now we are using the following process from finding closest N landmarks
Retrieve the lat and lng of the selected zipcode
Get the coordinates of all the landmarks
Order them by using a geographic distance formula
Take the closest N+2 landmarks and get the real distance to them using the following process
check if the distance between coordinates is stored in the distance cache table
if not it goes to a map engine, retrieved the distance and stores it in the cache
Reorder the list and return first N closest landmarks
The problem is we need to optimize this both from database access point of view and 3rd party access also.
We have tried to cache for all zipcodes the distance to closest M landmarks but the table would gain an additional 6Gb of data and it would take around 250 days to fill since a request takes aprox 30 sec.
We were thinking on partitioning the data and grouping close postcodes together but that will void the exact distance.
What optimising solutions you see in this situation.
Thank you.
You could try an repetitive approach.
Pick a value to use as your "radius"
Go through all results and pick only ones +- radius horizontally and vertically (according to geolocation
if not enough rows returned, increase "radius" and start again
Now perform distance calculation and use a PriorityQueue to minimise the number of calculations used in this sort and select the required items
This should be done on database- level. You should use a database with an geographic extension as SQL Server 2008 R2, or the excellent open source choise PostGre SQL with PostGIS extension. With those you store Geographical BLOBs instead of coordinates, and there are many built in functions to calculate geography that will take care of step 2 to 5 for you.
I suggest you start here:
http://postgis.refractions.net/
Regards

Resources