On calculating temperature differences between zipcodes - search

Prop A. I wrote a zipcode server that gives me 32,000 zip codes of USA.
Each zipcode has an associated lat-long.
Given 2 zipcodes, I can find the distance between them using their lat-longs.
Prop B. I also wrote a weather server where you can input atmost 200 zipcodes and it spits out the temperature at each of those zipcodes.
Person tells me his zipcode is Z, temperature is T.
He asks me, what's the nearest place from Z where its atleast 10 degrees cooler ?
So I get a list of 200 zipcodes from Z sorted by distance ( using Prop A).
I feed that to B and get 200 temperatures.
If none are 10 degrees cooler, I get the next 200 zipcodes and repeat until done.
Problem: This seems quite inefficient and brute-force. I feel there's some Physics insight I'm missing. Its not always true that if you go north the temperatures cool down & going south they heat up. So direction doesn't help. Altitude probably does ( mountains cooler than valleys ) but zipcodes data keyed to altitude is hard to find.
Can you guys think of some smarter way to go about this ? Any suggestions appreciated.
Note: The weather data is expensive. You can hit the weather server a few times only, and you can only get 200 temperatures at each time. ( otoh, the distances between any 2 zipcodes are precomputed constants, and there is no cost to get that. )

You could do it by binary sorting all of the zip codes and grabbing all of them lower than the user's zip code in the sorted list, then doing the same on that subset for distance. This should be reasonably fast - binary sort is log(n), so you won't kill yourself on the sorts.

I agree with the comment from the physics forum that this is not a physics problem, but some insights from physics (or mathematics at least) might indeed be in order. If you are able to cheaply obtain the weather data, you might be able to set up a dataset and perform analysis once to guide your search.
Specifically, record the temperature for each location concurrently. Then, for each location, calculate the change in temperature to each contiguous zip code and associate that with a relative coordinate (i.e. the direction to the neighboring zip) and store this list ordered by temperature. When someone enters a query zip, your algorithm would start with the zip at the top of the list and work it's way down. Each non-satisfactory answer is added to a stack. If none of the neighboring zips meets the criteria (in this case 10 degrees cooler), the algorithm would start working through the new stack repeating the algorithm.
I am not a wonderful programmer so I won't give any code, but it seems to me that this would "follow" the natural contours of the temperature map better than a brute force search and would retain primacy on the proximity of the result. If you set up your initial dataset with several concurrent temperature measurements, you could time average those for better performance.

This is best for stackoverflow.
Combine the databases.
Write a query abs(lat-lat_o) + abs(long-long_0) < 2.00 AND temp < temp_0 - 10. That query will take advantage of indexing on your server.
If no results, increase 2.00 by a multiple and repeat.
If results, find which is closest. If the closest one is farther than the nearest edge of your bounding box, save that entry and increase 2.000 to that distance and see if one of those is closer.
Scales, uses database efficiently.

Related

How to best store time series data in Elasticsearch?

I regularly have to conduct chemical experiments which result in a huge set of time serie data. For example, 100 lists with measured concentration of fluids and for each measurement an assigned timestamp in microseconds.
I would like to track and model each experiment and assign to it multiple lists with (measurement, timestamp) pairs. The measurement lists do not have to be of equal length and can greatly vary. For example, one measurement list could be of length 100, the next one 4000, always depending on the conducted experiment. When being at the university lab, I also take notes for different timestamps, which I would also like to track for the timestamps in the DB (tagged timestamps).
Later on, the full analysis text of the experiment should also be stored.
Is Elasticsearch capable of storing such time series data or measurement lists? Because this is not mostly text but rather numbers I'm a bit restraint.
Even though I have searched through the net for a while, I could not find a proper way to set up measurement lists as explained yet.
Any help, ideas and maybe helpful links are highly appreciated!

Small data anomaly detection algo

I have the following 3 cases of a numeric metric on a time series(t,t1,t2 etc denotes different hourly comparisons across periods)
If you notice the 3 graphs t(period of interest) clearly has a drop off for image 1 but not so much for image 2 and image 3. Assume this is some sort of numeric metric(raw metric or derived) and I want to create a system/algo which specifically catches case 1 but not case 2 or 3 with t being the point of interest. While visually this makes sense and is very intuitive I am trying to design a way to this in python using the dataframes shown in the picture.
Generally the problem is how do I detect when the time series is behaving very differently from any of the prior weeks.
Edit: When I say different what I really mean is, my metric trends together across periods in t1 to t4 but if they dont and try to separate out of the envelope, that to me is an anomaly. If you notice chart 1 you can see t tries to split out from rest of the tn this is an anomaly for me. in other cases t is within the bounds of other time periods. Hope this helps.
With small data the best is if you can come up with a good transformation into a simpler representation.
In this case I would try the following:
Distance to the median along the time-axis. Then a summary of that, could be median, Mean-Squared-Error etc
Median of the cross-correlation of the signals

How can I obtain hourly readings from 24 hour moving average data?

I have an excel dataset of 24-hour moving averages for PM10 air pollution concentration levels, and need to obtain the individual hourly readings from them. The moving average data is updated every hour, so at hour t, the reading is the average of the 24 readings from t-23 to t hours, and at hour t+1, the reading is the average of t-22 to t+1, etc. I do not have any known data points to extrapolate from, just the 24-hour moving averages.
Is there any way I can obtain the individual hourly readings for time t, t+1, etc, from the moving average?
The dataset contains data over 3 years, so with 24 readings a day (at every hour), the dataset has thousands of readings.
I have tried searching for a possible way to implement a simple excel VBA code to do this, but come up empty. Most of the posts I have seen on Stackoverflow and stackexchange, or other forums, involve calculating moving averages from discrete data, which is the reverse of what I need to do here.
The few I have seen involve using matrices, which I am not very sure how to implement.
(https://stats.stackexchange.com/questions/67907/extract-data-points-from-moving-average)
(https://stats.stackexchange.com/questions/112502/estimating-original-series-from-their-moving-average)
Any suggestions would be greatly appreciated!
Short answer: you can't.
Consider a moving average on 3 points. And even consider we multiply each MA term by 3, so we really have sums of consecutive
Data: a b c d e f g
MA a+b+c
b+c+d
c+d+e
d+e+f
e+f+g
With initial values, you can do something. To find the value of d, you would need to know b+c, hance to know a (since a+b+c is known). Then to find e, you know c+d+e and d, so you must find c, and since a is already needed, you will need also need b.
More generally, for a MA of length n, if you know the first n-1 values (hence also the nth, since you know the sum), then you can find all subsequent values. You can also start from the end. But basically, if you don't have enough original data, you are lost: there is a 1-1 relation between the n-1 first values of your data and the possible MA series. If you don't have enough information, there are infinitely many possibilities, and you can't decide which one is right.
Here I consider the simplest MA where the coefficient of each variable is 1/n (hence you compute the sum and divide by n). But this would apply to any MA, with slightly more complexity to account for different coefficients for each term in the sum.

View Collation with Couchbase

We are using couchbase as our nosql store and loving it for its capabilities.
There is however an issue that we are running in with creating associations
via view collation. This can be thought of akin to a join operation.
While our data sets are confidential I am illustrating the problem with this model.
The volume of data is considerable so cannot be processed in memory.Lets say we have data on ice-creams, zip-code and average temperature of the day.
One type of document contains a zipcode to icecream mapping
and the other one has transaction data of an ice-cream being sold in a particular zip.
The problem is to be able to determine a set of top ice-creams sold by the temperature of a given day.
We crunch this corpus with a view to emit two outputs, one is a zipcode to temperature mapping , while the other
represents an ice-cream sale in a zip code. :
Key Value
[zip1] temp1
[zip1,ice_cream1] 1
[zip2,ice_cream2] 1
The view collation here is a mechanism to create an association between the ice_cream sale, the zip and the average temperature ie a join.
We have a constraint that the temperature lookup happens only once in 24 hours when the zip is first seen and that is the valid
avg temperature to use for that day. eg lookup happened at 12:00 pm on Jan 1st, next lookup does not happen till 12:00 pm Jan 2nd.
However the avg temperature that is accepted in the 1st lookup is valid only for Jan 1st and that on the 2nd lookup only for Jan 2
including the first half of the day.
Now things get complicated when I want to do the same query with a time component involved, concretely associating the average temperature of a
day with the ice-creams that were sold on that day in that zip.eg. x vanilla icecreams were sold when the average temperature for that day is 70 F
Key Value
[y,m,d,zip1] temp1
[y,m,d,zip2,ice_cream2 ] 1
[y,m,d2,zip1,ice_cream1] 1
This has an interesting impact on the queries, say I query for the last 1 day I cannot make any associations between the ice-cream and temperature before the
first lookup happens, since that is when the two keys align. The net effect being that I lose the ice-cream counts for that day before that temperature lookup
happens. I was wondering if any of you have faced similar issues and if you are aware of a pattern or solution so as not to lose those counts.
First, welcome to StackOverflow, and thank you for the great question.
I understand the specific issue that you are having, but what I don't understand is the scale of your data - so please forgive me if I appear to be leading down the wrong path with what I am about to suggest. We can work back and forth on this answer depending on how it might suit your specific needs.
First, you have discovered that CB does not support joins in its queries. I am going to suggest that this is not really an issue if when CB is used properly. The conceptual model for how Couchbase should be used to filter out data is as follows:
Create CB view to be as precise as possible
Select records as precisely as possible from CB using the view
Fine-filter records as necessary in data-access layer (also perform any joins) before sending on to rest of application.
From your description, it sounds to me as though you are trying to be too clever with your CB view query. I would suggest one of two courses of action:
Manually look-up the value that you want when this happens with a second view query.
Look up more records than you need, then fine-filter afterward (step 3 above).

Statistically removing erroneous values

We have a application where users enter prices all day. These prices are recorded in a table with a timestamp and then used for producing charts of how the price has moved... Every now and then the user enters a price wrongly (eg. puts in a zero to many or to few) which somewhat ruins the chart (you get big spikes). We've even put in an extra confirmation dialogue if the price moves by more than 20% but this doesn't stop them entering wrong values...
What statistical method can I use to analyse the values before I chart them to exclude any values that are way different from the rest?
EDIT: To add some meat to the bone. Say the prices are share prices (they are not but they behave in the same way). You could see prices moving significantly up or down during the day. On an average day we record about 150 prices and sometimes one or two are way wrong. Other times they are all good...
Calculate and track the standard deviation for a while. After you have a decent backlog, you can disregard the outliers by seeing how many standard deviations away they are from the mean. Even better, if you've got the time, you could use the info to do some naive Bayesian classification.
That's a great question but may lead to quite a bit of discussion as the answers could be very varied. It depends on
how much effort are you willing to put into this?
could some answers genuinely differ by +/-20% or whatever test you invent? so will there always be need for some human intervention?
and to invent a relevant test I'd need to know far more about the subject matter.
That being said the following are possible alternatives.
A simple test against the previous value (or mean/mode of previous 10 or 20 values) would be straight forward to implement
The next level of complexity would involve some statistical measurement of all values (or previous x values, or values of the last 3 months), a normal or Gaussian distribution would enable you to give each value a degree of certainty as to it being a mistake vs. accurate. This degree of certainty would typically be expressed as a percentage.
See http://en.wikipedia.org/wiki/Normal_distribution and http://en.wikipedia.org/wiki/Gaussian_function there are adequate links from these pages to help in programming these, also depending on the language you're using there are likely to be functions and/or plugins available to help with this
A more advanced method could be to have some sort of learning algorithm that could take other parameters into account (on top of the last x values) a learning algorithm could take the product type or manufacturer into account, for instance. Or even monitor the time of day or the user that has entered the figure. This options seems way over the top for what you need however, it would require a lot of work to code it and also to train the learning algorithm.
I think the second option is the correct one for you. Using standard deviation (a lot of languages contain a function for this) may be a simpler alternative, this is simply a measure of how far the value has deviated from the mean of x previous values, I'd put the standard deviation option somewhere between option 1 and 2
You could measure the standard deviation in your existing population and exclude those that are greater than 1 or 2 standard deviations from the mean?
It's going to depend on what your data looks like to give a more precise answer...
Or graph a moving average of prices instead of the actual prices.
Quoting from here:
Statisticians have devised several methods for detecting outliers. All the methods first quantify how far the outlier is from the other values. This can be the difference between the outlier and the mean of all points, the difference between the outlier and the mean of the remaining values, or the difference between the outlier and the next closest value. Next, standardize this value by dividing by some measure of scatter, such as the SD of all values, the SD of the remaining values, or the range of the data. Finally, compute a P value answering this question: If all the values were really sampled from a Gaussian population, what is the chance of randomly obtaining an outlier so far from the other values? If the P value is small, you conclude that the deviation of the outlier from the other values is statistically significant.
Google is your friend, you know. ;)
For your specific question of plotting, and your specific scenario of an average of 1-2 errors per day out of 150, the simplest thing might be to plot trimmed means, or the range of the middle 95% of values, or something like that. It really depends on what value you want out of the plot.
If you are really concerned with the true max and true of a day's prices, then you have to deal with the outliers as outliers, and properly exclude them, probably using one of the outlier tests previously proposed ( data point is x% more than next point, or the last n points, or more than 5 standard deviations away from the daily mean). Another approach is to view what happens after the outlier. If it is an outlier, then it will have a sharp upturn followed by a sharp downturn.
If however you care about overall trend, plotting daily trimmed mean, median, 5% and 95% percentiles will portray history well.
Choose your display methods and how much outlier detection you need to do based on the analysis question. If you care about medians or percentiles, they're probably irrelevant.

Resources