I am looking at a pythonic implementation of this top rated accepted answer on GIS SE - Using geohash for proximity searches? and I am unable to retrieve any matches for my geohash query. Here is the approach I have tried so far.
To run this Minimum Verifiable Complete Example(MVCE) you need to download the following files - geohash int
and sortedlist python and install the python sortedlist via pip. You also need to have the latest version of Cython installed on your machine so as to wrap the C functionality of geohash-int(Note I am only wrapping what is necessary for this MVCE).
geohash_test.py
# GeoHash is my Cython wrapper of geohash-int C package
from geo import GeoHash
from sortedcontainers import SortedList
import numpy as np
def main():
# Bounding coordinates of my grid.
minLat = 27.401436
maxLat = 62.54858
minLo = -180.0
maxLo = 179.95000000000002
latGrid = np.arange(minLat,maxLat,0.05)
lonGrid = np.arange(minLo,maxLo,0.05)
geoHash = GeoHash()
# Create my own data set of points with a resolution of
# 0.05 in the latitude and longitude direction.
gridLon,gridLat = np.meshgrid(lonGrid,latGrid)
grid_points = np.c_[gridLon.ravel(),gridLat.ravel()]
sl = SortedList()
#Store my grid points in the best resolution possible i.e. 52(First step in accepted answer)
for grid_point in grid_points:
lon = grid_point[0]
lat = grid_point[1]
geohash = geoHash.encode(lat,lon,52)
bitsOriginal = geohash["bits"]
sl.add(bitsOriginal)
#Derive the minimum and maximum value for the range query from method below
minValue,maxValue = getMinMaxForQueryGeoHash(geoHash)
# Do the actual range query with a sorted list
it = sl.irange(minValue,maxValue,inclusive=(False,False))
print(len(list(it)))
def getMinMaxForQueryGeoHash(geoHash):
lonTest = 172.76843
latTest = 61.560745
#Query geohash encoded at resolution 26 because my search area
# is around 10 kms.(Step 2 and 3 in accepted answer)
queryGeoHash = geoHash.encode(latTest,lonTest,26)
# Step 4 is getting the neighbors for query geohash
neighbors = geoHash.get_neighbors(queryGeoHash)
bitsList = []
for key,value in neighbors.items():
bitsList.append(value["bits"])
#Step 5 from accepted answer
bitsList.append(queryGeoHash["bits"])
# Step 6 We need 1 to all the neighbors
newList = [x+1 for x in bitsList]
joinedList = bitsList + newList
#Step 7 Left bit shift this to 52
newList2 = [x <<26 for x in joinedList]
#Return min and max value to main method
minValue = min(newList2)
maxValue = max(newList2)
return minValue,maxValue
main()
If one were to write this out as a pseudocode here is what I am doing
Given my bounding box which is a grid I store it in the highest resolution possible by computing the geohash for each latitude and longitude(this happens to be bit depth 52)
I add the geohash to a sorted list
Then I would like to do a range query by specifying a search radius of 10 kms for a specific query coordinate
From the accepted answer to do this you need the min and max value for a query geohash
I calculate the min and max value in the method getMinMaxForQueryGeoHash
Calculate the query geohash at bit depth 26(this is the radius of 10 kms)
Calculate the neighbors of the query geohash and create the 18 member array
The 18 members are the 8 neighbors returned from the C method plus the original query geohash and the remaining 9 are obtained by adding 1 to this array
Then left bit shift this array by 26 and return the min and max value to the irange query of sorted list.
Bit shift = 52(maximum resolution) - query geohash precision(26) = 26
But that query returns me a NULL. Could somebody explain where I am going wrong ?
using your jargon: for a MVCE, you not need a complex two-languages implementations. There are a lot of simple good implementations of Geohash, some in 100% Python (example). All them use the Morton Curve (example).
Conclusion: try to plug-and-play a pure-Python implementation, first test encode/decode, them test the use of neighbors(geohash) function.
Related
I have a nested loop that has to loop through a huge amount of data.
Assuming a data frame with random values with a size of 1000,000 rows each has an X,Y location in 2D space. There is a window of 10 length that go through all the 1M data rows one by one till all the calculations are done.
Explaining what the code is supposed to do:
Each row represents a coordinates in X-Y plane.
r_test is containing the diameters of different circles of investigations in our 2D plane (X-Y plane).
For each 10 points/rows, for every single diameter in r_test, we compare the distance between every point with the remaining 9 points and if the value is less than R we add 2 to H. Then we calculate H/(N**5) and store it in c_10 with the index corresponding to that of the diameter of investigation.
For this first 10 points finally when the loop went through all those diameters in r_test, we read the slope of the fitted line and save it to S_wind[ii]. So the first 9 data points will have no value calculated for them thus giving them np.inf to be distinguished later.
Then the window moves one point down the rows and repeat this process till S_wind is completed.
What's a potentially better algorithm to solve this than the one I'm using? in python 3.x?
Many thanks in advance!
import numpy as np
import pandas as pd
####generating input data frame
df = pd.DataFrame(data = np.random.randint(2000, 6000, (1000000, 2)))
df.columns= ['X','Y']
####====creating upper and lower bound for the diameter of the investigation circles
x_range =max(df['X']) - min(df['X'])
y_range = max(df['Y']) - min(df['Y'])
R = max(x_range,y_range)/20
d = 2
N = 10 #### Number of points in each window
#r1 = 2*R*(1/N)**(1/d)
#r2 = (R)/(1+d)
#r_test = np.arange(r1, r2, 0.05)
##===avoiding generation of empty r_test
r1 = 80
r2= 800
r_test = np.arange(r1, r2, 5)
S_wind = np.zeros(len(df['X'])) + np.inf
for ii in range (10,len(df['X'])): #### maybe the code run slower because of using len() function instead of a number
c_10 = np.zeros(len(r_test)) +np.inf
H = 0
C = 0
N = 10 ##### maybe I should also remove this
for ind in range(len(r_test)):
for i in range (ii-10,ii):
for j in range(ii-10,ii):
dd = r_test[ind] - np.sqrt((df['X'][i] - df['X'][j])**2+ (df['Y'][i] - df['Y'][j])**2)
if dd > 0:
H += 1
c_10[ind] = (H/(N**2))
S_wind[ii] = np.polyfit(np.log10(r_test), np.log10(c_10), 1)[0]
You can use numpy broadcasting to eliminate all of the inner loops. I'm not sure if there's an easy way to get rid of the outermost loop, but the others are not too hard to avoid.
The inner loops are comparing ten 2D points against each other in pairs. That's just dying for using a 10x10x2 numpy array:
# replacing the `for ind` loop and its contents:
points = np.hstack((np.asarray(df['X'])[ii-10:ii, None], np.asarray(df['Y'])[ii-10:ii, None]))
differences = np.subtract(points[None, :, :], points[:, None, :]) # broadcast to 10x10x2
squared_distances = (differences * differences).sum(axis=2)
within_range = squared_distances[None,:,:] < (r_test*r_test)[:, None, None] # compare squares
c_10 = within_range.sum(axis=(1,2)).cumsum() * 2 / (N**2)
S_wind[ii] = np.polyfit(np.log10(r_test), np.log10(c_10), 1)[0] # this is unchanged...
I'm not very pandas savvy, so there's probably a better way to get the X and Y values into a single 2-dimensional numpy array. You generated the random data in the format that I'd find most useful, then converted into something less immediately useful for numeric operations!
Note that this code matches the output of your loop code. I'm not sure that's actually doing what you want it to do, as there are several slightly strange things in your current code. For example, you may not want the cumsum in my code, which corresponds to only re-initializing H to zero in the outermost loop. If you don't want the matches for smaller values of r_test to be counted again for the larger values, you can skip that sum (or equivalently, move the H = 0 line to in between the for ind and the for i loops in your original code).
I am trying to perform a K-mean algorithm to obtain a lowest cost which would result in a KxN matrix. The value of K is determined by number of clusters the algorithm creates with optimal cost. For example, K=2 would imply 2 clusters ( or 2 centroids ) while N is the number of features. The K-mean is run in a loop for K=1 to 10 and the loop stops when best optimal cost is obtained for a particular value of K. for example if an optimal cost is obtained for K=2, the centroid returned would be an 2xN matrix. I want to store all the centroids returned by the loop into a list. Please note that in every increment of loop the value of K would change by k=K+1. Therefore my centroid returned would be of size 1xN, 2xN, 3xN.
How to store this into a list such that I can get something like this:-
List= [[10,12,13], [[10,20,30],[1,2,3]], [[5,6,9],[4,12,20],[40,50,60]],...
With every loop I return a KxN matrix which I want to store it into a list. I want to access the list later by an index , say List[i] to retrieve the KxN matrix.
I am mostly working with numpy.
any suggestions would be a big help.
N = 5
lst = []
for K in range(1,11):
lst.append(np.empty((K,N)))
Background
I have a NETCDF4 file with grid size 0.125x0.125. The latitudes go from 90 to -90 and longitudes go from 0 to 360. The full table size is therefore 1441 x 2880 (latitudes x longitudes).
I am taking my location coordinates (lat lon in degrees) and trying to locate which cell I am in...
To calculate in which cell I am in, I do this:
'''
This function takes an array and a value.
It finds the item in the array for which the value most closely matches, and returns the index.
'''
def GetNearestIndex(arrayIn, value):
distance = math.fabs(arrayIn[0] - value);
nearest = 0;
index = 0
for idx, val in enumerate(arrayIn):
delta_distance = math.fabs(val - value)
if delta_distance < distance:
nearest = val
index = idx
distance = delta_distance
return index
#Lats and Longs arrays from the NETCDF4 dataset
lats = dataset.variables['latitude'][:]
longs = dataset.variables['longitude'][:]
#GetNearestIndex finds the item in the array for which the value most closely matches, and returns the index.
nearestLatitudeIndex = Utilities.GetNearestIndex(lats, myLat)
nearestLongitudeIndex = Utilities.GetNearestIndex(longs, myLon%360)
So given my NECTDF4 dataset, if my location is [31.351621, -113.305864] (lat, lon), I find that I am matched with cell [31.375, 246.75] (lat, lon). Plugging the calculated lat and lon into GetNearestIndex, I will then have the "address" (x, y) of the cell in which I am located.
Now that I know in which cell I am closest to, I take the value from the NETCDF file and I can then say something like "The temperature at your location is X".
The problem is, I do not know if I am doing this correctly, so hence my question:
Questions
How do I correctly determine which cell I am located in and get the
x and y indexes?
How can I verify that my calculation is correct?
Is myLon%360 the correct way to convert from myLon to a grid that
goes from 0 to 360 degrees? Does the grid cell size not matter?
I don't have time to check / test your approach but I would use numpy, the library on which NetCDF4 is based. geo_idx() below, is what I use for a regular grid of lat/lon degrees, with lons between -180 and 180. This approach avoids looping thru all lat / lon arrays.
import numpy as np
import netCDF4
def geo_idx(dd, dd_array):
"""
- dd - the decimal degree (latitude or longitude)
- dd_array - the list of decimal degrees to search.
search for nearest decimal degree in an array of decimal degrees and return the index.
np.argmin returns the indices of minium value along an axis.
so subtract dd from all values in dd_array, take absolute value and find index of minium.
"""
geo_idx = (np.abs(dd_array - dd)).argmin()
return geo_idx
To use and test
# to test
in_lat = 31.351621
in_lon = -113.305864
nci = netCDF4.Dataset(infile)
lats = nci.variables['lat'][:]
lons = nci.variables['lon'][:]
# since lons are 0 thru 360, convert to -180 thru 180
converted_lons = lons - ( lons.astype(np.int32) / 180) * 360
lat_idx = geo_idx(in_lat, lats)
lon_idx = geo_idx(in_lon, converted_lons)
print lats[lat_idx]
print converted_lons[lon_idx]
Given 1 Billion records containing following information:
ID x1 x2 x3 ... x100
1 0.1 0.12 1.3 ... -2.00
2 -1 1.2 2 ... 3
...
For each ID above, I want to find the top 10 closest IDs, based on Euclidean distance of their vectors (x1, x2, ..., x100).
What's the best way to compute this?
As it happens, I have a solution to this, involving combining sklearn with Spark: https://adventuresindatascience.wordpress.com/2016/04/02/integrating-spark-with-scikit-learn-visualizing-eigenvectors-and-fun/
The gist of it is:
Use sklearn’s k-NN fit() method centrally
But then use sklearn’s k-NN kneighbors() method distributedly
Performing a brute-force comparison of all records against all records is a losing battle. My suggestion would be to go for a ready-made implementation of k-Nearest Neighbor algorithm such as the one provided by scikit-learn then broadcast the resulting arrays of indices and distances and go further.
Steps in this case would be:
1- vectorize the features as Bryce suggested and let your vectorizing method return a list (or numpy array) of floats with as many elements as your features
2- fit your scikit-learn nn to your data:
nbrs = NearestNeighbors(n_neighbors=10, algorithm='auto').fit(vectorized_data)
3- run the trained algorithm on your vectorized data (training and query data are the same in your case)
distances, indices = nbrs.kneighbors(qpa)
Steps 2 and 3 will run on your pyspark node and are not parallelizable in this case. You will need to have enough memory on this node. In my case with 1.5 Million records and 4 features, it took a second or two.
Until we get a good implementation of NN for spark I guess we would have to stick to these workarounds. If you'd rather like to try something new, then go for http://spark-packages.org/package/saurfang/spark-knn
You haven't provided a lot of detail, but the general approach I would take to this problem would be to:
Convert the records to a data structure like like a LabeledPoint with (ID, x1..x100) as label and features
Map over each record and compare that record to all the other records (lots of room for optimization here)
Create some cutoff logic so that once you start comparing ID = 5 with ID = 1 you interrupt the computation because you have already compared ID = 1 with ID = 5
Some reduce step to get a data structure like {id_pair: [1,5], distance: 123}
Another map step to find the 10 closest neighbors of each record
You've identified pyspark and I generally do this type of work using scala, but some pseudo code for each step might look like:
# 1. vectorize the features
def vectorize_raw_data(record)
arr_of_features = record[1..99]
LabeledPoint( record[0] , arr_of_features)
# 2,3 + 4 map over each record for comparison
broadcast_var = []
def calc_distance(record, comparison)
# here you want to keep a broadcast variable with a list or dictionary of
# already compared IDs and break if the key pair already exists
# then, calc the euclidean distance by mapping over the features of
# the record and subtracting the values then squaring the result, keeping
# a running sum of those squares and square rooting that sum
return {"id_pair" : [1,5], "distance" : 123}
for record in allRecords:
for comparison in allRecords:
broadcast_var.append( calc_distance(record, comparison) )
# 5. map for 10 closest neighbors
def closest_neighbors(record, n=10)
broadcast_var.filter(x => x.id_pair.include?(record.id) ).takeOrdered(n, distance)
The psuedocode is terrible, but I think it communicates the intent. There will be a lot of shuffling and sorting here as you are comparing all records with all other records. IMHO, you want to store the keypair/distance in a central place (like a broadcast variable that gets updated though this is dangerous) to reduce the total euclidean distance calculations you perform.
For narrow banded processing I want the complex pressure at the peak frequency bin. To find the peak frequency bin I use the frequency with the highest absolute value, within a small range of frequencies.
I have come up with the following code, borrowing heavily from
Use idxmax for indexing in pandas
This seems to me bulky, and hard to generalize. Ideally I hope to be able to be able to make fBins into an array, and return many frequencies at once. Its OK to make maxAbsIndex into a list, but I can't see the next step.
import numpy as np
import pandas as pd
# Construct fake frequency data on multiple channels
np.random.seed(0)
numF = 1000
f = np.arange(numF) / (numF * 2)
y = np.random.randn(numF, 2) + 1j * np.random.randn(numF, 2)
# Put time series into a DataFrame, indexed by frequency
yFrame = pd.DataFrame(y, index = f)
fBins = 0.1
tol = 0.01
# Find the index of the maxium absolute value within a given frequency window
absMaxIndex = yFrame[(fBins - tol) : (fBins + tol)].abs().idxmax()
# Return the value at this index
value = [yFrame.ix[items[1], items[0]] for items in absMaxIndex.iteritems()]
print(value)
Value should have the complex value
[(-2.0946030712061448-1.0585718976053677j), (-2.7396771671895563+0.79204149842297422j)]
Which have the largest absolute value in yFrame between 0.09 and 0.11 Hz for each channel.