Use latitude & longitude to determine which "cell" I am located in (NETCDF4) - python-3.x

Background
I have a NETCDF4 file with grid size 0.125x0.125. The latitudes go from 90 to -90 and longitudes go from 0 to 360. The full table size is therefore 1441 x 2880 (latitudes x longitudes).
I am taking my location coordinates (lat lon in degrees) and trying to locate which cell I am in...
To calculate in which cell I am in, I do this:
'''
This function takes an array and a value.
It finds the item in the array for which the value most closely matches, and returns the index.
'''
def GetNearestIndex(arrayIn, value):
distance = math.fabs(arrayIn[0] - value);
nearest = 0;
index = 0
for idx, val in enumerate(arrayIn):
delta_distance = math.fabs(val - value)
if delta_distance < distance:
nearest = val
index = idx
distance = delta_distance
return index
#Lats and Longs arrays from the NETCDF4 dataset
lats = dataset.variables['latitude'][:]
longs = dataset.variables['longitude'][:]
#GetNearestIndex finds the item in the array for which the value most closely matches, and returns the index.
nearestLatitudeIndex = Utilities.GetNearestIndex(lats, myLat)
nearestLongitudeIndex = Utilities.GetNearestIndex(longs, myLon%360)
So given my NECTDF4 dataset, if my location is [31.351621, -113.305864] (lat, lon), I find that I am matched with cell [31.375, 246.75] (lat, lon). Plugging the calculated lat and lon into GetNearestIndex, I will then have the "address" (x, y) of the cell in which I am located.
Now that I know in which cell I am closest to, I take the value from the NETCDF file and I can then say something like "The temperature at your location is X".
The problem is, I do not know if I am doing this correctly, so hence my question:
Questions
How do I correctly determine which cell I am located in and get the
x and y indexes?
How can I verify that my calculation is correct?
Is myLon%360 the correct way to convert from myLon to a grid that
goes from 0 to 360 degrees? Does the grid cell size not matter?

I don't have time to check / test your approach but I would use numpy, the library on which NetCDF4 is based. geo_idx() below, is what I use for a regular grid of lat/lon degrees, with lons between -180 and 180. This approach avoids looping thru all lat / lon arrays.
import numpy as np
import netCDF4
def geo_idx(dd, dd_array):
"""
- dd - the decimal degree (latitude or longitude)
- dd_array - the list of decimal degrees to search.
search for nearest decimal degree in an array of decimal degrees and return the index.
np.argmin returns the indices of minium value along an axis.
so subtract dd from all values in dd_array, take absolute value and find index of minium.
"""
geo_idx = (np.abs(dd_array - dd)).argmin()
return geo_idx
To use and test
# to test
in_lat = 31.351621
in_lon = -113.305864
nci = netCDF4.Dataset(infile)
lats = nci.variables['lat'][:]
lons = nci.variables['lon'][:]
# since lons are 0 thru 360, convert to -180 thru 180
converted_lons = lons - ( lons.astype(np.int32) / 180) * 360
lat_idx = geo_idx(in_lat, lats)
lon_idx = geo_idx(in_lon, converted_lons)
print lats[lat_idx]
print converted_lons[lon_idx]

Related

What's a potentially better algorithm to solve this python nested for loop than the one I'm using?

I have a nested loop that has to loop through a huge amount of data.
Assuming a data frame with random values with a size of 1000,000 rows each has an X,Y location in 2D space. There is a window of 10 length that go through all the 1M data rows one by one till all the calculations are done.
Explaining what the code is supposed to do:
Each row represents a coordinates in X-Y plane.
r_test is containing the diameters of different circles of investigations in our 2D plane (X-Y plane).
For each 10 points/rows, for every single diameter in r_test, we compare the distance between every point with the remaining 9 points and if the value is less than R we add 2 to H. Then we calculate H/(N**5) and store it in c_10 with the index corresponding to that of the diameter of investigation.
For this first 10 points finally when the loop went through all those diameters in r_test, we read the slope of the fitted line and save it to S_wind[ii]. So the first 9 data points will have no value calculated for them thus giving them np.inf to be distinguished later.
Then the window moves one point down the rows and repeat this process till S_wind is completed.
What's a potentially better algorithm to solve this than the one I'm using? in python 3.x?
Many thanks in advance!
import numpy as np
import pandas as pd
####generating input data frame
df = pd.DataFrame(data = np.random.randint(2000, 6000, (1000000, 2)))
df.columns= ['X','Y']
####====creating upper and lower bound for the diameter of the investigation circles
x_range =max(df['X']) - min(df['X'])
y_range = max(df['Y']) - min(df['Y'])
R = max(x_range,y_range)/20
d = 2
N = 10 #### Number of points in each window
#r1 = 2*R*(1/N)**(1/d)
#r2 = (R)/(1+d)
#r_test = np.arange(r1, r2, 0.05)
##===avoiding generation of empty r_test
r1 = 80
r2= 800
r_test = np.arange(r1, r2, 5)
S_wind = np.zeros(len(df['X'])) + np.inf
for ii in range (10,len(df['X'])): #### maybe the code run slower because of using len() function instead of a number
c_10 = np.zeros(len(r_test)) +np.inf
H = 0
C = 0
N = 10 ##### maybe I should also remove this
for ind in range(len(r_test)):
for i in range (ii-10,ii):
for j in range(ii-10,ii):
dd = r_test[ind] - np.sqrt((df['X'][i] - df['X'][j])**2+ (df['Y'][i] - df['Y'][j])**2)
if dd > 0:
H += 1
c_10[ind] = (H/(N**2))
S_wind[ii] = np.polyfit(np.log10(r_test), np.log10(c_10), 1)[0]
You can use numpy broadcasting to eliminate all of the inner loops. I'm not sure if there's an easy way to get rid of the outermost loop, but the others are not too hard to avoid.
The inner loops are comparing ten 2D points against each other in pairs. That's just dying for using a 10x10x2 numpy array:
# replacing the `for ind` loop and its contents:
points = np.hstack((np.asarray(df['X'])[ii-10:ii, None], np.asarray(df['Y'])[ii-10:ii, None]))
differences = np.subtract(points[None, :, :], points[:, None, :]) # broadcast to 10x10x2
squared_distances = (differences * differences).sum(axis=2)
within_range = squared_distances[None,:,:] < (r_test*r_test)[:, None, None] # compare squares
c_10 = within_range.sum(axis=(1,2)).cumsum() * 2 / (N**2)
S_wind[ii] = np.polyfit(np.log10(r_test), np.log10(c_10), 1)[0] # this is unchanged...
I'm not very pandas savvy, so there's probably a better way to get the X and Y values into a single 2-dimensional numpy array. You generated the random data in the format that I'd find most useful, then converted into something less immediately useful for numeric operations!
Note that this code matches the output of your loop code. I'm not sure that's actually doing what you want it to do, as there are several slightly strange things in your current code. For example, you may not want the cumsum in my code, which corresponds to only re-initializing H to zero in the outermost loop. If you don't want the matches for smaller values of r_test to be counted again for the larger values, you can skip that sum (or equivalently, move the H = 0 line to in between the for ind and the for i loops in your original code).

Fuzzy matching string in python pyspark for dataframe

I am doing a fuzzy similarity matching between all rows in 'name' column using python pyspark in Jupyter notebook. The expected output is to produce a column with the similar string together with the score for each of the string as a new column. My question is quite fimiliar with this question, it's just that the question is in R language and it used 2 datasets (mine is only 1). As I'm quite new to python, I'm quite confused how to do it.
I'm also have used a simple code with similar function however not so sure how to run it for the dataframe.
Here is the code:
import numpy as np
def levenshtein_ratio_and_distance(s, t, ratio_calc = False):
""" levenshtein_ratio_and_distance:
Calculates levenshtein distance between two strings.
If ratio_calc = True, the function computes the
levenshtein distance ratio of similarity between two strings
For all i and j, distance[i,j] will contain the Levenshtein
distance between the first i characters of s and the
first j characters of t
"""
# Initialize matrix of zeros
rows = len(s)+1
cols = len(t)+1
distance = np.zeros((rows,cols),dtype = int)
# Populate matrix of zeros with the indeces of each character of both strings
for i in range(1, rows):
for k in range(1,cols):
distance[i][0] = i
distance[0][k] = k
# Iterate over the matrix to compute the cost of deletions,insertions and/or substitutions
for col in range(1, cols):
for row in range(1, rows):
if s[row-1] == t[col-1]:
cost = 0 # If the characters are the same in the two strings in a given position [i,j] then the cost is 0
else:
# In order to align the results with those of the Python Levenshtein package, if we choose to calculate the ratio
# the cost of a substitution is 2. If we calculate just distance, then the cost of a substitution is 1.
if ratio_calc == True:
cost = 2
else:
cost = 1
distance[row][col] = min(distance[row-1][col] + 1, # Cost of deletions
distance[row][col-1] + 1, # Cost of insertions
distance[row-1][col-1] + cost) # Cost of substitutions
if ratio_calc == True:
# Computation of the Levenshtein Distance Ratio
Ratio = ((len(s)+len(t)) - distance[row][col]) / (len(s)+len(t))
return Ratio
else:
# print(distance) # Uncomment if you want to see the matrix showing how the algorithm computes the cost of deletions,
# insertions and/or substitutions
# This is the minimum number of edits needed to convert string a to string b
return "The strings are {} edits away".format(distance[row][col])
#example I do for simple string
Str1 = "Apple Inc."
Str2 = "Jo Inc"
Distance = levenshtein_ratio_and_distance(Str1,Str2)
print(Distance)
Ratio = levenshtein_ratio_and_distance(Str1,Str2,ratio_calc = True)
print(Ratio)
However, the code above only applicable for string. What is I want to run the dataframe as the input instead of string. For example, the input data is (Saying that dataset name is customer):
name
1 Ace Co
2 Ace Co.
11 Baes
4 Bayes Inc.
8 Bayes
12 Bays
10 Bcy
15 asd
13 asd
The expected outcome is:
name b_name dist
Ace Co Ace Co. 0.64762
Baes Bayes Inc., Bayes,Bays, Bcy 0.80000,0.86667,0.70000,0.97778
asd asdf 0.08333

Retreiving neighbors with geohash algorithm?

I am looking at a pythonic implementation of this top rated accepted answer on GIS SE - Using geohash for proximity searches? and I am unable to retrieve any matches for my geohash query. Here is the approach I have tried so far.
To run this Minimum Verifiable Complete Example(MVCE) you need to download the following files - geohash int
and sortedlist python and install the python sortedlist via pip. You also need to have the latest version of Cython installed on your machine so as to wrap the C functionality of geohash-int(Note I am only wrapping what is necessary for this MVCE).
geohash_test.py
# GeoHash is my Cython wrapper of geohash-int C package
from geo import GeoHash
from sortedcontainers import SortedList
import numpy as np
def main():
# Bounding coordinates of my grid.
minLat = 27.401436
maxLat = 62.54858
minLo = -180.0
maxLo = 179.95000000000002
latGrid = np.arange(minLat,maxLat,0.05)
lonGrid = np.arange(minLo,maxLo,0.05)
geoHash = GeoHash()
# Create my own data set of points with a resolution of
# 0.05 in the latitude and longitude direction.
gridLon,gridLat = np.meshgrid(lonGrid,latGrid)
grid_points = np.c_[gridLon.ravel(),gridLat.ravel()]
sl = SortedList()
#Store my grid points in the best resolution possible i.e. 52(First step in accepted answer)
for grid_point in grid_points:
lon = grid_point[0]
lat = grid_point[1]
geohash = geoHash.encode(lat,lon,52)
bitsOriginal = geohash["bits"]
sl.add(bitsOriginal)
#Derive the minimum and maximum value for the range query from method below
minValue,maxValue = getMinMaxForQueryGeoHash(geoHash)
# Do the actual range query with a sorted list
it = sl.irange(minValue,maxValue,inclusive=(False,False))
print(len(list(it)))
def getMinMaxForQueryGeoHash(geoHash):
lonTest = 172.76843
latTest = 61.560745
#Query geohash encoded at resolution 26 because my search area
# is around 10 kms.(Step 2 and 3 in accepted answer)
queryGeoHash = geoHash.encode(latTest,lonTest,26)
# Step 4 is getting the neighbors for query geohash
neighbors = geoHash.get_neighbors(queryGeoHash)
bitsList = []
for key,value in neighbors.items():
bitsList.append(value["bits"])
#Step 5 from accepted answer
bitsList.append(queryGeoHash["bits"])
# Step 6 We need 1 to all the neighbors
newList = [x+1 for x in bitsList]
joinedList = bitsList + newList
#Step 7 Left bit shift this to 52
newList2 = [x <<26 for x in joinedList]
#Return min and max value to main method
minValue = min(newList2)
maxValue = max(newList2)
return minValue,maxValue
main()
If one were to write this out as a pseudocode here is what I am doing
Given my bounding box which is a grid I store it in the highest resolution possible by computing the geohash for each latitude and longitude(this happens to be bit depth 52)
I add the geohash to a sorted list
Then I would like to do a range query by specifying a search radius of 10 kms for a specific query coordinate
From the accepted answer to do this you need the min and max value for a query geohash
I calculate the min and max value in the method getMinMaxForQueryGeoHash
Calculate the query geohash at bit depth 26(this is the radius of 10 kms)
Calculate the neighbors of the query geohash and create the 18 member array
The 18 members are the 8 neighbors returned from the C method plus the original query geohash and the remaining 9 are obtained by adding 1 to this array
Then left bit shift this array by 26 and return the min and max value to the irange query of sorted list.
Bit shift = 52(maximum resolution) - query geohash precision(26) = 26
But that query returns me a NULL. Could somebody explain where I am going wrong ?
using your jargon: for a MVCE, you not need a complex two-languages implementations. There are a lot of simple good implementations of Geohash, some in 100% Python (example). All them use the Morton Curve (example).
Conclusion: try to plug-and-play a pure-Python implementation, first test encode/decode, them test the use of neighbors(geohash) function.

Numpy arrays and comparisons with different values

For reproducibility reasons, I am sharing the data here.
From column 2, I wanted to read the current row and compare it with the value of the previous row. If it is greater, I keep comparing. If the current value is smaller than the previous row's value, I want to divide the current value (smaller) by the previous value (larger). Accordingly, the following code:
import numpy as np
import matplotlib.pyplot as plt
protocols = {}
types = {"data_c": "data_c.csv", "data_r": "data_r.csv", "data_v": "data_v.csv"}
for protname, fname in types.items():
col_time,col_window = np.loadtxt(fname,delimiter=',').T
trailing_window = col_window[:-1] # "past" values at a given index
leading_window = col_window[1:] # "current values at a given index
decreasing_inds = np.where(leading_window < trailing_window)[0]
quotient = leading_window[decreasing_inds]/trailing_window[decreasing_inds]
quotient_times = col_time[decreasing_inds]
protocols[protname] = {
"col_time": col_time,
"col_window": col_window,
"quotient_times": quotient_times,
"quotient": quotient,
}
data_c is a numpy.array that has only one unique quotient value 0.7, as does data_r with a unique quotient value of 0.5. However, data_v has two unique quotient values (either 0.5 or 0.8).
I wanted to loop through the quotient values of these CSV files and categorize them using a simple if-else statement. I get help from one StackOverflow contributor using numpy.array_equal as the following.
import numpy as np
unique_quotient = np.unique(quotient)
unique_data_c_quotient = np.r_[ 0.7]
unique_data_r_quotient = np.r_[ 0.5]
if np.array_equal( unique_quotient, unique_data_c_quotient ):
print('data_c')
elif np.array_equal( unique_quotient, unique_data_c_quotient ):
print('data_r')
This perfectly works for data_c and data_r whose values are 0.7 and 0.5 respectively. This means it works only when the quotient value is unique (or fixed). However, it doesn't work when the quotient value is more than one. For example, data_m has quotient values between 0.65 and 0.7 (i.e. 0.65<=quotient<=0.7) and data_v has two quotient values (0.5 and 0.8)
How can we solve this issue using numpy arrays?
If you consistently have unique quotients, and consistently have unique quotient bounds then I would recommend the following:
ud_m_bounds = np.r_[0.65,0.7]
uq = unique_quotient
uq_min,uq_max = uq.min(),uq.max()
def is_uq_bounded_by(unique_data_bounds):
ud_min,ud_max = unique_data_bounds.min(), unique_data_bounds.max()
left_bounded = ud_min <= uq_min <= ud_max
right_bounded = ud_min <= uq_max <= ud_max
bounded = left_bounded & right_bounded
return bounded
label = 'ERROR -- DATA UNCLASSIFIED'
if len(uq) > 2:
if is_uq_bounded_by( unique_data_m_bounds ):
label = 'data_m'
elif 0 < len(uq) <= 2:
if np.array_equal( uq, unique_data_v_quotient):
label = 'data_v'
if np.array_equal( uq, unique_data_c_quotient):
label = 'data_c'
elif np.array_equal( uq, unique_data_r_quotient):
label = 'data_r'
print(label)
Note that the method becomes dubious when the data begin to overlap.

Return value from DataFrame at maximum of a function

For narrow banded processing I want the complex pressure at the peak frequency bin. To find the peak frequency bin I use the frequency with the highest absolute value, within a small range of frequencies.
I have come up with the following code, borrowing heavily from
Use idxmax for indexing in pandas
This seems to me bulky, and hard to generalize. Ideally I hope to be able to be able to make fBins into an array, and return many frequencies at once. Its OK to make maxAbsIndex into a list, but I can't see the next step.
import numpy as np
import pandas as pd
# Construct fake frequency data on multiple channels
np.random.seed(0)
numF = 1000
f = np.arange(numF) / (numF * 2)
y = np.random.randn(numF, 2) + 1j * np.random.randn(numF, 2)
# Put time series into a DataFrame, indexed by frequency
yFrame = pd.DataFrame(y, index = f)
fBins = 0.1
tol = 0.01
# Find the index of the maxium absolute value within a given frequency window
absMaxIndex = yFrame[(fBins - tol) : (fBins + tol)].abs().idxmax()
# Return the value at this index
value = [yFrame.ix[items[1], items[0]] for items in absMaxIndex.iteritems()]
print(value)
Value should have the complex value
[(-2.0946030712061448-1.0585718976053677j), (-2.7396771671895563+0.79204149842297422j)]
Which have the largest absolute value in yFrame between 0.09 and 0.11 Hz for each channel.

Resources