Related
I have geoDataFrame:
df = gpd.GeoDataFrame([[0, 'A', Point(10,12)],
[1, 'B', Point(14,8)],
[2, 'C', Point(100,2)],
[3, 'D' ,Point(20,10)]],
columns=['ID','Value','geometry'])
Is it possible to find points in a range of radius for example 10 for each point and add their "Value" and 'geometry' to GeoDataFrame so output would look like:
['ID','Value','geometry','value_of_point_in_range_1','geometry_of_point_in_range_1','value_of_point_in_range_2','geometry_of_point_in_range_2' etc.]
Before i was finding nearest neighbor for each and after that was checking if is it in range but i must find all of the points in radius and don't know what tool should i use.
Although in your example the output will have a predictable amount of columns in the resulting dataframe, this not true in general. Therefore I would instead create a column in the dataframe that consists of a lists denoting the index/value/geometry of the nearby points.
In a small dataset like you provided, simple arithmetics in python will suffice. But for large datasets you will want to use a spatial tree to query the nearby points. I suggest to use scipy's KDTree like this:
import geopandas as gpd
import numpy as np
from shapely.geometry import Point
from scipy.spatial import KDTree
df = gpd.GeoDataFrame([[0, 'A', Point(10,12)],
[1, 'B', Point(14,8)],
[2, 'C', Point(100,2)],
[3, 'D' ,Point(20,10)]],
columns=['ID','Value','geometry'])
tree = KDTree(list(zip(df.geometry.x, df.geometry.y)))
pairs = tree.query_pairs(10)
df['ValueOfNearbyPoints'] = np.empty((len(df), 0)).tolist()
n = df.columns.get_loc("ValueOfNearbyPoints")
m = df.columns.get_loc("Value")
for (i, j) in pairs:
df.iloc[i, n].append(df.iloc[j, m])
df.iloc[j, n].append(df.iloc[i, m])
This yields the following dataframe:
ID Value geometry ValueOfNearbyPoints
0 0 A POINT (10.00000 12.00000) [B]
1 1 B POINT (14.00000 8.00000) [A, D]
2 2 C POINT (100.00000 2.00000) []
3 3 D POINT (20.00000 10.00000) [B]
To verify the results, you may find plotting the result usefull:
import matplotlib.pyplot as plt
ax = plt.subplot()
df.plot(ax=ax)
for (i, j) in pairs:
plt.plot([df.iloc[i].geometry.x, df.iloc[j].geometry.x],
[df.iloc[i].geometry.y, df.iloc[j].geometry.y], "-r")
plt.show()
I can create a numpy array from a python list as follows:
>>> a = [1,2,3]
>>> b = np.array(a).reshape(3,1)
>>> print(b)
[[1]
[2]
[3]]
However, I don't know what causes error in the following code:
Code :
>>> a = [1,2,3]
>>> b = np.full((3,1), a)
Error :
ValueError Traceback (most recent call last)
<ipython-input-275-1ab6c109dda4> in <module>()
1 a = [1,2,3]
----> 2 b = np.full((3,1), a)
3 print(b)
/usr/local/lib/python3.6/dist-packages/numpy/core/numeric.py in full(shape, fill_value, dtype, order)
324 dtype = array(fill_value).dtype
325 a = empty(shape, dtype, order)
--> 326 multiarray.copyto(a, fill_value, casting='unsafe')
327 return a
328
<__array_function__ internals> in copyto(*args, **kwargs)
ValueError: could not broadcast input array from shape (3) into shape (3,1)
Even though the list a has 3 elements inside it and I expect a 3x1 numpy array, the full() method fails to deliver it.
I referred the broadcasting article of numpy too. However, they are much more focused towards the arithmetic operation perspective, hence I couldn't obtain anything useful from there.
So it would be great if you can help me to understand the difference in b/w. the above mentioned array creation methods and the cause of the error too.
Numpy is unable to broadcast the two shapes together because your list is interpreted as a 'row vector' (np.array(a).shape = (3,)) while you are asking for a 'column vector' (shape = (3, 1)). If you are set on using np.full, then you can shape your list as a column vector initially:
>>> import numpy as np
>>>
>>> a = [[1],[2],[3]]
>>> b = np.full((3,1), a)
Another option is to convert a into a numpy array ahead of time and add a new axis to match the desired output shape.
>>> a = [1,2,3]
>>> a = np.array(a)[:, np.newaxis]
>>> b = np.full((3,1), a)
I have 3 big CSV files. I try to randomly extract some samples from the files without loading them into the memory. I am doing this:
SITS = dd.read_csv("sits_train_0.csv", blocksize="512MB",
usecols=band_blue + ["samplefid"]).set_index("samplefid")
MASK = dd.read_csv("mask_train_0.csv", blocksize="512MB",
usecols=band_mask + ["samplefid"]).set_index("samplefid")
GP = dd.read_csv("sits_gp_train_0.csv", blocksize="512MB",
usecols=band_blue_gp + ["samplefid"]).set_index("samplefid")
# SITS = pd.read_csv("sits_train_0.csv",
# usecols=band_blue + ["samplefid"]).set_index("samplefid")
# MASK = pd.read_csv("mask_train_0.csv",
# usecols=band_mask + ["samplefid"]).set_index("samplefid")
# GP = pd.read_csv("sits_gp_train_0.csv",
# usecols=band_blue_gp + ["samplefid"]).set_index("samplefid")
np.random.seed(0)
NSAMPLES=100
samples = np.random.choice(MASK.index, size=NSAMPLES, replace=False)
s = SITS.loc[samples][band_blue].compute().values
m = MASK.loc[samples][band_mask].compute().values
sg = GP.loc[samples][band_blue_gp].compute().values
# s = SITS.loc[samples][band_blue].values
# m = MASK.loc[samples][band_mask].values
# sg = GP.loc[samples][band_blue_gp].values
I had strange results, so I compare to pandas with smaller files (see commented code above) for which I have correct results.
If I set blocksize to None, the results are fine, but it loads everything in memory, so using dask is not useful in that case and my CSV are to big to fits in memory. My CSV are written randomly so I need to use index to recover the same samples from the 3 CSV.
I feel I miss something from dask, but I don't see what.
I'd recommend using sample
In [16]: import pandas as pd
In [17]: import dask.dataframe as dd
In [18]: df = pd.DataFrame({'num_legs': [2, 4, 8, 0],
...: 'num_wings': [2, 0, 0, 0],
...: 'num_specimen_seen': [10, 2, 1, 8]},
...: index=['falcon', 'dog', 'spider', 'fish'])
In [19]: ddf = dd.from_pandas(df, npartitions=2)
In [20]: ddf.sample??
In [21]: df.sample(frac=0.5, replace=True, random_state=1)
Out[21]:
num_legs num_wings num_specimen_seen
dog 4 0 2
fish 0 0 8
In [22]: ddf.sample(frac=0.5, replace=True, random_state=1)
Out[22]:
Dask DataFrame Structure:
num_legs num_wings num_specimen_seen
npartitions=2
dog int64 int64 int64
fish ... ... ...
spider ... ... ...
Dask Name: sample, 4 tasks
In [23]: ddf.sample(frac=0.5, replace=True, random_state=1).compute()
Out[23]:
num_legs num_wings num_specimen_seen
falcon 2 2 10
fish 0 0 8
I have a text file that contains 3 columns of useful data that I would like to be able to extract in python using numpy. The file type is a *.nc and is NOT a netCDF4 filetype. It is a standard file output type for CNC machines. In my case it is sort of a CMM (coordinate measurement machine). The format goes something like this:
X0.8523542Y0.0000000Z0.5312869
The X,Y, and Z are the coordinate axes on the machine. My question is, can I delimit an array with multiple delimiters? In this case: "X","Y", and "Z".
You can use Pandas
import pandas as pd
from io import StringIO
#Create a mock file
ncfile = StringIO("""X0.8523542Y0.0000000Z0.5312869
X0.7523542Y1.0000000Z0.5312869
X0.6523542Y2.0000000Z0.5312869
X0.5523542Y3.0000000Z0.5312869""")
df = pd.read_csv(ncfile,header=None)
#Use regex with split to define delimiters as X, Y, Z.
df_out = df[0].str.split(r'X|Y|Z', expand=True)
df_out.set_axis(['index','X','Y','Z'], axis=1, inplace=False)
Output:
index X Y Z
0 0.8523542 0.0000000 0.5312869
1 0.7523542 1.0000000 0.5312869
2 0.6523542 2.0000000 0.5312869
3 0.5523542 3.0000000 0.5312869
I ended up using the Pandas solution provided by Scott. For some reason I am not 100% clear on, I cannot simply convert the array from string to float with float(array). I created an array of equal size and iterated over the size of the array, converting each individual element to a float and saving it to the other array.
Thanks all
Using the filter function that I suggested in a comment:
String sample (standin for file):
In [1]: txt = '''X0.8523542Y0.0000000Z0.5312869
...: X0.8523542Y0.0000000Z0.5312869
...: X0.8523542Y0.0000000Z0.5312869
...: X0.8523542Y0.0000000Z0.5312869'''
Basic genfromtxt use - getting strings:
In [3]: np.genfromtxt(txt.splitlines(), dtype=None,encoding=None)
Out[3]:
array(['X0.8523542Y0.0000000Z0.5312869', 'X0.8523542Y0.0000000Z0.5312869',
'X0.8523542Y0.0000000Z0.5312869', 'X0.8523542Y0.0000000Z0.5312869'],
dtype='<U30')
This array of strings could be split in the same spirit as the pandas answer.
Define a function to replace the delimiter characters in a line:
In [6]: def foo(aline):
...: return aline.replace('X','').replace('Y',',').replace('Z',',')
re could be used for a prettier split.
Test it:
In [7]: foo('X0.8523542Y0.0000000Z0.5312869')
Out[7]: '0.8523542,0.0000000,0.5312869'
Use it in genfromtxt:
In [9]: np.genfromtxt((foo(aline) for aline in txt.splitlines()), dtype=float,delimiter=',')
Out[9]:
array([[0.8523542, 0. , 0.5312869],
[0.8523542, 0. , 0.5312869],
[0.8523542, 0. , 0.5312869],
[0.8523542, 0. , 0.5312869]])
With a file instead, the generator would something like:
(foo(aline) for aline in open(afile))
What is the fastest way of converting a list of elements of type numpy.float64 to type float? I am currently using the straightforward for loop iteration in conjunction with float().
I came across this post: Converting numpy dtypes to native python types, however my question isn't one of how to convert types in python but rather more specifically how to best convert an entire list of one type to another in the quickest manner possible in python (i.e. in this specific case numpy.float64 to float). I was hoping for some secret python machinery that I hadn't come across that could do it all at once :)
The tolist() method should do what you want. If you have a numpy array, just call tolist():
In [17]: a
Out[17]:
array([ 0. , 0.14285714, 0.28571429, 0.42857143, 0.57142857,
0.71428571, 0.85714286, 1. , 1.14285714, 1.28571429,
1.42857143, 1.57142857, 1.71428571, 1.85714286, 2. ])
In [18]: a.dtype
Out[18]: dtype('float64')
In [19]: b = a.tolist()
In [20]: b
Out[20]:
[0.0,
0.14285714285714285,
0.2857142857142857,
0.42857142857142855,
0.5714285714285714,
0.7142857142857142,
0.8571428571428571,
1.0,
1.1428571428571428,
1.2857142857142856,
1.4285714285714284,
1.5714285714285714,
1.7142857142857142,
1.857142857142857,
2.0]
In [21]: type(b)
Out[21]: list
In [22]: type(b[0])
Out[22]: float
If, in fact, you really have python list of numpy.float64 objects, then #Alexander's answer is great, or you could convert the list to an array and then use the tolist() method. E.g.
In [46]: c
Out[46]:
[0.0,
0.33333333333333331,
0.66666666666666663,
1.0,
1.3333333333333333,
1.6666666666666665,
2.0]
In [47]: type(c)
Out[47]: list
In [48]: type(c[0])
Out[48]: numpy.float64
#Alexander's suggestion, a list comprehension:
In [49]: [float(v) for v in c]
Out[49]:
[0.0,
0.3333333333333333,
0.6666666666666666,
1.0,
1.3333333333333333,
1.6666666666666665,
2.0]
Or, convert to an array and then use the tolist() method.
In [50]: np.array(c).tolist()
Out[50]:
[0.0,
0.3333333333333333,
0.6666666666666666,
1.0,
1.3333333333333333,
1.6666666666666665,
2.0]
If you are concerned with the speed, here's a comparison. The input, x, is a python list of numpy.float64 objects:
In [8]: type(x)
Out[8]: list
In [9]: len(x)
Out[9]: 1000
In [10]: type(x[0])
Out[10]: numpy.float64
Timing for the list comprehension:
In [11]: %timeit list1 = [float(v) for v in x]
10000 loops, best of 3: 109 µs per loop
Timing for conversion to numpy array and then tolist():
In [12]: %timeit list2 = np.array(x).tolist()
10000 loops, best of 3: 70.5 µs per loop
So it is faster to convert the list to an array and then call tolist().
You could use a list comprehension:
floats = [float(np_float) for np_float in np_float_list]
So out of the possible solutions I've come across (big thanks to Warren Weckesser and Alexander for pointing out all of the best possible approaches) I ran my current method and that presented by Alexander to give a simple comparison for runtimes (the two choices come as a result of the fact that I have a true list of elements of numpy.float64 and wish to convert them to float speedily):
2 approaches covered: list comprehension and basic for loop iteration
First here's the code:
import datetime
import numpy
list1 = []
for i in range(0,1000):
list1.append(numpy.float64(i))
list2 = []
t_init = time.time()
for num in list1:
list2.append(float(num))
t_1 = time.time()
list2 = [float(np_float) for np_float in list1]
t_2 = time.time()
print("t1 run time: {}".format(t_1-t_init))
print("t2 run time: {}".format(t_2-t_1))
I ran four times to give a quick set of results:
>>> run 1
t1 run time: 0.000179290771484375
t2 run time: 0.0001533031463623047
Python 3.4.0
>>> run 2
t1 run time: 0.00018739700317382812
t2 run time: 0.0001518726348876953
Python 3.4.0
>>> run 3
t1 run time: 0.00017976760864257812
t2 run time: 0.0001513957977294922
Python 3.4.0
>>> run 4
t1 run time: 0.0002455711364746094
t2 run time: 0.00015997886657714844
Python 3.4.0
Clearly to convert a true list of numpy.float64 to float, the optimal approach is to use python's list comprehension.