I have a very large dataset with a polygons and points with buffers around them. I would like to creat a new column in the points data which includes the number of polygons that point's buffer intersects.
Heres a simplified example:
import pandas as pd
import geopandas as gp
from shapely.geometry import Polygon
from shapely.geometry import Point
import matplotlib.pyplot as plt
## Create polygons and points ##
df = gp.GeoDataFrame([['a',Polygon([(1, 0), (1, 1), (2,2), (1,2)])],
['b',Polygon([(1, 0.25), (2,1.25), (3,0.25)])]],
columns = ['name','geometry'])
df = gp.GeoDataFrame(df, geometry = 'geometry')
points = gp.GeoDataFrame( [['box', Point(1.5, 1.115), 4],
['triangle', Point(2.5,1.25), 8]],
columns=['name', 'geometry', 'value'],
geometry='geometry')
##Set a buffer around the points##
buf = points.buffer(0.5)
points['buffer'] = buf
points = points.drop(['geometry'], axis = 1)
points = points.rename(columns = {'buffer': 'geometry'})
This data looks like this:
What I'd like to do is create another column in the points dataframe that includes the number of polygons that point intersects.
I've tried utilising a for loop as such:
points['intersect'] = []
for geo1 in points['geometry']:
for geo2 in df['geometry']:
if geo1.intersects(geo2):
points['intersect'].append('1')
Which I would then sum to get the total number of intersects.
However, I get the error: 'Length of values does not match length of index'. I know this is because it is attempting to assign three rows of data to a frame with only two rows.
How can I aggrigate the counts so the first point is assigned a value of 2 and the second a value of 1?
If you have large dataset, I would go for solution using rtree spatial index, something like this.
import pandas as pd
import geopandas as gp
from shapely.geometry import Polygon
from shapely.geometry import Point
import matplotlib.pyplot as plt
## Create polygons and points ##
df = gp.GeoDataFrame([['a',Polygon([(1, 0), (1, 1), (2,2), (1,2)])],
['b',Polygon([(1, 0.25), (2,1.25), (3,0.25)])]],
columns = ['name','geometry'])
df = gp.GeoDataFrame(df, geometry = 'geometry')
points = gp.GeoDataFrame( [['box', Point(1.5, 1.115), 4],
['triangle', Point(2.5,1.25), 8]],
columns=['name', 'geometry', 'value'],
geometry='geometry')
# generate spatial index
sindex = df.sindex
# define empty list for results
results_list = []
# iterate over the points
for index, row in points.iterrows():
buffer = row['geometry'].buffer(0.5) # buffer
# find approximate matches with r-tree, then precise matches from those approximate ones
possible_matches_index = list(sindex.intersection(buffer.bounds))
possible_matches = df.iloc[possible_matches_index]
precise_matches = possible_matches[possible_matches.intersects(buffer)]
results_list.append(len(precise_matches))
# add list of results as a new column
points['polygons'] = pd.Series(results_list)
Related
I really appreciate your help in developing my code since I am not an expert in python. I attempt to write a code to be able to:
Read all the points (longitude, latitude, cumulative forecasted rainfall for 24, 48, and 72 hours) from a csv file (Mean_PCP_REPS_12_20220809_Gridded.csv).
Read the polygon representing the watershed boundary (NelsonRiverBasin.shp).
Mask/remove the points outside of the watershed polygon.
Create a rainfall colormap image or raster for the points inside the watershed polygon.
Color boundaries should be based on rainfall value. I defined the rainfall range for each color in my code.
I tried many ways but I was not successful in creating an image or raster with desired color map (please click here as an example of the intended image). My python code is as follows. It creates and saves "New_ras.tiff" but my code cannot remap the colors of this image based on the range of rainfall after its creation.
from __future__ import division
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
from shapely.geometry import Point, Polygon, MultiPolygon
import operator
#extending the code
import os
from matplotlib.patches import Patch
from matplotlib.colors import ListedColormap
import matplotlib.colors as colors
import seaborn as sns
import numpy as np
import rioxarray as rxr
import earthpy as et
import earthpy.plot as ep
from scipy.interpolate import griddata #added code up to here
import rasterio
# load the data that should be cropped by the polygon
# this assumes that the csv file already includes
# a geometry column with point data as performed below
dat_gpd = pd.read_csv(r'Mean_PCP_REPS_12_20220809_Gridded.csv')
# make shapely points out of the X and Y coordinates
point_data = [Point(xy) for xy in zip(dat_gpd.iloc[:,0], dat_gpd.iloc[:,1])]
all_pts = list(zip(dat_gpd.iloc[:,0], dat_gpd.iloc[:,1]))
# assign shapely points as geometry to a geodataframe
# Like this you can also inspect the individual points if needed
arr_gpd = gpd.GeoDataFrame(dat_gpd, crs=4269, geometry=point_data)
# assign defined polygon to a new dataframe
nlpoly = gpd.read_file('NelsonRiverBasin.shp')
nlpoly = nlpoly.to_crs('epsg:4269')
mask = [nlpoly.contains(Point(p)).any() for p in all_pts]
# define a new dataframe from the spatial join of the dataframe with the data to be cropped
# and the dataframe with the polygon data, using the within function.
#dat_fin = gpd.sjoin(arr_gpd, nlpoly[['OCEAN_EN', 'COUNT', 'geometry']], predicate = 'within')
#dat_fin = dat_fin.to_crs('epsg:4326')
#dat_fin.plot(column= 'Hr72')
#plt.savefig('Raster2.tiff')
data = dat_gpd[['Long', 'Lat', 'Hr72']]
pts = list(zip(data.Long, data.Lat))
print (pts)
print(type(pts))
pts2 = [pts[i] for i in range(len(pts)) if mask[i]]
print(pts2)
print(type(pts2))
pts_val = data.Hr72.values
pts_val2 = [pts_val[i] for i in range(len(pts_val)) if mask[i]]
new_pts = [Point(xy) for xy in pts2]
print(type(pts_val2[1]))
pts3=[]
for tup, j in zip(pts2,range(len(pts_val2))):
pts3.append(list(tup)+[pts_val2[j]])
print(type(pts3))
masked_pts = pd.DataFrame(pts3)
print(masked_pts)
masked_pts.columns = pd.Series(['Long', 'Lat', 'Hr72'])
new_arr_gpd = gpd.GeoDataFrame(masked_pts, crs = 4269, geometry = new_pts)
new_arr_gpd.plot(column = 'Hr72')
plt.savefig('new_ras.tiff')
rRes = 0.01
#xRange = np.arange(data.Long.min(), data.Long.max(), rRes)
#yRange = np.arange(data.Lat.min(), data.Lat.max(), rRes)
#print(xRange[:5],yRange[:5])
#gridX, gridY = np.meshgrid(xRange, yRange)
#grid_pcp = griddata(pts2, pts_val2, (gridX, gridY), method = 'linear')
#Extending the code
sns.set(font_scale = 1, style = "white")
lidar_chm = rxr.open_rasterio(r'new_ras.tiff', masked=True).squeeze()
# Define the colors you want
cmap = ListedColormap(["white", "lightskyblue","dodgerblue","mediumblue","lawngreen","limegreen", "forestgreen","darkgreen", "yellow", "orange","darkorange", "chocolate", "red", "maroon", "indianred","lightpink", "pink", "lightgray", "whitesmoke" ])
# Define a normalization from values -> colors
norm = colors.BoundaryNorm([0, 1, 5, 10, 15, 20, 25, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 150, 200, 250], 19)
fig, ax = plt.subplots(figsize=(9, 5))
chm_plot = ax.imshow(np.squeeze(r'new_ras.tiff'),cmap=cmap,norm=norm)
#print(chm_plot)
map_title = input ("Enter a title for this map (for ex. 72-hr accumulated forecast map):")
ax.set_title("Hydrologic Forecast Centre (MTI)\n" + map_title)
# Add a legend for labels
legend_labels = {"white": "0-1", "lightskyblue": "1-5","dodgerblue": "5-10","mediumblue": "10-15","lawngreen": "15-20","limegreen": "20-25", "forestgreen": "25-30","darkgreen": "30-40", "yellow": "40-50", "orange": "50-60","darkorange": "60-70", "chocolate": "70-80", "red": "80-90", "maroon": "90-100","indianred": "100-110", "lightpink": "110-120", "pink": "120-150", "lightgray": "150-200", "whitesmoke": "200-250"}
patches = [Patch(color=color, label=label) for color, label in legend_labels.items()]
ax.legend(handles=patches,bbox_to_anchor=(1.2, 1),facecolor="white")
ax.set_axis_off()
plt.show()
I have a pandas dataframe with two columns, A and B, named df in the following bits of code.
And I try to plot a kde for each value of B like so:
import seaborn as sbn, numpy as np, pandas as pd
fig = plt.figure(figsize=(15, 7.5))
sbn.kdeplot(data=df, x="A", hue="B", fill=True)
fig.savefig("test.png")
I read the following propositions but only those where I compute the kde from scratch using statsmodel or some other module get me somewhere:
Seaborn/Matplotlib: how to access line values in FacetGrid?
Get data points from Seaborn distplot
For curiosity's sake, I would like to know why I am unable to get something from the following code:
kde = sns.kdeplot(data=df, x="A", hue="B", fill=True)
line = kde.lines[0]
x, y = line.get_data()
print(x, y)
The error I get is IndexError: list index out of range. kde.lines has a length of 0.
Accessing the lines through fig.axes[0].lines[0] also raises an IndexError.
All in all, I think I tried everything proposed in the previous threads (I tried switching to displot instead of using kdeplot but this is the same story, only that I have to access axes differently, note displot and not distplot because it is deprecated), but every time I get to .get_lines(), ax.lines, ... what is returned is an empty list. So I can't get any values out of it.
EDIT : Reproducible example
import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sbn
# 1. Generate random data
df = pd.DataFrame(columns=["A", "B"])
for i in [1, 2, 3, 5, 7, 8, 10, 12, 15, 17, 20, 40, 50]:
for _ in range(10):
df = df.append({"A": np.random.random() * i, "B": i}, ignore_index=True)
# 2. Plot data
fig = plt.figure(figsize=(15, 7.5))
sbn.kdeplot(data=df, x="A", hue="B", fill=True)
# 3. Read data (error)
ax = fig.axes[0]
x, y = ax.lines[0].get_data()
print(x, y)
This happens because using fill=True changes the object that matplotlib draws.
When no fill is used, lines are plotted:
fig = plt.figure(figsize=(15, 7.5))
ax = sbn.kdeplot(data=df, x="A", hue="B")
print(ax.lines)
# [<matplotlib.lines.Line2D object at 0x000001F365EF7848>, etc.]
when you use fill, it changes them to PolyCollection objects
fig = plt.figure(figsize=(15, 7.5))
ax = sbn.kdeplot(data=df, x="A", hue="B", fill=True)
print(ax.collections)
# [<matplotlib.collections.PolyCollection object at 0x0000016EE13F39C8>, etc.]
You could draw the kdeplot a second time, but with fill=False so that you have access to the line objects
I have a dictionary called "topic_word"
topic_word = {0: [[-0.669712, 0.6868, 0.9821409999999999], [-0.925967, 0.6138399999999999, 1.247525], [-1.09941, 1.0252620000000001, 1.327866]],
1: [[-0.862131, 0.890915, 1.07759], [-0.437658, 0.279271, 0.627497], [-0.437658, 0.279271, 0.627497]],
2: [[-0.671647, 0.670583, 0.937155], [-0.675347, 0.466983, 0.8505440000000001], [-0.706244, 0.612532, 0.762877]],
3: [[-0.8414590000000001, 0.797826, 1.124295], [-0.567535, 0.40820300000000004, 0.811368], [-0.800963, 0.699767, 0.9237989999999999]],
4: [[-0.8560549999999999, 1.0617020000000001, 1.579302], [-0.576105, 0.5029239999999999, 0.9392], [-0.743683, 0.69884, 0.9794930000000001]]
}
where each key represents topic ( here 0 to 4; 5 topics) and value represents embeddings of words under each topic ( here every topic has 3 words). I want to visualize data using 2-d scatter plot if need to normalize how can I normalize "topic_word" data that I can represent correctly in python 3.x
How to visualize it using Scatter plot that will show cluster of words (dots) under their topics.
something as below:
import numpy as np
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
for key, value in topic_word.items():
ax.scatter(value[0],value[1],label=key)
plt.legend()
I gather from your post that you want to have normalized values for each list corresponding to a key. And, each one of these normalized lists are represented as scatter datapoints. Here's one way to do it:
import numpy as np
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
topic_word = {0: [[-0.669712, 0.6868, 0.9821409999999999], [-0.925967, 0.6138399999999999, 1.247525], [-1.09941, 1.0252620000000001, 1.327866]],
1: [[-0.862131, 0.890915, 1.07759], [-0.437658, 0.279271, 0.627497], [-0.437658, 0.279271, 0.627497]],
2: [[-0.671647, 0.670583, 0.937155], [-0.675347, 0.466983, 0.8505440000000001], [-0.706244, 0.612532, 0.762877]],
3: [[-0.8414590000000001, 0.797826, 1.124295], [-0.567535, 0.40820300000000004, 0.811368], [-0.800963, 0.699767, 0.9237989999999999]],
4: [[-0.8560549999999999, 1.0617020000000001, 1.579302], [-0.576105, 0.5029239999999999, 0.9392], [-0.743683, 0.69884, 0.9794930000000001]]
}
colorkey={0:'red',1:'blue',2:'green',3:'black',4:'magenta'} # creating a color map for keys
for key, value in topic_word.items():
valno=0 # keeping a count of number of lists under each topic_word (key)
for val in value:
meanval=np.mean(val)
stdval=np.std(val)
val = (val-meanval)/(stdval) # normalized list
ax.scatter(key*np.ones(len(val)),val,color=colorkey[key],label="Topic "+str(key) if valno == 0 else "") # label is done such that duplication of legend elements is avoided
handles, labels = ax.get_legend_handles_labels()
valno=valno+1
fig.legend(handles, labels, loc='best')
I can create a simple columnar diagram in a matplotlib according to the 'simple' dictionary:
import matplotlib.pyplot as plt
D = {u'Label1':26, u'Label2': 17, u'Label3':30}
plt.bar(range(len(D)), D.values(), align='center')
plt.xticks(range(len(D)), D.keys())
plt.show()
But, how do I create curved line on the text and numeric data of this dictionarie, I do not know?
ΠΆ_OLD = {'10': 'need1', '11': 'need2', '12': 'need1', '13': 'need2', '14': 'need1'}
Like the picture below
You may use numpy to convert the dictionary to an array with two columns, which can be plotted.
import matplotlib.pyplot as plt
import numpy as np
T_OLD = {'10' : 'need1', '11':'need2', '12':'need1', '13':'need2','14':'need1'}
x = list(zip(*T_OLD.items()))
# sort array, since dictionary is unsorted
x = np.array(x)[:,np.argsort(x[0])].T
# let second column be "True" if "need2", else be "False
x[:,1] = (x[:,1] == "need2").astype(int)
# plot the two columns of the array
plt.plot(x[:,0], x[:,1])
#set the labels accordinly
plt.gca().set_yticks([0,1])
plt.gca().set_yticklabels(['need1', 'need2'])
plt.show()
The following would be a version, which is independent on the actual content of the dictionary; only assumption is that the keys can be converted to floats.
import matplotlib.pyplot as plt
import numpy as np
T_OLD = {'10': 'run', '11': 'tea', '12': 'mathematics', '13': 'run', '14' :'chemistry'}
x = np.array(list(zip(*T_OLD.items())))
u, ind = np.unique(x[1,:], return_inverse=True)
x[1,:] = ind
x = x.astype(float)[:,np.argsort(x[0])].T
# plot the two columns of the array
plt.plot(x[:,0], x[:,1])
#set the labels accordinly
plt.gca().set_yticks(range(len(u)))
plt.gca().set_yticklabels(u)
plt.show()
Use numeric values for your y-axis ticks, and then map them to desired strings with plt.yticks():
import matplotlib.pyplot as plt
import pandas as pd
# example data
times = pd.date_range(start='2017-10-17 00:00', end='2017-10-17 5:00', freq='H')
data = np.random.choice([0,1], size=len(times))
data_labels = ['need1','need2']
fig, ax = plt.subplots()
ax.plot(times, data, marker='o', linestyle="None")
plt.yticks(data, data_labels)
plt.xlabel("time")
Note: It's generally not a good idea to use a line graph to represent categorical changes in time (e.g. from need1 to need2). Doing that gives the visual impression of a continuum between time points, which may not be accurate. Here, I changed the plotting style to points instead of lines. If for some reason you need the lines, just remove linestyle="None" from the call to plt.plot().
UPDATE
(per comments)
To make this work with a y-axis category set of arbitrary length, use ax.set_yticks() and ax.set_yticklabels() to map to y-axis values.
For example, given a set of potential y-axis values labels, let N be the size of a subset of labels (here we'll set it to 4, but it could be any size).
Then draw a random sample data of y values and plot against time, labeling the y-axis ticks based on the full set labels. Note that we still use set_yticks() first with numerical markers, and then replace with our category labels with set_yticklabels().
labels = np.array(['A','B','C','D','E','F','G'])
N = 4
# example data
times = pd.date_range(start='2017-10-17 00:00', end='2017-10-17 5:00', freq='H')
data = np.random.choice(np.arange(len(labels)), size=len(times))
fig, ax = plt.subplots(figsize=(15,10))
ax.plot(times, data, marker='o', linestyle="None")
ax.set_yticks(np.arange(len(labels)))
ax.set_yticklabels(labels)
plt.xlabel("time")
This gives the exact desired plot:
import matplotlib.pyplot as plt
from collections import OrderedDict
T_OLD = {'10' : 'need1', '11':'need2', '12':'need1', '13':'need2','14':'need1'}
T_SRT = OrderedDict(sorted(T_OLD.items(), key=lambda t: t[0]))
plt.plot(map(int, T_SRT.keys()), map(lambda x: int(x[-1]), T_SRT.values()),'r')
plt.ylim([0.9,2.1])
ax = plt.gca()
ax.set_yticks([1,2])
ax.set_yticklabels(['need1', 'need2'])
plt.title('T_OLD')
plt.xlabel('time')
plt.ylabel('need')
plt.show()
For Python 3.X the plotting lines needs to explicitly convert the map() output to lists:
plt.plot(list(map(int, T_SRT.keys())), list(map(lambda x: int(x[-1]), T_SRT.values())),'r')
as in Python 3.X map() returns an iterator as opposed to a list in Python 2.7.
The plot uses the dictionary keys converted to ints and last elements of need1 or need2, also converted to ints. This relies on the particular structure of your data, if the values where need1 and need3 it would need a couple more operations.
After plotting and changing the axes limits, the program simply modifies the tick labels at y positions 1 and 2. It then also adds the title and the x and y axis labels.
Important part is that the dictionary/input data has to be sorted. One way to do it is to use OrderedDict. Here T_SRT is an OrderedDict object sorted by keys in T_OLD.
The output is:
This is a more general case for more values/labels in T_OLD. It assumes that the label is always 'needX' where X is any number. This can readily be done for a general case of any string preceding the number though it would require more processing,
import matplotlib.pyplot as plt
from collections import OrderedDict
import re
T_OLD = {'10' : 'need1', '11':'need8', '12':'need11', '13':'need1','14':'need3'}
T_SRT = OrderedDict(sorted(T_OLD.items(), key=lambda t: t[0]))
x_val = list(map(int, T_SRT.keys()))
y_val = list(map(lambda x: int(re.findall(r'\d+', x)[-1]), T_SRT.values()))
plt.plot(x_val, y_val,'r')
plt.ylim([0.9*min(y_val),1.1*max(y_val)])
ax = plt.gca()
y_axis = list(set(y_val))
ax.set_yticks(y_axis)
ax.set_yticklabels(['need' + str(i) for i in y_axis])
plt.title('T_OLD')
plt.xlabel('time')
plt.ylabel('need')
plt.show()
This solution finds the number at the end of the label using re.findall to accommodate for the possibility of multi-digit numbers. Previous solution just took the last component of the string because numbers were single digit. It still assumes that the number for plotting position is the last number in the string, hence the [-1]. Again for Python 3.X map output is explicitly converted to list, step not necessary in Python 2.7.
The labels are now generated by first selecting unique y-values using set and then renaming their labels through concatenation of the strings 'need' with its corresponding integer.
The limits of y-axis are set as 0.9 of the minimum value and 1.1 of the maximum value. Rest of the formatting is as before.
The result for this test case is:
I have data in Cartesian coordinates. To each Cartesian coordinate there is also binary variable. I wan to make a heatmap, where in each polygon (hexagon/rectangle,etc.) the color strength is the ratio of number of occurrences where the boolean is True out of the total occurrences in that polygon.
The data can for example look like this:
df = pd.DataFrame([[1,2,False],[-1,5,True], [51,52,False]])
I know that seaborn can generate heatmaps via seaborn.heatmap, but the color strength is based by default on the total occurrences in each polygon, not the above ratio. Is there perhaps another plotting tool that would be more suitable?
You could also use the pandas groupby functionality to compute the ratios and then pass the result to seaborn.heatmap. With the example data borrowed from #ImportanceOfBeingErnest it would look like this:
import numpy as np
import pandas as pd
import seaborn as sns
np.random.seed(0)
x = np.random.poisson(5, size=200)
y = np.random.poisson(7, size=200)
z = np.random.choice([True, False], size=200, p=[0.3, 0.7])
df = pd.DataFrame({"x" : x, "y" : y, "z":z})
res = df.groupby(['y','x'])['z'].mean().unstack()
ax = sns.heatmap(res)
ax.axis('equal')
ax.invert_yaxis()
the resulting plot
If your x and y values aren't integers you can cut them into the desired number of categories for grouping:
bins = 10
res = df.groupby([pd.cut(df.y, bins),pd.cut(df.x,bins)])['z'].mean().unstack()
An option would be to calculate two histograms, one for the complete dataframe, and one for the dataframe filtered for the True values. Then dividing the latter by the former gives the ratio, you're after.
from __future__ import division
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
x = np.random.poisson(5, size=200)
y = np.random.poisson(7, size=200)
z = np.random.choice([True, False], size=200, p=[0.3, 0.7])
df = pd.DataFrame({"x" : x, "y" : y, "z":z})
dftrue = df[df["z"] == True]
bins = np.arange(0,22)
hist, xbins, ybins = np.histogram2d(df.x, df.y, bins=bins)
histtrue, _ ,__ = np.histogram2d(dftrue.x, dftrue.y, bins=bins)
plt.imshow(histtrue/hist, cmap=plt.cm.Reds)
plt.colorbar()
plt.show()