Making rows of points in dataframe into POLYGON using geopandas - python-3.x

I am trying to finding out if points are in closed polygons (in this question: Finding if a point in a dataframe is in a polygon and assigning polygon name to point) but I realized that there might be another way to do this:
I have this dataframe
df=
id x_zone y_zone
0 A1 65.422080 48.147850
1 A1 46.635708 51.165745
2 A1 46.597984 47.657444
3 A1 68.477700 44.073700
4 A3 46.635708 54.108190
5 A3 46.635708 51.844770
6 A3 63.309560 48.826878
7 A3 62.215572 54.108190
and I would like to transform this into
id Polygon
0 A1 POLYGON((65.422080, 48.147850), (46.635708, 51.165745), (46.597984, 47.657444), (68.477700, 44.073700))
1 A3 POLYGON((46.635708,54.108190), (46.635708 ,51.844770), (63.309560, 48.826878),(62.215572 , 54.108190))
and do the same for points:
df1=
item x y
0 1 50 49
1 2 60 53
2 3 70 30
to
item point
0 1 POINT(50,49)
1 2 POINT(60,53)
2 3 POINT(70,30)
I have never used geopandas and am a little at a loss here.
My question is thus: How do I get from a pandas dataframe to a dataframe with geopandas attributes?
Thankful for any insight!

You can achieve as follows but you would have to set the right dtype. I know in ArcGIS you have to set the dtype as geometry;
df.groupby('id').apply(lambda x: 'POLYGON(' + str(tuple(zip(x['x_zone'],x['y_zone'])))+')')

I'd suggest the following to directly get a GeoDataFrame from your df:
from shapely.geometry import Polygon
import geopandas as gpd
gdf = gpd.GeoDataFrame(geometry=df.groupby('name').apply(
lambda g: Polygon(gpd.points_from_xy(g['x_zone'], g['y_zone']))))
It first creates a list of points using geopandas' points_from_xy, then create a Polygon object from this list.

Related

Calculate length of all segments in polygon using geopandas

I have this little issue I am trying to solve and I have looked everywhere for an answer. It seems odd that I cannot find it, but it might just be me.
So, I have this dataframe
df=
id x_zone y_zone
0 A1 65.422080 48.147850
1 A1 46.635708 51.165745
2 A1 46.597984 47.657444
3 A1 68.477700 44.073700
4 A3 46.635708 54.108190
5 A3 46.635708 51.844770
6 A3 63.309560 48.826878
7 A3 62.215572 54.108190
that I convert into a geopandas dataframe:
df_geometry = gpd.GeoDataFrame(geometry=df.groupby('id').apply(
lambda g: Polygon(gpd.points_from_xy(g['x_zone'], g['y_zone']))))
df_geometry = df_geometry.reset_index()
print(df_geometry)
which returns:
id geometry
A1 POLYGON ((65.42208 48.14785, 46.63571 51.16575...
A3 POLYGON ((46.63571 54.10819, 46.63571 51.84477...
and for which I can compute the area and the perimeter:
df_geometry["area"] = df_geometry['geometry'].area
df_geometry["perimeter"] = df_geometry['geometry'].length
which gives:
id geometry area perimeter
0 A1 POLYGON ((65.42208 48.14785, 46.63571 51.16575... 72.106390 49.799695
1 A3 POLYGON ((46.63571 54.10819, 46.63571 51.84477... 60.011026 40.181476
Now, to the core of my problem: IF one can calculate the length, surely the length of each segment of the polygons is being calculated. How can I retrieve this?
I understand that for very complicated polygons (e.g. country maps, this might be problematics to store). Anyone with an idea?
Here is the runnable code that shows all the steps to create a dataframe that contains required segment lengths.
from io import StringIO
import geopandas as gpd
import pandas as pd
from shapely.geometry import Point, Polygon
import numpy as np
dats_str = """index id x_zone y_zone
0 A1 65.422080 48.147850
1 A1 46.635708 51.165745
2 A1 46.597984 47.657444
3 A1 68.477700 44.073700
4 A3 46.635708 54.108190
5 A3 46.635708 51.844770
6 A3 63.309560 48.826878
7 A3 62.215572 54.108190"""
# read the string, convert to dataframe
df1 = pd.read_csv(StringIO(dats_str), sep='\s+', index_col='index') #good for space/s separation
gdf = gpd.GeoDataFrame(geometry=df1.groupby('id')
.apply(lambda g: Polygon(gpd.points_from_xy(g['x_zone'], g['y_zone']))))
gdf = gdf.reset_index() #bring `id` to `column` status
# Facts about polygon outer vertices
# - first vertex is the same as the last
# - to get segments, ignore zero-th point (use it as from_point in next row)
# create basic lists for creation of new dataframe
indx = [] # for A1, A3
sequ = [] # for seg order
pxy0 = [] # from-point
pxy1 = [] # to-point
for ix,geom in zip(gdf.id, gdf.geometry):
num_pts = len(geom.exterior.xy[0])
#print(ix, "Num points:", num_pts)
old_xy = []
for inx, (x,y) in enumerate(zip(geom.exterior.xy[0],geom.exterior.xy[1])):
if (inx==0):
# first vertex is the same as the last
pass
else:
indx.append(ix)
sequ.append(inx)
pxy0.append(Point(old_xy))
pxy1.append(Point(x,y))
old_xy = (x,y)
# Create new geodataframe
pgon_segs = gpd.GeoDataFrame({"poly_id": indx,
"vertex_id": sequ,
"fr_point": pxy0,
"to_point": pxy1}, geometry="to_point")
# Compute segment lengths
# Note: seg length is Euclidean distance, ***not geographic***
pgon_segs["seg_length"] = pgon_segs.apply(lambda row: row.fr_point.distance(row.to_point), axis=1)
The content of pgon_segs:
poly_id vertex_id fr_point to_point seg_length
0 A1 1 POINT (65.42207999999999 48.14785) POINT (46.63571 51.16575) 19.027230
1 A1 2 POINT (46.635708 51.165745) POINT (46.59798 47.65744) 3.508504
2 A1 3 POINT (46.597984 47.657444) POINT (68.47770 44.07370) 22.171270
3 A1 4 POINT (68.4777 44.0737) POINT (65.42208 48.14785) 5.092692
4 A3 1 POINT (46.635708 54.10819) POINT (46.63571 51.84477) 2.263420
5 A3 2 POINT (46.635708 51.84477) POINT (63.30956 48.82688) 16.944764
6 A3 3 POINT (63.30956 48.826878) POINT (62.21557 54.10819) 5.393428
7 A3 4 POINT (62.215572 54.10819) POINT (46.63571 54.10819) 15.579864

Plot from csv with panda grouping

If I have a csv with 4 columns:
how can I average the values of one column (x) over the average of another column (y) by grouping through the first one with panda? I have to do a loop for every value of the first column? I am not sure about the implementation.
For example, if I have a csv file:
a,1,2,4
a,2,2,5
a,3,2,6
a,4,2,5
b,1,3,2
b,2,3,3
b,3,3,4
and I want a plot with a,average(3rd column) and b,average(3rd column)
I have to do something like:
df=pd.reas_csv
x=group_by("values of the 1st column").average()
I would also try to plot kde over the 2nd column, which has ten rows for every group of the first column.
I don't understand how to group data from *csv file without a header in particular.
Thank you for the help.
Assume your dataframe looks like
print(df)
0 1 2 3
0 a 1 2 4
1 a 2 2 5
2 a 3 2 6
3 a 4 2 5
4 b 1 3 2
5 b 2 3 3
6 b 3 3 4
If you want to plot with a average of 3rd column and b average of 3rd column, you can do
import pandas as pd
import matplotlib.pyplot as plt
df.groupby(0).mean()[3].plot.bar(rot=0)
plt.show()

How do I transpose a Dataframe and how to scatter plot the transposed df

I have this dataframe with 20 countries and 20 years of data
Country 2000 2001 2002 ...
USA 1 2 3
CANADA 4 5 6
SWEDEN 7 8 9
...
and I want to get a new df to create a scatter plot with y = value for each column (country) and x= Year
Country USA CANADA SWEDEN ...
2000 1 4 7
2001 2 5 8
2002 3 6 9
...
My Code :
data = pd.read_csv("data.csv")
data.set_index("Country Name", inplace = True)
data_transposed = data.T
I'm struggling to create this kind of scatter plot.
Any idea ?
Thanks
Scatter is a plot which receives x and y only, you can scatter the whole dataframe directly. However, a small workaround:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(data={"Country":["USA", "Canada", "Brasil"], 2000:[1,4,7], 2001:[3,7,9], 2002: [2,8,5]})
for column in df.columns:
if column != "Country":
plt.scatter(x=df["Country"], y=df[column])
plt.show()
result:
It just plotting each column separately, eventually you get what you want.
As you see, each year is represent by different colors - you can do the opposite (plotting years and having countries as different colors). Scatter is 1x1: you have Country, Year, Value. You can present only two of them in a scatter plot (unless you use colors for example)
You need to transpose your dataframe for that (as you specify yourself what x and y are) but you can do it with df.transpose(): see documentation.
Notice in my df, country column is not an index. You can use set_index or reset_index to control it.

labeling data points with dataframe including empty cells

I have an Excel sheet like this:
A B C D
3 1 2 8
4 2 2 8
5 3 2 9
2 9
6 4 2 7
Now I am trying to plot 'B' over 'C' and label the data points with the entrys of 'A'. It should show me the points 1/2, 2/2, 3/2 and 4/2 with the corresponding labels.
import matplotlib.pyplot as plt
import pandas as pd
import os
df = pd.read_excel(os.path.join(os.path.dirname(__file__), "./Datenbank/Test.xlsx"))
fig, ax = plt.subplots()
df.plot('B', 'C', kind='scatter', ax=ax)
df[['B','C','A']].apply(lambda x: ax.text(*x),axis=1);
plt.show()
Unfortunately I am getting this:
with the Error:
ValueError: posx and posy should be finite values
As you can see it did not label the last data point. I know it is because of the empty cells in the sheet but i cannot avoid them. There is just no measurement data at this positions.
I already searched for a solution here:
Annotate data points while plotting from Pandas DataFrame
but it did not solve my problem.
So, is there a way to still label the last data point?
P.S.: The excel sheet is just an example. So keep in mind in reality there are many empty cells at different positions.
You can simply trash the invalid data rows from df before plotting them
df = df[df['B'].notnull()]

df.mean() / jupyter / pandas alternating axis for output

I haven't posted many questions, but, I have found a very strange behavior causing alternating output. I'm hoping someone can help shed some light on this.
I am using jupyter and I am creating some data like this:
# Use the following data for this assignment:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
np.random.seed(12345)
df = pd.DataFrame([np.random.normal(32000,200000,3650),
np.random.normal(43000,100000,3650),
np.random.normal(43500,140000,3650),
np.random.normal(48000,70000,3650)],
index=[1992,1993,1994,1995])
df
Now in the next cell I have a couple lines to get the transpose of the DF and then get the mean and standard deviations. However, when I run this cell multiple times it seems that I am getting different output from .mean()
df = df.T
values = df.mean(axis=0)
std = df.std(axis=0)
values
I am using shift enter to run this second cell and this is what I will get:
1992 33312.107476
1993 41861.859541
1994 39493.304941
1995 47743.550969
dtype: float64
And when I run the cell again using shift + enter (Output truncated but you should get the idea)
0 5447.716574
1 126449.084350
2 41091.469083
3 -61754.197831
4 223744.364842
5 94746.779056
6 57607.078825
7 109812.089923
8 28283.060354
9 69768.157194
10 32952.030326
11 40222.026635
12 64786.632304
13 17025.266684
14 111334.168830
15 96067.788206
16 -68157.985363
I have tried changing the axis parameter and removing the axis parameter but the output remains the same
Here is a screen shot incase anyone is interested in duplicating what I have done:
Jupyter window on my end
Thanks for reading.
Your problem is that in your second cell, you are re-assigning your df to be df.T, so every time, it is transposing your dataframe again. So what you can do is: Don't use df = df.T, just say this instead:
values = df.T.mean(axis=0)
std = df.T.std(axis=0)
Or even better, use axis=1 (apply it to columns instead of rows) without transposing:
values = df.mean(axis=1)
std = df.std(axis=1)
You can use describe
df.T.describe()
Out[267]:
1992 1993 1994 1995
count 3650.000000 3650.000000 3650.000000 3650.000000
mean 34922.760627 41574.363827 43186.197526 49355.777683
std 200618.445749 98495.601455 140639.407130 70408.448642
min -632057.636640 -292484.131067 -435217.159232 -181304.694667
25% -98715.272565 -24771.835741 -49460.639563 -973.422386
50% 34446.219184 41474.621854 43323.557410 49281.270881
75% 170722.706967 107502.446843 136286.933017 97422.070284
max 714855.084396 453834.306915 516751.566696 295427.273677

Resources