I have 2 shapefiles for the UK:
In [3]: # SHAPEFILE 1:
...: # WESTMINISTER PARLIAMENTARY CONSTITUENCY UK SHAPEFILE
...: shapefile1 = "../Westminster_Parliamentary_Constituencies_De
...: cember_2017_UK_BSC_SUPER_SMALL/Westminster_Parliamentary_Constituencies_
...: December_2017_UK_BSC.shp"
In [4]: # SHAPEFILE 2:
...: # LAD19 UK SHAPEFILE
...: shapefile2 = "../03_Maps_March_2020/level3_LAD19_CONTAINS_4_L
...: EVELS_OF_DETAIL/Local_Authority_Districts_December_2019_Boundaries_UK_BU
...: C/Local_Authority_Districts_December_2019_Boundaries_UK_BUC.shp"
In [6]: # LOAD SHAPEFILE 1 INTO GEOPANDAS
...: parl_con = gpd.read_file(shapefile1)
...: parl_con.head()
Out[6]:
FID PCON17CD PCON17NM BNG_E BNG_N LONG LAT Shape__Are Shape__Len geometry
0 11 E14000540 Barking 546099 184533 0.105346 51.5408 5.225347e+07 44697.210277 MULTIPOLYGON (((0.07106 51.53715, 0.07551 51.5...
1 12 E14000541 Barnsley Central 433719 408537 -1.492280 53.5724 1.377661e+08 72932.918783 POLYGON ((-1.42490 53.60448, -1.43298 53.59652...
2 13 E14000542 Barnsley East 439730 404883 -1.401980 53.5391 2.460912e+08 87932.525762 POLYGON ((-1.34873 53.58335, -1.33215 53.56286...
3 14 E14000543 Barrow and Furness 325384 484663 -3.146730 54.2522 8.203002e+08 283121.334647 MULTIPOLYGON (((-3.20064 54.06488, -3.20111 54...
4 15 E14000544 Basildon and Billericay 569070 192467 0.440099 51.6057 1.567962e+08 57385.722178 POLYGON ((0.49457 51.62362, 0.50044 51.61807, ...
In [7]: # SHAPEFILE 1 PROJECTION:
...: parl_con.crs
Out[7]: {'init': 'epsg:4326'}
In [12]: # LOAD SHAPEFILE 2 INTO GEOPANDAS
...: lad19 = gpd.read_file(shapefile2)
...: lad19.head()
Out[12]:
objectid lad19cd lad19nm lad19nmw bng_e bng_n long lat st_areasha st_lengths geometry
0 1 E06000001 Hartlepool None 447160 531474 -1.27018 54.676140 9.684551e+07 50305.325058 POLYGON ((448986.025 536729.674, 453194.600 53...
1 2 E06000002 Middlesbrough None 451141 516887 -1.21099 54.544670 5.290846e+07 34964.406313 POLYGON ((451752.698 520561.900, 452424.399 52...
2 3 E06000003 Redcar and Cleveland None 464361 519597 -1.00608 54.567520 2.486791e+08 83939.752513 POLYGON ((451965.636 521061.756, 454348.400 52...
3 4 E06000004 Stockton-on-Tees None 444940 518183 -1.30664 54.556911 2.071591e+08 87075.860824 POLYGON ((451965.636 521061.756, 451752.698 52...
4 5 E06000005 Darlington None 428029 515648 -1.56835 54.535339 1.988128e+08 91926.839545 POLYGON ((419709.299 515678.298, 419162.998 51...
In [13]: # SHAPEFILE 2 PROJECTION:
...: lad19.crs
Out[13]: {'init': 'epsg:27700'}
With the shapefile using WGS 84 projection, I can successfully plot my choropleth using gv.Polygons:
In [14]: # USE GEOPANDAS DATAFRAME WITH gv.Polygons TO PRODUCE INTERACTIVE CHROPLETH:
...: gv.Polygons(parl_con, vdims='PCON17NM'
...: ).opts(tools=['hover','tap'],
...: width=450, height=600
...: )
Out[14]: :Polygons [Longitude,Latitude] (PCON17NM)\
However if I use the shapefile using OSGB projection then I get an error:
In [15]: # USE GEOPANDAS DATAFRAME WITH gv.Polygons TO PRODUCE INTERACTIVE CHROPLETH:
...: gv.Polygons(lad19, vdims='lad19_name',
...: ).opts(tools=['hover','tap'],
...: width=450, height=600
...: )
DataError: Expected Polygons instance to declare two key dimensions corresponding to the geometry coordinates but 3 dimensions were found which did not refer to any columns.
GeoPandasInterface expects a list of tabular data, for more information on supported datatypes see http://holoviews.org/user_guide/Tabular_Datasets.html
I tried converting the projection used but I just got the same error again when I tried to run gv.Polygons again:
In [16]: lad19.crs
Out[16]: {'init': 'epsg:27700'}
In [17]: lad19.crs = {'init': 'epsg:4326'}
...: lad19.crs
Out[17]: {'init': 'epsg:4326'}
In [19]: # USE GEOPANDAS DATAFRAME WITH gv.Polygons TO PRODUCE INTERACTIVE CHROPLETH:
...: gv.Polygons(lad19, vdims='lad19_name',
...: ).opts(tools=['hover','tap'],
...: width=450, height=600
...: )
DataError: Expected Polygons instance to declare two key dimensions corresponding to the geometry coordinates but 3 dimensions were found which did not refer to any columns.
GeoPandasInterface expects a list of tabular data, for more information on supported datatypes see http://holoviews.org/user_guide/Tabular_Datasets.html
Note that I can successfully plot choropleths for both of these shapefiles using gv.Shape. The only difference using gv.Shape is that with shapefile 1 I don’t need to specify the projection used whereas with shapefile 2 I have to specify crs=ccrs.OSGB().
Does anyone know what’s going on here?
Thanks
Shapefile download links:
Shapefile 1:
https://geoportal.statistics.gov.uk/datasets/westminster-parliamentary-constituencies-december-2017-uk-bsc
Shapefile 2:
https://geoportal.statistics.gov.uk/datasets/local-authority-districts-december-2019-boundaries-uk-buc
My issue turned out to be caused by my reprojection step from OSGB to WGS 84.
# THE ORIGINAL PROJECTION ON THE SHAPEFILE
In [16]: lad19.crs
Out[16]: {'init': 'epsg:27700'}
While the result of the following command would suggest that the reprojection step worked
In [17]: lad19.crs = {'init': 'epsg:4326'}
...: lad19.crs
Out[17]: {'init': 'epsg:4326'}
if you look at the geometry attribute you can see that it is still made up of eastings and northings and not longitudes and latitudes as you would expect after reprojecting:
In [8]: lad19["geometry"].head()
Out[8]:
0 POLYGON ((448986.025 536729.674, 453194.600 53...
1 POLYGON ((451752.698 520561.900, 452424.399 52...
2 POLYGON ((451965.636 521061.756, 454348.400 52...
3 POLYGON ((451965.636 521061.756, 451752.698 52...
4 POLYGON ((419709.299 515678.298, 419162.998 51...
Name: geometry, dtype: geometry
The solution was to instead reproject from the original to the desired projection using this method, with the key part being to include inplace=True:
In [11]: lad19.to_crs({'init': 'epsg:4326'},inplace=True)
...: lad19.crs
Out[11]: {'init': 'epsg:4326'}
The eastings and northings contained in the geometry column have now been converted to longitudes and latitudes
In [12]: lad19["geometry"].head()
Out[12]:
0 POLYGON ((-1.24098 54.72318, -1.17615 54.69768...
1 POLYGON ((-1.20088 54.57763, -1.19055 54.57496...
2 POLYGON ((-1.19750 54.58210, -1.16017 54.60449...
3 POLYGON ((-1.19750 54.58210, -1.20088 54.57763...
4 POLYGON ((-1.69692 54.53600, -1.70526 54.54916...
Name: geometry, dtype: geometry
and now gv.Polygons can use this shapefile to successfully produce a choropleth map:
In [13]: gv.Polygons(lad19, vdims='lad19nm',
...: ).opts(tools=['hover','tap'],
...: width=450, height=600
...: )
Out[13]: :Polygons [Longitude,Latitude] (lad19nm)
Related
I have a dataframe of XY coordinates which I'm plotting as Markers in a Scatter plot. I'd like to add_trace lines between specific XY pairs, not between every pair. For example, I'd like a line between Index 0 and Index 3 and another between Index 1 and Index 2. This means that just using a line plot won't work as I don't want to show all the connections. Is it possible to do it with a version of iloc or do I need to make my DataFrame in 'Wide-format' and have each XY pair as separate column pairs?
I've read through this but I'm not sure it helps in my case.
Adding specific lines to a Plotly Scatter3d() plot
import pandas as pd
import plotly.graph_objects as go
# sample data
d={'MeanE': {0: 22.448461538460553, 1: 34.78435897435799, 2: 25.94307692307667, 3: 51.688974358974164},
'MeanN': {0: 110.71128205129256, 1: 107.71666666666428, 2: 140.6384615384711, 3: 134.58615384616363}}
# pandas dataframe
df=pd.DataFrame(d)
# set up plotly figure
fig = go.Figure()
fig.add_trace(go.Scatter(x=df['MeanE'],y=df['MeanN'],mode='markers'))
fig.show()
UPDATE:
Adding the accepted answer below to what I had already, I now get the following finished plot.
taken approach of updating data frame rows that are the pairs of co-ordinates where you have defined
then add traces to figure to complete requirement as a list comprehension
import pandas as pd
import plotly.graph_objects as go
# sample data
d={'MeanE': {0: 22.448461538460553, 1: 34.78435897435799, 2: 25.94307692307667, 3: 51.688974358974164},
'MeanN': {0: 110.71128205129256, 1: 107.71666666666428, 2: 140.6384615384711, 3: 134.58615384616363}}
# pandas dataframe
df=pd.DataFrame(d)
# set up plotly figure
fig = go.Figure()
fig.add_trace(go.Scatter(x=df['MeanE'],y=df['MeanN'],mode='markers'))
# mark of pairs that will be lines
df.loc[[0, 3], "group"] = 1
df.loc[[1, 2], "group"] = 2
# add the lines to the figure
fig.add_traces(
[
go.Scatter(
x=df.loc[df["group"].eq(g), "MeanE"],
y=df.loc[df["group"].eq(g), "MeanN"],
mode="lines",
)
for g in df["group"].unique()
]
)
fig.show()
alternate solution to enhanced requirement in comments
# mark of pairs that will be lines
lines = [[0, 3], [1, 2], [0,2],[1,3]]
# add the lines to the figure
fig.add_traces(
[
go.Scatter(
x=df.loc[pair, "MeanE"],
y=df.loc[pair, "MeanN"],
mode="lines",
)
for pair in lines
]
)
data = b'0.01,71.5\r\n' #from PySerial to RaspPi USB
a,b = [float(x) for x in data]
ValueError: too many values to unpack (expected 2)
You need to convert the data object into list to iterate over it. You can take following approach:
In [20]: data = b'0.01,71.5\r\n' #from PySerial to RaspPi USB
...: a,b = [float(x) for x in data.decode('utf-8')
...: .split(',')]
In [21]: a,b
Out[21]: (0.01, 71.5)
I am reading CSV file:
Notation Level RFResult PRIResult PDResult Total Result
AAA 1 1.23 0 2 3.23
AAA 1 3.4 1 0 4.4
BBB 2 0.26 1 1.42 2.68
BBB 2 0.73 1 1.3 3.03
CCC 3 0.30 0 2.73 3.03
DDD 4 0.25 1 1.50 2.75
AAA 5 0.25 1 1.50 2.75
FFF 6 0.26 1 1.42 2.68
...
...
Here is the code
import pandas as pd
import matplotlib.pyplot as plt
df = pd.rad_csv('home\NewFiles\Files.csv')
Notation = df['Notation']
Level = df['Level']
RFResult = df['RFResult']
PRIResult = df['PRIResult']
PDResult = df['PDResult']
fig, axes = plt.subplots(nrows=7, ncols=1)
ax1, ax2, ax3, ax4, ax5, ax6, ax7 = axes.flatten()
n_bins = 13
ax1.hist(data['Total'], n_bins, histtype='bar') #Current this shows all Total Results in one plot
plt.show()
I want to show each Level Total Result in each different axes like as follow:
ax1 will show Level 1 Total Result
ax2 will show Level 2 Total Result
ax3 will show Level 3 Total Result
ax4 will show Level 4 Total Result
ax5 will show Level 5 Total Result
ax6 will show Level 6 Total Result
ax7 will show Level 7 Total Result
You can select a filtered part of a dataframe just by indexing: df[df['Level'] == level]['Total']. You can loop through the axes using for ax in axes.flatten(). To also get the index, use for ind, ax in enumerate(axes.flatten()). Note that Python normally starts counting from 1, so adding 1 to the index would be a good choice to indicate the level.
Note that when you have backslashes in a string, you can escape them using an r-string: r'home\NewFiles\Files.csv'.
The default ylim is from 0 to the maximum bar height, plus some padding. This can be changed for each ax separately. In the example below a list of ymax values is used to show the principle.
ax.grid(True, axis='both) sets the grid on for that ax. Instead of 'both', also 'x' or 'y' can be used to only set the grid for that axis. A grid line is drawn for each tick value. (The example below tries to use little space, so only a few gridlines are visible.)
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
N = 1000
df = pd.DataFrame({'Level': np.random.randint(1, 6, N), 'Total': np.random.uniform(1, 5, N)})
fig, axes = plt.subplots(nrows=5, ncols=1, sharex=True)
ymax_per_level = [27, 29, 28, 26, 27]
for ind, (ax, lev_ymax) in enumerate(zip(axes.flatten(), ymax_per_level)):
level = ind + 1
n_bins = 13
ax.hist(df[df['Level'] == level]['Total'], bins=n_bins, histtype='bar')
ax.set_ylabel(f'TL={level}') # to add the level in the ylabel
ax.set_ylim(0, lev_ymax)
ax.grid(True, axis='both')
plt.show()
PS: A stacked histogram with custom legend and custom vertical lines could be created as:
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
import pandas as pd
import numpy as np
N = 1000
df = pd.DataFrame({'Level': np.random.randint(1, 6, N),
'RFResult': np.random.uniform(1, 5, N),
'PRIResult': np.random.uniform(1, 5, N),
'PDResult': np.random.uniform(1, 5, N)})
df['Total'] = df['RFResult'] + df['PRIResult'] + df['PDResult']
fig, axes = plt.subplots(nrows=5, ncols=1, sharex=True)
colors = ['crimson', 'limegreen', 'dodgerblue']
column_names = ['RFResult', 'PRIResult', 'PDResult']
level_vertical_line = [1, 2, 3, 4, 5]
for level, (ax, vertical_line) in enumerate(zip(axes.flatten(), level_vertical_line), start=1):
n_bins = 13
level_data = df[df['Level'] == level][column_names].to_numpy()
# vertical_line = level_data.mean()
ax.hist(level_data, bins=n_bins,
histtype='bar', stacked=True, color=colors)
ax.axvline(vertical_line, color='gold', ls=':', lw=2)
ax.set_ylabel(f'TL={level}') # to add the level in the ylabel
ax.margins(x=0.01)
ax.grid(True, axis='both')
legend_handles = [Patch(color=color) for color in colors]
axes[0].legend(legend_handles, column_names, ncol=len(column_names), loc='lower center', bbox_to_anchor=(0.5, 1.02))
plt.show()
I have 3 big CSV files. I try to randomly extract some samples from the files without loading them into the memory. I am doing this:
SITS = dd.read_csv("sits_train_0.csv", blocksize="512MB",
usecols=band_blue + ["samplefid"]).set_index("samplefid")
MASK = dd.read_csv("mask_train_0.csv", blocksize="512MB",
usecols=band_mask + ["samplefid"]).set_index("samplefid")
GP = dd.read_csv("sits_gp_train_0.csv", blocksize="512MB",
usecols=band_blue_gp + ["samplefid"]).set_index("samplefid")
# SITS = pd.read_csv("sits_train_0.csv",
# usecols=band_blue + ["samplefid"]).set_index("samplefid")
# MASK = pd.read_csv("mask_train_0.csv",
# usecols=band_mask + ["samplefid"]).set_index("samplefid")
# GP = pd.read_csv("sits_gp_train_0.csv",
# usecols=band_blue_gp + ["samplefid"]).set_index("samplefid")
np.random.seed(0)
NSAMPLES=100
samples = np.random.choice(MASK.index, size=NSAMPLES, replace=False)
s = SITS.loc[samples][band_blue].compute().values
m = MASK.loc[samples][band_mask].compute().values
sg = GP.loc[samples][band_blue_gp].compute().values
# s = SITS.loc[samples][band_blue].values
# m = MASK.loc[samples][band_mask].values
# sg = GP.loc[samples][band_blue_gp].values
I had strange results, so I compare to pandas with smaller files (see commented code above) for which I have correct results.
If I set blocksize to None, the results are fine, but it loads everything in memory, so using dask is not useful in that case and my CSV are to big to fits in memory. My CSV are written randomly so I need to use index to recover the same samples from the 3 CSV.
I feel I miss something from dask, but I don't see what.
I'd recommend using sample
In [16]: import pandas as pd
In [17]: import dask.dataframe as dd
In [18]: df = pd.DataFrame({'num_legs': [2, 4, 8, 0],
...: 'num_wings': [2, 0, 0, 0],
...: 'num_specimen_seen': [10, 2, 1, 8]},
...: index=['falcon', 'dog', 'spider', 'fish'])
In [19]: ddf = dd.from_pandas(df, npartitions=2)
In [20]: ddf.sample??
In [21]: df.sample(frac=0.5, replace=True, random_state=1)
Out[21]:
num_legs num_wings num_specimen_seen
dog 4 0 2
fish 0 0 8
In [22]: ddf.sample(frac=0.5, replace=True, random_state=1)
Out[22]:
Dask DataFrame Structure:
num_legs num_wings num_specimen_seen
npartitions=2
dog int64 int64 int64
fish ... ... ...
spider ... ... ...
Dask Name: sample, 4 tasks
In [23]: ddf.sample(frac=0.5, replace=True, random_state=1).compute()
Out[23]:
num_legs num_wings num_specimen_seen
falcon 2 2 10
fish 0 0 8
I'm trying to annotate a chart to include the plotted values of the x-axis as well as additional information from the DataFrame. I am able to annotate the values from the x-axis but not sure how I can add additional information from the data frame. In my example below I am annotating the x-axis which are the values from the Completion column but also want to add the Completed and Participants values from the DataFrame.
For example the Running Completion is 20% but I want my annotation to show the Completed and Participants values in the format - 20% (2/10). Below is sample code that can reproduce my scenario as well as current and desired results. Any help is appreciated.
Code:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
mydict = {
'Event': ['Running', 'Swimming', 'Biking', 'Hiking'],
'Completed': [2, 4, 3, 7],
'Participants': [10, 20, 35, 10]}
df = pd.DataFrame(mydict).set_index('Event')
df = df.assign(Completion=(df.Completed/df.Participants) * 100)
print(df)
plt.subplots(figsize=(5, 3))
ax = sns.barplot(x=df.Completion, y=df.index, color="cyan", orient='h')
for i in ax.patches:
ax.text(i.get_width() + .4,
i.get_y() + .67,
str(round((i.get_width()), 2)) + '%', fontsize=10)
plt.tight_layout()
plt.show()
DataFrame:
Completed Participants Completion
Event
Running 2 10 20.000000
Swimming 4 20 20.000000
Biking 3 35 8.571429
Hiking 7 10 70.000000
Current Output:
Desired Output:
Loop through the columns Completed and Participants as well when you annotate:
for (c,p), i in zip(df[["Completed","Participants"]].values, ax.patches):
ax.text(i.get_width() + .4,
i.get_y() + .67,
str(round((i.get_width()), 2)) + '%' + f" ({c}/{p})", fontsize=10)