CRS error while clipping rioxarray to shapefile - geospatial

I'm trying to clip a rioxarray dataset to a shapefile, but get the following error:
> data_clipped = data.rio.clip(shape.geometry.apply(mapping))
MissingCRS: CRS not found. Please set the CRS with 'set_crs()' or 'write_crs()'. Data variable: precip
This error seems straightforward, but I can't figure out which CRS needs to be set. Both the dataset and the shapefile have CRS values that rio can find:
> print(data.rio.crs)
EPSG:4326
> print(shape.crs)
epsg:4326
The dataarray within the dataset, called 'precip', does not have a CRS, but it also doesn't seem to respond to the set_crs() command:
> print(data.precip.rio.crs)
None
> data.precip.rio.set_crs(data.rio.crs)
> print(data.precip.rio.crs)
None
What am I missing here?
For reference, rioxarray set_crs() documentation - this shows set_crs() working on data arrays, unlike my experience with data.precip
My data, in case I have something unusual:
> print(data)
<xarray.Dataset>
Dimensions: (x: 541, y: 411)
Coordinates:
* y (y) float64 75.0 74.9 74.8 74.7 74.6 ... 34.3 34.2 34.1 34.0
* x (x) float64 -12.0 -11.9 -11.8 -11.7 ... 41.7 41.8 41.9 42.0
time object 2020-01-01 00:00:00
spatial_ref int64 0
Data variables:
precip (y, x) float64 nan nan nan ... 1.388e-17 1.388e-17 1.388e-17
Attributes:
Conventions: CF-1.6
history: 2021-01-05 01:36:52 GMT by grib_to_netcdf-2.16.0: /opt/ecmw...
> print(shape)
ID name orgn_name geometry
0 Albania Shqipëria MULTIPOLYGON (((19.50115 40.96230, 19.50563 40...
1 Andorra Andorra POLYGON ((1.43992 42.60649, 1.45041 42.60596, ...
2 Austria Österreich POLYGON ((16.00000 48.77775, 16.00000 48.78252...

This issue is resolved if the set_crs() is used in the same command as the clip operation:
data_clipped = data.precip.rio.set_crs('WGS84').rio.clip(shape.geometry.apply(mapping))

Related

Overlay of two plots from two different data sources using Python / hvplot

I would like to plot a line plot (source: pandas dataframe) over a hvplot (source: xarray/ NetCDF).
The xarray looks like this:
dataDIR = 'ceilodata.nc'
DS = xr.open_dataset(dataDIR)
DS = DS.transpose()
print(DS)
<xarray.Dataset>
Dimensions: (range_hr: 32, range: 1024, layer: 3, time: 5760)
Coordinates:
* range_hr (range_hr) float32 0.001 4.995 9.99 ... 144.9 149.9 154.8
* range (range) float32 14.98 29.97 44.96 ... 1.533e+04 1.534e+04
* layer (layer) int32 1 2 3
* time (time) datetime64[ns] 2022-03-18 ... 2022-03-18T23:59:46
Data variables: (12/41)
zenith float32 ...
wavelength float32 ...
scaling float32 ...
range_gate_hr float32 ...
range_gate float32 ...
longitude float32 ...
... ...
cbe (layer, time) int16 ...
beta_raw_hr (range_hr, time) float32 ...
beta_raw (range, time) float32 ...
bcc (time) int8 ...
base (time) float32 ...
average_time (time) int32 ...
Attributes: (12/13)
comment:
software_version: 15.06.1 2.13 1.040 1
title: CHM15k Nimbus
wmo_id: 10865
month: 3
source: CHM160138
... ...
serlom: TUB160038
location: muenchen
year: 2022
device_name: CHM160138
institution: DWD
day: 18
The pandas dataframe source looks like this:
df = pd.read_csv('PTU.csv')
print(df)
Unnamed: 0 PTU
0 2022-03-18 07:38:56 451.839
1 2022-03-18 07:38:57 468.826
2 2022-03-18 07:38:58 469.093
3 2022-03-18 07:38:59 469.356
4 2022-03-18 07:39:00 469.623
... ... ...
6140 2022-03-18 09:21:16 31690.600
6141 2022-03-18 09:21:17 31694.700
6142 2022-03-18 09:21:18 31692.900
6143 2022-03-18 09:21:19 31712.000
6144 2022-03-18 09:21:20 31711.500
[6145 rows x 2 columns]
Both are time dependend datasets but have different time stamps and frequencies. Time is index in each data set.
I tried to plot them together with additional imports of holoviews. While each single plot is no problem, plotting them together seems not to work the way I tried it:
import hvplot.pandas
import holoviews as hv
# cmap of the xarray:
ceilo = (DS.b_r.hvplot(cmap="viridis_r", width = 850, height = 600, title = 'title', clim = (5, 80))
# line plot of the data frame
p = df.hvplot.line()
# add pressure line plot to pcolormeshplot using * which overlays the line on the plot
ceilo * p
but this ended in an error message with the following complete traceback:
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
<ipython-input-10-2b1c6baca339> in <module>
24 p = df.hvplot.line()
25 # add pressure line plot to pcolormeshplot using * which overlays the line on the plot
---> 26 ceilo * df
c:\python38\lib\site-packages\pandas\core\ops\common.py in new_method(self, other)
68 other = item_from_zerodim(other)
69
---> 70 return method(self, other)
71
72 return new_method
c:\python38\lib\site-packages\pandas\core\arraylike.py in __rmul__(self, other)
118 #unpack_zerodim_and_defer("__rmul__")
119 def __rmul__(self, other):
--> 120 return self._arith_method(other, roperator.rmul)
121
122 #unpack_zerodim_and_defer("__truediv__")
c:\python38\lib\site-packages\pandas\core\frame.py in _arith_method(self, other, op)
6936 other = ops.maybe_prepare_scalar_for_op(other, (self.shape[axis],))
6937
-> 6938 self, other = ops.align_method_FRAME(self, other, axis, flex=True, level=None)
6939
6940 new_data = self._dispatch_frame_op(other, op, axis=axis)
c:\python38\lib\site-packages\pandas\core\ops\__init__.py in align_method_FRAME(left, right, axis, flex, level)
275 elif is_list_like(right) and not isinstance(right, (ABCSeries, ABCDataFrame)):
276 # GH 36702. Raise when attempting arithmetic with list of array-like.
--> 277 if any(is_array_like(el) for el in right):
278 raise ValueError(
279 f"Unable to coerce list of {type(right[0])} to Series/DataFrame"
c:\python38\lib\site-packages\holoviews\core\element.py in __iter__(self)
94 def __iter__(self):
95 "Disable iterator interface."
---> 96 raise NotImplementedError('Iteration on Elements is not supported.')
97
98
NotImplementedError: Iteration on Elements is not supported.
Is the different time frequency a problem here? The line plot should be orientated along the x- and the y-axis considering the right time stamp and altitude of the underlying cmap-(matplotlib)-plot.
To illustrate what I am aiming for, here is a picture of my goal:
Thanks for reading / helping.
I found a solution for this case:
Both dataset time columns have to have the same format. In my case it's: datetime64[ns] (to adopt to the NetCDF xarray). That is why I converted the dataframe time column to datetime64[ns]:
df.Datetime = df.Datetime.astype('datetime64')
Also I found the data to be type "object". So I transformed it to "float":
df.PTU = df.PTU.astype(float) # convert to correct data type
The last step was choosing hvplot as this helps in plotting xarray data
import hvplot.xarray
hvplot.quadmesh
And here is my final solution:
title = ('Ceilo data + '\ndate: '+ str(DS.year) + '-' + str(DS.month) + '-' + str(DS.day))
ceilo = (DS.br.hvplot.quadmesh(cmap="viridis_r", width = 850, height = 600, title = title,
clim = (1000, 10000), # set colorbar limits
cnorm = ('log'), # choose log scale
clabel = ('colorbar title'),
rot = 0 # degree rotation of ticks
)
)
# from: https://justinbois.github.io/bootcamp/2020/lessons/l27_holoviews.html
# take care! may take 2...3 minutes to be ploted:
p = hv.Points(data=df,
kdims=['Datetime', 'PTU'],
).opts(#alpha=0.7,
color='red',
size=1,
ylim=(0, 5000))
# add PTU line plot to quadmesh plot using * which overlays the line on the plot
ceilo * p

how do I use .assign for values in a column

I have a dataframe which looks like this:
date symbol numerator denominator
4522 2021-10-06 PAG.SG 1.0 18
1016 2020-11-23 IPA.V 1.0 5
412 2020-04-17 LRK.AX 1.0 30
1884 2021-06-03 BOUVETO.ST 1.0 1
2504 2021-04-28 VKGYO.IS 1.0 100
3523 2021-07-08 603355.SS 1.0 1
3195 2021-08-23 IDAI 1.0 1
3238 2021-08-19 6690.TWO 1.0 1000
3430 2021-07-19 CAXPD 1.0 10
2642 2021-04-15 035720.KS 1.0 1
dtypes:
date: object
symbol: object
numerator: float64
denominator: int64
I am trying to use pd.assign to assign a classifier to this df in the form of
df = df.assign(category = ['forward' if numerator > denominator else 'reverse' for numerator in df[['numerator', 'denominator']]])
But I'm receiving a TypeError stating:
TypeError: Invalid comparison between dtype=int64 and str
I have tried casting them explicitly, with:
df = df.assign(category = ['forward' if df['numerator'] > df['denominator'] else 'reverse' for df['numerator'] in df])
But receive another TypeError stating:
TypeError: '>' not supported between instances of 'str' and 'int'
Which is confusing because I'm not comparing strings, I'm comparing int and float.
Any help would be greatly appreciated.
You still can do that with np.where
import numpy as np
df = df.assign(category = np.where(df['numerator']>df['denominator'],
'forward',
'reverse')

Adding band description to rioxarray to_raster()

I've seen that one can add band descriptions to a geotiff image using rasterio [1]. How would I do the same thing when saving an array to a raster with rioxarray?
I tried adding the names as coords, but when I save an re-open the raster, the bands are named [1, 2, 3, 4] instead of ['R', 'G', 'B', 'NIR'].
import numpy as np
import xarray as xa
import rioxarray as rioxa
bands = ['R', 'G', 'B', 'NIR']
im_arr = np.random.randint(0, 255, size=(4, 400, 400))
im_save = xa.DataArray(im_arr, dims=('band', 'y', 'x'),
coords={'x': np.arange(0, 400), 'y': np.arange(0, 400),
'band': bands})
path = 'test.tiff'
im_save.rio.to_raster(path)
im_load = rioxa.open_rasterio(path)
print(im_load)
<xarray.DataArray (band: 4, y: 400, x: 400)> [640000 values with dtype=int32]
Coordinates:
band (band) int32 1 2 3 4
y (y) float64 0.0 1.0 2.0 3.0 4.0 ... 396.0 397.0 398.0 399.0
x (x) float64 0.0 1.0 2.0 3.0 4.0 ... 396.0 397.0 398.0 399.0
spatial_ref int32 0 Attributes:
scale_factor: 1.0
add_offset: 0.0
grid_mapping: spatial_ref
You should consider switching from a 3d DataArray to a Dataset with 4 variables, each representing a separate band.
If you name the variables correctly, it should get written to the tiff:
import numpy as np
import xarray as xa
import rioxarray as rioxa
bands = ['R', 'G', 'B', 'NIR']
xa_dataset = xa.Dataset()
for band in bands:
xa_dataset[band] = xa.DataArray(np.random.randint(0, 255, (400, 400), dtype="uint8"), dims=('y', 'x'),
coords={'x': np.arange(0, 400), 'y': np.arange(0, 400)})
# see the structure
print(xa_dataset)
# <xarray.Dataset>
# Dimensions: (x: 400, y: 400)
# Coordinates:
# * x (x) int64 0 1 2 3 4 5 6 7 8 ... 391 392 393 394 395 396 397 398 399
# * y (y) int64 0 1 2 3 4 5 6 7 8 ... 391 392 393 394 395 396 397 398 399
# Data variables:
# R (y, x) uint8 18 41 126 79 64 215 105 ... 29 137 243 23 150 23 224
# G (y, x) uint8 1 18 90 195 45 8 150 68 ... 96 194 22 58 118 210 198
# B (y, x) uint8 125 90 165 226 153 253 212 ... 162 217 221 162 18 17
# NIR (y, x) uint8 161 195 149 168 40 182 146 ... 18 114 38 119 23 110 26
# write to disk
xa_dataset.rio.to_raster("test.tiff")
# load
im_load = rioxa.open_rasterio('test.tiff')
print(im_load)
# <xarray.DataArray (band: 4, y: 400, x: 400)>
# [640000 values with dtype=uint8]
# Coordinates:
# * band (band) int64 1 2 3 4
# * y (y) float64 0.0 1.0 2.0 3.0 4.0 ... 396.0 397.0 398.0 399.0
# * x (x) float64 0.0 1.0 2.0 3.0 4.0 ... 396.0 397.0 398.0 399.0
# spatial_ref int64 0
# Attributes:
# scale_factor: 1.0
# add_offset: 0.0
# long_name: ('R', 'G', 'B', 'NIR')
# grid_mapping: spatial_ref
You can see the band names are now included in the attributes as long_name.
Running gdalinfo, you can see the band description has been set:
Driver: GTiff/GeoTIFF
Files: test.tiff
Size is 400, 400
Origin = (-0.500000000000000,-0.500000000000000)
Pixel Size = (1.000000000000000,1.000000000000000)
Image Structure Metadata:
INTERLEAVE=PIXEL
Corner Coordinates:
Upper Left ( -0.5000000, -0.5000000)
Lower Left ( -0.500, 399.500)
Upper Right ( 399.500, -0.500)
Lower Right ( 399.500, 399.500)
Center ( 199.500, 199.500)
Band 1 Block=400x5 Type=Byte, ColorInterp=Red
Description = R
Mask Flags: PER_DATASET ALPHA
Band 2 Block=400x5 Type=Byte, ColorInterp=Green
Description = G
Mask Flags: PER_DATASET ALPHA
Band 3 Block=400x5 Type=Byte, ColorInterp=Blue
Description = B
Mask Flags: PER_DATASET ALPHA
Band 4 Block=400x5 Type=Byte, ColorInterp=Alpha
Description = NIR

How to fill missing value based on other columns in Pandas based on an interval in another column?

Suppose I have this df_atm:
borough Longitude Latitude
0 bronx 40.79 -73.78
1 manhattan 40.78 -73.90
2 staten island 40.84 -73.95
3 NaN 40.57 -74.11
Every row represents an ATM withdrawal.
I hope to generate value for missing value based on the coordinate inside the Longitude and Latitude columns.
borough Longitude Latitude
0 bronx 40.79 -73.78
1 manhattan 40.78 -73.90
2 staten island 40.84 -73.95
3 staten island 40.57 -74.11
Since coordinate [40.57, -74.11] are inside Staten Island's borough.
I have generated a dict with boroughs' coordinates:
borough_dict = {"Bronx" : [40.837048, -73.865433], "Brooklyn" : [40.650002, -73.949997], "Manhattan" : [40.758896, -73.985130], "Queens" : [40.742054,-73.769417], "Staten Island" : [40.579021,-74.151535]}
And this is what I try so far (code/pseudocode):
df_atm['borough'] = df_atm.apply(
lambda row: **idk what do to here** if np.isnan(row['borough']) else row['borough'],
axis=1
)
Many thanks in advance!
Try this :
from math import cos, asin, sqrt
import pandas as pd
def distance(lat1, lon1, lat2, lon2):
p = 0.017453292519943295
a = 0.5 - cos((lat2-lat1)*p)/2 + cos(lat1*p)*cos(lat2*p) * (1-cos((lon2-lon1)*p)) / 2
return 12742 * asin(sqrt(a))
def closest(data, v):
return min(data, key=lambda p: distance(v[0], v[1], p[0], p[1]))
df = pd.DataFrame(
[
{'borough': 'bronx', 'lat': 40.79, 'long': -73.78},
{'borough': 'manhattan', 'lat': 40.78, 'long': -73.90},
{'borough': None, 'lat': 40.57, 'long': -74.11}
],
)
borough_dict = {"Bronx" : [40.837048, -73.865433], "Brooklyn" : [40.650002, -73.949997], "Manhattan" : [40.758896, -73.985130], "Queens" : [40.742054,-73.769417], "Staten Island" : [40.579021,-74.151535]}
boroughs = [(*value, key) for key, value in borough_dict.items()]
df['borough'] = df.apply(
lambda row: closest(boroughs, [row['lat'], row['long']])[2] if row['borough'] is None else row['borough'],
axis=1
)
print(df)
Output:
borough lat long
0 bronx 40.79 -73.78
1 manhattan 40.78 -73.90
2 Staten Island 40.57 -74.11
Credit to #trincot answer
You want a spatial join, so use the very closely related GeoPandas library. We'll convert your original DataFrame to a GeoDataFrame so that we can merge. Also note in your example your Latitude and Longitude columns are incorrectly labeled. I fixed that here.
import pandas as pd
import geopandas as gpd
dfg = gpd.GeoDataFrame(df.copy(), geometry=gpd.points_from_xy(df.Longitude, df.Latitude))
# borough Latitude Longitude geometry
#0 bronx 40.79 -73.78 POINT (-73.78000 40.79000)
#1 manhattan 40.78 -73.90 POINT (-73.90000 40.78000)
#2 staten island 40.84 -73.95 POINT (-73.95000 40.84000)
#3 NaN 40.57 -74.11 POINT (-74.11000 40.57000)
# Shapefile from https://geo.nyu.edu/catalog/nyu-2451-34154
# I downloaded the geojson
df_nys = gpd.read_file('nyu-2451-34154-geojson.json')
dfg.crs = df_nys.crs # Set coordinate reference system to be the same
dfg = gpd.sjoin(dfg, df_nys[['geometry', 'boroname']], how='left', op='within')
borough Latitude Longitude geometry index_right boroname
0 bronx 40.79 -73.78 POINT (-73.78000 40.79000) 4.0 Queens
1 manhattan 40.78 -73.90 POINT (-73.90000 40.78000) 4.0 Queens
2 staten island 40.84 -73.95 POINT (-73.95000 40.84000) NaN NaN
3 NaN 40.57 -74.11 POINT (-74.11000 40.57000) 2.0 Staten Island
So now you could fill missing 'borough's with 'boroname'. But it does seem like a few of the other points are miss-classified. This is mostly because you don't have enough precision on your stored Latitude and Longitude. Though this would probably be the more accurate solution with better precision on Lat/Lon, I might favor the distance calculation by #adnanmuttaleb given the level of precision you have in your data.

Metpy HRRR Cross Section

I am working on creating cross sections of HRRR model output, I have read in the grib files using xarray with pynio as the engine and then converted this files to netcdf so I can work with them on my windows machine, therefore I am wondering if this is causing these issues.
Here is a what my dataset looks like after reading in the netcdf with xarray: Imgur
After reading in the data I try to follow the Metpy cross section/ Xarray tutorials by parsing the data:
data = ds.metpy.parse_cf()
Which yields this new dataset:Imgur
It created the crs coordinate so I assumed it worked somewhat correctly.
Following this I created a contour map of 700mb RH, winds, and elevation(different data set) where I parsed the RH from the data dataset and also pulled out the x and y
RH = data.metpy.parse_cf('RH_P0_L100_GLC0')
x, y = RH.metpy.coordinates('x', 'y')
This all worked and I could produce a nice looking plot no problem. So next I wanted to make a cross section. Following the example in the documentation:
start = (40.3847, -120.5676)
end = (39.2692, -122.3784)
cross = cross_section(data, start, end)
which gave these errors:Imgur
So then I instead tried using the RH variable from above since
RH.metpy.x
gave the x-dimension. But running
cross = cross_section(RH, start, end)
gave this error instead:Imgur
So I'm just wondering if I missed a step in parsing the original dataset or if the grib to netcdf conversion messed something up or if this is even possible using metpy?
In general I am just working towards creating a cross section like the one in the example: https://unidata.github.io/MetPy/latest/examples/cross_section.html#sphx-glr-examples-cross-section-py
As a bonus question would it be possible to fill terrain under the plots?
Currently, MetPy's cross section interpolation relies on the x and y dimensions being present in the Dataset/DataArray as dimension coordinates (see the description in xarray's documentation here). In your dataset, the x and y dimensions of ygrid_0 and xgrid_0 are listed as dimensions without coordinates, hence the problem.
However, since this situation is commonly encountered in meteorological data files, MetPy's current implementation may be too stringent. I would suggest opening an issue on MetPy's issue tracker.
In regards to your bonus question, so long as you have terrain level data in the same vertical coordinate as your data, you can use the fill_between() method in matplotlib to fill in terrain under the plots.
I have nearly the same problem.
ValueError: Data missing required coordinate information. Verify that your data have been parsed by MetPy with proper x and y dimension coordinates and added crs coordinate of the correct projection for each variable.
if i try this:
cross = cross_section(data, start, end)
the xarray looks like this:
<xarray.Dataset>
Dimensions: (bnds: 2, height: 61, height_2: 1, height_3: 60, height_4: 61, height_5: 1, lat: 101, lev: 1, lev_2: 1, lev_3: 1, lon: 121, time: 24)
Coordinates:
* height (height) float64 1.0 2.0 3.0 4.0 ... 58.0 59.0 60.0 61.0
* height_3 (height_3) float64 1.0 2.0 3.0 4.0 ... 57.0 58.0 59.0 60.0
* lev (lev) float64 0.0
* lev_2 (lev_2) float64 400.0
* lev_3 (lev_3) float64 800.0
* lon (lon) float64 -30.0 -29.5 -29.0 -28.5 ... 29.0 29.5 30.0
* lat (lat) float64 -10.0 -9.5 -9.0 -8.5 ... 38.5 39.0 39.5 40.0
crs object Projection: latitude_longitude
* height_2 (height_2) float64 10.0
* time (time) float64 2.017e+07 2.017e+07 ... 2.017e+07 2.017e+07
* height_4 (height_4) float64 1.0 2.0 3.0 4.0 ... 58.0 59.0 60.0 61.0
* height_5 (height_5) float64 2.0
Dimensions without coordinates: bnds
Data variables:
height_bnds (height, bnds) float64 ...
height_3_bnds (height_3, bnds) float64 ...
lev_bnds (lev, bnds) float64 ...
lev_2_bnds (lev_2, bnds) float64 ...
lev_3_bnds (lev_3, bnds) float64 ...
z_ifc (height, lat, lon) float32 ...
topography_c (lat, lon) float32 ...
fis (lat, lon) float32 ...
con_gust (time, height_2, lat, lon) float32 ...
gust10 (time, height_2, lat, lon) float32 ...
u (time, height_3, lat, lon) float32 ...
I mean there is a lat lon grid... is there a workaround to use the cross_section for a lat lon grid?
or can i rename the lat lon to x and y?
Best

Resources