Fastest way to slice and download hundreds of NetCDF files from THREDDS/OPeNDap server - geospatial

I am working with NASA-NEX-GDDP CMIP6 data. I currently have working code that individually opens and slices each file, however it takes days to download one variable for all model outputs and scenarios. My goal is to have all temperature and precipitation data for all models outputs and scenarios then apply climate indicators and make an ensemble with xclim.
url = 'https://ds.nccs.nasa.gov/thredds2/dodsC/AMES/NEX/GDDP-CMIP6/UKESM1-0-LL/ssp585/r1i1p1f2/tasmax/tasmax_day_UKESM1-0-LL_ssp585_r1i1p1f2_gn_2098.nc'
lat = 53
lon = 0
try:
with xr.open_dataset(url) as ds:
ds.interp(lat=lat,lon=lon).to_netcdf(url.split('/')[-1])
except Exception as e: print(e)
This code works but is very slow (days for one variable, one location). Wondering if there is a better, faster way? I'd rather not download the whole files as they are each 240 MB!
Update:
I have also tried the following to take advantage of dask parallel tasks and it is slightly faster but still on the order of days to complete for a full variable output:
def interp_one_url(path,lat,lon):
with xr.open_dataset(path) as ds:
ds = ds.interp(lat=lat,lon=lon)
return ds
urls = ['https://ds.nccs.nasa.gov/thredds2/dodsC/AMES/NEX/GDDP-CMIP6/UKESM1-0-LL/ssp585/r1i1p1f2/tasmax/tasmax_day_UKESM1-0-LL_ssp585_r1i1p1f2_gn_2100.nc',
'https://ds.nccs.nasa.gov/thredds2/dodsC/AMES/NEX/GDDP-CMIP6/UKESM1-0-LL/ssp585/r1i1p1f2/tasmax/tasmax_day_UKESM1-0-LL_ssp585_r1i1p1f2_gn_2099.nc']
lat = 53
lon = 0
paths = [url.split('/')[-1] for url in urls]
datasets = [interp_one_url(url,lat,lon) for url in urls]
xr.save_mfdataset(datasets, paths=paths)

One way is to download via the ncss portal instead of the OpenDAP, available via NASA. The URL is different but it is iterative as well.
e.g.
lat = 53
lon = 0
URL = "https://ds.nccs.nasa.gov/thredds/ncss/AMES/NEX/GDDP-CMIP6/ACCESS-CM2/historical/r1i1p1f1/pr/pr_day_ACCESS-CM2_historical_r1i1p1f1_gn_2014.nc?var=pr&north={}&west={}&east={}&south={}&disableProjSubset=on&horizStride=1&time_start=2014-01-01T12%3A00%3A00Z&time_end=2014-12-31T12%3A00%3A00Z&timeStride=1&addLatLon=true"
wget.download(URL.format(lat,lon,lon+1,lat-1) #north, west, east, south boundary
This accomplishes the slicing and download in one step. Once you have the URL, you can use something like wget, and complete downloads in parallel, which will speed up compared to selecting and saving one at a time

Related

How to lower RAM usage using xarray open_mfdataset and the quantile function

I am trying to load multiple years of daily data in nc files (one nc file per year). A single nc file has a dimension of 365 (days) * 720 (lat) * 1440 (lon). All the nc files are in the "data" folder.
import xarray as xr
ds = xr.open_mfdataset('data/*.nc',
chunks={'latitude': 10, 'longitude': 10})
# I need the following line (time: -1) in order to do quantile, or it throws a ValueError:
# ValueError: dimension time on 0th function argument to apply_ufunc with dask='parallelized'
# consists of multiple chunks, but is also a core dimension. To fix, either rechunk into a single
# dask array chunk along this dimension, i.e., ``.chunk(time: -1)``, or pass ``allow_rechunk=True``
# in ``dask_gufunc_kwargs`` but beware that this may significantly increase memory usage.
ds = ds.chunk({'time': -1})
# Perform the quantile "computation" (looks more like a reference to the computation, as it's fast
ds_qt = ds.quantile(0.975, dim="time")
# Verify the shape of the loaded ds
print(ds)
# This shows the expected "concatenation" of the nc files.
# Get a sample for a given location to test the algorithm
print(len(ds.sel(lon = 35.86,lat = 14.375, method='nearest')['variable'].values))
print(ds_qt.sel(lon = 35.86,lat = 14.375, method='nearest')['variable'].values)
The result is correct. My issue comes from memory usage. I thought that by doing the open_mfdataset method, which uses Dask under the hood, this would be solved. However, loading "just" 2 years of nc files uses around 8GB of virtual RAM, and using 10 years of data uses my entire virtual RAM (around 32GB).
Am I missing something to be able to take a given percentile value across a dask array (I would need 30 nc files)? I apparently have to apply the chunk({'time': -1}) to the dataset to be able to use the quantile function, is this what makes the RAM savings fail?
This may help somebody in the future, here is the solution I am implementing, even though it is not optimized. I basically break the nc files into slices based on geolocation, and paste it back together to create the output file.
ds = xr.open_mfdataset('data/*.nc')
step = 10
min_lat = -90
max_lat = min_lat + step
output_ds = None
while max_lat <= 90:
cropped_ds = ds.sel(lat=slice(min_lat, max_lat))
cropped_ds = cropped_ds.chunk({'time': -1})
cropped_ds_quantile = cropped_ds.quantile(0.975, dim="time")
if not output_ds:
output_ds = cropped_ds_quantile
else:
output_ds = xr.merge([output_ds, cropped_ds_quantile])
min_lat += step
max_lat += step
output_ds.to_netcdf('output.nc')
It's not great, but it limits RAM usage to manageable levels. I am still open to a cleaner/faster solution if it exists (likely).

Expand netcdf to the whole globe with xarray

I have a dataset that looks like this:
As you can see, it only covers Latitudes between -55.75 and 83.25. I would like to expand that dataset so that it covers the whole globe (-89.75 to 89.75 in my case) and fill it with an arbitrary NA value.
Ideally I would want to do this with xarray. I have looked at .pad(), .expand_dims() and .assign_coords(), but did not really get a handle on the working ofeither of those.
If someone can provide an alternative solution with cdo, I would also be grateful for that.
You could do this with nctoolkit (https://nctoolkit.readthedocs.io/en/latest/), which uses CDO as a backend.
The example below shows how you could do it. Example starts by cropping a global temperature dataset to latitudes between -50 and 50. You would then need to regrid it to a global dataset, at whatever resolution you need. This uses CDO, which will extrapolate at the edges. So you probably want to set everything to NA outside the original dataset's values, so my code calls masklonlatbox from CDO.
import nctoolkit as nc
ds = nc.open_thredds("https://psl.noaa.gov/thredds/dodsC/Datasets/COBE2/sst.mon.ltm.1981-2010.nc")
ds.subset(time = 0)
ds.crop(lat = [-50, 50])
ds.to_latlon(lon = [-179.5, 179.5], lat = [-89.5, 89.5], res = 1)
ds.mask_box(lon = [-179.5, 179.5], lat = [-50, 50])
ds.plot()
# convert to xarray dataset
ds_xr = ds.to_xarray()

How to process the data returned from a function (Python 3.7)

Background:
My question should be relatively easy, however I am not able to figure it out.
I have written a function regarding queueing theory and it will be used for ambulance service planning. For example, how many calls for service can I expect in a given time frame.
The function takes two parameters; a starting value of the number of ambulances in my system starting at 0 and ending at 100 ambulances. This will show the probability of zero calls for service, one call for service, three calls for service….up to 100 calls for service. Second parameter is an arrival rate number which is the past historical arrival rate in my system.
The function runs and prints out the result to my screen. I have checked the math and it appears to be correct.
This is Python 3.7 with the Anaconda distribution.
My question is this:
I would like to process this data even further but I don’t know how to capture it and do more math. For example, I would like to take this list and accumulate the probability values. With an arrival rate of five, there is a cumulative probability of 61.56% of at least five calls for service, etc.
A second example of how I would like to process this data is to format it as percentages and write out a text file
A third example would be to process the cumulative probabilities and exclude any values higher than the 99% cumulative value (because these vanish into extremely small numbers).
A fourth example would be to create a bar chart showing the probability of n calls for service.
These are some of the things I want to do with the queueing theory calculations. And there are a lot more. I am planning on writing a larger application. But I am stuck at this point. The function writes an output into my Python 3.7 console. How do I “capture” that output as an object or something and perform other processing on the data?
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import math
import csv
def probability_x(start_value = 0, arrival_rate = 0):
probability_arrivals = []
while start_value <= 100:
probability_arrivals = [start_value, math.pow(arrival_rate, start_value) * math.pow(math.e, -arrival_rate) / math.factorial(start_value)]
print(probability_arrivals)
start_value = start_value + 1
return probability_arrivals
#probability_x(arrival_rate = 5, x = 5)
#The code written above prints to the console, but my goal is to take the returned values and make other calculations.
#How do I 'capture' this data for further processing is where I need help (for example, bar plots, cumulative frequency, etc )
#failure. TypeError: writerows() argument must be iterable.
with open('ExpectedProbability.csv', 'w') as writeFile:
writer = csv.writer(writeFile)
for value in probability_x(arrival_rate = 5):
writer.writerows(value)
writeFile.close()
#Failure. Why does it return 2. Yes there are two columns but I was expecting 101 as the length because that is the end of my loop.
print(len(probability_x(arrival_rate = 5)))
The problem is, when you write
probability_arrivals = [start_value, math.pow(arrival_rate, start_value) * math.pow(math.e, -arrival_rate) / math.factorial(start_value)]
You're overwriting the previous contents of probability_arrivals. Everything that it held previously is lost.
Instead of using = to reassign probability_arrivals, you want to append another entry to the list:
probability_arrivals.append([start_value, math.pow(arrival_rate, start_value) * math.pow(math.e, -arrival_rate) / math.factorial(start_value)])
I'll also note, your while loop can be improved. You're basically just looping over start_value until it reaches a certain value. A for loop would be more appropriate here:
for s in range(start_value, 101): # The end value is exclusive, so it's 101 not 100
probability_arrivals = [s, math.pow(arrival_rate, s) * math.pow(math.e, -arrival_rate) / math.factorial(s)]
print(probability_arrivals)
Now you don't need to manually worry about incrementing the counter.

xarray dataset selection method is very slow

I have 37 years of NetCDF files with a daily time step and computing a function for each cell over all years (13513 days). The computation of this function is repeated for all cells. For this, I am using xarray and using da.sel approach but it is very slow and not making use of multiple cores of my laptop. I am struggling to figure out how to use dask in the current scenario. Any suggestions to improve/speed-up the code?
for c in range(len(df)):
arr = np.array([])
lon=df.X[c]
lat=df.Y[c]
for yr in range(1979,2016,1):
ds = xr.open_dataset('D:/pr_'+str(yr)+'.nc')
da = ds.var.sel(lon=lon, lat=lat, method='nearest')
arr = np.concatenate([arr, da])
fun = function(arr)
It seems like you're looking for xarray.open_mfdataset
ds = xr.open_dataset('D:/pr_*.nc')
Your code is particularly slow because you repeatedly call np.concatenate. Every time you call this function you have to copy all of the data that you've loaded so far. This is quadratic in costs.

Transform a third image w.r.t. the outcome of a registration

The Execute method of registration in SimpleElastix returns the registered (transformed) version moving image, but doesn't allow one to transform another image similarly. I have two CT-images and want to register based on the bones, therefore I soft-thresholded the input images using a logistic sigmoid between approximately 600 and 1500 Hounsfield units, such that the contrast is focused on the bones. For simplicity you can assume the threshold puts everything below 600 to 0, scales everything linearly from 0 to 1 in-between and puts everything above 1500 to 1.
The registration, using SimpleElastix:
fixed = sitk.GetImageFromArray(threshold(...))
moving = sitk.GetImageFromArray(threshold(...))
elastixImageFilter = sitk.ElastixImageFilter()
elastixImageFilter.SetFixedImage(fixed)
elastixImageFilter.SetMovingImage(moving)
parameterMapVector = sitk.VectorOfParameterMap()
# ...
elastixImageFilter.SetParameterMap(parameterMapVector)
registered = elastixImageFilter.Execute()
However, I want to operate on the original images, without the soft-threshold, afterwards.
Is there a way to apply the transformation the registration found on the original image as well? Either by getting the Transformation or by providing the not-thresholded moving image 'passenger side', such that it's transformed similarly but not used in the optimization cost function.
I think you can do it using a transformixImageFilter after the elastixImageFilter like this:
elastixImageFilter = sitk.ElastixImageFilter()
elastixImageFilter.SetFixedImage(fixed)
elastixImageFilter.SetMovingImage(moving)
parameterMapVector = sitk.VectorOfParameterMap()
# ...
elastixImageFilter.SetParameterMap(parameterMapVector)
elastixImageFilter.Execute()
transformParameterMap = elastixImageFilter.GetTransformParameterMap()
transformix = sitk.TransformixImageFilter()
transformix.SetTransformParameterMap(transformParameterMap)
transformix.SetMovingImage(sitk.GetImageFromArray(thirdImage))
transformix.Execute()
transformedThirdImg =
sitk.GetArrayFromImage(transformix.GetResultImage())

Resources