Convert a numpy dataset to netCDF - python-3.x

I have a numpy array in python with size (16,250,186) representing time, latitude and longitude.
I want to convert it to a netCDF file so that I can read the data easily with co-ordinates in future.
My numpy array looks like this
RZS = np.load("/home/chandra/Data/rootzone_CHIRPS_era5_2003-2015_daily-analysis_annual-result.npy")
RZS.shape
Output: (16, 250, 186)
As you can see my above numpy array represents annual values for 16 years.
chirps_precip =xarray.open_mfdataset("/home/chandra/Data/CHIRPS/chirps-v2.0.2000.days_p25.nc")
precip = chirps_precip.precip.sel(latitude = slice(-50,12.5), longitude = slice(-81.25,-34.75))
precip[0,:,:]
Output:
<xarray.DataArray 'precip' (latitude: 250, longitude: 186)>
dask.array<shape=(250, 186), dtype=float32, chunksize=(250, 186)>
Coordinates:
* latitude (latitude) float32 -49.875 -49.625 -49.375 ... 12.125 12.375
* longitude (longitude) float32 -81.125 -80.875 -80.625 ... -35.125 -34.875
time datetime64[ns] 2000-01-01
Attributes:
units: mm/day
standard_name: convective precipitation rate
long_name: Climate Hazards group InfraRed Precipitation with St...
time_step: day
geostatial_lat_min: -50.0
geostatial_lat_max: 50.0
geostatial_lon_min: -180.0
geostatial_lon_max: 180.0
These are the co-ordinates of the chirps_precip dataset that I want my numpy array RZS to have with years (as 2000, 2001, .....2015) on the timestep
I have tried some methods like
# from xarray
array = xarray.DataArray(RZS, latitude = 'precip.latitude')
#from netCDF
Dataset.createVariable('rootzone storage cap', np.float32, ('time','lat','lon'))
But I am not able to do anything. I also tried to copy attrs and coords but that also didn't work.
It seems like I am doing this the wrong way. Can anyone suggest what am I missing.
I want my numpy array to have the same co-ordinate as the netcdf file, but with a modified time attribute to years.

I would suggest a code like using module netCDF4, assuming you have latitude and longitude in variables lat and lon and dataout is dataout.
#!/usr/bin/env ipython
# ---------------------
import numpy as np
import datetime
from netCDF4 import Dataset,num2date,date2num
# -----------------------
nyears = 16;
unout = 'days since 2000-01-01 00:00:00'
# -----------------------
ny, nx = (250, 186)
lon = np.linspace(9,30,nx);
lat = np.linspace(50,60,ny);
dataout = np.random.random((nyears,ny,nx)); # create some random data
datesout = [datetime.datetime(2000+iyear,1,1) for iyear in range(nyears)]; # create datevalues
# =========================
ncout = Dataset('myfile.nc','w','NETCDF3'); # using netCDF3 for output format
ncout.createDimension('lon',nx);
ncout.createDimension('lat',ny);
ncout.createDimension('time',nyears);
lonvar = ncout.createVariable('lon','float32',('lon'));lonvar[:] = lon;
latvar = ncout.createVariable('lat','float32',('lat'));latvar[:] = lat;
timevar = ncout.createVariable('time','float64',('time'));timevar.setncattr('units',unout);timevar[:]=date2num(datesout,unout);
myvar = ncout.createVariable('myvar','float32',('time','lat','lon'));myvar.setncattr('units','mm');myvar[:] = dataout;
ncout.close();
Compared to xarray, you have to write more code, but it is still very easy to create the netCDF files using that module.

Related

Python scipy interpolation meshgrid data

Dear all I want to interpolate an experimental data in order to make it look with higher resolution but apparently it does not work. I followed the example in this link for mgrid data the csv data can be found goes as follow.
Csv data
My code
import pandas as pd
import numpy as np
import scipy
x=np.linspace(0,2.8,15)
y=np.array([2.1,2,1.9,1.8,1.7,1.6,1.5,1.4,1.3,1.2,1.1,0.9,0.7,0.5,0.3,0.13])
[X, Y]=np.meshgrid(x,y)
Vx_df=pd.read_csv("Vx.csv", header=None)
Vx=Vx_df.to_numpy()
tck=scipy.interpolate.bisplrep(X,Y,Vx)
plt.pcolor(X,Y,Vx, shading='nearest');
plt.show()
xi=np.linspace(0.1, 2.5, 30)
yi=np.linspace(0.15, 2.0, 50)
[X1, Y1]=np.meshgrid(xi,yi)
VxNew = scipy.interpolate.bisplev(X1[:,0], Y1[0,:], tck, dx=1, dy=1)
plt.pcolor(X1,Y1,VxNew, shading='nearest')
plt.show()
CSV DATA:
0.73,,,-0.08,-0.19,-0.06,0.02,0.27,0.35,0.47,0.64,0.77,0.86,0.90,0.93
0.84,,,0.13,0.03,0.12,0.23,0.32,0.52,0.61,0.72,0.83,0.91,0.96,0.95
1.01,1.47,,0.46,0.46,0.48,0.51,0.65,0.74,0.80,0.89,0.99,0.99,1.07,1.06
1.17,1.39,1.51,1.19,1.02,0.96,0.95,1.01,1.01,1.05,1.06,1.05,1.11,1.13,1.19
1.22,1.36,1.42,1.44,1.36,1.23,1.24,1.17,1.18,1.14,1.14,1.09,1.08,1.14,1.19
1.21,1.30,1.35,1.37,1.43,1.36,1.33,1.23,1.14,1.11,1.05,0.98,1.01,1.09,1.15
1.14,1.17,1.22,1.25,1.23,1.16,1.23,1.00,1.00,0.93,0.93,0.80,0.82,1.05,1.09
,0.89,0.95,0.98,1.03,0.97,0.94,0.84,0.77,0.68,0.66,0.61,0.48,,
,0.06,0.25,0.42,0.55,0.55,0.61,0.49,0.46,0.56,0.51,0.40,0.28,,
,0.01,0.05,0.13,0.23,0.32,0.33,0.37,0.29,0.30,0.32,0.27,0.25,,
,-0.02,0.01,0.07,0.15,0.21,0.23,0.22,0.20,0.19,0.17,0.20,0.21,0.13,
,-0.07,-0.05,-0.02,0.06,0.07,0.07,0.16,0.11,0.08,0.12,0.08,0.13,0.16,
,-0.13,-0.14,-0.09,-0.07,0.01,-0.03,0.06,0.02,-0.01,0.00,0.01,0.02,0.04,
,-0.16,-0.23,-0.21,-0.16,-0.10,-0.08,-0.05,-0.11,-0.14,-0.17,-0.16,-0.11,-0.05,
,-0.14,-0.25,-0.29,-0.32,-0.31,-0.33,-0.31,-0.34,-0.36,-0.35,-0.31,-0.26,-0.14,
,-0.02,-0.07,-0.24,-0.36,-0.39,-0.45,-0.45,-0.52,-0.48,-0.41,-0.43,-0.37,-0.22,
The image of the low resolution (without iterpolation) is Low resolution and the image I get after interpolation is High resolution
Can you please give me some advice? why it does not interpolate properly?
Ok so to interpolate we need to set up an input and output grid an possibly need to remove values from the grid that are missing. We do that like so
array = pd.read_csv(StringIO(csv_string), header=None).to_numpy()
def interp(array, scale=1, method='cubic'):
x = np.arange(array.shape[1]*scale)[::scale]
y = np.arange(array.shape[0]*scale)[::scale]
x_in_grid, y_in_grid = np.meshgrid(x,y)
x_out, y_out = np.meshgrid(np.arange(max(x)+1),np.arange(max(y)+1))
array = np.ma.masked_invalid(array)
x_in = x_in_grid[~array.mask]
y_in = y_in_grid[~array.mask]
return interpolate.griddata((x_in, y_in), array[~array.mask].reshape(-1),(x_out, y_out), method=method)
Now we need to call this function 3 times. First we fill the missing values in the middle with spline interpolation. Then we fill the boundary values with nearest neighbor interpolation. And finally we size it up by interpreting the pixels as being a few pixels apart and filling in gaps with spline interpolation.
array = interp(array)
array = interp(array, method='nearest')
array = interp(array, 50)
plt.imshow(array)
And we get the following result

Extract area from high resolution netcdf file python

I am trying to extract an area from a netcdf file by longitude and latitude.
However the resolution is much higher than 1x1 degree.
How would you extract an area then, e.g. lon: 30-80 and lat: 30-40.
The file can be found here: https://drive.google.com/open?id=1zX-qYBdXT_GuktC81NoQz9xSxSzM-CTJ
Keys and shapes are as follows:
odict_keys(['crs', 'lat', 'lon', 'Band1'])
crs ()
lat (25827,)
lon (35178,)
Band1 (25827, 35178)
I have tried this, but with the high resolution, it doesn't refer to the actual longitude/langitude.
from netCDF4 import Dataset
import numpy as np
import matplotlib.pyplot as plt
file = path + '20180801-ESACCI-L3S_FIRE-BA-MODIS-AREA_3-fv5.1-JD.nc'
fh = Dataset(file)
longitude = fh.variables['lon'][:]
latitude = fh.variables['lat'][:]
band1 = fh.variables['Band1'][:30:80,30:40]
since you have variables(dimensions): ..., int16 Band1(lat,lon), you could apply np.where to variables lat and lon to find the appropriate indices and then select the according Band1 data as sel_band1:
import numpy as np
from netCDF4 import Dataset
file = '20180801-ESACCI-L3S_FIRE-BA-MODIS-AREA_3-fv5.1-JD.nc'
with Dataset(file) as nc_obj:
lat = nc_obj.variables['lat'][:]
lon = nc_obj.variables['lon'][:]
sel_lat, sel_lon = [30, 40], [30, 80]
sel_lat_idx = np.where((lat >= sel_lat[0]) & (lat <= sel_lat[1]))
sel_lon_idx = np.where((lon >= sel_lon[0]) & (lon <= sel_lon[1]))
sel_band1 = nc_obj.variables['Band1'][:][np.ix_(sel_lat_idx[0], sel_lon_idx[0])]
note that np.where applied to lat and lon returns 1D index arrays. Use np.ix_ to apply them to the 2D data in Band1. See here for more info.

How can I achieve faster access to this netcdf?

I am selecting spatially and temporal data from this kind of NetCDF opened by
ds = xr.open_mfdataset(file_list):
<xarray.Dataset>
Dimensions: (lat: 576, lon: 1152, time: 1464)
Coordinates:
* lon (lon) float32 0.0 0.3125 0.625 0.9375 ... 359.0625 359.375 359.6875
* lat (lat) float32 89.761 89.4514 89.1399 ... -89.1399 -89.4514 -89.761
* time (time) datetime64[ns] 1980-04-01T01:00:00 ... 1980-06-01
Data variables:
uasmean (lat, lon, time) float32 dask.array<shape=(576, 1152, 1464), chunksize=(576, 1152, 720)>
vasmean (lat, lon, time) float32 dask.array<shape=(576, 1152, 1464), chunksize=(576, 1152, 720)>
Attributes:
Creator: NCAR - CISL RDA (dattore)
history: Mon Aug 11 12:24:36 2014: ncatted -a history,global,d,, -O Wind...
I achieved to get the correct subset in time and lon/lat using:
ds = ds.where((ds.time >= np.datetime64(date_ini)) & (ds.time <= np.datetime64(date_end)), drop=True)
ds = ds.where((ds.lon >= lonlat[0]) & (ds.lon <= lonlat[1]) & (ds.lat >= lonlat[2]) & (ds.lat <= lonlat[3]), drop=True)
And finally to extract this information in my target format I use a loop over the time to convert the information to a dataframe that I export to csv after:
# for t in ds['time']:
t = ds['time'][0]
# Select time and convert to dataframe
df = ds.sel(time=t).to_dataframe()
My problem is that the conversion to dataframe is slow and I know that the originals netCDF are written in order to optimize the extraction of temporal series instead of extracting maps as I am trying to do. I know that is possible to change the sort of coordinates and write a new netCDF in order to speed this up, but the database is a too big... so it is not an option. Do you know if there is any other way to speed up this extraction??
Thank you all in advance!!!
P.D.: I attached the complete script of this block of code I am using to check the performance...
import os
import random
import shutil
from datetime import datetime, timedelta
from glob import glob
import pandas as pd
import xarray as xr
import numpy as np
import scipy.io
import matplotlib.pyplot as plt
import time
start_time = time.time()
files = glob('*.nc')
lonlat = [-5, 10, 50, 64]
date_ini = datetime(1980, 4, 28)
date_end = datetime(1980, 5, 3)
ds = xr.open_mfdataset(files)
print('[Processing 2D winds]')
# create date list to loop over folders
dates = pd.date_range(start=date_ini - timedelta(days=1), end=date_end + timedelta(days=1), freq='D')
# Create date list of files to open
file_list = []
for date in dates:
file_list.append('Wind_CFS_Global_' + date.strftime('%Y.%m') + '.nc')
# Delete repeated elements
file_list = list(dict.fromkeys(file_list))
print(file_list)
# load data
ds = xr.open_mfdataset(file_list)
# Select temporal subset
ds = ds.get(['uasmean','vasmean'])
ds = ds.where((ds.time >= np.datetime64(date_ini)) & (ds.time <= np.datetime64(date_end)), drop=True)
# from 0º,360º to -180º,180º
ds['lon'] = (ds.lon + 180) % 360 - 180
ds = ds.sortby('lon', 'lat')
ds = ds.where((ds.lon >= lonlat[0]) & (ds.lon <= lonlat[1]) & (ds.lat >= lonlat[2]) & (ds.lat <= lonlat[3]), drop=True)
print(ds)
currents_list = []
# for t in ds['time']:
t = ds['time'][0]
# Select time and depth array
df = ds.sel(time=t).to_dataframe()
# reset index because longitude latitude are as multi-index and I want them as columns
df = df.reset_index()
# sort data-rows for TESEO: longitude, latitude (ascending)
df = df.sort_values(['lon', 'lat'])
# generate full file path
outfile = 'winds_' + df['time'][0].strftime('%Y%m%dT%H%M') + '.txt'
# export to ascii without separator, without header neither index column, replace nan by 0 and set 3 floating numbers
df.to_csv(path_or_buf=outfile,
sep=' ',
columns=['lon', 'lat', 'uasmean', 'vasmean'],
header=False,
index=False,
na_rep=0,
float_format='%.3f'
)
elapsed_time = (time.time() - start_time)
print('Elapsed time: {} sec.'.format(elapsed_time))
I found a big improvement in the performance doing this:
convert all the Xarray dataset to a dataframe
loop over time directly in the dataframe
That makes a really big difference! I was looping over time and converting this shorter dataset to dataframe that is really inefficient!
Best regards!

Using xarray to change coordinate system in order to Slice operation

I am new here.
first on all, I am very thankful for your time and consideration.
I have 2 questions regarding to managing 2 different netcdf files in python.
I searched a lot but unfortunately I couldn't find a solution.
1- I have a netcdf file which has coordinates like below:
time datetime64[ns] 2016-08-16T22:00:00
* y (y) int32 220000 ... 620000
* x (x) int32 20000 ... 720000
lat (y, x) float64 dask.array<shape=(401, 701),
lon (y, x) float64 dask.array<shape=(401, 701),
I need to change coords to lon/lat in order that I can slice an area based on specific lon/lat coords (by using xarray). But I don't know how to change x and y to lon lat.
here my code:
import xarray as xr
import matplotlib.pyplot as plt
p = "R_201608.nc"
ds = xr.open_mfdataset(p)
q=ds.RR.sel(time='2016-08-16T21:00:00')
2- Similar to 1, I have another netcdf file which has coordinates like below:
* X (X) float32 557600.0 .. 579400.0
* Y (Y) float32 5190600 ... 5205400.0
* time (time) datetime64[ns] 2007-01I
How can I convert x and y to lon/lat system in order that I can plot it in lon/lat system?
Edit related to #Ryan :
1- Yes. this file demonestrates rainfall over a large area. I want to cut it into smaller area -similar area of file related to q2- and compare them uusing bias, RMSE, etc. here is full information related to this file:
<xarray.Dataset>
Dimensions: (time: 2976, x: 701, y: 401)
Coordinates:
* time (time) datetime64[ns] 2016-08-31T23:45:00
* y (y) int32 220000 221000 ... 619000 620000
* x (x) int32 20000 21000 ... 719000 720000
lat (y, x) float64 dask.array<shape=(401, 701),chunksize=(401, 701)>
lon (y, x) float64 dask.array<shape=(401, 701), chunksize=(401, 701)
Data variables:
RR (time, y, x) float32 dask.array<shape=(2976, 401, 701), chunksize=(2976, 401, 701)>
lambert_conformal_conic int32 ...
Conventions: CF-1.5
edit related to #Ryan :2- And here it is the full information about the second file (Smaller area):
<xarray.DataArray 'Precip' (time: 8928, Y: 75, X: 110)>
dask.array<shape=(8928, 75, 110), dtype=float32, chunksize=(288, 75, 110)>
Coordinates:
sensor_height_precip float32 1.5
sensor_height_P float32 1.5
* X (X) float32 557600.0 557800.0 ... 579200.0 579400.0
* Y (Y) float32 5190600.0 5190800.0 ... 5205400.0
* time (time) datetime64[ns] 2007-01-31T23:55:00
Attributes:
grid_mapping: UTM33N
ancillary_variables: QFlag_Precip QGrid_Precip
long_name: Precipitation Amount
standard_name: precipitation_amount
cell_methods: time:sum
units: mm
In problem 1), it is not possible to convert lon and lat to dimension coordinates, because they are two-dimensional (both have dimension x, y). Dimension coordinates, used for slicing, can only be one-dimensional. If you can be more specific about what you want to do after slicing, we can provide more suggestions about how to proceed. Do you want to select a particular latitude / longitude range and then calculate some statistics (e.g. mean / variance)?
In problem 2) it looks like you have a map projection. Without more information about the projection, it is impossible to convert to lat / lon coordinates or plot on a map. Is there more information contained in your dataset about the map projection used? Can you post the full output of print(ds)?
I have solved my problem with your help. Thanks a lot.
I could change the coords of both data sets to lon/lat using PYPROJ as #Bart mentioned. creating meshgid from original and projected coordinates was the key point.
from pyproj import Proj
nxv, nyv = np.meshgrid(nx, ny)
unausp = Proj('+proj=lcc +lat_1=49 +lat_2=46 +lat_0=47.5 +lon_0=13.33333333333333 +x_0=400000 +y_0=400000 +ellps=bessel +towgs84=577.326,90.129,463.919,5.137,1.474,5.297,2.4232 +units=m +no_defs ')
nlons, nlats = unausp(nxv, nyv, inverse=True)
upLon, upLat = np.meshgrid(nlons,nlats)
Since I want to compare two rainfall data sets with different spatial resolution (different grid size), I have to upscale one of them by using xarray interpolation:
upnew_lon = np.linspace(w.X[0], w.X[-1], w.dims['X'] // 5)
upnew_lat = np.linspace(w.Y[0], w.Y[-1], w.dims['Y'] //5)
uppds = w.interp(Y=upnew_lat, X=upnew_lon)
AS far as I know, this interpolation is based on linear interpolation. I compared upscaled data set with the original one. The mean of rainfall decreases about 0.03mm/day after upscaling. I just want to know do you think this upscaling method for sub-hourly rainfall is reliable?

How to create an accurate buffer of 5 miles around a coordinate in python?

I would like to create an accurate buffer of 5 miles around a coordinate, my current code is:
cpr_gdf['p_buffer']=cpr_gdf['coordinates'].buffer(5*(1/60))
The coordinates column was created with this code:
cpr_df['coordinates']=list(zip(cpr_df.sample_longitude_decimal,cpr_df.sample_latitude_decimal))
cpr_df['coordinates']=cpr_df['coordinates'].apply(Point)
cpr_gdf=gpd.GeoDataFrame(cpr_df,geometry='coordinates',crs={'init' :'epsg:4326'})
Thanks for any help!
You need to convert to an equal area projection that is most accurate to where your buffer will be (good resource at https://epsg.io/)
For example, I'm making maps in Michigan, so I'm using EPSG:3174 (which I believe is in meters, correct me if wrong). Given you've already converted your dataframe to a GeoPandas dataframe, you can convert your current projection to 3174 and then create your buffer (converting miles to meters)
cpr_gdf= cpr_gdf.to_crs({'init': 'epsg:3174'})
buffer_length_in_meters = (5 * 1000) * 1.60934
cpr_gdf['geometry'] = cpr_gdf.geometry.buffer(buffer_length_in_meters)
You can calculate buffer over points without converting to any other CRS using the function bellow. But it calculates in meters, so if you want to use miles just multiply distance on 1609.34
Here is an example
from geographiclib.geodesic import Geodesic
import numpy as np
from shapely.geometry import Polygon
import pandas as pd
import geopandas as gpd
def geod_buffer(gdf, distance, resolution=16, geod = Geodesic.WGS84):
"""
gdf - GeoDataFrame with geometry column
distance - The radius of the buffer in meters
resolution – The resolution of the buffer around each vertex
geod - Define an ellipsoid
"""
buffer = list()
for index, row in gdf.iterrows():
lon1, lat1 = row['geometry'].x, row['geometry'].y
buffer_ = list()
for azi1 in np.arange(0, 360, 90/resolution):
properties = geod.Direct(lat1, lon1, azi1, distance)
buffer_.append([properties['lon2'], properties['lat2']])
buffer.append(Polygon(buffer_))
return buffer
locations = pd.DataFrame([
{
'longitude': 54.604972,
'latitude': 18.346815},
{
'longitude': 54.605917,
'latitude': 18.347249}
])
locations_gpd = gpd.GeoDataFrame(locations,
geometry=gpd.points_from_xy(locations.longitude, locations.latitude),
crs='epsg:4326').drop(columns=['longitude', 'latitude'])
locations_gpd['geometry'] = geod_buffer(locations_gpd, 1000)
At the equator, one minute of latitude or longitude is ~ 1.84 km or 1.15 mi (ref).
So if you define your point as P = [y, x] then you can create a buffer around it of lets say 4 minutes which are approximately 5 miles: buffer = 0.04. The bounding box then is easily obtained with
minlat = P[0]-(P[0]*buffer)
maxlat = P[0]+(P[0]*buffer)
minlon = P[1]-(P[1]*buffer)
maxlon = P[1]+(P[1]*buffer)

Resources