How can I achieve faster access to this netcdf? - python-3.x

I am selecting spatially and temporal data from this kind of NetCDF opened by
ds = xr.open_mfdataset(file_list):
<xarray.Dataset>
Dimensions: (lat: 576, lon: 1152, time: 1464)
Coordinates:
* lon (lon) float32 0.0 0.3125 0.625 0.9375 ... 359.0625 359.375 359.6875
* lat (lat) float32 89.761 89.4514 89.1399 ... -89.1399 -89.4514 -89.761
* time (time) datetime64[ns] 1980-04-01T01:00:00 ... 1980-06-01
Data variables:
uasmean (lat, lon, time) float32 dask.array<shape=(576, 1152, 1464), chunksize=(576, 1152, 720)>
vasmean (lat, lon, time) float32 dask.array<shape=(576, 1152, 1464), chunksize=(576, 1152, 720)>
Attributes:
Creator: NCAR - CISL RDA (dattore)
history: Mon Aug 11 12:24:36 2014: ncatted -a history,global,d,, -O Wind...
I achieved to get the correct subset in time and lon/lat using:
ds = ds.where((ds.time >= np.datetime64(date_ini)) & (ds.time <= np.datetime64(date_end)), drop=True)
ds = ds.where((ds.lon >= lonlat[0]) & (ds.lon <= lonlat[1]) & (ds.lat >= lonlat[2]) & (ds.lat <= lonlat[3]), drop=True)
And finally to extract this information in my target format I use a loop over the time to convert the information to a dataframe that I export to csv after:
# for t in ds['time']:
t = ds['time'][0]
# Select time and convert to dataframe
df = ds.sel(time=t).to_dataframe()
My problem is that the conversion to dataframe is slow and I know that the originals netCDF are written in order to optimize the extraction of temporal series instead of extracting maps as I am trying to do. I know that is possible to change the sort of coordinates and write a new netCDF in order to speed this up, but the database is a too big... so it is not an option. Do you know if there is any other way to speed up this extraction??
Thank you all in advance!!!
P.D.: I attached the complete script of this block of code I am using to check the performance...
import os
import random
import shutil
from datetime import datetime, timedelta
from glob import glob
import pandas as pd
import xarray as xr
import numpy as np
import scipy.io
import matplotlib.pyplot as plt
import time
start_time = time.time()
files = glob('*.nc')
lonlat = [-5, 10, 50, 64]
date_ini = datetime(1980, 4, 28)
date_end = datetime(1980, 5, 3)
ds = xr.open_mfdataset(files)
print('[Processing 2D winds]')
# create date list to loop over folders
dates = pd.date_range(start=date_ini - timedelta(days=1), end=date_end + timedelta(days=1), freq='D')
# Create date list of files to open
file_list = []
for date in dates:
file_list.append('Wind_CFS_Global_' + date.strftime('%Y.%m') + '.nc')
# Delete repeated elements
file_list = list(dict.fromkeys(file_list))
print(file_list)
# load data
ds = xr.open_mfdataset(file_list)
# Select temporal subset
ds = ds.get(['uasmean','vasmean'])
ds = ds.where((ds.time >= np.datetime64(date_ini)) & (ds.time <= np.datetime64(date_end)), drop=True)
# from 0º,360º to -180º,180º
ds['lon'] = (ds.lon + 180) % 360 - 180
ds = ds.sortby('lon', 'lat')
ds = ds.where((ds.lon >= lonlat[0]) & (ds.lon <= lonlat[1]) & (ds.lat >= lonlat[2]) & (ds.lat <= lonlat[3]), drop=True)
print(ds)
currents_list = []
# for t in ds['time']:
t = ds['time'][0]
# Select time and depth array
df = ds.sel(time=t).to_dataframe()
# reset index because longitude latitude are as multi-index and I want them as columns
df = df.reset_index()
# sort data-rows for TESEO: longitude, latitude (ascending)
df = df.sort_values(['lon', 'lat'])
# generate full file path
outfile = 'winds_' + df['time'][0].strftime('%Y%m%dT%H%M') + '.txt'
# export to ascii without separator, without header neither index column, replace nan by 0 and set 3 floating numbers
df.to_csv(path_or_buf=outfile,
sep=' ',
columns=['lon', 'lat', 'uasmean', 'vasmean'],
header=False,
index=False,
na_rep=0,
float_format='%.3f'
)
elapsed_time = (time.time() - start_time)
print('Elapsed time: {} sec.'.format(elapsed_time))

I found a big improvement in the performance doing this:
convert all the Xarray dataset to a dataframe
loop over time directly in the dataframe
That makes a really big difference! I was looping over time and converting this shorter dataset to dataframe that is really inefficient!
Best regards!

Related

How to extract 4dimensional data from a list of pandas dataframes?

I have a list of 500 dataframes (in the form of .csv files); 500 = 20(time) x 25(energy) bins. In other words, each dataframe is a measurement of flux at a single time and energy and is represented as 150x150 mesh grid corresponding to x and y spatial coordinates. However, I would like to transform these data into 4-d coordinates of the form Flux(x, y, t, E) such that I have new set of dataframes with columns E and rows t for any given (x,y) position.
I am not sure how to approach the problem. I would appreciate your help in giving me some sort of roadmap for doing this procedure.
Note:
The time and energy of each dataframe is in the name of the corresponding .csv file in the form time-5e+35-energy0.00023-position.csv where t=-5 10^35 and E=0.00023.
What I know:
500 dataframes of 20tx25E must be converted to 22,500 dataframes of 150x150 coordinates. However, this is very time consuming and I am not sure if there is any other package in python3 that can do the job easier.
Code that combines your files into one big Pandas dataframe of size 11,250,000 or 25 × 20 × 150 × 150:
import pandas as pd
from glob import glob
import re
from datetime import datetime
pattern_file_name = re.compile(r'time-(.*)-energy(.*)-position.csv')
start_time = datetime.now()
result_df = None
for file_name in glob('time-*.csv'):
# extract time and energy values from file name
if not pattern_file_name.match(file_name):
raise ValueError(f'file name {file_name} failed pattern match.')
time_s, energy_s = pattern_file_name.findall(file_name)[0]
time, energy = float(time_s), float(energy_s)
print(f'Processing | {time_s} | {energy_s} |...')
df = pd.read_csv(file_name, header=None)
# assuming the CSV (i) has no headers (ii) is an array of 150x150...
# ...floats with no missing or problematic values (iii) each row...
# ...represents a fixed y-coordinate; adjust to your needs
df.index.name = 'y'
df = df.stack()
df.index.rename('x', level=-1, inplace=True)
df = df.swaplevel().sort_index().reset_index().rename(columns={0: 'flux'})
# df is now (x, y, f)
# x and y will each vary from 0 to 149
df.insert(0, 't', time)
df.insert(0, 'E', energy)
result_df = df if result_df is None else pd.concat([result_df, df])
result_df = result_df.set_index(['E', 't', 'x', 'y']).sort_index()
# result_df is now (E, t, x, y) -> flux
result_df.to_csv('output.csv', index=True)
final_time = datetime.now()
delta_time = final_time - start_time
print(f'Completed in {delta_time}')
The main steps are as follows:
Loop over file names
Extract t and E values from file name
Read square matrix of flux values from file
Transform 150 × 150 square matrix to Pandas dataframe of length 22,500
Add columns to keep track of E and t
Append local result to a global, ever-increasing result vector
Finally, leave the loop and save results to disk as CSV
The resulting CSV file will have 5 columns. The first four would represent (E,t,x,y) and the last column would be the value of the flux field at those co-ordinates.

How to delete the outliers

I manage to apply the interquartile range principle well but when I display the mustache box of the dataset without outliers, I see that there are always outliers. what is wrong?
Here is code :
# Load libraries
import pandas as pd;
from pandas import read_csv, set_option;
from matplotlib import pyplot as plt;
# Load dataset
filename = "/home/fogang/dataset/Regression/Housing Boston/housing.csv";
df = read_csv(filename, header=0);
df = df.drop('Unnamed: 0', axis=1); # Let's delete the column 'Unnamed: 0'
one_dim = pd.DataFrame();
one_dim['rm'] = df['rm'];
#shape dataset
print(one_dim.shape);
# Peek at dataset
print(one_dim.head(10));
# Let's look whether there are NaN values
print(one_dim.isnull().sum());
# Box and whisker plots
one_dim.plot(kind='box', subplots=True, layout=(1, 1), sharex=False, sharey=False, fontsize=12);
plt.show();
# Describe Dataset
print(one_dim.describe());
# Let's find Inter-Quartile Range
unidim = one_dim['rm'];
unidim_Q1 = unidim.quantile(0.25);
unidim_Q3 = unidim.quantile(0.75);
unidim_IQR = unidim_Q3 - unidim_Q1;
unidim_lower = unidim_Q1 - (1.5 * unidim_IQR);
unidim_upper = unidim_Q3 + (1.5 * unidim_IQR);
# Outliers
unidim_outliers = pd.DataFrame();
unidim_outliers['outliers'] = unidim[(unidim < unidim_lower) | (unidim > unidim_upper)]
unidim_outliers.info()
# Good data
unidim_good = pd.DataFrame();
unidim_good['good'] = unidim[(unidim >= unidim_lower) & (unidim <= unidim_upper)];
unidim_good.info();
unidim_good.plot(kind='box', subplots=True, layout=(1, 2), sharex=False, sharey=False, fontsize=12);
plt.show();
What to do ?
You have too wide spread outliers from both tails - up and down. So, then you cut out some of outliers and check it again, you have new outliers in cutted data.
If you want totally get rid of outliers with one cut you can do it using more strict rule to cut, for example by so:
unidim_lower = unidim_Q1 - (1.3 * unidim_IQR);
unidim_upper = unidim_Q3 + (1.3 * unidim_IQR);
But I should warn you: not all 'outliers' are bad for the model, you shoud choose wisely what to treat as 'ouliers' and what is usefull data anyway.

Extract area from high resolution netcdf file python

I am trying to extract an area from a netcdf file by longitude and latitude.
However the resolution is much higher than 1x1 degree.
How would you extract an area then, e.g. lon: 30-80 and lat: 30-40.
The file can be found here: https://drive.google.com/open?id=1zX-qYBdXT_GuktC81NoQz9xSxSzM-CTJ
Keys and shapes are as follows:
odict_keys(['crs', 'lat', 'lon', 'Band1'])
crs ()
lat (25827,)
lon (35178,)
Band1 (25827, 35178)
I have tried this, but with the high resolution, it doesn't refer to the actual longitude/langitude.
from netCDF4 import Dataset
import numpy as np
import matplotlib.pyplot as plt
file = path + '20180801-ESACCI-L3S_FIRE-BA-MODIS-AREA_3-fv5.1-JD.nc'
fh = Dataset(file)
longitude = fh.variables['lon'][:]
latitude = fh.variables['lat'][:]
band1 = fh.variables['Band1'][:30:80,30:40]
since you have variables(dimensions): ..., int16 Band1(lat,lon), you could apply np.where to variables lat and lon to find the appropriate indices and then select the according Band1 data as sel_band1:
import numpy as np
from netCDF4 import Dataset
file = '20180801-ESACCI-L3S_FIRE-BA-MODIS-AREA_3-fv5.1-JD.nc'
with Dataset(file) as nc_obj:
lat = nc_obj.variables['lat'][:]
lon = nc_obj.variables['lon'][:]
sel_lat, sel_lon = [30, 40], [30, 80]
sel_lat_idx = np.where((lat >= sel_lat[0]) & (lat <= sel_lat[1]))
sel_lon_idx = np.where((lon >= sel_lon[0]) & (lon <= sel_lon[1]))
sel_band1 = nc_obj.variables['Band1'][:][np.ix_(sel_lat_idx[0], sel_lon_idx[0])]
note that np.where applied to lat and lon returns 1D index arrays. Use np.ix_ to apply them to the 2D data in Band1. See here for more info.

How to color the line graph according to conditions in a plot?

I try to find the solution for plot of the data
I have a graph of trajectory according to time(x) and kilometers(y) and i need to mark with different colours where the availability parameter from dataframe is 0 or 100
I try this but i have completly different result that i expected
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
# Read file, using ; as delimiter
filename = "H:\\run_linux\\river_km_calculations\\route2_8_07_23_07\\true_route2_8_07_23_07_test.csv"
df = pd.read_csv(filename, delimiter=';', parse_dates=['datetime']) #dtype={'lon_deg':'float', 'lat_deg':'float'})
df = df[189940:]
df.set_index('datetime', inplace=False)
plt.plot( df['datetime'], df['river_km'])
plt.show()
connection = 100
noconection = 0
def conditions(s):
if (s['age_gps_data'] <= 1.5) or (s['age_gps_data'] >=0.5 ):
return 100
else:
return 0
df['availability'] = df.apply(conditions, axis=1)
internet = np.ma.masked_where(df.availability == connection, df.availability)
nointernet = np.ma.masked_where((df.availability == noconection) , df.availability)
fig, ax = plt.subplots()
ax.plot(df.river_km, internet, df.river_km, nointernet)
plt.show()
How I can mark on a plot with different colours where availability is 0 and where is 100 and where is no value of these parameter?
What I want to achieve should looks like this:

Convert a numpy dataset to netCDF

I have a numpy array in python with size (16,250,186) representing time, latitude and longitude.
I want to convert it to a netCDF file so that I can read the data easily with co-ordinates in future.
My numpy array looks like this
RZS = np.load("/home/chandra/Data/rootzone_CHIRPS_era5_2003-2015_daily-analysis_annual-result.npy")
RZS.shape
Output: (16, 250, 186)
As you can see my above numpy array represents annual values for 16 years.
chirps_precip =xarray.open_mfdataset("/home/chandra/Data/CHIRPS/chirps-v2.0.2000.days_p25.nc")
precip = chirps_precip.precip.sel(latitude = slice(-50,12.5), longitude = slice(-81.25,-34.75))
precip[0,:,:]
Output:
<xarray.DataArray 'precip' (latitude: 250, longitude: 186)>
dask.array<shape=(250, 186), dtype=float32, chunksize=(250, 186)>
Coordinates:
* latitude (latitude) float32 -49.875 -49.625 -49.375 ... 12.125 12.375
* longitude (longitude) float32 -81.125 -80.875 -80.625 ... -35.125 -34.875
time datetime64[ns] 2000-01-01
Attributes:
units: mm/day
standard_name: convective precipitation rate
long_name: Climate Hazards group InfraRed Precipitation with St...
time_step: day
geostatial_lat_min: -50.0
geostatial_lat_max: 50.0
geostatial_lon_min: -180.0
geostatial_lon_max: 180.0
These are the co-ordinates of the chirps_precip dataset that I want my numpy array RZS to have with years (as 2000, 2001, .....2015) on the timestep
I have tried some methods like
# from xarray
array = xarray.DataArray(RZS, latitude = 'precip.latitude')
#from netCDF
Dataset.createVariable('rootzone storage cap', np.float32, ('time','lat','lon'))
But I am not able to do anything. I also tried to copy attrs and coords but that also didn't work.
It seems like I am doing this the wrong way. Can anyone suggest what am I missing.
I want my numpy array to have the same co-ordinate as the netcdf file, but with a modified time attribute to years.
I would suggest a code like using module netCDF4, assuming you have latitude and longitude in variables lat and lon and dataout is dataout.
#!/usr/bin/env ipython
# ---------------------
import numpy as np
import datetime
from netCDF4 import Dataset,num2date,date2num
# -----------------------
nyears = 16;
unout = 'days since 2000-01-01 00:00:00'
# -----------------------
ny, nx = (250, 186)
lon = np.linspace(9,30,nx);
lat = np.linspace(50,60,ny);
dataout = np.random.random((nyears,ny,nx)); # create some random data
datesout = [datetime.datetime(2000+iyear,1,1) for iyear in range(nyears)]; # create datevalues
# =========================
ncout = Dataset('myfile.nc','w','NETCDF3'); # using netCDF3 for output format
ncout.createDimension('lon',nx);
ncout.createDimension('lat',ny);
ncout.createDimension('time',nyears);
lonvar = ncout.createVariable('lon','float32',('lon'));lonvar[:] = lon;
latvar = ncout.createVariable('lat','float32',('lat'));latvar[:] = lat;
timevar = ncout.createVariable('time','float64',('time'));timevar.setncattr('units',unout);timevar[:]=date2num(datesout,unout);
myvar = ncout.createVariable('myvar','float32',('time','lat','lon'));myvar.setncattr('units','mm');myvar[:] = dataout;
ncout.close();
Compared to xarray, you have to write more code, but it is still very easy to create the netCDF files using that module.

Resources