How to bin a netcdf data using xarray - python-3.x

I have some spatiotemporal data derived from the CHIRPS Database. It is a NetCDF that contains daily precipitation for all over the world with a spatial resolution of 1x1km2. The DataSet possesses 3 dimensions ('time', 'longitude', 'latitude').
I would like to bin this precipitation data according to each pixel's coordinate ('latitude' & 'longitude') temporal distribution. Therefore, the dimension I wish to apply the binnarization is the 'time' domain.
This is a similar question already discussed in StackOverflow (see in here). The difference between their Issue and mine is that, in my case, I need to binnarize the data according to each specific pixel's temporal distribution, instead of applying a single range of values for binnarization for all my coordinates (pixels). As a consequence, I expect to have different binning thresholds ('n' sets of thresholds), one for each of the 'n' pixels in my dataset.
As far as I understand, the simplest and fastest way to apply a function over each of the coordinates (pixels) of a Xarray's DataArray/DataSet is to use the xarray.apply_ufunc.
For the binnarization, I am using the pandas qcut method, which only requires an array of values and some given relative frequency (i.e.: [0.1%, 0.5%, 25%, 99%]) in order for it to work.
Since pandas binning function requires an array of data, and it also returns another array of binnarized data, I understand that I have to use the argument "vectorize"=True in the U_function (described in here).
Finally, when I run the analysis, The resulted Xarray DataSet ends up losing the 'time' dimension after the processing. Also, I get unsure whether that processing truly returned an Xarray DataSet with data properly classified.
Here is a reproducible snippet code. Notice that the 'time' dimension of the "ds_binned" is lost. Therefore, I have to later insert the binned data back to the original xarray dataset (ds). Also notice that the dimensions are not set in proper order. That also is causing problems for my analysis.
import pandas as pd
pd.set_option('display.width', 50000)
pd.set_option('display.max_rows', 50000)
pd.set_option('display.max_columns', 5000)
import numpy as np
import xarray as xr
from dask.diagnostics import ProgressBar
ds = xr.tutorial.open_dataset('rasm').load()
def parse_datetime(time):
return pd.to_datetime([str(x) for x in time])
ds.coords['time'] = parse_datetime(ds.coords['time'].values)
def binning_function(x, distribution_type='Positive', b=False):
y = np.where(np.abs(x)==np.inf, 0, x)
y = np.where(np.isnan(y), 0, y)
if np.all(y) == 0:
return x
else:
Classified = pd.qcut(y, np.linspace(0.01, 1, 10))
return Classified.codes
def xarray_parse_extremes(ds, dim=['time'], dask='allowed', new_dim_name=['classes'], kwargs={'b': False, 'distribution_type':'Positive'}):
filtered = xr.apply_ufunc(binning_function,
ds,
dask=dask,
vectorize=True,
input_core_dims=[dim],
#exclude_dims = [dim],
output_core_dims=[new_dim_name],
kwargs=kwargs,
output_dtypes=[float],
join='outer',
dataset_fill_value=np.nan,
).compute()
return filtered
with ProgressBar():
da_binned = xarray_parse_extremes(ds['Tair'] ,
['time'],
dask='allowed')
da_binned.name = 'classes'
ds_binned = da_binned.to_dataset()
ds['classes'] = (('y', 'x', 'time'), ds_binned['classes'].values)
mask = (ds['classes'] >= 5) & (ds['classes'] != 0)
ds.where(mask, drop=True).resample({'time':'Y'}).count('time')['Tair'].isel({'time':-1}).plot()
print(ds)
(ds.where(mask, drop=True).resample({'time':'Y'}).count('time')['Tair']
.to_dataframe().dropna().sort_values('Tair', ascending=False)
)
delayed_to_netcdf = ds.to_netcdf(r'F:\Philipe\temp\teste_tutorial.nc',
engine='netcdf4',
compute =False)
print('saving data classified')
with ProgressBar():
delayed_to_netcdf.compute()

Related

Python scipy interpolation meshgrid data

Dear all I want to interpolate an experimental data in order to make it look with higher resolution but apparently it does not work. I followed the example in this link for mgrid data the csv data can be found goes as follow.
Csv data
My code
import pandas as pd
import numpy as np
import scipy
x=np.linspace(0,2.8,15)
y=np.array([2.1,2,1.9,1.8,1.7,1.6,1.5,1.4,1.3,1.2,1.1,0.9,0.7,0.5,0.3,0.13])
[X, Y]=np.meshgrid(x,y)
Vx_df=pd.read_csv("Vx.csv", header=None)
Vx=Vx_df.to_numpy()
tck=scipy.interpolate.bisplrep(X,Y,Vx)
plt.pcolor(X,Y,Vx, shading='nearest');
plt.show()
xi=np.linspace(0.1, 2.5, 30)
yi=np.linspace(0.15, 2.0, 50)
[X1, Y1]=np.meshgrid(xi,yi)
VxNew = scipy.interpolate.bisplev(X1[:,0], Y1[0,:], tck, dx=1, dy=1)
plt.pcolor(X1,Y1,VxNew, shading='nearest')
plt.show()
CSV DATA:
0.73,,,-0.08,-0.19,-0.06,0.02,0.27,0.35,0.47,0.64,0.77,0.86,0.90,0.93
0.84,,,0.13,0.03,0.12,0.23,0.32,0.52,0.61,0.72,0.83,0.91,0.96,0.95
1.01,1.47,,0.46,0.46,0.48,0.51,0.65,0.74,0.80,0.89,0.99,0.99,1.07,1.06
1.17,1.39,1.51,1.19,1.02,0.96,0.95,1.01,1.01,1.05,1.06,1.05,1.11,1.13,1.19
1.22,1.36,1.42,1.44,1.36,1.23,1.24,1.17,1.18,1.14,1.14,1.09,1.08,1.14,1.19
1.21,1.30,1.35,1.37,1.43,1.36,1.33,1.23,1.14,1.11,1.05,0.98,1.01,1.09,1.15
1.14,1.17,1.22,1.25,1.23,1.16,1.23,1.00,1.00,0.93,0.93,0.80,0.82,1.05,1.09
,0.89,0.95,0.98,1.03,0.97,0.94,0.84,0.77,0.68,0.66,0.61,0.48,,
,0.06,0.25,0.42,0.55,0.55,0.61,0.49,0.46,0.56,0.51,0.40,0.28,,
,0.01,0.05,0.13,0.23,0.32,0.33,0.37,0.29,0.30,0.32,0.27,0.25,,
,-0.02,0.01,0.07,0.15,0.21,0.23,0.22,0.20,0.19,0.17,0.20,0.21,0.13,
,-0.07,-0.05,-0.02,0.06,0.07,0.07,0.16,0.11,0.08,0.12,0.08,0.13,0.16,
,-0.13,-0.14,-0.09,-0.07,0.01,-0.03,0.06,0.02,-0.01,0.00,0.01,0.02,0.04,
,-0.16,-0.23,-0.21,-0.16,-0.10,-0.08,-0.05,-0.11,-0.14,-0.17,-0.16,-0.11,-0.05,
,-0.14,-0.25,-0.29,-0.32,-0.31,-0.33,-0.31,-0.34,-0.36,-0.35,-0.31,-0.26,-0.14,
,-0.02,-0.07,-0.24,-0.36,-0.39,-0.45,-0.45,-0.52,-0.48,-0.41,-0.43,-0.37,-0.22,
The image of the low resolution (without iterpolation) is Low resolution and the image I get after interpolation is High resolution
Can you please give me some advice? why it does not interpolate properly?
Ok so to interpolate we need to set up an input and output grid an possibly need to remove values from the grid that are missing. We do that like so
array = pd.read_csv(StringIO(csv_string), header=None).to_numpy()
def interp(array, scale=1, method='cubic'):
x = np.arange(array.shape[1]*scale)[::scale]
y = np.arange(array.shape[0]*scale)[::scale]
x_in_grid, y_in_grid = np.meshgrid(x,y)
x_out, y_out = np.meshgrid(np.arange(max(x)+1),np.arange(max(y)+1))
array = np.ma.masked_invalid(array)
x_in = x_in_grid[~array.mask]
y_in = y_in_grid[~array.mask]
return interpolate.griddata((x_in, y_in), array[~array.mask].reshape(-1),(x_out, y_out), method=method)
Now we need to call this function 3 times. First we fill the missing values in the middle with spline interpolation. Then we fill the boundary values with nearest neighbor interpolation. And finally we size it up by interpreting the pixels as being a few pixels apart and filling in gaps with spline interpolation.
array = interp(array)
array = interp(array, method='nearest')
array = interp(array, 50)
plt.imshow(array)
And we get the following result

Vectorizing a for loop with a pandas dataframe

I am trying to do a project for my physics class where we are supposed to simulate motion of charged particles. We are supposed to randomly generate their positions and charges but we have to have positively charged particles in one region and negatively charged ones anywhere else. Right now, as a proof of concept, I am trying to do only 10 particles but the final project will have at least 1000.
My thought process is to create a dataframe with the first column containing the randomly generated charges and run a loop to see what value I get and place in the same dataframe as the next three columns their generated positions.
I have tried to do a simple for loop going over the rows and inputting the data as I go, but I run into an IndexingError: too many indexers. I also want this to run as efficiently as possible so that if I scale up the number of particles, it doesn't slow as much.
I also want to vectorize the operations of calculating the motion of each particle since it is based on position of every other particle which, through normal loops would take a lot of computational time.
Any vectorization optimization or offloading to GPU would be very helpful, thanks.
# In[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
# In[2]:
num_points=10
df_position = pd.DataFrame(pd,np.empty((num_points,4)),columns=['Charge','X','Y','Z'])
# In[3]:
charge = np.array([np.random.choice(2,num_points)])
df_position.iloc[:,0]=np.where(df_position["Charge"]==0,-1,1)
# In[4]:
def positive():
return np.random.uniform(low=0, high=5)
def negative():
return np.random.uniform(low=5, high=10)
# In[5]:
for row in df_position.itertuples(index=True,name='Charge'):
if(getattr(row,"Charge")==-1):
df_position.iloc[row,1]=positive()
df_position.iloc[row,2]=positive()
df_position.iloc[row,3]=positive()
else:
df_position.iloc[row,1]=negative()
#this is where I would get the IndexingError and would like to optimize this portion
df_position.iloc[row,2]=negative()
df_position.iloc[row,3]=negative()
df_position.iloc[:,0]=np.where(df_position["Charge"]==0,-1,1)
# In[6]:
ax=plt.axes(projection='3d')
ax.set_xlim(0, 10); ax.set_ylim(0, 10); ax.set_zlim(0,10);
xdata=df_position.iloc[:,1]
ydata=df_position.iloc[:,2]
zdata=df_position.iloc[:,3]
chargedata=df_position.iloc[:11,0]
colors = np.where(df_position["Charge"]==1,'r','b')
ax.scatter3D(xdata,ydata,zdata,c=colors,alpha=1)
EDIT:
The dataframe that I want the results in would be something like this
Charge X Y Z
-1
1
-1
-1
1
With the inital coordinates of each charge listed after in their respective columns. It will be a 3D dataframe as I will need to track of all their new positions after each time step so that I can do animations of the motion. Each layer will be exactly the same format.
Some code for creating your dataframe:
import numpy as np
import pandas as pd
num_points = 1_000
# uniform distribution of int, not sure it is the best one for your problem
# positive_point = np.random.randint(0, num_points)
positive_point = int(num_points / 100 * np.random.randn() + num_points / 2)
negavite_point = num_points - positive_point
positive_df = pd.DataFrame(
np.random.uniform(0.0, 5.0, size=[positive_point, 3]), index=[1] * positive_point, columns=['X', 'Y', 'Z']
)
negative_df = pd.DataFrame(
np.random.uniform(5.0, 10.0, size=[negavite_point, 3]), index=[-1] *negavite_point, columns=['X', 'Y', 'Z']
)
df = pd.concat([positive_df, negative_df])
It is quite fast for 1,000 or 1,000,000.
Edit: with my first answer, I totally miss a big part of the question. This new one should fit better.
Second edit: I use a better distribution for the number of positive point than a uniform distribution of int.

How do I fix KeyError bug in my code while implementing K-Nearest Neighbours from scratch?

I am trying to implement K-Nearest Neighbours algorithm from scratch in Python. The code I wrote worked well for the Breast-Cancer-Wisconsin.csv dataset.
However, the same code when I try to run for Iris.csv dataset, my implementation fails and gives KeyError.
The only difference in the 2 datasets is the fact that in Breast-Cancer-Wisconsin.csv there are only 2 classes ('2' for malignant and '4' for benign) and both the labels are integers wheres in Iris.csv there are 3 classes ('setosa', 'versicolor', 'virginica') and all these 3 labels are in string type.
Here is the code I wrote (for Iris.csv) :
import numpy as np
from math import sqrt
import matplotlib.pyplot as plt
from matplotlib import style
from collections import Counter
import warnings
import pandas as pd
import random
style.use('fivethirtyeight')
dataset = {'k':[[1,2],[2,3],[3,1]], 'r':[[6,5],[7,7],[8,6]]}
new_features = [5,7]
#[[plt.scatter(j[0],j[1], s=100, color=i) for j in dataset[i]] for i in dataset]
#plt.scatter(new_features[0], new_features[1], s=100)
#plt.show()
def k_nearest_neighbors(data, predict, k=3):
if len(data) >= k:
warnings.warn('K is set to a value less than total voting groups!')
distances = []
for group in data:
for features in data[group]:
euclidean_distance = np.linalg.norm(np.array(features) - np.array(predict))
distances.append([euclidean_distance, group])
votes = [i[1] for i in sorted(distances)[:k]]
vote_result = Counter(votes).most_common(1)[0][0]
return vote_result
df = pd.read_csv('iris.csv')
df.replace('?', -99999, inplace=True)
#full_data = df.astype(float).values.tolist()
#random.shuffle(full_data)
test_size = 0.2
train_set = {'setosa':[], 'versicolor':[], 'virginica':[]}
test_set = {'setosa':[], 'versicolor':[], 'virginica':[]}
train_data = full_data[:-int(test_size*len(full_data))]
test_data = full_data[-int(test_size*len(full_data)):]
for i in train_data:
train_set[i[-1]].append(i[:-1])
for i in test_data:
test_set[i[-1]].append(i[:-1])
correct = 0
total = 0
for group in test_set:
for data in test_set[group]:
vote = k_nearest_neighbors(train_set, data, k=5)
if group == vote:
correct += 1
total += 1
print('Accuracy : ', correct/total)
When I run the above code, I get a KeyError message at line number 49.
Could anyone please explain to me where I am going wrong? Also, it would be great if someone could point out how do I modify this algorithm to classify multiple classes (instead of 2 or 3) in the future?
Also, how do I handle if the classes are in string type instead of integer?
one solution I thought of was to convert all string types to integer types and try to solve but would that work?
REFERENCES
Iris.csv
Breas-Cancer-Wisconsin.csv
Let's start from your last question:
one solution I thought of was to convert all string types to integer types and try to solve but would that work?
Yes, that would work. You shouldn't have to hardcode the names of all the classes of every problem in your code. Instead, you can just write a function that reads all the different values for the class attribute, and assigns a numeric value to each different one.
Could anyone please explain to me where I am going wrong?
Most likely, the problem is that you are reading an instance whose class attribute is not 'setosa', 'versicolor', 'virginica' (something like Iris-setosa perhaps?). The idea above should fix this problem.
Also, it would be great if someone could point out how do I modify this algorithm to classify multiple classes (instead of 2 or 3) in the future?
As discuss before, you just need to avoid hard-coding the names of the classes in your code
Also, how do I handle if the classes are in string type instead of integer?
def get_class_values(data):
classes_seen = {}
for i in data:
_class = data[-1]
if _class not in classes_seen:
classes_seen[_class] = len(classes_seen)
return classes_seen
A function like this one would return a mapping between all your classes (no matter the type) and numeric codes (from 0 to N-1). Using this mapping would also solve all the problems mentioned before.
Convert String Labels In CSV Files To Integer Labels
After going through some GitHub repos I came across a very simple yet elegant piece of code that solves the above problem. Hope it helps those who have faced this problem before (beginners especially!)
% read the csv file
df = pd.read_csv('iris.csv')
% clean the data file
df.replace('?', -99999, inplace=True)
% convert the string classes into integer types.
% integers are assigned from 0 to N-1.
% species is the name of the column which has class labels.
df['species'] = df['species'].astype('category')
df['species_value'] = df['species'].cat.codes
df.drop(['species'], 1, inplace=True)
% convert the data frame to list
full_data = df.astype(float).values.tolist()
random.shuffle(full_data)
Post Debugging
Turns out that we need not use the above piece of code also, i.e I can get the answer without explicitly converting the string labels into integer labels (using the above code).
I have posted the original code after some minor changes (below) and the key error is now fixed. Also, I am now getting an accuracy of 97% to 100% (only on IRIS dataset).
test_size = 0.2
train_set = {0:[], 1:[], 2:[]}
test_set = {0:[], 1:[], 2:[]}
That is the only change you need to make to the original code I posted in order to make it work!! Simple!
However, please note that the numbers have to be given as integers and not string (otherwise it would lead to key error!).
Wrap-Up
There are some commented lines in the original code which I thought would be good to explain in case somebody ran into some issues. Here's one snippet with the comments removed (compare with original code in the question).
df = pd.read_csv('iris.csv')
df.replace('?', -99999, inplace=True)
full_data = df.astype(float).values.tolist()
random.shuffle(full_data)
Here's the output you get:
ValueError: could not convert string to float: 'virginica'
What went wrong?
Note that here we did not convert the string labels into integer labels. Therefore, when we tried to convert the data in the CSV to float values, the kernel threw an error because a string cannot be converted to float!
So one way to go about it is that you don't convert the data into floating point values and then you won't get this error. However in many cases you need to convert all the data into floating point (for eg.. normalisation, accuracy, long mathematical calculations, prevention of loss of precision etc etc..).
Hence after heavy debugging and going through a lot of articles I finally came up with a simple version of the original code (below):
import numpy as np
from math import sqrt
import matplotlib.pyplot as plt
from matplotlib import style
from collections import Counter
import warnings
import pandas as pd
import random
def k_nearest_neighbors(data, predict, k=3):
if len(data) >= k:
warnings.warn('K is set to a value less than total voting groups!')
distances = []
for group in data:
for features in data[group]:
euclidean_distance = np.linalg.norm(np.array(features) - np.array(predict))
distances.append([euclidean_distance, group])
votes = [i[1] for i in sorted(distances)[:k]]
vote_result = Counter(votes).most_common(1)[0][0]
return vote_result
df = pd.read_csv('iris.csv')
df.replace('?', -99999, inplace=True)
df['species'] = df['species'].astype('category')
df['species_value'] = df['species'].cat.codes
df.drop(['species'], 1, inplace=True)
full_data = df.astype(float).values.tolist()
random.shuffle(full_data)
test_size = 0.2
train_set = {0:[], 1:[], 2:[]}
test_set = {0:[], 1:[], 2:[]}
train_data = full_data[:-int(test_size*len(full_data))]
test_data = full_data[-int(test_size*len(full_data)):]
for i in train_data:
train_set[i[-1]].append(i[:-1])
for i in test_data:
test_set[i[-1]].append(i[:-1])
correct = 0
total = 0
for group in test_set:
for data in test_set[group]:
vote = k_nearest_neighbors(train_set, data, k=5)
if group == vote:
correct += 1
total += 1
print('Accuracy : ', (correct/total)*100,'%')
Hope this helps!

DBSCAN sklearn memory issues

I am trying to use DBSCAN sklearn implementation for anomaly detection. It works fine for small datasets (500 x 6). However, it runs into memory issues when I try to use a large dataset (180000 x 24). Is there something I can do to overcome this issue?
from sklearn.cluster import DBSCAN
import pandas as pd
from sklearn.preprocessing import StandardScaler
import numpy as np
data = pd.read_csv("dataset.csv")
# Drop non-continuous variables
data.drop(["x1", "x2"], axis = 1, inplace = True)
df = data
data = df.as_matrix().astype("float32", copy = False)
stscaler = StandardScaler().fit(data)
data = stscaler.transform(data)
print "Dataset size:", df.shape
dbsc = DBSCAN(eps = 3, min_samples = 30).fit(data)
labels = dbsc.labels_
core_samples = np.zeros_like(labels, dtype = bool)
core_samples[dbsc.core_sample_indices_] = True
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters: %d' % n_clusters_)
df['Labels'] = labels.tolist()
#print df.head(10)
print "Number of anomalies:", -1 * (df[df.Labels < 0]['Labels'].sum())
Depending on the type of problem you are tackling could play around this parameter in the DBSCAN constructor:
leaf_size : int, optional (default = 30)
Leaf size passed to BallTree or cKDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.
If that does not suit your needs, this question is already addressed here, you can try to use ELKI's DBSCAN implementation.

Apply power fit to data by using levenberg-marquardt algorithm in python

Hy everybody!
I am a beginer in python and data analysis, and meet with a problem, during fitting a power function to my data.
Here I plotted my dataset as a scatterplot
I want to plot a power function with expontent arround -1 , but after I apply the levenberg-marquardt method, using lmfit library in python, I get the following faulty image. I tried to modify the initial parameters, but it didn't help.
Here is my code:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from lmfit import minimize, Parameters, Parameter, report_fit
be = pd.read_table('...',
skipinitialspace=True,
names = ["CoM", "slope", "slope2"])
x=be["CoM"]
data=be["slope"]
def fcn2min(params, x, data):
n2 = params['n2'].value
n1 = params['n1'].value
model = n1 * x ** n2
return model - data #that's what you want to minimize
# create a set of Parameters
# 'value' is the initial condition
params = Parameters()
params.add('n2', value= -1.00)
params.add('n1',value= 23.0)
# do fit, here with leastsq model
result = minimize(fcn2min, params, args=(be["CoM"],be["slope"]))
#calculate final result
final = data + result.residual
resid = result.residual
# write error report
report_fit(result)
#plot results
xplot = x
yplot = result.params['n1'].value * x ** result.params['n2'].value
plt.figure(figsize=(15,6))
plt.ylabel('OD-slope',fontsize=18, color='blue')
plt.xlabel('CoM height_Sz [m]',fontsize=18, color='blue')
plt.plot(be["CoM"],be["slope"],"o", label="slope_flat")
plt.plot(be["CoM"],be["slope2"],"+",color='r', label="slope_curv")
plt.plot(xplot,yplot)
plt.legend()
plt.savefig('plot2')
plt.show()
I don't quite understand what is the problem with this, so if you have any observations, thank you very much.
It's a little hard to tell what the question is. t looks to me like the fit completed and gave a reasonably good fit, but you don't provide the fit statistics or report of the parameters.
If you're asking about all the green lines for the "COM" array (the best fit?), this is almost certainly because the starting x axis "height_Sz" data was not sorted to be strictly increasing. That's OK for the fit, but plotting an X-Y trace with a line expects the data to be in order.

Resources