How can i compute spatial statistics in grid space of two NetCDF files? - geospatial

I have two files with the same spatial and temporal dimension, with ERA5_w1 being the observation and CCMP_w1 being the forecast file.
I wonder how I can calculate the R-M-S-E to get a spatial distribution of the R-M-S-E over the 28 timesteps in a 3-dimensional field?
File information and download link are below:
I would like to generate an R-M-S-E plot like the image below:
Link to download the files: Files

One option for doing this is my package nctoolkit. The following code will calculate RSME for your data.
import nctoolkit as nc
# load the two files as datasets
ds1 = nc.open_data("CCMP_w1")
ds2 = nc.open_data("ERA5_w1")
# subtract the data in one dataset from the other
ds1.subtract(ds2)
#square the differences
ds1.power(2)
# sum up over all time steps
ds1.tsum()
# divide by the number of time steps
ds1.divide(28)
#square the results
ds1.sqrt()
# view the results
ds1.plot("WS10")
At present there isn't an explicit rsme method in nctoolkit, but I plan to add one in an upcoming release.
More details about the package here

Related

Generating multiple sets of n samples from data set where standard deviation of each set is minimized

I prepared a dataset and later learned that it is skewed.
Assuming a plot of user_count vs score where user_count is the number of users on that particular score.
I have to sample out the total users in multiple samples of size [100<=n<=1000] in such a way that the standard deviation of each created sample is minimized.
How do I do that?
I have tried binning methods like custom binning, quantile, etc. but that is not helpful to me, as by manual binning some of my bins have high SD.
Example:
I created 19 custom bins of interval: .05-.10, .10-.15, ......, .90-.95, >.95
this gives me something like this:
the problem here is: q19 has a high SD.
so, I am trying to figure out a way using which I can create an optimal number of bins automatically with minimum standard deviations.

python - How do I extract the id from an unsupervised text classification

So I have the following dataframe:
id text
342 text sample
341 another text sample
343 ...
And the following code:
X = tfidf_vectorizer.fit_transform(df['text']).todense()
pca = PCA(n_components=2)
data2D = pca.fit_transform(X)
clusterer = KMeans(n_clusters=n_clusters), random_state=10)
cluster_labels = clusterer.fit_predict(data2D)
silhouette_avg = silhouette_score(data2D, cluster_labels)
print(silhouette_avg)
y_lower = 10
for i in range(n_clusters):
# here I would like to get the id's of each item per cluster
# so that I know which list of id's falls into which cluster
Now, how can I see which id falls in which cluster, is this something that can be done? Also is my approach correct in order to "clusterize" these text documents?
Please not that I might have skipped some code in order to keep the question short
There are many ways to perform document classification. K-Means is one way. To say what you are doing is the best would be impossible with looking at the data and use case and exploring other methods.
If you'd like to stick with KMeans, I suggest you go read the documentation on the scikit-learn website one more time. You'll notice in the example how you can get the predicted class label for each point by calling the labels_ property on the fit classifier (note: not the result of fit_transform as you currently have).

sklearn decision tree classifier: How to control max number of branches of each split

I am trying to code a two class classification DT problem that I used SAS EM before. But trying to do it in Sklearn. The target variable is a two class categorical variable. But there are a few continuous independent variables. In SAS I could specify the "Maximum Number of Branches" for each split. So when it is set to 4, some leaf will split into 2 and some in 4 (especially for continuous variables). I could not find an equivalent parameter in sklearn. Looked at "max_leaf-nodes". But that controls the total number of "leaf" nodes of the entire tree. I am sure some of you probably has faced the same situation and already found a solution. Please help/share. I will really appreciate it.
I don't think this option is available in sklearn, You will find this Post very useful for your Classification DT; as it lists all the options you have available.
I would recommend creating Bins for your continues variables; this way you force the branches to be the number of bins you have.
Example: For continuous variable COl1 has values between 1-100; you can create a 4 bins 1-25, 26-50 , 51-75, 76-100. or you can create the bins bases on the median.

What is the best way to permanently store an array with 512 floats and 1 million records to facilitate fast search?

I have millions of images and for each image, I have converted them
into 512 numbers to represent what is in that image at a higher level
of abstraction than pixels. The dataset is like table with 512 fields
and a million rows, filled with floats.
When given a new image, I would like to be able to query through the 1
million records and return the records in order of "similarity".
Similarity can be defined as lowest sum of difference between the two
arrays of 512 elements.
What is the best way of permanently storing this data and performing the numerical calculations so that the "image search" is fast?
Just for background info: the 512 elements is the intermediate output features of a convolutional neural network used in image classification. I'm trying to return the most similar images when given a new image.
I'm pretty new to this - I hope the question makes sense.
I can store the database in many different ways... serialized in sql database, csv file... but what I'm not sure of is what is the best format for fast search later on.
My suggestion would be vectorization, possible in Python's Numpy, MATLAB, or Octave, etc. Basically, this means you could take a different between two matrices like so:
For instance, in Python3:
import numpy as np
pic1 = np.array([[1,2], [3,4]])
pic2 = np.array([[4,3], [2,1]])
diff = pic1 - pic2
dist = diff * diff
similarity = 1/ sum(sum(dist))
print(similarity)
This is fast because now your operation is O(num of pictures) rather than O(n * d^2), where d is the dimension of an edge of your image

Importing Transient Data into Paraview

I have a 3D triangulated surface. Nodes and Conn variables store the coordinates and connectivity of the triangles. At each vertex, a scalar quantity, S, and a vector with three components, V, are stored. These data are time-dependent. Also, my geometry does not change over time and I have one surface for all the timesteps.
How should I approach for writing a VTK file that has the transient data over this surface? In other words, I want to write the value of S and V at different timestep on this 3D surface in a single VTK file. I ultimately want to import this VTK file into Paraview for visualization. vtkTemporalDataSet seems to be the solution for me but I could not find an example on how to write an ASCII or binary file for this VTK class. Could vtkPolyData somehow be used to define time so that Paraview knows the transient nature of my dataset? I would appreciate any help or comment.
The VTK file format does not support transient data. However, you can write a series of files that ParaView will interpret as a time sequence. This will work fine with poly data in the VTK file. The file series is defined as files of the same name with a number identifier in them. For example, if you have a series of files named:
MyFile_000.vtk
MyFile_001.vtk
MyFile_002.vtk
ParaView will group these files together in its file browser and when you read them together, it will treat them as a file sequence with 3 time steps.
The bad part of this representation is that you will have to replicate the Nodes and Conn in each file. If that is a problem, you will have to use a different file format that supports multiple time steps using the same connection information (such as the Exodus II file format).

Resources