How can I save and load a MetaGraph object from LightGraphs, and MetaGraphs, so that if I load the metagraph I still have the metadata?
Right now I have a metagraph mg that I save using:
LightGraphs.savegraph("net.lg", mg)
But trying to reload it :
reloaded = LightGraphs.loadgraph("net.lg")
gives me the following:
BoundsError: attempt to access 2-element Array{SubString{String},1} at index [3]
Is there anyway to read in the metagraphs in the MetaGaphs package?
We support MetaGraphs persistence using a JLD format provided by JLD2.jl:
using LightGraphs, MetaGraphs
julia> g = Graph(10,20)
{10, 20} undirected simple Int64 graph
julia> mg = MetaGraph(g)
{10, 20} undirected Int64 metagraph with Float64 weights defined by :weight (default weight 1.0)
julia> savegraph("foo.mg", mg)
1
julia> mg2 = loadgraph("foo.mg", MGFormat())
{10, 20} undirected Int64 metagraph with Float64 weights defined by :weight (default weight 1.0)
julia> mg2 == mg
true
Note that you need to specify MGFormat() in the loadgraph, otherwise LightGraphs won't know what type of graph you're trying to load.
Related
In Distributions.jl we can specify the priors of a mixture model. But we cannot specify the weights. For example, if I want to make a mixture like this:
pdf(Normal(2, 3), x)*w1.+pdf(Normal(5, 10), x)*w2
I cannot really specify the weights. And the priors are required to add up to 1 for obv reasons.
So, is there a way to specify the weights in MixtureModel?
Something like:
MixtureModel(Normal[
Normal(2, 3),
Normal(5, 10)
], **weights=[w1, w2]**)
Thanks
This is covered in the Distributions.jl documentation on mixture model constructors — you want the prior argument. See
https://juliastats.org/Distributions.jl/v0.14/mixture.html#Constructors-1
Here's a quick plot of their first example. The [0.2, 0.5, 0.3] are the weights:
julia> using Distributions, Plots
julia> d = MixtureModel(Normal[
Normal(-2.0, 1.2),
Normal(0.0, 1.0),
Normal(3.0, 2.5)], [0.2, 0.5, 0.3])
MixtureModel{Normal}(K = 3)
components[1] (prior = 0.2000): Normal{Float64}(μ=-2.0, σ=1.2)
components[2] (prior = 0.5000): Normal{Float64}(μ=0.0, σ=1.0)
components[3] (prior = 0.3000): Normal{Float64}(μ=3.0, σ=2.5)
julia> x = -10:0.1:10
-10.0:0.1:10.0
julia> plot(x, pdf.(d, x), legend=nothing, xlabel="x", ylabel="pdf")
Which produces
Hello I'm doing a GridSearchCV and I'm printing the result with the .cv_results_ function from scikit learn.
My problem is that when I'm evaluating by hand the mean on all the test score splits I obtain a different number compared to what it is written in 'mean_test_score'. Which is different from the standard np.mean()?
I attach here the code with the result:
n_estimators = [100]
max_depth = [3]
learning_rate = [0.1]
param_grid = dict(max_depth=max_depth, n_estimators=n_estimators, learning_rate=learning_rate)
gkf = GroupKFold(n_splits=7)
grid_search = GridSearchCV(model, param_grid, scoring=score_auc, cv=gkf)
grid_result = grid_search.fit(X, Y, groups=patients)
grid_result.cv_results_
The result of this operation is:
{'mean_fit_time': array([ 8.92773601]),
'mean_score_time': array([ 0.04288721]),
'mean_test_score': array([ 0.83490629]),
'mean_train_score': array([ 0.95167036]),
'param_learning_rate': masked_array(data = [0.1],
mask = [False],
fill_value = ?),
'param_max_depth': masked_array(data = [3],
mask = [False],
fill_value = ?),
'param_n_estimators': masked_array(data = [100],
mask = [False],
fill_value = ?),
'params': ({'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100},),
'rank_test_score': array([1]),
'split0_test_score': array([ 0.74821666]),
'split0_train_score': array([ 0.97564995]),
'split1_test_score': array([ 0.80089016]),
'split1_train_score': array([ 0.95361201]),
'split2_test_score': array([ 0.92876979]),
'split2_train_score': array([ 0.93935856]),
'split3_test_score': array([ 0.95540287]),
'split3_train_score': array([ 0.94718634]),
'split4_test_score': array([ 0.89083901]),
'split4_train_score': array([ 0.94787374]),
'split5_test_score': array([ 0.90926355]),
'split5_train_score': array([ 0.94829775]),
'split6_test_score': array([ 0.82520379]),
'split6_train_score': array([ 0.94971417]),
'std_fit_time': array([ 1.79167576]),
'std_score_time': array([ 0.02970254]),
'std_test_score': array([ 0.0809713]),
'std_train_score': array([ 0.0105566])}
As you can see, doing the np.mean of all the test_score it gives you a value approximately of 0.8655122606479532 while the 'mean_test_score' is 0.83490629
Thanks for you help,
Leonardo.
I will post this as a new answer since its so much code:
The test and train scores of the folds are: (taken from the results you posted in your question)
test_scores = [0.74821666,0.80089016,0.92876979,0.95540287,0.89083901,0.90926355,0.82520379]
train_scores = [0.97564995,0.95361201,0.93935856,0.94718634,0.94787374,0.94829775,0.94971417]
The amount of training samples in those folds are: (taken from the output of print([(len(train), len(test)) for train, test in gkf.split(X, groups=patients)]))
train_len = [41835, 56229, 56581, 58759, 60893, 60919, 62056]
test_len = [24377, 9983, 9631, 7453, 5319, 5293, 4156]
Then the test- and train-means with the amount of training samples per fold as weight is:
train_avg = np.average(train_scores, weights=train_len)
-> 0.95064898361714389
test_avg = np.average(test_scores, weights=test_len)
-> 0.83490628649308296
So this is exactly the value sklearn gives you. It is also the correct mean accuracy of your classification. The mean of the folds is incorrect in that it depends on the somewhat arbitrary splits/folds you chose.
So in concusion, both explanations were indeed identical and correct.
If you see the original code of GridSearchCV in their github repository, they dont use np.mean() instead they use np.average() with weights. Hence the difference. Here's their code:
n_splits = 3
test_sample_counts = np.array(test_sample_counts[:n_splits],
dtype=np.int)
weights = test_sample_counts if self.iid else None
means = np.average(test_scores, axis=1, weights=weights)
stds = np.sqrt(np.average((test_scores - means[:, np.newaxis])
axis=1, weights=weights))
cv_results = dict()
for split_i in range(n_splits):
cv_results["split%d_test_score" % split_i] = test_scores[:,
split_i]
cv_results["mean_test_score"] = means
cv_results["std_test_score"] = stds
In case you want to know more about the difference between them take a look
Difference between np.mean() and np.average()
I suppose the reason for the different means are different weighting factors in the mean calculation.
The mean_test_score that sklearn returns is the mean calculated on all samples where each sample has the same weight.
If you calculate the mean by taking the mean of the folds (splits), then you only get the same results if the folds are all of equal size. If they are not, then all samples of larger folds will automatically have a smaller impact on the mean of the folds than smaller folds, and the other way around.
Small numeric example:
mean([2,3,5,8,9]) = 5.4 # mean over all samples ('mean_test_score')
mean([2,3,5]) = 3.333 # mean of fold 1
mean([8,9]) = 8.5 # mean of fold 2
mean(3.333, 8.5) = 5.91 # mean of means of folds
5.4 != 5.91
I'm aware that subsets of ImageNet exist, however they don't fulfill my requirement. I want 50 classes at their native ImageNet resolutions.
To this end, I used torch.utils.data.dataset.Subset to select specific classes from ImageNet. However, it turns out, class labels/indices must be greater than 0 and less than num_classes.
Since ImageNet contains 1000 classes, the idx of my selected classes quickly goes over 50. How can I reassign the class indices and do so in a way that allows for evaluation later down the road as well?
Is there a way more elegant way to select a subset?
I am not sure I understood your conclusions about labels being greater than zero and less than num_classes. The torch.utils.data.Subset helper takes in a torch.utils.data.Dataset and a sequence of indices, they correspond to indices of data points from the Dataset you would like to keep in the subset. These indices have nothing to do with the classes they belong to.
Here's how I would approach this:
Load your dataset through torchvision.datasets (custom datasets would work the same way). Here I will demonstrate it with FashionMNIST since ImageNet's data is not made available directly through torchvision's API.
>>> ds = torchvision.datasets.FashionMNIST('.')
>>> len(ds)
60000
Define the classes you want to select for the subset dataset. And retrieve all indices from the main dataset which correspond to these classes:
>>> targets = [1, 3, 5, 9]
>>> indices = [i for i, label in enumerate(ds.targets) if label in targets]
You have your subset:
>>> ds_subset = Subset(ds, indices)
>>> len(ds_subset)
24000
At this point, you can use a dictionnary to remap your labels using targets:
>>> remap = {i:x for i, x in enumerate(targets)}
{0: 1, 1: 3, 2: 5, 3: 9}
For example:
>>> x, y = ds_subset[10]
>>> y, remap[y] # old_label, new_label
1, 3
I am trying to create a transformation using pyproj's CRS. I want to transform map data (of the Netherlands) in stereographic projection to a latitude longitude representation. I have found the necessary transformation info in the meta-data of the map, but I get the following error:
pyproj.exceptions.ProjError: Error creating Transformer from CRS.: (Internal Proj Error: proj_create_operations: Source and target ellipsoid do not belong to the same celestial body)
I use the following python3 code to generate the transformator:
import h5py
from pyproj import CRS, Transformer
with h5py.File(folder+filename, 'r') as f:
proj4 = str(list(f['geographic/map_projection'].attrs.items())[2][1])
proj4 = proj4[2:len(proj4)-1]
from_proj = CRS.from_proj4(proj4)
to_proj = CRS.from_proj4("+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0")
print(from_proj)
print(to_proj)
transform = Transformer.from_crs(from_proj, to_proj)
print(from_proj)
outputs:
+proj=stere +lat_0=90 +lon_0=0.0 +lat_ts=60.0 +a=6378.137 +b=6356.752 +x_0=0 +y_0=0 +type=crs
print(to_proj)
outputs:
+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0 +type=crs
transform = Transformer.from_crs(from_proj, to_proj)
generates an error:
Traceback (most recent call last):
File "load_random_h5.py", line 100, in <module>
load_random_h5()
File "load_random_h5.py", line 50, in load_random_h5
transform = Transformer.from_crs(from_proj, to_proj)
File "/Users/user/miniconda2/envs/gdal_test/lib/python3.7/site-packages/pyproj/transformer.py", line 323, in from_crs
area_of_interest=area_of_interest,
File "pyproj/_transformer.pyx", line 311, in pyproj._transformer._Transformer.from_crs
pyproj.exceptions.ProjError: Error creating Transformer from CRS.: (Internal Proj Error: proj_create_operations: Source and target ellipsoid do not belong to the same celestial body)
The problem might stem from the fact that the map data represents precipitation (water falling in the atmosphere) about 1 km in the air, thereby not matching the radius of the earth, but this is just speculation. The map data is a composite of two radar stations in the Netherlands. But I assume the projection information in the meta-data should be enough to apply a transform.
I have tried to replace the from_proj projection with several projections from the EPSG standard, but they all return nonsensical longitude latitude coordinates (sensical would be a latitude between 50 and 53, and a longitude between 3 and 8, in the case of the Netherlands). Concatinating the flags +ellps=WGS84 +towgs84=0,0,0 to proj4 negates the error, but again returns nonsensical longitude latitude coordinates.
Does anybody know a way around the error, or a way to fix it?
I believe that the a and b parameters defined in your projection are in the incorrect units. They appear to be in kilometers when they need to be in meters.
When looking at the ellipsoid parameters of the +proj=latlon, you can see the magnitude of the ellipsoid is ~1000x that of the other projection:
>>> from pyproj import CRS
>>> cc = CRS("+proj=longlat")
>>> cc.datum.ellipsoid.semi_major_metre
6378137.0
>>> cc.datum.ellipsoid.semi_minor_metre
6356752.314245179
>>> cp = CRS("+proj=stere +lat_0=90 +lon_0=0.0 +lat_ts=60.0 +a=6378.137 +b=6356.752 +x_0=0 +y_0=0 +type=crs")
>>> cp.datum.ellipsoid.semi_major_metre
6378.137
>>> cp.datum.ellipsoid.semi_minor_metre
6356.752
Based on this, multiplying the a and b parameters by 1000 should fix your ellipsoid:
>>> cp = CRS("+proj=stere +lat_0=90 +lon_0=0.0 +lat_ts=60.0 +a=6378137 +b=6356752 +x_0=0 +y_0=0 +type=crs")
>>> cp.datum.ellipsoid.semi_major_metre
6378137.0
>>> cp.datum.ellipsoid.semi_minor_metre
6356752.0
And it does not have an error when creating the transformer:
>>> from pyproj import Transformer
>>> trans = Transformer.from_crs("+proj=stere +lat_0=90 +lon_0=0.0 +lat_ts=60.0 +a=6378137 +b=6356752 +x_0=0 +y_0=0 +type=crs", "+proj=latlon")
>>> trans
<Concatenated Operation Transformer: pipeline>
Description: Inverse of unknown + Ballpark geographic offset from unknown to unknown
Area of Use:
- name: World
- bounds: (-180.0, -90.0, 180.0, 90.0)
I am new here.
first on all, I am very thankful for your time and consideration.
I have 2 questions regarding to managing 2 different netcdf files in python.
I searched a lot but unfortunately I couldn't find a solution.
1- I have a netcdf file which has coordinates like below:
time datetime64[ns] 2016-08-16T22:00:00
* y (y) int32 220000 ... 620000
* x (x) int32 20000 ... 720000
lat (y, x) float64 dask.array<shape=(401, 701),
lon (y, x) float64 dask.array<shape=(401, 701),
I need to change coords to lon/lat in order that I can slice an area based on specific lon/lat coords (by using xarray). But I don't know how to change x and y to lon lat.
here my code:
import xarray as xr
import matplotlib.pyplot as plt
p = "R_201608.nc"
ds = xr.open_mfdataset(p)
q=ds.RR.sel(time='2016-08-16T21:00:00')
2- Similar to 1, I have another netcdf file which has coordinates like below:
* X (X) float32 557600.0 .. 579400.0
* Y (Y) float32 5190600 ... 5205400.0
* time (time) datetime64[ns] 2007-01I
How can I convert x and y to lon/lat system in order that I can plot it in lon/lat system?
Edit related to #Ryan :
1- Yes. this file demonestrates rainfall over a large area. I want to cut it into smaller area -similar area of file related to q2- and compare them uusing bias, RMSE, etc. here is full information related to this file:
<xarray.Dataset>
Dimensions: (time: 2976, x: 701, y: 401)
Coordinates:
* time (time) datetime64[ns] 2016-08-31T23:45:00
* y (y) int32 220000 221000 ... 619000 620000
* x (x) int32 20000 21000 ... 719000 720000
lat (y, x) float64 dask.array<shape=(401, 701),chunksize=(401, 701)>
lon (y, x) float64 dask.array<shape=(401, 701), chunksize=(401, 701)
Data variables:
RR (time, y, x) float32 dask.array<shape=(2976, 401, 701), chunksize=(2976, 401, 701)>
lambert_conformal_conic int32 ...
Conventions: CF-1.5
edit related to #Ryan :2- And here it is the full information about the second file (Smaller area):
<xarray.DataArray 'Precip' (time: 8928, Y: 75, X: 110)>
dask.array<shape=(8928, 75, 110), dtype=float32, chunksize=(288, 75, 110)>
Coordinates:
sensor_height_precip float32 1.5
sensor_height_P float32 1.5
* X (X) float32 557600.0 557800.0 ... 579200.0 579400.0
* Y (Y) float32 5190600.0 5190800.0 ... 5205400.0
* time (time) datetime64[ns] 2007-01-31T23:55:00
Attributes:
grid_mapping: UTM33N
ancillary_variables: QFlag_Precip QGrid_Precip
long_name: Precipitation Amount
standard_name: precipitation_amount
cell_methods: time:sum
units: mm
In problem 1), it is not possible to convert lon and lat to dimension coordinates, because they are two-dimensional (both have dimension x, y). Dimension coordinates, used for slicing, can only be one-dimensional. If you can be more specific about what you want to do after slicing, we can provide more suggestions about how to proceed. Do you want to select a particular latitude / longitude range and then calculate some statistics (e.g. mean / variance)?
In problem 2) it looks like you have a map projection. Without more information about the projection, it is impossible to convert to lat / lon coordinates or plot on a map. Is there more information contained in your dataset about the map projection used? Can you post the full output of print(ds)?
I have solved my problem with your help. Thanks a lot.
I could change the coords of both data sets to lon/lat using PYPROJ as #Bart mentioned. creating meshgid from original and projected coordinates was the key point.
from pyproj import Proj
nxv, nyv = np.meshgrid(nx, ny)
unausp = Proj('+proj=lcc +lat_1=49 +lat_2=46 +lat_0=47.5 +lon_0=13.33333333333333 +x_0=400000 +y_0=400000 +ellps=bessel +towgs84=577.326,90.129,463.919,5.137,1.474,5.297,2.4232 +units=m +no_defs ')
nlons, nlats = unausp(nxv, nyv, inverse=True)
upLon, upLat = np.meshgrid(nlons,nlats)
Since I want to compare two rainfall data sets with different spatial resolution (different grid size), I have to upscale one of them by using xarray interpolation:
upnew_lon = np.linspace(w.X[0], w.X[-1], w.dims['X'] // 5)
upnew_lat = np.linspace(w.Y[0], w.Y[-1], w.dims['Y'] //5)
uppds = w.interp(Y=upnew_lat, X=upnew_lon)
AS far as I know, this interpolation is based on linear interpolation. I compared upscaled data set with the original one. The mean of rainfall decreases about 0.03mm/day after upscaling. I just want to know do you think this upscaling method for sub-hourly rainfall is reliable?