Save NetCDF through Xarray with dimensions given by two coordinates - python-3.x

I have geographical information in a geopandas. I have a function that saves that information to a netcdf through xarray as follows
def write_ncfile(name, new_model, variables):
## Distinct latitudes
lats=new_model.drop_duplicates(["ilat"], keep="first").geometry.y.values
## Distinct Longitudes
lons=new_model.drop_duplicates(["ilon"], keep="first").geometry.x.values
## Temporal store for DataArrays
temporal_dataset = {}
## Dimensions and coordinates
dims = ('lat', 'lon')
coords = dict(lat=lats, lon=lons)
## Variables to save
for variable in variables:
bcvar=new_model[variable].values
## Reshapes data to have sahpe of lats and lons
bcvar=np.reshape(bcvar, (-1, len(lons)))
ds = xr.DataArray(bcvar, dims=dims, coords=coords)
ds.attrs['long_name'] = descriptions[variable]
ds.attrs['_FillValue'] = 0
temporal_dataset[variable] = ds
## Create DataSet
DT=xr.Dataset(temporal_dataset)
## Save to file
makedirs(name, exist_ok = True)
filename="%s/%s.nc"%(name, name)
DT.to_netcdf(filename, format="NETCDF4_CLASSIC")
return None
This code works wonderfully if the underlaying geographical grid is squared (lat lon projection) but now if the projection is not square (ex lambert) then I would need to have the dimensions defined as a 2d array not 1d. I am stumped as to how to do that.
I am trying to achieve something like this in the header of ncdump
dimensions:
lat:dim_lat
lon:dim_lon
variables:
double lat(lat, lon)
double lon(lat, lon)
double var1(lat, lon)
double var2(lat, lon)
Currently the code is saving it as
dimensions:
lat:dim_lat
lon:dim_lon
variables:
double lat(lat)
double lon(lon)
double var1(lat, lon)
double var2(lat, lon)
How can I change this?
Example gdf:
ilat ilon geometry d_p T_P d_v T_V
22 0 0 POINT (-70.95000 -33.30000) 0.000000 0 0.000000 0
0 0 1 POINT (-70.85000 -33.30000) 383.862700 39674 120.439438 12448
1 0 2 POINT (-70.75000 -33.30000) 327.639330 33863 112.502638 11628
2 0 3 POINT (-70.65000 -33.30000) 320.808104 33157 96.602750 9984
3 0 4 POINT (-70.55000 -33.30000) 415.217240 42915 99.144774 10247
23 1 0 POINT (-70.95000 -33.40000) 0.000000 0 0.000000 0
4 1 1 POINT (-70.85000 -33.40000) 56.055971 5787 16.853605 17310
5 1 2 POINT (-70.75000 -33.40000) 6686.807845 690341 1992.373592 205691
6 1 3 POINT (-70.65000 -33.40000) 8812.040534 909749 3512.456618 362623
7 1 4 POINT (-70.55000 -33.40000) 5203.112762 537166 2015.376536 208066
24 2 0 POINT (-70.95000 -33.50000) 0.000000 0 0.000000 0
8 2 1 POINT (-70.85000 -33.50000) 133.485233 13765 40.937021 4222
9 2 2 POINT (-70.75000 -33.50000) 7358.668562 758846 2309.069300 238118
10 2 3 POINT (-70.65000 -33.50000) 10420.377036 1074578 3668.947758 378352
11 2 4 POINT (-70.55000 -33.50000) 6166.780423 635935 2047.500621 211144
12 3 0 POINT (-70.95000 -33.60000) 71.933395 74010 21.287101 2193
13 3 1 POINT (-70.85000 -33.60000) 1154.803477 118952 373.474444 38470
14 3 2 POINT (-70.75000 -33.60000) 1512.189352 155764 466.310819 48033
15 3 3 POINT (-70.65000 -33.60000) 7160.093545 737532 2095.296251 215828
16 3 4 POINT (-70.55000 -33.60000) 4870.217943 501661 1494.220152 153914
17 4 0 POINT (-70.95000 -33.70000) 767.033734 78919 241.884877 24887
18 4 1 POINT (-70.85000 -33.70000) 163.023696 16773 48.526857 4993
19 4 2 POINT (-70.75000 -33.70000) 632.011798 65027 207.326845 21332
20 4 3 POINT (-70.65000 -33.70000) 93.053338 9574 27.787137 2859
And the function usage would be
write_ncfile("Trial", gdf, ["d_p", "d_v"])
In the example above saves the information just fine with the code above, but I need to generalize it so that it works when the grid is not a square grid.

I had the same problem and xarray is not very intuitiv on that point. The solution to create a DataArray with the right dimension (Note this is a 3D example):
data_array = xarray.DataArray(
data,
coords={
"time": timestamps,
"latitude": (["y", "x"], latitude_grid),
"longitude": (["y", "x"], longitude_grid),
},
dims=["time", "y", "x"],
)
So the loop in your code should look like this:
coords = {
"latitude": (["lon", "lat"], latitude_grid),
"longitude": (["lon", "lat"], longitude_grid)}
dims = ['lon', 'lat']
for variable in variables:
bcvar=new_model[variable].values
## Reshapes data to have sahpe of lats and lons
bcvar=np.reshape(bcvar, (-1, len(lons)))
ds = xr.DataArray(bcvar, dims=dims, coords=coords)
ds.attrs['long_name'] = descriptions[variable]
ds.attrs['_FillValue'] = 0
temporal_dataset[variable] = ds
Feel free to adapt this to your final running solution.

Related

Assigning ID to column name in merging distance matrix to dataframe

I have this issue I haven't been able to solve and I was hoping to get some insights here.
I have this geopandas dataframe:
GEO =
id geometry_zone \
0 A001DFD POLYGON ((48.08793 50.93755, 48.08793 49.18650...
1 A001DG POLYGON ((60.96434 49.05222, 59.86796 49.29929...
2 A001DS007 POLYGON ((53.16200 50.20131, 52.84363 48.45026...
3 A001DS01 POLYGON ((59.04953 49.34561, 58.77158 47.52346...
4 A001DS02 POLYGON ((58.12301 49.46915, 57.79873 47.67788...
5 A001DS03 POLYGON ((57.07498 49.66937, 56.79702 47.84722...
6 A001DS04 POLYGON ((56.13302 49.80835, 55.83962 48.00164...
7 A001DS05 POLYGON ((55.16017 49.93189, 54.89766 48.18694...
8 A001DS06 POLYGON ((54.14099 50.05542, 53.86304 48.27959...
9 A001DS08 POLYGON ((52.22678 50.36050, 51.94821 48.52985...
10 A001DS09 POLYGON ((50.93339 48.70894, 51.96811 48.52985...
11 A001DS10 POLYGON ((50.23695 50.67887, 49.91857 48.84823...
12 A001DS11 POLYGON ((50.23695 50.67887, 49.60020 50.75847...
13 A001FS01 POLYGON ((46.47617 48.94772, 46.47617 47.63443...
14 A001FS02 POLYGON ((46.49606 50.04213, 46.47617 48.94772...
centroid
0 POINT (48.75295 49.98494)
1 POINT (60.27696 48.21993)
2 POINT (53.49869 49.22928)
3 POINT (59.29040 48.38586)
4 POINT (58.42620 48.49535)
5 POINT (57.43469 48.68996)
6 POINT (56.46528 48.82210)
7 POINT (55.50608 48.98701)
8 POINT (54.51093 49.10232)
9 POINT (52.52668 49.40021)
10 POINT (51.59314 49.51614)
11 POINT (50.57522 49.68396)
12 POINT (49.74105 49.81923)
13 POINT (47.00679 48.58955)
14 POINT (47.23437 49.55921)
where the points are the geometry_zone centroids. Now, I know how to calculate the distance between every point, i.e. compute the distance matrix:
GEO_distances
0 1 2 3 4 5 6 \
0 0.000000 11.063874 4.299228 10.275246 9.312075 8.274448 7.312941
1 10.983097 0.000000 6.348082 0.616036 1.399226 2.373198 3.374784
2 4.132203 6.259105 0.000000 5.469828 4.507633 3.469029 2.507443
3 9.982697 0.409114 5.348195 0.000000 0.399280 1.373252 2.374671
4 9.112541 1.279148 4.477119 0.487986 0.000000 0.504366 1.503677
5 8.102334 2.289412 3.468492 1.497509 0.538514 0.000000 0.494605
6 7.124643 3.266993 2.490125 2.475753 1.515950 0.474954 0.000000
7 6.151367 4.240258 1.517485 3.448859 2.489192 1.448060 0.487174
8 5.151208 5.240246 0.515855 4.450013 3.488962 2.449214 1.487936
9 3.145284 7.246023 0.481768 6.456493 5.494540 4.455695 3.494278
10 2.205711 8.185458 1.420986 7.396838 6.433798 5.396039 4.434327
11 1.174092 9.217045 2.452510 8.428427 7.465334 6.427628 5.465988
12 0.329081 10.062023 3.297427 9.273461 8.310263 7.272662 6.311059
13 1.235000 12.579303 5.838504 11.812993 10.830385 9.818336 8.852372
14 0.853558 12.484730 5.717153 11.712257 10.730567 9.711458 8.743639
7 8 9 10 11 12 13 \
0 6.343811 5.312333 3.377798 2.368462 1.343153 0.675055 1.051959
1 4.353762 5.318769 7.388784 8.269175 9.305375 10.325337 12.247130
2 1.538467 0.506829 0.544190 1.416284 2.454398 3.479383 5.430826
3 3.353424 4.318400 6.388838 7.269062 8.304972 9.325272 11.250890
4 2.482659 3.447704 5.519952 6.398068 7.434796 8.456133 10.381205
5 1.473030 2.437971 4.509526 5.388997 6.424600 7.445701 9.379809
6 0.494829 1.459821 3.533033 4.410650 5.446892 6.468964 8.405156
7 0.000000 0.486633 2.560113 3.437762 4.473614 5.495941 7.440721
8 0.518599 0.000000 1.561677 2.436310 3.473427 4.497171 6.443451
9 2.525085 1.493574 0.000000 0.429875 1.467480 2.492644 4.463771
10 3.465481 2.433809 0.499402 0.000000 0.527884 1.554218 3.540493
11 4.497042 3.465439 1.530986 0.521601 0.000000 0.523013 2.556065
12 5.342058 4.310497 2.376017 1.366597 0.341276 0.000000 1.788666
13 7.901132 6.863255 4.941781 3.928417 2.923256 2.273971 0.000000
14 7.782154 6.746808 4.815043 3.790372 2.766326 2.077512 0.492253
14
0 0.703212
1 12.250335
2 5.430658
3 11.253792
4 10.383930
5 9.382000
6 8.406976
7 7.441895
8 6.444094
9 4.461567
10 3.531133
11 2.517604
12 1.686975
13 0.444277
14 0.000000
(So, first row contains the distance to all points in the centroid column, including the first point).
What I actually want is to merge this matrix to the dataframe AND that the column names be the ids from GEO.
Now, I know how to merge:
new = GEO.merge(GEO_distances, on=['index'])
which returns:
index id geometry_zone \
0 0 A001DFD POLYGON ((48.08793 50.93755, 48.08793 49.18650...
1 1 A001DG POLYGON ((60.96434 49.05222, 59.86796 49.29929...
2 2 A001DS007 POLYGON ((53.16200 50.20131, 52.84363 48.45026...
3 3 A001DS01 POLYGON ((59.04953 49.34561, 58.77158 47.52346...
4 4 A001DS02 POLYGON ((58.12301 49.46915, 57.79873 47.67788...
5 5 A001DS03 POLYGON ((57.07498 49.66937, 56.79702 47.84722...
6 6 A001DS04 POLYGON ((56.13302 49.80835, 55.83962 48.00164...
7 7 A001DS05 POLYGON ((55.16017 49.93189, 54.89766 48.18694...
8 8 A001DS06 POLYGON ((54.14099 50.05542, 53.86304 48.27959...
9 9 A001DS08 POLYGON ((52.22678 50.36050, 51.94821 48.52985...
10 10 A001DS09 POLYGON ((50.93339 48.70894, 51.96811 48.52985...
11 11 A001DS10 POLYGON ((50.23695 50.67887, 49.91857 48.84823...
12 12 A001DS11 POLYGON ((50.23695 50.67887, 49.60020 50.75847...
13 13 A001FS01 POLYGON ((46.47617 48.94772, 46.47617 47.63443...
14 14 A001FS02 POLYGON ((46.49606 50.04213, 46.47617 48.94772...
centroid 0 1 2 3 \
0 POINT (48.75295 49.98494) 0.000000 11.063874 4.299228 10.275246
1 POINT (60.27696 48.21993) 10.983097 0.000000 6.348082 0.616036
2 POINT (53.49869 49.22928) 4.132203 6.259105 0.000000 5.469828
3 POINT (59.29040 48.38586) 9.982697 0.409114 5.348195 0.000000
4 POINT (58.42620 48.49535) 9.112541 1.279148 4.477119 0.487986
5 POINT (57.43469 48.68996) 8.102334 2.289412 3.468492 1.497509
6 POINT (56.46528 48.82210) 7.124643 3.266993 2.490125 2.475753
7 POINT (55.50608 48.98701) 6.151367 4.240258 1.517485 3.448859
8 POINT (54.51093 49.10232) 5.151208 5.240246 0.515855 4.450013
9 POINT (52.52668 49.40021) 3.145284 7.246023 0.481768 6.456493
10 POINT (51.59314 49.51614) 2.205711 8.185458 1.420986 7.396838
11 POINT (50.57522 49.68396) 1.174092 9.217045 2.452510 8.428427
12 POINT (49.74105 49.81923) 0.329081 10.062023 3.297427 9.273461
13 POINT (47.00679 48.58955) 1.235000 12.579303 5.838504 11.812993
14 POINT (47.23437 49.55921) 0.853558 12.484730 5.717153 11.712257
4 5 6 7 8 9 10 \
0 9.312075 8.274448 7.312941 6.343811 5.312333 3.377798 2.368462
1 1.399226 2.373198 3.374784 4.353762 5.318769 7.388784 8.269175
2 4.507633 3.469029 2.507443 1.538467 0.506829 0.544190 1.416284
3 0.399280 1.373252 2.374671 3.353424 4.318400 6.388838 7.269062
4 0.000000 0.504366 1.503677 2.482659 3.447704 5.519952 6.398068
5 0.538514 0.000000 0.494605 1.473030 2.437971 4.509526 5.388997
6 1.515950 0.474954 0.000000 0.494829 1.459821 3.533033 4.410650
7 2.489192 1.448060 0.487174 0.000000 0.486633 2.560113 3.437762
8 3.488962 2.449214 1.487936 0.518599 0.000000 1.561677 2.436310
9 5.494540 4.455695 3.494278 2.525085 1.493574 0.000000 0.429875
10 6.433798 5.396039 4.434327 3.465481 2.433809 0.499402 0.000000
11 7.465334 6.427628 5.465988 4.497042 3.465439 1.530986 0.521601
12 8.310263 7.272662 6.311059 5.342058 4.310497 2.376017 1.366597
13 10.830385 9.818336 8.852372 7.901132 6.863255 4.941781 3.928417
14 10.730567 9.711458 8.743639 7.782154 6.746808 4.815043 3.790372
11 12 13 14
0 1.343153 0.675055 1.051959 0.703212
1 9.305375 10.325337 12.247130 12.250335
2 2.454398 3.479383 5.430826 5.430658
3 8.304972 9.325272 11.250890 11.253792
4 7.434796 8.456133 10.381205 10.383930
5 6.424600 7.445701 9.379809 9.382000
6 5.446892 6.468964 8.405156 8.406976
7 4.473614 5.495941 7.440721 7.441895
8 3.473427 4.497171 6.443451 6.444094
9 1.467480 2.492644 4.463771 4.461567
10 0.527884 1.554218 3.540493 3.531133
11 0.000000 0.523013 2.556065 2.517604
12 0.341276 0.000000 1.788666 1.686975
13 2.923256 2.273971 0.000000 0.444277
14 2.766326 2.077512 0.492253 0.000000
But, how do I give the column the id names in a simple way? Manually renaming 18 000 columns is not my idea of a fun afternoon.
I have found an answer to my question, but I still wonder if there is a better and more elegant way to do this (e.g. on the fly). What I did was this:
new_column_name = GEO.id.to_list()
columnlist = GEO_distances.columns.to_list()
cols_remove = ['index','id','geometry_zone','centroid']
old_column_names = [x for x in columnlist if (x not in cols_remove)]
col_rename_dict = {i:j for i,j in zip(old_column_names,new_column_name)}
GEO_distances.rename(columns=col_rename_dict, inplace=True)
which gives:
index id geometry_zone \
0 0 A001DFD POLYGON ((48.08793 50.93755, 48.08793 49.18650...
1 1 A001DG POLYGON ((60.96434 49.05222, 59.86796 49.29929...
2 2 A001DS007 POLYGON ((53.16200 50.20131, 52.84363 48.45026...
3 3 A001DS01 POLYGON ((59.04953 49.34561, 58.77158 47.52346...
4 4 A001DS02 POLYGON ((58.12301 49.46915, 57.79873 47.67788...
... ... ... ...
1790 1790 R13C1G POLYGON ((63.72846 54.07087, 61.04155 54.02454...
1791 1791 R13D1A POLYGON ((63.03727 60.43190, 65.27641 57.78312...
1792 1792 R13D1D POLYGON ((68.90781 67.16844, 68.95414 60.51294...
1793 1793 R13D1F POLYGON ((61.42043 67.16403, 75.48019 67.22166...
1794 1794 R13D1G POLYGON ((61.40300 67.15300, 61.43388 63.43148...
centroid A001DFD A001DG A001DS007 A001DS01 \
0 POINT (48.75295 49.98494) 0.000000 11.063874 4.299228 10.275246
1 POINT (60.27696 48.21993) 10.983097 0.000000 6.348082 0.616036
2 POINT (53.49869 49.22928) 4.132203 6.259105 0.000000 5.469828
3 POINT (59.29040 48.38586) 9.982697 0.409114 5.348195 0.000000
4 POINT (58.42620 48.49535) 9.112541 1.279148 4.477119 0.487986
... ... ... ... ... ...
1790 POINT (62.36165 51.28081) 12.814471 2.630419 8.337061 3.267367
1791 POINT (69.85889 59.16021) 21.991462 13.464194 18.191827 14.124815
1792 POINT (72.22137 63.86261) 26.206982 18.602918 22.776510 19.187716
1793 POINT (68.46954 68.61039) 26.045757 20.948750 23.468535 21.237352
1794 POINT (65.33358 63.93216) 20.589210 15.508162 17.853344 15.717912
A001DS02 A001DS03 A001DS04 A001DS05 A001DS06 A001DS08 \
0 9.312075 8.274448 7.312941 6.343811 5.312333 3.377798
1 1.399226 2.373198 3.374784 4.353762 5.318769 7.388784
2 4.507633 3.469029 2.507443 1.538467 0.506829 0.544190
3 0.399280 1.373252 2.374671 3.353424 4.318400 6.388838
4 0.000000 0.504366 1.503677 2.482659 3.447704 5.519952
... ... ... ... ... ... ...
1790 3.862726 4.616091 5.526811 6.415345 7.329588 9.280236
1791 14.623171 15.220831 15.921820 16.621702 17.363742 18.956701
1792 19.622822 20.146582 20.757196 21.374153 22.035882 23.454339
1793 21.458123 21.751888 22.104256 22.496392 22.947829 23.939341
1794 15.894837 16.153433 16.481252 16.864660 17.318735 18.347265
Any other more efficient solution is welcome.

Merge distance matrix results and original indices with Python Pandas

I have a panda df with list of bus stops and their geolocations:
stop_id stop_lat stop_lon
0 1 32.183939 34.917812
1 2 31.870034 34.819541
2 3 31.984553 34.782828
3 4 31.888550 34.790904
4 6 31.956576 34.898125
stop_id isn't necessarily incremental.
Using sklearn.metrics.pairwise.manhattan_distances I calculate distances and get a symmetric distance matrix:
array([[0. , 1.412176, 2.33437 , 3.422297, 5.24705 ],
[1.412176, 0. , 1.151232, 2.047153, 4.165126],
[2.33437 , 1.151232, 0. , 1.104079, 3.143274],
[3.422297, 2.047153, 1.104079, 0. , 2.175247],
[5.24705 , 4.165126, 3.143274, 2.175247, 0. ]])
But I can't manage to easily connect between the two now. I want to have a df that contains a tuple for each pair of stops and their distance, something like:
stop_id_1 stop_id_2 distance
1 2 3.33
I tried working with the lower triangle, convert to vector and all sorts but I feel I just over-complicate things with no success.
Hope this helps!
d= ''' stop_id stop_lat stop_lon
0 1 32.183939 34.917812
1 2 31.870034 34.819541
2 3 31.984553 34.782828
3 4 31.888550 34.790904
4 6 31.956576 34.898125 '''
df = pd.read_csv(pd.compat.StringIO(d), sep='\s+')
from sklearn.metrics.pairwise import manhattan_distances
distance_df = pd.DataFrame(manhattan_distances(df))
distance_df.index = df.stop_id.values
distance_df.columns = df.stop_id.values
print(distance_df)
output:
1 2 3 4 6
1 0.000000 1.412176 2.334370 3.422297 5.247050
2 1.412176 0.000000 1.151232 2.047153 4.165126
3 2.334370 1.151232 0.000000 1.104079 3.143274
4 3.422297 2.047153 1.104079 0.000000 2.175247
6 5.247050 4.165126 3.143274 2.175247 0.000000
Now, to create the long format of the same df, use the following.
long_frmt_dist=distance_df.unstack().reset_index()
long_frmt_dist.columns = ['stop_id_1', 'stop_id_2', 'distance']
print(long_frmt_dist.head())
output:
stop_id_1 stop_id_2 distance
0 1 1 0.000000
1 1 2 1.412176
2 1 3 2.334370
3 1 4 3.422297
4 1 6 5.247050
df_dist = pd.DataFrame.from_dict(dist_matrix)
pd.merge(df, df_dist, how='left', left_index=True, right_index=True)
example

How to expand Python Pandas Dataframe in linearly spaced increments

Beginner question:
I have a pandas dataframe that looks like this:
x1 y1 x2 y2
0 0 2 2
10 10 12 12
and I want to expand that dataframe by half units along the x and y coordinates to look like this:
x1 y1 x2 y2 Interpolated_X Interpolated_Y
0 0 2 2 0 0
0 0 2 2 0.5 0.5
0 0 2 2 1 1
0 0 2 2 1.5 1.5
0 0 2 2 2 2
10 10 12 12 10 10
10 10 12 12 10.5 10.5
10 10 12 12 11 11
10 10 12 12 11.5 11.5
10 10 12 12 12 12
Any help would be much appreciated.
The cleanest way I know how to expand rows like this is through groupby.apply. May be faster to use something like itertuples in pandas but it will be a little more complicated code (keep that in mind if your data-set is larger).
groupby the index which will send each row to my apply function (your index has to be unique for each row, if its not just run reset_index). I can return a DataFrame from my apply therefore we can expand from one row to multiple rows.
caveat, your x2-x1 and y2-y1 distance must be the same or this won't work.
import pandas as pd
import numpy as np
def expand(row):
row = row.iloc[0] # passes a dateframe so this gets reference to first and only row
xdistance = (row.x2 - row.x1)
ydistance = (row.y2 - row.y1)
xsteps = np.arange(row.x1, row.x2 + .5, .5) # create steps arrays
ysteps = np.arange(row.y1, row.y2 + .5, .5)
return (pd.DataFrame([row] * len(xsteps)) # you can expand lists in python by multiplying like this [val] * 3 = [val, val, val]
.assign(int_x = xsteps, int_y = ysteps))
(df.groupby(df.index) # "group" on each row
.apply(expand) # send row to expand function
.reset_index(level=1, drop=True)) # groupby gives us an extra index we don't want
starting df
x1 y1 x2 y2
0 0 2 2
10 10 12 12
ending df
x1 y1 x2 y2 int_x int_y
0 0 0 2 2 0.0 0.0
0 0 0 2 2 0.5 0.5
0 0 0 2 2 1.0 1.0
0 0 0 2 2 1.5 1.5
0 0 0 2 2 2.0 2.0
1 10 10 12 12 10.0 10.0
1 10 10 12 12 10.5 10.5
1 10 10 12 12 11.0 11.0
1 10 10 12 12 11.5 11.5
1 10 10 12 12 12.0 12.0

How to get Equation of a decision boundary in matlab svm plot?

my data
y n Rh y2
1 1 1.166666667 1
-1 2 0.5 1
-1 3 0.333333333 1
-1 4 0.166666667 1
1 5 1.666666667 2
1 6 1.333333333 1
-1 7 0.333333333 1
-1 8 0.333333333 1
1 9 0.833333333 1
1 10 2.333333333 2
1 11 1 1
-1 12 0.166666667 1
1 13 0.666666667 1
1 14 0.833333333 1
1 15 0.833333333 1
-1 16 0.333333333 1
-1 17 0.166666667 1
1 18 2 2
1 19 0.833333333 1
1 20 1.333333333 1
1 21 1.333333333 1
-1 22 0.166666667 1
-1 23 0.166666667 1
-1 24 0.333333333 1
-1 25 0.166666667 1
-1 26 0.166666667 1
-1 27 0.333333333 1
-1 28 0.166666667 1
-1 29 0.166666667 1
-1 30 0.5 1
1 31 0.833333333 1
-1 32 0.166666667 1
-1 33 0.333333333 1
-1 34 0.166666667 1
-1 35 0.166666667 1
my codes r
data=xlsread('btpdata.xlsx',1.)
A = data(1:end,2:3)
B = data(1:end,1)
svmStruct = svmtrain(A,B,'showplot',true)
hold on
C = data(1:end,2:3)
D = data(1:end,4)
svmStruct = svmtrain(C,D,'showplot',true)
hold off
How can i get the approximate equations of this black lines in the given mat-lab plot?
It depends what package you did use, but as it is a linear Support Vector Machine there are more or less two options:
Your trained svm contains the equation of the line in a property coefs (sometimes called w or weights) and b (or intercept), so your line is <coefs, X> + b = 0
Your svm containes alphas (dual coefficients, Lagrange multipliers) and then coefs = SUM_i alphas_i * y_i * SV_i where SV_i is i'th support vector (the ones in circles on your plot) and y_i is its label (-1 or +1). Sometimes alphas are already multiplied by y_i, then your coefs = SUM_i alphas_i * SV_i.
If you are trying to get the equation from the actual plot (image), then you can only read it (and it is more or less y = 0.6, meaning that coefs = [0 1] and b = -0.6. Image analysis based approach (for arbitrary such plot) would require:
detecting image part (object detection)
reading the ticks/scale (OCR + object detection) <- this would be actually the hardest part
filtering out everything non-black and performing linear regression to points left, then trasforming through scale detected earlier.
I have had the same problem. To build the linear equation (y = mx + b) of the decision boundary you need the gradient (m) and the y-intercept (b). SVMStruct.Bias is the b-term. The gradient is determined by the SVM beta weights, which SVMStruct does not contain so you need to calculate them from the alphas (which are included in SVMStruct):
alphas = SVMStruct.Alpha;
SV = SVMStruct.SupportVectors;
betas = sum(alphas.*SV);
m = betas(1)/betas(2)
By the way, if your SVM has scaled the data, then I think you will need to unscale it.

Empty square for legend for stackplot

I'm trying to generate a stack plot of version data using matplotlib. I have that portion working and displaying properly, but I'm unable to get the legend to display anything other than an empty square in the corner.
ra_ys = np.asarray(ra_ys)
# Going to generate a stack plot of the version stats
fig = plt.figure()
ra_plot = fig.add_subplot(111)
# Our x axis is going to be the dates, but we need them as numbers
x = [date2num(date) for date in dates]
# Plot the data
ra_plot.stackplot(x, ra_ys)
# Setup our legends
ra_plot.legend(ra_versions) #Also tried converting to a tuple
ra_plot.set_title("blah blah words")
print(ra_versions)
# Only want x ticks on the dates we supplied, and want them to display AS dates
ra_plot.set_xticks(x)
ra_plot.set_xticklabels([date.strftime("%m-%d") for date in dates])
plt.show()
ra_ys is a multidimensional array:
[[ 2 2 2 2 2 2 2 2 2 2 1]
[ 1 1 1 1 1 1 1 1 1 1 1]
[ 1 1 1 1 1 1 1 1 1 1 1]
[53 52 51 50 50 49 48 48 48 48 47]
[18 19 20 20 20 20 21 21 21 21 21]
[ 0 0 12 15 17 18 19 19 19 19 22]
[ 5 5 3 3 3 3 3 3 3 3 3]
[ 4 4 3 3 2 2 2 2 2 2 2]
[14 14 6 4 3 3 2 2 2 2 2]
[ 1 1 1 1 1 1 1 1 1 1 1]
[ 1 1 1 1 1 1 1 1 1 1 1]
[ 1 1 1 1 1 1 1 1 1 1 1]
[ 2 2 2 2 2 2 2 2 2 2 2]
[ 1 1 1 1 1 1 1 1 1 1 1]
[ 1 1 1 1 1 1 1 1 1 1 1]
[ 3 3 2 2 2 2 2 2 2 2 2]]
x is some dates: [734969.0, 734970.0, 734973.0, 734974.0, 734975.0, 734976.0, 734977.0, 734978.0, 734979.0, 734980.0, 734981.0]
ra_versions is a list: ['4.5.2', '4.5.7', '4.5.8', '5.0.0', '5.0.1', '5.0.10', '5.0.7', '5.0.8', '5.0.9', '5.9.105', '5.9.26', '5.9.27', '5.9.29', '5.9.31', '5.9.32', '5.9.34']
Am I doing something wrong? Can stack plots not have legends?
EDIT: I tried to print the handles and labels for the plot and got two empty lists ([] []):
handles, labels = theplot.get_legend_handles_labels()
print(handles,labels)
I then tested the same figure using the follow code for a proxy handle and it worked. So it looks like the lack of handles is the problem.
p = plt.Rectangle((0, 0), 1, 1, fc="r")
theplot.legend([p], ['test'])
So now the question is, how can I generate a variable number of proxy handles that match the colors of my stack plot?
This is the final (cleaner) approach to getting the legend. Since there are no handles, I generate proxy artists for each line. It's theoretically capable of handling cases where colors are reused, but it'll be confusing.
def plot_version_data(title, dates, versions, version_ys, savename=None):
print("Prepping plot for \"{0}\"".format(title))
fig = plt.figure()
theplot = fig.add_subplot(111)
# Our x axis is going to be the dates, but we need them as numbers
x = [date2num(date) for date in dates]
# Use these colors
colormap = "bgrcmy"
theplot.stackplot(x, version_ys, colors=colormap)
# Make some proxy artists for the legend
p = []
i = 0
for _ in versions:
p.append(plt.Rectangle((0, 0), 1, 1, fc=colormap[i]))
i = (i + 1) % len(colormap)
theplot.legend(p, versions)
theplot.set_ylabel(versions) # Cheating way to handle the legend
theplot.set_title(title)
# Setup the X axis - rotate to keep from overlapping, display like Oct-16,
# make sure there's no random whitespace on either end
plt.xticks(rotation=315)
theplot.set_xticks(x)
theplot.set_xticklabels([date.strftime("%b-%d") for date in dates])
plt.xlim(x[0],x[-1])
if savename:
print("Saving output as \"{0}\"".format(savename))
fig.savefig(os.path.join(sys.path[0], savename))
else:
plt.show()

Resources