Related
I would like myval to show the name of each car for each aggregated year, ex. "chevrolet chevelle malibu".
The [object Object] thing appears to be JavaScript related.
import altair as alt
from vega_datasets import data
import pandas as pd
import pdb
df = data.cars()
alt.renderers.enable("altair_viewer")
mychart = (
alt.Chart(df)
.transform_joinaggregate(count="count(*)", myval="values(Name)", groupby=["Year"])
.mark_bar()
.encode(
x=alt.X(
"Year",
timeUnit=alt.TimeUnitParams(unit="year", step=1),
type="quantitative",
),
y=alt.Y("count", type="quantitative"),
tooltip=alt.Tooltip(["myval:N"]),
)
)
mychart.show()
This is a great question, and I'm not sure there's a satisfactory answer. The reason this is displayed as [object Object], [object Object], etc. is because the values aggregate returns a list of the entire row for each value. So the full representation would be something like this:
[{'Name': 'chevrolet chevelle malibu', 'Miles_per_Gallon': 18.0, 'Cylinders': 8, 'Displacement': 307.0, 'Horsepower': 130.0, 'Weight_in_lbs': 3504, 'Acceleration': 12.0, 'Year': 1970, 'Origin': 'USA'}, {'Name': 'buick skylark 320', 'Miles_per_Gallon': 15.0, 'Cylinders': 8, 'Displacement': 350.0, 'Horsepower': 165.0, 'Weight_in_lbs': 3693, 'Acceleration': 11.5, 'Year': 1970, 'Origin': 'USA'}, ...]
and those are just the first two entries! So clearly it won't really fit in a tooltip. For what it's worth, newer versions of Vega improve on this (which you can see by viewing the equivalent chart in the vega editor) but it's still not what you're looking for.
What you need is a way to extract just the name from each value in the list... and I'm sure that Vega-Lite transforms provide any good way to do that (the vega expression language does not have anything that resembles list comprehensions or function mapping).
The best I can think of is something like this, to display, say, the first 4 values:
mychart = (
alt.Chart(df)
.transform_joinaggregate(count="count(*)", myval="values(Name)", groupby=["Year"])
.transform_calculate(
first_val="datum.myval[0].Name",
second_val="datum.myval[1].Name",
third_val="datum.myval[2].Name",
fourth_val="datum.myval[3].Name",
)
.mark_bar()
.encode(
x=alt.X(
"Year",
timeUnit=alt.TimeUnitParams(unit="year", step=1),
type="quantitative",
),
y=alt.Y("count", type="quantitative"),
tooltip=alt.Tooltip(["first_val:N", "second_val:N", "third_val:N", "fourth_val:N"]),
)
)
Another option would be, instead of using a tooltip, to use a second chart that updates on mouseover:
base = (
alt.Chart(df)
.transform_joinaggregate(count="count(*)", values="values(Name)", groupby=["Year"])
)
selection = alt.selection_single(fields=['Year'], on='mouseover', empty='none')
bars = (
base
.mark_bar()
.encode(
x=alt.X(
"Year:N",
timeUnit=alt.TimeUnitParams(unit="year", step=1),
type="quantitative",
),
y=alt.Y("count", type="quantitative"),
)
).add_selection(selection)
text = (
base
.transform_filter(selection)
.transform_flatten(['values'])
.transform_calculate(Name="datum.values.Name")
.mark_text()
.encode(
y=alt.Y('Name:N', axis=None),
text='Name:N'
)
).properties(width=300)
chart2 = bars | text
I'd be interested to see if anyone knows of a more complete solution.
I'm trying to use the warm start anotation in Minizinc to give a known suboptimal solution to a model.
I started by trying to execute this warm start example from the Minizinc documentation (the only one they provide):
array[1..3] of var 0..10: x;
array[1..3] of var 0.0..10.5: xf;
var bool: b;
array[1..3] of var set of 5..9: xs;
constraint b+sum(x)==1;
constraint b+sum(xf)==2.4;
constraint 5==sum( [ card(xs[i]) | i in index_set(xs) ] );
solve
:: warm_start_array( [ %%% Can be on the upper level
warm_start( x, [<>,8,4] ), %%% Use <> for missing values
warm_start( xf, array1d(-5..-3, [5.6,<>,4.7] ) ),
warm_start( xs, array1d( -3..-2, [ 6..8, 5..7 ] ) )
] )
:: seq_search( [
warm_start_array( [ %%% Now included in seq_search to keep order
warm_start( x, [<>,5,2] ), %%% Repeated warm_starts allowed but not specified
warm_start( xf, array1d(-5..-3, [5.6,<>,4.7] ) ),
warm_start( xs, array1d( -3..-2, [ 6..8, 5..7 ] ) )
] ),
warm_start( [b], [true] ),
int_search(x, first_fail, indomain_min)
] )
minimize x[1] + b + xf[2] + card( xs[1] intersect xs[3] );
The example runs, and it gets the optimal solution. However, the output displays warnings stating all the warm start anotations were ignored.
Warning, ignored search annotation: warm_start_array([warm_start([[xi(1), xi(2)], [i(5), i(2)]]), warm_start([[xf(0), xf(2)], [f(5.6), f(4.7)]]), warm_start([[xs(0), xs(1), xs(2)], [s(), s()]])])
Warning, ignored search annotation: warm_start([[xb(0)], [b(true)]])
Warning, ignored search annotation: warm_start_array([warm_start([[xi(1), xi(2)], [i(8), i(4)]]), warm_start([[xf(0), xf(2)], [f(5.6), f(4.7)]]), warm_start([[xs(0), xs(1), xs(2)], [s(), s()]])])
I didnt modified anything in the example, just copy-pasted it and ran it in the Minizinc IDE with the Geocode default solver. In case it is relevant, I'm using Windows. I have ran other models and used other search anotations without problems.
In the example there is two blocks of warm stars (one after solve and one inside seq_search). I'm not sure if both are necessary. I tried removing one, then the other, but the warnings still happen for all the remaining warm start anotations. Also I dont get why 'b' isnt refered in the fisrt block.
There is a similar example in git https://github.com/google/or-tools/issues/539 but it also produces the warnings.
If someone could point me out to a working example of warm_start it would be great.
Your usage of the warm_start annotations are correct, but warm start annotations are currently not supported in most solvers. At the time of writing I believe the warm start annotations are only supported by the Mixed Integer Programming interfaces (CoinBC, Gurobi, CPlex, XPress, and SCIP). Although we've been working on adding support for the annotation in Gecode and Chuffed, support for this annotation has not been included in any of the released versions.
To solve a 5 parameter model, I need at least 5 data points to get a unique solution. For x and y data below:
import numpy as np
x = np.array([[-0.24155831, 0.37083184, -1.69002708, 1.4578805 , 0.91790011,
0.31648635, -0.15957368],
[-0.37541846, -0.14572825, -2.19695883, 1.01136142, 0.57288752,
0.32080956, -0.82986857],
[ 0.33815532, 3.1123936 , -0.29317028, 3.01493602, 1.64978158,
0.56301755, 1.3958912 ],
[ 0.84486735, 4.74567324, 0.7982888 , 3.56604097, 1.47633894,
1.38743513, 3.0679506 ],
[-0.2752026 , 2.9110031 , 0.19218081, 2.0691105 , 0.49240373,
1.63213241, 2.4235483 ],
[ 0.89942508, 5.09052174, 1.26048572, 3.73477373, 1.4302902 ,
1.91907482, 3.70126468]])
y = np.array([-0.81388378, -1.59719762, -0.08256274, 0.61297275, 0.99359647,
1.11315445])
I used only 6 data to fit a 8 parameter model (7 slopes and 1 intercept).
lr = LinearRegression().fit(x, y)
print(lr.coef_)
array([-0.83916772, -0.57249998, 0.73025938, -0.02065629, 0.47637768,
-0.36962192, 0.99128474])
print(lr.intercept_)
0.2978781587718828
Clearly, it's using some kind of assignment to reduce the degrees of freedom. I tried to look into the source code but couldn't found anything about that. What method do they use to find the parameter of under specified model?
You don't need to reduce the degrees of freedom, it simply finds a solution to the least squares problem min sum_i (dot(beta,x_i)+beta_0-y_i)**2. For example, in the non-sparse case it uses the linalg.lstsq module from scipy. The default solver for this optimization problem is the gelsd LAPACK driver. If
A= np.concatenate((ones_v, X), axis=1)
is the augmented array with ones as its first column, then your solution is given by
x=numpy.linalg.pinv(A.T*A)*A.T*y
Where we use the pseudoinverse precisely because the matrix may not be of full rank. Of course, the solver doesn't actually use this formula but uses singular value Decomposition of A to reduce this formula.
I have a data set that spans a certain length of time and data points for each of these time points. I would like to create a much more detailed timescale and fill the empty data points to zero. I wrote a piece of code to do this but it isn't doing what I want it to. I tried a sample case though and it seems to work. Below are the two codes.
This piece of code does not do what I want it to.
import numpy as np
TD_t = np.array([36000, 36500, 37000, 37500, 38000, 38500, 39000, 39500, 40000, 40500, 41000, 41500, 42000, 42500,
43000, 43500, 44000, 44500, 45000, 45500, 46000, 46500, 47000, 47500, 48000, 48500, 49000, 49500,
50000, 50500, 51000, 51500, 52000, 52500, 53000, 53500, 54000, 54500, 55000, 55500, 56000, 56500,
57000, 57500, 58000, 58500, 59000, 59500, 60000, 60500, 61000, 61500, 62000, 62500, 63000, 63500,
64000, 64500, 65000, 65500, 66000])
TD_d = np.array([-0.05466527, -0.04238242, -0.04477601, -0.02453717, -0.01662798, -0.02548617, -0.02339215,
-0.01186576, -0.0029057 , -0.01094671, -0.0095005 , -0.0190277 , -0.01215644, -0.01997112,
-0.01384497, -0.01610656, -0.01927564, -0.02119056, -0.011634 , -0.00544096, -0.00046568,
-0.0017769 , -0.0007341, 0.00193066, 0.01359107, 0.02054919, 0.01420335, 0.01550565,
0.0132394 , 0.01371563, 0.01959774, 0.0165316 , 0.01881992, 0.01554435, 0.01409003,
0.01898334, 0.02300266, 0.03045158, 0.02869013, 0.0238423 , 0.02902356, 0.02568908,
0.02954539, 0.02537967, 0.02927247, 0.02138605, 0.02815635, 0.02733237, 0.03321588,
0.03063803, 0.03783137, 0.04110955, 0.0451221 , 0.04646263, 0.04472884, 0.04935833,
0.03372911, 0.04031406, 0.04165237, 0.03940343, 0.03805504])
time = np.arange(0, 100001,1)
data = np.zeros_like(time)
for i in range(0, len(TD_t)):
t = TD_t[i]
data[t] = TD_d[i]
print(i,t,TD_d[i],data[t])
But for some reason this code works.
import numpy
nums = numpy.array([0,1,2,3])
data = numpy.zeros_like(nums)
data[0] = nums[2]
data[0], nums[2]
Any help will be much appreciated!!
It's because the dtype of data is being set to int64, and so when you try to reassign one of the data elements, it gets rounded to zero.
Try changing the line to:
data = np.zeros_like(time, dtype=float)
and it should work (or use whatever dtype the TD_d array is)
I use RF twice in a row.
First, I fit it using max_features='auto' and the whole dataset (109 feature), in order to perform features selection.
The following is RandomForestClassifier.feature_importances_, it correctly gives me 109 score per each feature:
[0.00118087, 0.01268531, 0.0017589 , 0.01614814, 0.01105567,
0.0146838 , 0.0187875 , 0.0190427 , 0.01429976, 0.01311706,
0.01702717, 0.00901344, 0.01044047, 0.00932331, 0.01211333,
0.01271825, 0.0095337 , 0.00985686, 0.00952823, 0.01165877,
0.00193286, 0.0012602 , 0.00208145, 0.00203459, 0.00229907,
0.00242616, 0.00051358, 0.00071606, 0.00975515, 0.00171034,
0.01134927, 0.00687018, 0.00987706, 0.01507474, 0.01223525,
0.01170495, 0.00928417, 0.01083082, 0.01302036, 0.01002457,
0.00894818, 0.00833564, 0.00930602, 0.01100774, 0.00818604,
0.00675784, 0.00740617, 0.00185461, 0.00119627, 0.00159034,
0.00154336, 0.00478926, 0.00200773, 0.00063574, 0.00065675,
0.01104192, 0.00246746, 0.01663812, 0.01041134, 0.01401842,
0.02038318, 0.0202834 , 0.01290935, 0.01476593, 0.0108275 ,
0.0118773 , 0.01050919, 0.0111477 , 0.00684507, 0.01170021,
0.01291888, 0.00963295, 0.01161876, 0.00756015, 0.00178329,
0.00065709, 0. , 0.00246064, 0.00217982, 0.00305187,
0.00061284, 0.00063431, 0.01963523, 0.00265208, 0.01543552,
0.0176546 , 0.01443356, 0.01834896, 0.01385694, 0.01320648,
0.00966011, 0.0148321 , 0.01574166, 0.0167107 , 0.00791634,
0.01121442, 0.02171706, 0.01855552, 0.0257449 , 0.02925843,
0.01789742, 0. , 0. , 0.00379275, 0.0024365 ,
0.00333905, 0.00238971, 0.00068355, 0.00075399]
Then, I transform the dataset over the previous fit which should reduce its dimensionality, and then i re-fit RF over it.
Given max_features='auto' and the 109 feats, I would expect to have in total ~10 features instead, calling rf.feats_importance_, returns more (62):
[ 0.01261971, 0.02003921, 0.00961297, 0.02505467, 0.02038449,
0.02353745, 0.01893777, 0.01932577, 0.01681398, 0.01464485,
0.01672119, 0.00748981, 0.01109461, 0.01116948, 0.0087081 ,
0.01056344, 0.00971319, 0.01532258, 0.0167348 , 0.01601214,
0.01522208, 0.01625487, 0.01653784, 0.01483562, 0.01602748,
0.01522369, 0.01581573, 0.01406688, 0.01269036, 0.00884105,
0.02538574, 0.00637611, 0.01928382, 0.02061512, 0.02566056,
0.02180902, 0.01537295, 0.01796305, 0.01171095, 0.01179759,
0.01371328, 0.00811729, 0.01060708, 0.015717 , 0.01067911,
0.01773623, 0.0169396 , 0.0226369 , 0.01547827, 0.01499467,
0.01356075, 0.01040735, 0.01360752, 0.01754145, 0.01446933,
0.01845195, 0.0190799 , 0.02608652, 0.02095663, 0.02939744,
0.01870901, 0.02512201]
Why? Shouldn't it returns just ~10 features importances?
You misunderstood the meaning of max_features, which is
The number of features to consider when looking for the best split
It is not the number of features when transforming the data.
It is the threshold in transform method that determines the most important features.
threshold : string, float or None, optional (default=None)
The threshold value to use for feature selection. Features whose importance is greater or equal are kept while the others are discarded. If “median” (resp. “mean”), then the threshold value is the median (resp. the mean) of the feature importances. A scaling factor (e.g., “1.25*mean”) may also be used. If None and if available, the object attribute threshold is used. Otherwise, “mean” is used by default.