For Friedman Test, what are the pros and cons of different post hoc comparison method? - statistics

scikit_posthocs lists some post hoc pairwise comparison methods if significant results of the Friedman’s test are obtained. How do I decide which one to use? In my case, the toy dataset df that I have is:
df = pd.DataFrame(
data=np.array([[ 581.125, 366.25 , 144.08 , 296.575],
[1743.4 , 900.85 , 338.925, 481.6 ],
[4335.75 , 2000.1 , 816.75 , 1236.75 ],
[ 241.925, 540.35 , 249.9 , 167.5 ],
[2822.4 , 1261.35 , 547.025, 626.05 ],
[ 568.32 , 652.55 , 277.34 , 265.68 ]]),
columns=['Both', 'Media Advertisement', 'Neither', 'Store Events'],
index=['Hardware Accessories', 'Pants', 'Shoes', 'Shorts', 'Sweatshirts',
'T-Shirts']
)
I first ran:
from scipy.stats import friedmanchisquare
print(friedmanchisquare(*df.T.values))
This gives me
FriedmanchisquareResult(statistic=12.200000000000003, pvalue=0.006728522930461357)
Now I need to do post hoc analysis, I first ran:
import scikit_posthocs as sp
sp.posthoc_conover_friedman(q1_friedman).style.applymap(lambda x: "background-color: yellow" if x < 0.05 else "")
It gives me this:
I am confused which post hoc analysis approach I should choose to meet my analytic need. In post hocs API, there are posthoc_nemenyi_friedman, posthoc_conover_friedman, posthoc_siegel_friedman and posthoc_miller_friedman for me to choose from. I tried them all and they returned some conclusions that may slightly differ from one another. Statistically speaking, which approach is appropriate for my case?

Related

How do I show all values in an aggregated tooltip?

I would like myval to show the name of each car for each aggregated year, ex. "chevrolet chevelle malibu".
The [object Object] thing appears to be JavaScript related.
import altair as alt
from vega_datasets import data
import pandas as pd
import pdb
df = data.cars()
alt.renderers.enable("altair_viewer")
mychart = (
alt.Chart(df)
.transform_joinaggregate(count="count(*)", myval="values(Name)", groupby=["Year"])
.mark_bar()
.encode(
x=alt.X(
"Year",
timeUnit=alt.TimeUnitParams(unit="year", step=1),
type="quantitative",
),
y=alt.Y("count", type="quantitative"),
tooltip=alt.Tooltip(["myval:N"]),
)
)
mychart.show()
This is a great question, and I'm not sure there's a satisfactory answer. The reason this is displayed as [object Object], [object Object], etc. is because the values aggregate returns a list of the entire row for each value. So the full representation would be something like this:
[{'Name': 'chevrolet chevelle malibu', 'Miles_per_Gallon': 18.0, 'Cylinders': 8, 'Displacement': 307.0, 'Horsepower': 130.0, 'Weight_in_lbs': 3504, 'Acceleration': 12.0, 'Year': 1970, 'Origin': 'USA'}, {'Name': 'buick skylark 320', 'Miles_per_Gallon': 15.0, 'Cylinders': 8, 'Displacement': 350.0, 'Horsepower': 165.0, 'Weight_in_lbs': 3693, 'Acceleration': 11.5, 'Year': 1970, 'Origin': 'USA'}, ...]
and those are just the first two entries! So clearly it won't really fit in a tooltip. For what it's worth, newer versions of Vega improve on this (which you can see by viewing the equivalent chart in the vega editor) but it's still not what you're looking for.
What you need is a way to extract just the name from each value in the list... and I'm sure that Vega-Lite transforms provide any good way to do that (the vega expression language does not have anything that resembles list comprehensions or function mapping).
The best I can think of is something like this, to display, say, the first 4 values:
mychart = (
alt.Chart(df)
.transform_joinaggregate(count="count(*)", myval="values(Name)", groupby=["Year"])
.transform_calculate(
first_val="datum.myval[0].Name",
second_val="datum.myval[1].Name",
third_val="datum.myval[2].Name",
fourth_val="datum.myval[3].Name",
)
.mark_bar()
.encode(
x=alt.X(
"Year",
timeUnit=alt.TimeUnitParams(unit="year", step=1),
type="quantitative",
),
y=alt.Y("count", type="quantitative"),
tooltip=alt.Tooltip(["first_val:N", "second_val:N", "third_val:N", "fourth_val:N"]),
)
)
Another option would be, instead of using a tooltip, to use a second chart that updates on mouseover:
base = (
alt.Chart(df)
.transform_joinaggregate(count="count(*)", values="values(Name)", groupby=["Year"])
)
selection = alt.selection_single(fields=['Year'], on='mouseover', empty='none')
bars = (
base
.mark_bar()
.encode(
x=alt.X(
"Year:N",
timeUnit=alt.TimeUnitParams(unit="year", step=1),
type="quantitative",
),
y=alt.Y("count", type="quantitative"),
)
).add_selection(selection)
text = (
base
.transform_filter(selection)
.transform_flatten(['values'])
.transform_calculate(Name="datum.values.Name")
.mark_text()
.encode(
y=alt.Y('Name:N', axis=None),
text='Name:N'
)
).properties(width=300)
chart2 = bars | text
I'd be interested to see if anyone knows of a more complete solution.

How to use warm starts in Minizinc?

I'm trying to use the warm start anotation in Minizinc to give a known suboptimal solution to a model.
I started by trying to execute this warm start example from the Minizinc documentation (the only one they provide):
array[1..3] of var 0..10: x;
array[1..3] of var 0.0..10.5: xf;
var bool: b;
array[1..3] of var set of 5..9: xs;
constraint b+sum(x)==1;
constraint b+sum(xf)==2.4;
constraint 5==sum( [ card(xs[i]) | i in index_set(xs) ] );
solve
:: warm_start_array( [ %%% Can be on the upper level
warm_start( x, [<>,8,4] ), %%% Use <> for missing values
warm_start( xf, array1d(-5..-3, [5.6,<>,4.7] ) ),
warm_start( xs, array1d( -3..-2, [ 6..8, 5..7 ] ) )
] )
:: seq_search( [
warm_start_array( [ %%% Now included in seq_search to keep order
warm_start( x, [<>,5,2] ), %%% Repeated warm_starts allowed but not specified
warm_start( xf, array1d(-5..-3, [5.6,<>,4.7] ) ),
warm_start( xs, array1d( -3..-2, [ 6..8, 5..7 ] ) )
] ),
warm_start( [b], [true] ),
int_search(x, first_fail, indomain_min)
] )
minimize x[1] + b + xf[2] + card( xs[1] intersect xs[3] );
The example runs, and it gets the optimal solution. However, the output displays warnings stating all the warm start anotations were ignored.
Warning, ignored search annotation: warm_start_array([warm_start([[xi(1), xi(2)], [i(5), i(2)]]), warm_start([[xf(0), xf(2)], [f(5.6), f(4.7)]]), warm_start([[xs(0), xs(1), xs(2)], [s(), s()]])])
Warning, ignored search annotation: warm_start([[xb(0)], [b(true)]])
Warning, ignored search annotation: warm_start_array([warm_start([[xi(1), xi(2)], [i(8), i(4)]]), warm_start([[xf(0), xf(2)], [f(5.6), f(4.7)]]), warm_start([[xs(0), xs(1), xs(2)], [s(), s()]])])
I didnt modified anything in the example, just copy-pasted it and ran it in the Minizinc IDE with the Geocode default solver. In case it is relevant, I'm using Windows. I have ran other models and used other search anotations without problems.
In the example there is two blocks of warm stars (one after solve and one inside seq_search). I'm not sure if both are necessary. I tried removing one, then the other, but the warnings still happen for all the remaining warm start anotations. Also I dont get why 'b' isnt refered in the fisrt block.
There is a similar example in git https://github.com/google/or-tools/issues/539 but it also produces the warnings.
If someone could point me out to a working example of warm_start it would be great.
Your usage of the warm_start annotations are correct, but warm start annotations are currently not supported in most solvers. At the time of writing I believe the warm start annotations are only supported by the Mixed Integer Programming interfaces (CoinBC, Gurobi, CPlex, XPress, and SCIP). Although we've been working on adding support for the annotation in Gecode and Chuffed, support for this annotation has not been included in any of the released versions.

How does sklearn.linear_model.LinearRegression work with insufficient data?

To solve a 5 parameter model, I need at least 5 data points to get a unique solution. For x and y data below:
import numpy as np
x = np.array([[-0.24155831, 0.37083184, -1.69002708, 1.4578805 , 0.91790011,
0.31648635, -0.15957368],
[-0.37541846, -0.14572825, -2.19695883, 1.01136142, 0.57288752,
0.32080956, -0.82986857],
[ 0.33815532, 3.1123936 , -0.29317028, 3.01493602, 1.64978158,
0.56301755, 1.3958912 ],
[ 0.84486735, 4.74567324, 0.7982888 , 3.56604097, 1.47633894,
1.38743513, 3.0679506 ],
[-0.2752026 , 2.9110031 , 0.19218081, 2.0691105 , 0.49240373,
1.63213241, 2.4235483 ],
[ 0.89942508, 5.09052174, 1.26048572, 3.73477373, 1.4302902 ,
1.91907482, 3.70126468]])
y = np.array([-0.81388378, -1.59719762, -0.08256274, 0.61297275, 0.99359647,
1.11315445])
I used only 6 data to fit a 8 parameter model (7 slopes and 1 intercept).
lr = LinearRegression().fit(x, y)
print(lr.coef_)
array([-0.83916772, -0.57249998, 0.73025938, -0.02065629, 0.47637768,
-0.36962192, 0.99128474])
print(lr.intercept_)
0.2978781587718828
Clearly, it's using some kind of assignment to reduce the degrees of freedom. I tried to look into the source code but couldn't found anything about that. What method do they use to find the parameter of under specified model?
You don't need to reduce the degrees of freedom, it simply finds a solution to the least squares problem min sum_i (dot(beta,x_i)+beta_0-y_i)**2. For example, in the non-sparse case it uses the linalg.lstsq module from scipy. The default solver for this optimization problem is the gelsd LAPACK driver. If
A= np.concatenate((ones_v, X), axis=1)
is the augmented array with ones as its first column, then your solution is given by
x=numpy.linalg.pinv(A.T*A)*A.T*y
Where we use the pseudoinverse precisely because the matrix may not be of full rank. Of course, the solver doesn't actually use this formula but uses singular value Decomposition of A to reduce this formula.

Numpy Array value setting issues

I have a data set that spans a certain length of time and data points for each of these time points. I would like to create a much more detailed timescale and fill the empty data points to zero. I wrote a piece of code to do this but it isn't doing what I want it to. I tried a sample case though and it seems to work. Below are the two codes.
This piece of code does not do what I want it to.
import numpy as np
TD_t = np.array([36000, 36500, 37000, 37500, 38000, 38500, 39000, 39500, 40000, 40500, 41000, 41500, 42000, 42500,
43000, 43500, 44000, 44500, 45000, 45500, 46000, 46500, 47000, 47500, 48000, 48500, 49000, 49500,
50000, 50500, 51000, 51500, 52000, 52500, 53000, 53500, 54000, 54500, 55000, 55500, 56000, 56500,
57000, 57500, 58000, 58500, 59000, 59500, 60000, 60500, 61000, 61500, 62000, 62500, 63000, 63500,
64000, 64500, 65000, 65500, 66000])
TD_d = np.array([-0.05466527, -0.04238242, -0.04477601, -0.02453717, -0.01662798, -0.02548617, -0.02339215,
-0.01186576, -0.0029057 , -0.01094671, -0.0095005 , -0.0190277 , -0.01215644, -0.01997112,
-0.01384497, -0.01610656, -0.01927564, -0.02119056, -0.011634 , -0.00544096, -0.00046568,
-0.0017769 , -0.0007341, 0.00193066, 0.01359107, 0.02054919, 0.01420335, 0.01550565,
0.0132394 , 0.01371563, 0.01959774, 0.0165316 , 0.01881992, 0.01554435, 0.01409003,
0.01898334, 0.02300266, 0.03045158, 0.02869013, 0.0238423 , 0.02902356, 0.02568908,
0.02954539, 0.02537967, 0.02927247, 0.02138605, 0.02815635, 0.02733237, 0.03321588,
0.03063803, 0.03783137, 0.04110955, 0.0451221 , 0.04646263, 0.04472884, 0.04935833,
0.03372911, 0.04031406, 0.04165237, 0.03940343, 0.03805504])
time = np.arange(0, 100001,1)
data = np.zeros_like(time)
for i in range(0, len(TD_t)):
t = TD_t[i]
data[t] = TD_d[i]
print(i,t,TD_d[i],data[t])
But for some reason this code works.
import numpy
nums = numpy.array([0,1,2,3])
data = numpy.zeros_like(nums)
data[0] = nums[2]
data[0], nums[2]
Any help will be much appreciated!!
It's because the dtype of data is being set to int64, and so when you try to reassign one of the data elements, it gets rounded to zero.
Try changing the line to:
data = np.zeros_like(time, dtype=float)
and it should work (or use whatever dtype the TD_d array is)

Random Forest feature importance: how many are actually used?

I use RF twice in a row.
First, I fit it using max_features='auto' and the whole dataset (109 feature), in order to perform features selection.
The following is RandomForestClassifier.feature_importances_, it correctly gives me 109 score per each feature:
[0.00118087, 0.01268531, 0.0017589 , 0.01614814, 0.01105567,
0.0146838 , 0.0187875 , 0.0190427 , 0.01429976, 0.01311706,
0.01702717, 0.00901344, 0.01044047, 0.00932331, 0.01211333,
0.01271825, 0.0095337 , 0.00985686, 0.00952823, 0.01165877,
0.00193286, 0.0012602 , 0.00208145, 0.00203459, 0.00229907,
0.00242616, 0.00051358, 0.00071606, 0.00975515, 0.00171034,
0.01134927, 0.00687018, 0.00987706, 0.01507474, 0.01223525,
0.01170495, 0.00928417, 0.01083082, 0.01302036, 0.01002457,
0.00894818, 0.00833564, 0.00930602, 0.01100774, 0.00818604,
0.00675784, 0.00740617, 0.00185461, 0.00119627, 0.00159034,
0.00154336, 0.00478926, 0.00200773, 0.00063574, 0.00065675,
0.01104192, 0.00246746, 0.01663812, 0.01041134, 0.01401842,
0.02038318, 0.0202834 , 0.01290935, 0.01476593, 0.0108275 ,
0.0118773 , 0.01050919, 0.0111477 , 0.00684507, 0.01170021,
0.01291888, 0.00963295, 0.01161876, 0.00756015, 0.00178329,
0.00065709, 0. , 0.00246064, 0.00217982, 0.00305187,
0.00061284, 0.00063431, 0.01963523, 0.00265208, 0.01543552,
0.0176546 , 0.01443356, 0.01834896, 0.01385694, 0.01320648,
0.00966011, 0.0148321 , 0.01574166, 0.0167107 , 0.00791634,
0.01121442, 0.02171706, 0.01855552, 0.0257449 , 0.02925843,
0.01789742, 0. , 0. , 0.00379275, 0.0024365 ,
0.00333905, 0.00238971, 0.00068355, 0.00075399]
Then, I transform the dataset over the previous fit which should reduce its dimensionality, and then i re-fit RF over it.
Given max_features='auto' and the 109 feats, I would expect to have in total ~10 features instead, calling rf.feats_importance_, returns more (62):
[ 0.01261971, 0.02003921, 0.00961297, 0.02505467, 0.02038449,
0.02353745, 0.01893777, 0.01932577, 0.01681398, 0.01464485,
0.01672119, 0.00748981, 0.01109461, 0.01116948, 0.0087081 ,
0.01056344, 0.00971319, 0.01532258, 0.0167348 , 0.01601214,
0.01522208, 0.01625487, 0.01653784, 0.01483562, 0.01602748,
0.01522369, 0.01581573, 0.01406688, 0.01269036, 0.00884105,
0.02538574, 0.00637611, 0.01928382, 0.02061512, 0.02566056,
0.02180902, 0.01537295, 0.01796305, 0.01171095, 0.01179759,
0.01371328, 0.00811729, 0.01060708, 0.015717 , 0.01067911,
0.01773623, 0.0169396 , 0.0226369 , 0.01547827, 0.01499467,
0.01356075, 0.01040735, 0.01360752, 0.01754145, 0.01446933,
0.01845195, 0.0190799 , 0.02608652, 0.02095663, 0.02939744,
0.01870901, 0.02512201]
Why? Shouldn't it returns just ~10 features importances?
You misunderstood the meaning of max_features, which is
The number of features to consider when looking for the best split
It is not the number of features when transforming the data.
It is the threshold in transform method that determines the most important features.
threshold : string, float or None, optional (default=None)
The threshold value to use for feature selection. Features whose importance is greater or equal are kept while the others are discarded. If “median” (resp. “mean”), then the threshold value is the median (resp. the mean) of the feature importances. A scaling factor (e.g., “1.25*mean”) may also be used. If None and if available, the object attribute threshold is used. Otherwise, “mean” is used by default.

Resources