Defining metrics when evaluating multiple values per sample - statistics

I have an application that executes a
function foo() {...}
several times for each user session. There are 2 alternate algorithms that i can implement as "foo" function and my goal is to evaluate them based on execution delay using A/B testing.
The number of times foo() is called per user session is variable but will not exceed 10000.
The range of each value is between [1 - 400] milliseconds.
Say delays values are:
Algo1: [ [12, 30, 20, 40, 280] , [13, 14, 15, 100, 10], [20, 40] , ... ]
Algo2: [ [1, 10, 5, 4, 150] , [14, 10, 20], [21, 33, 41, 79], ... ]
My question is whats the best metric to pick the winner ?
possible options
average from each session, and then evaluate cdf
median from each session and then evaluate cdf
anything else ?

One possibility which captures both mean performance and volatility (variability) is quadratic loss: ℓ = (Y - τ)2, where Y's are the individual outcomes and τ is a desired target value (in your case zero). Calculate the average loss across all observations for each of your algorithms, which estimates the expected loss E[ℓ], then pick the algorithm with the smallest average loss.
It's straightforward to show that under expectation E[ℓ] = (E[Y] - τ)2 + σ2Y. In other words, quadratic loss has two components:
how far the expected value of the Y's is from your target τ; and
how variable the Y's are.
Low loss is achieved by being consistently close to the target. With a target of zero, this means you're getting values that on average are close to zero and aren't subject to large discrepancies. Either large means or large variances will inflate the loss, so minimum loss requires both aspects to perform well simultaneously.

Related

Scatter plot linear trend does not match data analysis toolpak

When I create a scatterplot of my data, and go to Add Trendline..., the trendline that I get is y = 0.5425x + 12.205
When I run the same data set through the Data Analysis Toolpak (Regression), I get a trendline of y=1.65333 - 17.26667
Aren't these two things supposed to be the same, except perhaps for rounding? What are some common causes of this issue? I've already checked to make sure all of my data values are included in both.
Edit: here is the data set (y is the first column, x is the second; can't get this to format properly in stackoverflow)
y: 3, 4, 8, 7, 15, 25, 35, 45, 60, 80
x: 5, 10, 15, 20, 25, 30, 35, 40, 45, 50
Edit (update): I verified by hand, and the results of the Data Analysis Toolpak are correct; the trendline on the scatter plot is incorrect.
I found the source of the error: the Data Analysis Toolpak understood that I had the columns ordered (y, x) (i.e., y was in column A and x was in column B); however, the scatter plot did not. So the scatter plot was doing x vs y instead of y vs x.

Value Error when trying to create a dictionary with lists as values

I am having issues creating a dictionary that assigns a list of multiple values to each key. Currently the data is in a list of list of 2 items:Category and Value, for example:
sample_data = [["January", 9],["Februrary", 10], ["June", 12], ["March", 15], ["January", 10],["June", 14], ["March", 16]]
It has to be transformed into a dicitonary like this:
d = {"January" : [9,10], "February":[10], "June":[12,14], "March": [15,16]}
This is my current code:
d = defaultdict(list)
for category, value in sample_data:
d[category].append(value)
This works for small samples but with very large samples of data it raises a ValueError saying too much values to unpack. Is there any way I could improve on this code or is there another way of doing this?
So, the setdefault method creates a list as the value for a key.
d = defaultdict(list)
for category, value in sample_data:
d.setdefault(category, []).append(value)
Output:
defaultdict(<class 'list'>, {'January': [9, 10], 'Februrary': [10], 'June': [12, 14], 'March': [15, 16]})
Note: I do not have a larger sample set to work with but the setdefault() method could possibly help out with that.
One way to solve this is prob. change the code to accept more than one values. This is just a wild guess - could it be something in your data (eg. value) problem - eg. one particular month has 2+ more data points showing all at once.
Note - *value means that it can take multiple values (more than one)
Without the * before value, it can only take one number at a time. That is why you got the error "Too many values to unpack..."
Because the sample data is not complete enough to show the exact error point, there's prob. other issue with data. But it could help you eliminate the earlier "error"...or narrow down to it.
data = [["January", 9],["Februrary", 10], ["June", 12],
["March", 15], ["January", 10],["June", 14], ["March", 16],
['April', 20, 21, 22]] # <--- add April & 3 values (to handle the earlier error)
from collections import defaultdict
# d = {"January" : [9,10], "February":[10], "June":[12,14],
# "March": [15,16]}
# This is my current code:
dc = defaultdict(list)
for category, *value in data: # *value to accept multiple values
dc[category].append(value)
print(dc)
output:
defaultdict(<class 'list'>, {'January': [[9], [10]], 'Februrary': [[10]], 'June': [[12], [14]], 'March': [[15], [16]], 'April': [[20, 21, 22]]})

Confidence thresholds on mean average precision calculation

is there any rules for PR curve threshold because in sklearn.metrics.average_precision they automatically make threshold from the prob/confidence which can result in weird result if I have inputs like this:
y_true = np.array([0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
y_scores = np.array([ 0.7088982, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
it will output mAP = 0.93333. Sklearn implementation got that number because it automatically uses [0.7088982, 0] as the thresholds. When the prob threshold is 0 all of zero score will counted as positive resulting in high map. Is this a correct behavior ?
A couple of considerations on your example:
the peculiarity of your y_scores, having two distinct values only, defines the length of your threshold. As you might see from source code and as
you may logically imply, threshold is defined by the number of distinct values in y_scores.
then, your argument is correct and implicit in what the threshold represents. Actually, if the score is greater or equal than the threshold,
the instance is assigned to the positive class. Therefore, in the case score=threshold=0 you'll have true positives only based on your y_true (and in turn the average precision is a weighted mean of precisions achieved at each threshold).
Have a look also here to observe that
Precision values such that element i is the precision of predictions with score >= thresholds[i]
and
Decreasing recall values such that element i is the recall of predictions with score >= thresholds[i]
I'd also suggest you to have a look here to get a glimpse of how precision, recall and threshold are computed within precision_recall_curve().

What n_estimators and max_features means in RandomForestRegressor

I was reading about fine tuning the model using GridSearchCV and I came across a Parameter Grid Shown below :
param_grid = [
{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]
forest_reg = RandomForestRegressor(random_state=42)
# train across 5 folds, that's a total of (12+6)*5=90 rounds of training
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
scoring='neg_mean_squared_error')
grid_search.fit(housing_prepared, housing_labels)
Here I am not getting the concept of n_estimator and max_feature. Is it like n_estimator means number of records from data and max_features means number of attributes to be selected from data?
After Going further I got this result :
>> grid_search.best_params_
{'max_feature':8, 'n_estimator':30}
So the thing is I am not getting what Actually this result want to say..
After reading the documentation for RandomForest Regressor you can see that n_estimators is the number of trees to be used in the forest. Since Random Forest is an ensemble method comprising of creating multiple decision trees, this parameter is used to control the number of trees to be used in the process.
max_features on the other hand, determines the maximum number of features to consider while looking for a split. For more information on max_features read this answer.
n_estimators: This is the number of trees (in general the number of samples on which this algorithm will work then it will aggregate them to give you the final answer) you want to build before taking the maximum voting or averages of predictions. The higher number of trees give you better performance but makes your code slower.
max_features: The number of features to consider when looking for the best split.
>> grid_search.best_params_ :- {'max_feature':8, 'n_estimator':30}
This means they are best hyperparameter you should run model among n_estimators{3,10,30} or max_features {2, 4, 6, 8}

Which Support Vectors returned in Multiclass SVM SKLearn

By default, SKLearn uses a One vs One classification scheme when training SVM's in the multiclass case.
I'm a bit confused as to, when you call attributes such as svm.n_support_ or svm.support_vectors_, which support vectors you're getting? For instance, in the case of iris dataset, there are 3 classes, so there should be a total of 3*(3-1)/2 = 3 different SVM classifiers built. Of which classifier are you getting support vectors back?
Update: dual_coef_ is the key, giving you the coefficients of the support vectors in the decision function. "Each of the support vectors is used in n_class - 1 classifiers. The n_class - 1 entries in each row correspond to the dual coefficients for these classifiers." .
Take a look at the very end of this section (1.4.1.1), the table clearly explains it http://scikit-learn.org/stable/modules/svm.html#multi-class-classification)
Implementation details are very confusing to me as well. Coefficients of the support vector in the decision function for multiclass are non-trivial.
But here is the rule of thumb I use whenever I want to go into detail of specific properties of chosen support vectors:
y[svm.support_]
outputs:
array([0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
This way you get to know (maybe for debugging purposes) which support vector corresponds to which class. And of course you can check support vectors:
X[svm.support_]
My intuition here is that, as its name indicates, you take subsets of samples of the involved categories. Let's say we have 3 categories A, B and C:
A vs. B --> it gives you several support vectors from A and B (a,a,a,b,b,...)
A vs. C --> same... a,a,a,c,c,c,c (maybe some 'a' are repeated from before)
B vs. C --> idem
So the svm.support_vectors_ returns all the support vectors but how it uses then in the decision_function is still tricky to me as I'm not sure if it could use for example support vectors from A vs. B when doing the pair A vs. C - and I couldn't find implementation details (http://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsOneClassifier.html#sklearn.multiclass.OneVsOneClassifier.decision_function)
All support vectors of all 3 classifiers.
Look at svm.support_.shape it is 45.
19+19+7 = 45. All adds up.
Also if you look at svm.support_vectors_.shape it will be (45,4) - [n_SV, n_features]. Again makes sense, because we have 45 support vectors and 4 features in iris data set.

Resources