Underscore Aggregate Hashes by Keys - node.js

I work on machine learning application. I use underscorejs when I need to operate with arrays and hashes.
The question is following, in ML there is a cross-validation approach, when you need to calculate performance for several folds.
For each fold, I have a hash of parameters of performance, like following
{ 'F1': 0.8,
'Precision': 0.7,
'Recall':0.9
}
I push all hashes to the array, at the end I have an array of the hashes, like following
[ { 'F1': 0.8,
'Precision': 0.7,
'Recall':0.9
},
{ 'F1': 0.5,
'Precision': 0.6,
'Recall':0.4
},
{ 'F1': 0.4,
'Precision': 0.3,
'Recall':0.4
}
]
The question is, at the end I want to calculate the average for each parameter of the hash, i.e. I want to sum up all hashes by parameters and then divide every parameters by the number of folds, in my case 3.
If there are any elegant way to do so with underscore and javascript?
One important point is sometimes I need to do this aggregation, when the hash for fold like the following
{
label1:{ 'F1': 0.8,
'Precision': 0.7,
'Recall':0.9
},
label2:{ 'F1': 0.8,
'Precision': 0.7,
'Recall':0.9
},
...
}
The task is the same, average of F1, Precision, Recall for every label among all folds.
Currently I have some ugly solution that run over all hash several times, I would appreciate any help, thank you.

If it is an array, just use the array. If it is not an array, use _.values to turn it into one and use that. Then, we can do a fold (or reduce) over the data:
_.reduce(data, function(memo, obj) {
return {
F1: memo.F1 + obj.F1,
Precision: memo.Precision + obj.Precision,
Recall: memo.Recall + obj.Recall,
count: memo.count + 1
};
}, {F1: 0, Precision: 0, Recall: 0, count: 0})
This returns a hash containing F1, Precision, and Recall, which are sums, and count, which is the number of objects. It should be pretty easy to get an average from those.

Related

How to exclude starting point from linspace function under numpy in python?

I want to exclude starting point from a array in python using numpy. How I can execute? For example I want to exclude 0, but want to continue from the very next real number(i.e want to run from greater than 0) following code x=np.linspace(0,2,10)
Kind of an old question but I thought I'll share my solution to the problem.
Assuming you want to get an array
[0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.]
you can make use of the endpoint option in np.linspace() and reverse the direction:
x = np.linspace(2, 0, 10, endpoint=False)[::-1]
the [::-1] reverses the array so that it ends up being in the desired sequence.
x=np.linspace(0,2,10)[1:] #remove the first element by indexing
print(x)
[0.22222222 0.44444444 0.66666667 0.88888889 1.11111111 1.33333333
1.55555556 1.77777778 2. ]

How to interpret the return result of precisionByThreshold in Spark?

I am aware of the concept of precisionByThreshold method, while when I use SparkML to implement the linear regression binary classification and print out the analysis result of precisionByThreshold. I got results like this:
Threshold: 1.0, Precision: 0.7158351409978309
Threshold: 0.0, Precision: 0.22472464244616144
Why are there only two thresholds? And when the threshold is 1.0, no sample should be classified as Positive, then the precision should be 0. Can anybody explain this result to me and tell me how to add more threshold? Thanks in advance.

Storing numbers up to 2 decimal places in Mongodb?

I am working on a financial project where I am trying to store amount for my transactions, the price can store up to two decimal, I am trying to choose Schema type for my amount field, I first thought of Numberwith roundto:2, another option is to store them in Number Decimal type.
Now as my prices can only go to upto 2 decimals, so should I stick with default Numberwith roundto:2 or there can be some issues where decimals can get rounded of.
Also is there any difference between the number of bytes to store the values in Number and in Number Decimal?
Thanks
Use NumberDecimal, of course. One should never use regular floating-point numbers for money (they can't represent most values exactly).
Demonstration:
db.numbers.insert({fp: 0.1, dec: NumberDecimal('0.1')})
db.numbers.insert({fp: 0.2, dec: NumberDecimal('0.2')})
db.numbers.aggregate([
{
$group: {
_id: 1,
total_fp: { $sum: "$fp"},
total_dec: { $sum: "$dec"}
}
}
])
// { "_id" : 1, "total_fp" : 0.30000000000000004, "total_dec" : NumberDecimal("0.3") }

Stacked bar chart overwhelmed by one value

I'm using JQuery Flot with the stacking plugin to create a stacked bar chart of revenue over time from a number of different sources. The problem I'm running into is that there is one timepoint where one source gained revenue a couple of orders of magnitude larger than any other in the entire chart.
The end result is that this value dominates the graph, shrinking all the other bars to an unusable height. I can set a max height on the graph, but then you lose being able to visualize the outstanding value.
Is there any best practice in data visualization to address a situation like this? Some flot option/plugin that could help? Or a library that would handle the situation in a way better than flot?
I'm not certain about financial plots, but in scientific plotting, to emphasize data closer to 0, a logarithmic scaled axis is frequently used. Unlike a linear scaled axis where each equally spaced tick represents a +N, in a log scaled axis each equally spaced tick represents an order of magnitude increase. The simplest case is an exponential increase where the axis goes 0.1, 1, 10, 100, 1000, etc...
For instance here's the same bar graph with a linear and log scaled axis:
Here's the flot code I used to generate this (fiddle here):
$(function() {
var series = {data: [[0, 0.1], [1, 1], [2, 10], [3, 100000]],
lines: {show: false},
bars: {show: true}}
$.plot("#linear", [ series ]);
$.plot("#log", [ series ], {
yaxis: {
min: 0.1,
max: 150000,
ticks: [[0.1,"0.1"], 1, 10, 100, 1000, 10000],
transform: function (v) {
return (v == 0) ? Math.log(0.0001) : Math.log(v);
},
inverseTransform: function (v) {
return Math.exp(v);
}
}
});
});

Mean absoluate error of each tree in Random Forest

I am using the evaluation class of weka for the the mean absolute error of each generated tree in random forest. The explanation says that "Refers to the error of the predicted values for numeric classes, and the error of the predicted probability distribution for nominal classes."
Can someone explain it in easy words or probably with an exammple ?
The mean absolute error is an indication of how close your predictions are, on average, to the actual values of the test data.
For numerical classes this is easy to think about.
Example:
True values: {0, 1, 4}
Predicted values: {1, 3, 1}
Differences: {-1, -2, 3} (subtract predicted from true)
Absolute differences: {1, 2, 3}
Mean Absolute Difference: (1+2+3)/3 = 2
For nominal classes a prediction is no longer a single value, but rather the probability distribution of the instance belonging to the different possible classes. The provided example will have two classes.
Example:
Notation: [0.5, 0.5] indicates an instance with 50% chance of belonging to class Y, 50% chance of belonging to class X.
True distributions: { [0,1] , [1,0] }
Predicted distributions: { [0.25, 0.75], [1, 0] }
Differences: { [-0.25, 0.25], [0, 0] }
Absolute differences: { (0.25 + 0.25)/2, (0 + 0)/2 } = {0.25, 0}
Mean absolute difference: (0.25 + 0)/2 = 0.125
You can double check my explanation by visiting the source code for Weka's evaluation class.
Also as a side note, I believe the mean absolute difference reported by Weka for random forest is for the forest as a whole, not the individual trees.

Resources