In Weka 3.8.3 get different result when set the probaabilityEstimates true or false - svm

I use the same training data set and testing data set.
And I choose Weka classifiers-> functions-> LibSVM , and use default parameters.
I use default parameter and get the result:
https://imgur.com/aIq90wP
When I set the parameter probabilityEstimates to true, I get this result:
https://imgur.com/NGVY5No
The default parameters set are like this:
https://imgur.com/GOfLnVd
Why am I getting different results?
Maybe it's a silly question but I'll be grateful if someone can answer this.
Thanks!

This seems to be related to the random number process.
I used the same libSVM, all defaults, with diabetes.arff (comes with the software).
Run 1: no probabilityEstimates, 500 correct
Run 2: same, 500 correct
Run 3: probabilityEstimates, 498 correct
Run 4: same, 498 correct (so, with identical parameters, the process replicates)
Run 5: probabilityEstimates, but change seed from 1 to 55, 500 correct.
Run 6: probabilityEstimates, but change seed from 55 to 666, 498 correct.
Run 7: probabilityEstimates, but change seed from 666 to 1492, 499 correct.
The algorithm needs, for whatever reason, a different amount of random numbers or uses them in a different order, resulting in slight perturbations in the number correct when probabilityEstimates are requested. We get the same effect if we change the random number seed (which tells the random number generator where to start).

Related

In HPCC ECL, when running a LOCAL, LOOKUP JOIN. Does the RHS dataset gets copied to all nodes, or kept distributed due to LOCAL?

Say I have a cluster of 400 machines, and 2 datasets. some_dataset_1 has 100M records, some_dataset_2 has 1M. I then run:
ds1:=DISTRIBUTE(some_dataset_1,hash(field_a));
ds2:=DISTRIBUTE(some_dataset_2,hash(field_b));
Then, I run the join:
j1:=JOIN(ds1,ds2,LEFT.field_a=LEFT.field_b,LOOKUP,LOCAL);
Will the distribution of ds2 "mess up" the join, meaning parts of ds2 will be incorrectly scattered across the cluster leading to low match rate?
Or, will the LOOKUP keyword take precedence and the distributed ds2 will get copied in full to each node, thus rendering the distribution irrelevant, and allowing the join to find all the possible matches (as each node will have a full copy of ds2).
I know I can test this myself and come to my own conclusion, but I am looking for a definitive answer based on the way the language is written to make sure I understand and can use these options correctly.
For reference (from the Language Reference document v 7.0.0):
LOOKUP: Specifies the rightrecset is a relatively small file of lookup records that can be fully copied to every node.
LOCAL: Specifies the operation is performed on each supercomputer node independently, without requiring interaction with all other nodes to acquire data; the operation maintains the distribution of any previous DISTRIBUTE
It seems that with the LOCAL, the join completes more quickly. There does not seem to be a loss of matches on initial trials. I am working with others to run a more thorough test and will post the results here.
First, your code:
ds1:=DISTRIBUTE(some_dataset_1,hash(field_a));
ds2:=DISTRIBUTE(some_dataset_2,hash(field_b));
Since you're intending these results to be used in a JOIN, it is imperative that both datasets are distributed on the "same" data, so that the matching values end up on the same nodes so that your JOIN can be done with the LOCAL option. So this will only work correctly if ds1.field_a and ds2.field_b contain the "same" data.
Then, your join code. I assume you've made a typo in this post, because your join code needs to be (to work at all):
j1:=JOIN(ds1,ds2,LEFT.field_a=RIGHT.field_b,LOOKUP,LOCAL);
Using both LOOKUP and LOCAL options is redundant because a LOOKUP JOIN is implicitly a LOCAL operation. That means, your LOOKUP option does "override" the LOCAL in this insatnce.
So, all that means that you should either do it this way:
ds1:=DISTRIBUTE(some_dataset_1,hash(field_a));
ds2:=DISTRIBUTE(some_dataset_2,hash(field_b));
j1:=JOIN(ds1,ds2,LEFT.field_a=RIGHT.field_b,LOCAL);
Or this way:
j1:=JOIN(some_dataset_1,some_dataset_2,LEFT.field_a=RIGHT.field_b,LOOKUP);
Because the LOOKUP option does copy the entire right-hand dataset (in memory) to every node, it makes the JOIN implicitly a LOCAL operation and you do not need to do the DISTRIBUTEs. Which way you choose to do it is up to you.
However, I see from your Language Reference version that you may be unaware of the SMART option on JOIN, which in my current Language Reference (8.10.10) says:
SMART -- Specifies to use an in-memory lookup when possible, but use a
distributed join if the right dataset is large.
So you could just do it this way:
j1:=JOIN(some_dataset_1,some_dataset_2,LEFT.field_a=RIGHT.field_b,SMART);
and let the platform figure out which is best.
HTH,
Richard
Thank you, Richard. Yes, I am notorious for typo's. I apologize. As I use a lot of legacy code, I have not had a chance to work with the SMART option, but I will certainly keep that in mine for me and the team, - so thank you for that!
However, I did run a test to evaluate how the compiler and the platform would handles this scenario. I ran the following code:
sd1:=DATASET(100000,TRANSFORM({unsigned8 num1},SELF.num1 := COUNTER ));
sd2:=DATASET(1000,TRANSFORM({unsigned8 num1, unsigned8 num2},SELF.num1 := COUNTER , SELF.num2 := COUNTER % 10 ));
ds1:=DISTRIBUTE(sd1,hash(num1));
ds4:=DISTRIBUTE(sd1,random());
ds2:=DISTRIBUTE(sd2,hash(num1));
ds3:=DISTRIBUTE(sd2,hash(num2));
j11:=JOIN(sd1,sd2,LEFT.num1=RIGHT.num1 ):independent;
j12:=JOIN(sd1,sd2,LEFT.num1=RIGHT.num1,LOOKUP ):independent;
j13:=JOIN(sd1,sd2,LEFT.num1=RIGHT.num1, LOCAL):independent;
j14:=JOIN(sd1,sd2,LEFT.num1=RIGHT.num1,LOOKUP,LOCAL):independent;
j21:=JOIN(ds1,ds2,LEFT.num1=RIGHT.num1 ):independent;
j22:=JOIN(ds1,ds2,LEFT.num1=RIGHT.num1,LOOKUP ):independent;
j23:=JOIN(ds1,ds2,LEFT.num1=RIGHT.num1, LOCAL):independent;
j24:=JOIN(ds1,ds2,LEFT.num1=RIGHT.num1,LOOKUP,LOCAL):independent;
j31:=JOIN(ds1,ds3,LEFT.num1=RIGHT.num1 ):independent;
j32:=JOIN(ds1,ds3,LEFT.num1=RIGHT.num1,LOOKUP ):independent;
j33:=JOIN(ds1,ds3,LEFT.num1=RIGHT.num1, LOCAL):independent;
j34:=JOIN(ds1,ds3,LEFT.num1=RIGHT.num1,LOOKUP,LOCAL):independent;
j41:=JOIN(ds4,ds2,LEFT.num1=RIGHT.num1 ):independent;
j42:=JOIN(ds4,ds2,LEFT.num1=RIGHT.num1,LOOKUP ):independent;
j43:=JOIN(ds4,ds2,LEFT.num1=RIGHT.num1, LOCAL):independent;
j44:=JOIN(ds4,ds2,LEFT.num1=RIGHT.num1,LOOKUP,LOCAL):independent;
j51:=JOIN(ds4,ds2,LEFT.num1=RIGHT.num1 ):independent;
j52:=JOIN(ds4,ds2,LEFT.num1=RIGHT.num1,LOOKUP ):independent;
j53:=JOIN(ds4,ds2,LEFT.num1=RIGHT.num1, LOCAL,HASH):independent;
j54:=JOIN(ds4,ds2,LEFT.num1=RIGHT.num1,LOOKUP,LOCAL,HASH):independent;
dataset([{count(j11),'11'},{count(j12),'12'},{count(j13),'13'},{count(j14),'14'},
{count(j21),'21'},{count(j22),'22'},{count(j23),'23'},{count(j24),'24'},
{count(j31),'31'},{count(j32),'32'},{count(j33),'33'},{count(j34),'34'},
{count(j31),'41'},{count(j32),'42'},{count(j33),'43'},{count(j44),'44'},
{count(j51),'51'},{count(j52),'52'},{count(j53),'53'},{count(j54),'54'}
] , {unsigned8 num, string lbl});
On a 400 node cluster, the results come back as:
##
num
lbl
1
1000
11
2
1000
12
3
1000
13
4
1000
14
5
1000
21
6
1000
22
7
1000
23
8
1000
24
9
1000
31
10
1000
32
11
12
33
12
12
34
13
1000
41
14
1000
42
15
12
43
16
6
44
17
1000
51
18
1000
52
19
1
53
20
1
54
If you look at the row 12 in the result ( lbl 34 ), you will notice the match rate drops substantially, suggesting the compiler does indeed distribute the file (with the wrong hashed field) and disregard the LOOKUP option.
My conclusion is therefore that as always, it remains the developer's responsibility to ensure the distribution is right ahead of the join REGARDLESS of which join options are being used.
The manual page could be better. LOOKUP by itself is properly documented. and LOCAL by itself is properly documented. However, they represent two different concepts and can be combined without issue so that JOIN(,,, LOOKUP, LOCAL) makes sense and can be useful.
It is probably best to consider LOOKUP as a specific kind of JOIN matching algorithm and to consider LOCAL as a way to tell the compiler that you are not a novice and that you are absolutely sure the data is already where it needs to be to accomplish what you intend.
For a normal LOOKUP join the LEFT-hand side doesn't need to be sorted or distributed in any particular way and the whole RHS-hand side is copied to every slave. No matter what join value appears on the LEFT, if there is a matching value on the RIGHT then it will be found because the whole RIGHT dataset is present.
In a 400-way system with well-distributed join values, IF the LEFT side is distributed on the join value, then the LEFT dataset in each worker only contains 1/400th of the join values and only 1/400th of the values in the RIGHT dataset will ever be matched. Effectively, within each worker, 399/400th of the RIGHT data will be unused.
However, if both the LEFT and RIGHT datasets are distributed on the join value ... and you are not a novice and know that using LOCAL is what you want ... then you can specify a LOOKUP, LOCAL join. The RIGHT data is already where it needs to be. Any join value that appears in the LEFT data will, if the value exists, find a match locally in the RIGHT dataset. As a bonus, the RIGHT data only contains join values that could match ... it is only 1/400th of the LOOKUP only size.
This enables larger LOOKUP joins. Imagine your 400-way system and a 100GB RIGHT dataset that you would like to use in a LOOKUP join. Copying a 100GB dataset to each slave seems unlikely to work. However, if evenly distributed, a LOOKUP, LOCAL join only requires 250MB of RIGHT data per worker ... which seems quite reasonable.
HTH

PromQL: increase over counter

Here my current counter value:
method_timed_seconds_count{method="getByUId"} ---> 68
After having fetch a http request to my service, this counter is iscremented:
method_timed_seconds_count{method="getByUId"} ---> 69
I want to get how this counter has increased inside a 30s window, using this:
increase(method_timed_seconds_count{method="getByUId"}[30s]) ---> 2
However, I'm getting value 2!
Why? I was expecting to get 1!
Scrape interval is 15s.
Any ideas?
Prometheus has the following issues with increase() calculations:
It extrapolates increase() results - see this issue.
It doesn't take into account the increase between the last raw sample before the specified lookbehind window in square brackets and the first raw sample inside the lookbehind window. See this design doc for details.
It misses the increase for the first raw sample in a time series. For example, if a time series starts from 5 and has the following samples: 5 6 9 12, then increase over these samples would return something around 12-5=7 instead of the expected 12.
That's why it isn't recommended to use increase() in Prometheus for calculating the exact counter increases.
P.S. If you need calculating the exact counter increases over the specified lookbehind window, then try VictoriaMetrics - Prometheus-like monitoring system I work on. It provides increase() function, which is free from the issues mentioned above.

How to export-import MLFlow experiments with original experiment ID using mlflow-export-import?

I am using https://github.com/mlflow/mlflow-export-import to export experiments from local mlruns/ to a RDS Postgres. However, for each new imported experiment, the experiment ID is not sequential. It is the sum of all runs and experiments imported before.
For example:
ID 0: Default, 0 runs
ID 1: Experiment 1, 88 runs
ID 90: Experiment 2, 86 runs
ID 177: Experiment 3, 1 run
ID 179: Experiment 4, 10 runs
Since the experiments ids can not be setted mannualy, there is a way to change the mlflow-export-import code to use the original experiment id? Or, at least, use incremental experiment ids for each new imported experiment?
Experiment IDs are auto-generated by MLflow. There is no API to set or change them. In open source MLflow they are monotonically increasing integers, in Databricks MLflow they are UUIDs.

In Azure Anomaly Detector API,Why is changing sensitivity parameter is not changing response output of detected anomaly?

with reference to notebook available on Azure-site,
I have created an experiment, where am pushing some 5000 records of the parameter. I tried changing sensitivity from 90 to 25 but I can-not see any changes on output bokeh plot.
sensitivity = 95
sensitivity = 25
I even checked the JSON that is being loaded before running Anomaly-Detector API and sensitivity value is being updated as per required format.
Can you suggest me what could be the reason? Where should I look into to resolve the issue?
Thanks!
What are the other parameters you set? From the figure you attached, it seems that your data is periodic. Can you try to set the period value, and set maxAnomalyRatio to a small value?

What factors influence the "Avoids Enormous Network Payloads" message?

According to the official documentation, the target is 1600KB:
https://developers.google.com/web/tools/lighthouse/audits/network-payloads
However, even when the payload is larger than the target, it sometimes still shows as a passed audit:
payload of 2405KB still passes
What are the other conditions that allow large payloads to pass the audit?
Lighthouse scores are 0–100 based on some log normal distribution math.
1600KB is not a passing score, it is approximately a maximum possible 100 score.
As of right now the values used for distribution calculation are 2500KB point of diminishing returns and 4000KB median, which would correspond to scores of about 93 and 50 respectively.
That puts 2405KB result at ~94 score which is sufficient to pass.

Resources