Weka, SVM technique output - svm

I am working with SVM at Weka
I have some data and I try SVM (I tried different values of C) technique to analize the data. But the output has totally confused me, that is why I hope some help.
This is the output for a polynomial kernel of degree 1:
Scheme:weka.classifiers.meta.CVParameterSelection -P "C 0.001 10.0 5.0" -X 10 -S 1 -W weka.classifiers.functions.SMO -- -C 0.7 -L 0.001 -P 1.0E-12 -N 0 -V -1 -W 1 -K "weka.classifiers.functions.supportVector.PolyKernel -C 250007 -E 1.0"
Classifier Options: -C 7.5003 -L 0.001 -P 1.0E-12 -N 0 -V -1 -W 1 -K "weka.classifiers.functions.supportVector.PolyKernel -C 250007 -E 1.0"
=== Summary ===
Correctly Classified Instances 83 79.0476 %
Incorrectly Classified Instances 22 20.9524 %
Kappa statistic 0.6555
Mean absolute error 0.0487
Root mean squared error 0.1549
Relative absolute error 91.5633 %
Root relative squared error 100.2828 %
Total Number of Instances 105
This is the output for a polynomial kernel of degree 2:
Scheme:weka.classifiers.meta.CVParameterSelection -P "C 0.001 10.0 5.0" -X 10 -S 1 -W weka.classifiers.functions.SMO -- -C 0.7 -L 0.001 -P 1.0E-12 -N 0 -V -1 -W 1 -K "weka.classifiers.functions.supportVector.PolyKernel -C 250007 -E 2.0"
Classifier Options: -C 2.5008 -L 0.001 -P 1.0E-12 -N 0 -V -1 -W 1 -K "weka.classifiers.functions.supportVector.PolyKernel -C 250007 -E 2.0"
=== Summary ===
Correctly Classified Instances 87 82.8571 %
Incorrectly Classified Instances 18 17.1429 %
Kappa statistic 0.7236
Mean absolute error 0.0486
Root mean squared error 0.1547
Relative absolute error 91.4748 %
Root relative squared error 100.1845 %
Total Number of Instances 105
This is the output for a gaussian kernel and gamma value of 1.0:
Scheme:weka.classifiers.meta.CVParameterSelection -P "C 0.001 10.0 5.0" -X 10 -S 1 -W weka.classifiers.functions.SMO -- -C 0.7 -L 0.001 -P 1.0E-12 -N 0 -V -1 -W 1 -K "weka.classifiers.functions.supportVector.RBFKernel -C 250007 -G 1.0"
Classifier Options: -C 2.5008 -L 0.001 -P 1.0E-12 -N 0 -V -1 -W 1 -K "weka.classifiers.functions.supportVector.RBFKernel -C 250007 -G 1.0"
=== Summary ===
Correctly Classified Instances 87 82.8571 %
Incorrectly Classified Instances 18 17.1429 %
Kappa statistic 0.721
Mean absolute error 0.0486
Root mean squared error 0.1547
Relative absolute error 91.4571 %
Root relative squared error 100.1651 %
Total Number of Instances 105
These are my questions:
Why the classifiers at “Classifier Options: -Cxxxxx” have different values of C? I think it is related to the values of C that I am checking with “C 0.001 10.0 5.0”, but I may be wrong and I may be getting wrong conclusions. I think that with “C 0.001 10.0 5.0” it tries the values 0.001, 0.01, 0.1, 1.0 and 10.0. If that is true why the values of C are: 7.5003 and 2.5008??
The percentage of the Correctly Classified Instances is very similar in all of the different runs, I don’t understand why… What does it mean? If it were 12% I would think SVM was not a proper technique (it wasn't linearly separable) but with 80% (no or little variations after changing the parameters) I don’t know what to think…
If I check the data in the “Visualize” of Weka the data doesn’t seem linearly separable but due to the conversion of a bigger number of dimensions when I use SVM I don’t think it is possible to get conclusions from the visualize.

1) Yes you use CVParameterSelection which trys different parameters in your case from 0.001 to 10 in 5 steps. The steps would be 0.001 + k * (10-0.001)/4) because your first step is allready defined as 0.001. If you round the following values (for k= 0/1/2/3/4) you see that they fit.
Step 1) 0.001
Step 2) 2.50075
Step 3) 5.005
Step 4) 7.50025
Step 5) 10.0
2) If you would have 12% accuracy there would be something really strange. Random classification (assuming you have balanced data) would result in 50%. I cannot tell you if 82% is a good result because i don't know about your data.
3) To your question about linar separability. Thats exactly why the SVM is so good. SVM with the right Kernel transforms your data into a higher featurespace to get rid of the non-separability. An RBF kernel can transform your data even into an infinite-dimensional feature space. Thats why a perfect linar separation is allways possible. At this point overfitting can occur. To avoid that and reach a good generalisation you have your complexity parameter C.
Here you can read more in a good stackexchange post about the last point
https://stats.stackexchange.com/questions/80398/how-can-svm-find-an-infinite-feature-space-where-linear-separation-is-always-p

Related

OpenCV Training Custom Haar Cascade

I am trying to train a haar cascade on my face. I have everything setup including the positive, negetive, the vec file, etc. but when I run the opencv_traincascade, it gave me a terminate called after throwing an instance of 'std::bad_alloc' error. Then I added this line to my arguments -nonsym -mem 512 and it gave me this error: terminate called after throwing an instance of 'std::logic_error'.
Here is the command I am running:
opencv_traincascade -data classifier -vec samples.vec -bg negatives.txt\
> -numStages 20 -minHitRate 0.999 -maxFalseAlarmRate 0.5 -numPos 1000\
> -numNeg 600 -w 80 -h 40 -mode ALL -precalcValBufSize 1024\
> -precalcIdxBufSize 1024\
> -nonsym\
> -mem 512\
Any help would be greatly appreciated!
You have to get rid of the -nonsym -mem 512 and instead put -mode ALL. So new the command looks like this:
opencv_traincascade -data classifier -vec samples.vec -bg negatives.txt\
> -numStages 20 -minHitRate 0.999 -maxFalseAlarmRate 0.5 -numPos 1000\
> -numNeg 600 -w 80 -h 40 -mode ALL -precalcValBufSize 1024\
> -precalcIdxBufSize 1024\
> -mode ALL
The -nonsym -mem512 commands don't actually exist.

Best hyperparameters from GridSearch give inverted results compared to the actual SMOreg prediction results

I am a beginner in Machine Learning, and I have a question regarding the results of Grid Search to find the best hyperparameters of C and gamma for SMOreg.
I am doing Grid Search to find the best C and gamma that will give me the highest correlation. I am using GridSearch as shown below:
weka.classifiers.meta.GridSearch -E CC -y-property kernel.gamma -y-min -3.0 -y-max 3.0 -y-step 1.0 -y-base 10.0 -y-expression pow(BASE,I) -x-property C -x-min -3.0 -x-max 3.0 -x-step 1.0 -x-base 10.0 -x-expression pow(BASE,I) -sample-size 100.0 -traversal ROW-WISE -log-file "D:\Program Files\Weka-3-9-4" -num-slots 1 -S 1 -W weka.classifiers.functions.SMOreg -- -C 1.0 -N 0 -I "weka.classifiers.functions.supportVector.RegSMOImproved -T 0.001 -V -P 1.0E-12 -L 0.001 -W 1" -K "weka.classifiers.functions.supportVector.RBFKernel -C 250007 -G 0.01"
As shown above, I set the evaluation into CC (Correlation Coefficient), the XProperty into C and YProperty into kernel.gamma, with both the C and the gamma parameters ranging from 10^-3,10^-2,10^-1,1,10,100, and 1000. It should be noted that I use SMOreg as the base classifier and RBFKernel as the kernel function. Furthermore, I set the evaluation metrics into Correlation, MAE, RMSE, and MAPE.
The hyperparameter output of the GridSearch is shown below:
weka.classifiers.meta.GridSearch:
Classifier: weka.classifiers.functions.SMOreg -C 1.0 -N 0 -I "weka.classifiers.functions.supportVector.RegSMOImproved -T 0.001 -V -P 1.0E-12 -L 0.001 -W 1" -K "weka.classifiers.functions.supportVector.RBFKernel -C 250007 -G 10.0"
X property: C
Y property: kernel.gamma
Evaluation: Correlation coefficient
Coordinates: [0.0, 1.0]
Values: 1.0 (X coordinate), 10.0 (Y coordinate)
The meaning of this result is that the best value for C is 1 and the best gamma is 10
At the end of the GridSearch result, the following summary data is shown:
=== Summary ===
Correlation coefficient 0.945
Mean absolute percentage error 0.1434
Mean absolute error 3.8465
Root mean squared error 5.5417
Total Number of Instances 309
This implies that, by using the C=1 and gamma=10 respectively, the prediction result should be the one showed in the summary above.
However, when I am doing an SMOreg with the same dataset using the best parameters acquired from the GridSearch (C=1 and gamma=10), I get different results as shown below:
The classifier:
weka.classifiers.functions.SMOreg -C 1.0 -N 0 -I "weka.classifiers.functions.supportVector.RegSMOImproved -T 0.001 -V -P 1.0E-12 -L 0.001 -W 1" -K "weka.classifiers.functions.supportVector.RBFKernel -C 250007 -G 10.0"
The output:
=== Summary ===
Correlation coefficient 0.9417
Mean absolute percentage error 0.1381
Mean absolute error 3.7672
Root mean squared error 5.7114
Total Number of Instances 309
Funny thing is, if I invert the best hyperparameters result (C=10, gamma=1), I get the perfectly similar result to the GridSearch prediction result:
The classifier:
weka.classifiers.functions.SMOreg -C 10.0 -N 0 -I "weka.classifiers.functions.supportVector.RegSMOImproved -T 0.001 -V -P 1.0E-12 -L 0.001 -W 1" -K "weka.classifiers.functions.supportVector.RBFKernel -C 250007 -G 1.0"
The output:
=== Summary ===
Correlation coefficient 0.945
Mean absolute percentage error 0.1434
Mean absolute error 3.8465
Root mean squared error 5.5417
Total Number of Instances 309
Does anyone know why this happens?
One more thing: In the GridSearch, the maximum search space for both C and gamma is 1000. However, I tried using C=100, and gamma=1 and it gave better/higher Correlation Coefficient number compared to the best hyperparameters according to the GridSearch which is C=1 and gamma=10. Why the GridSearch didn't give the best results?
Thank you.

Weka command line attributes arguments

On the command line, I'm able to get this rolling with no problem:
java weka.Run weka.classifiers.timeseries.WekaForecaster -W
"weka.classifiers.functions.MultilayerPerceptron -L 0.01 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H 20 " -t "C:\MyFile.arff" -F DirectionNumeric -L 1 -M 3 -prime 3 -horizon 6 -holdout 100 -G TradeDay -dayofweek -weekend -future
But once I try to put the skip list, I start to get errors saying that it's missing a date that is not in the skip list even though the date is in fact on it:
java weka.Run weka.classifiers.timeseries.WekaForecaster -W "weka.classifiers.functions.MultilayerPerceptron -L 0.01 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H 20 " -t "C:\MyFile.arff" -F DirectionNumeric -L 1 -M 3 -prime 3 -horizon 6 -holdout 100 -G TradeDay -dayofweek -weekend -future -skip ""2014-06-07#yyyy-MM-dd, 2014-06-12"
Does anybody knows how to get this working? Weka is low on documentation as far as I know.
Thank's in advance!
Forget it. I got it, the problem was the 's' must be in capital letters:
-Skip
instead of
-skip.

perf : How to check processess running on particular cpu

Is there any option in perf to look into processes running on a particular cpu /core, and how much percentage of that core is taken by each process.
Reference links would be helpful.
perf is intended to do a profiling which is not good fit for your case. You may try to do sampling /proc/sched_debug (if it is compiled in your kernel). For example you may check which process is currently running on CPU:
egrep '^R|cpu#' /proc/sched_debug
cpu#0, 917.276 MHz
R egrep 2614 37730.177313 ...
cpu#1, 917.276 MHz
R bash 2023 218715.010833 ...
By using his PID as a key, you may check how many CPU time in milliseconds it consumed:
grep se.sum_exec_runtime /proc/2023/sched
se.sum_exec_runtime : 279346.058986
However, as #BrenoLeitão mentioned, SystemTap is quite useful for your script. Here is script for your task.
global cputimes;
global cmdline;
global oncpu;
global NS_PER_SEC = 1000000000;
probe scheduler.cpu_on {
oncpu[pid()] = local_clock_ns();
}
probe scheduler.cpu_off {
if(oncpu[pid()] == 0)
next;
cmdline[pid()] = cmdline_str();
cputimes[pid(), cpu()] <<< local_clock_ns() - oncpu[pid()];
delete oncpu[pid()];
}
probe timer.s(1) {
printf("%6s %3s %6s %s\n", "PID", "CPU", "PCT", "CMDLINE");
foreach([pid+, cpu] in cputimes) {
cpupct = #sum(cputimes[pid, cpu]) * 10000 / NS_PER_SEC;
printf("%6d %3d %3d.%02d %s\n", pid, cpu,
cpupct / 100, cpupct % 100, cmdline[pid]);
}
delete cputimes;
}
It traces moments when process is running on CPU and stops execution on that (due to migration or sleeping) by attaching to scheduler.cpu_on and scheduler.cpu_off probes. Second probe calculates time difference between these events and saves it to cputimes aggregation along with process command line arguments.
timer.s(1) fires once per second -- it walks over aggregation and calculates percentage. Here is sample output for Centos 7 with bash running infinite loop:
0 0 100.16
30 1 0.00
51 0 0.00
380 0 0.02 /usr/bin/python -Es /usr/sbin/tuned -l -P
2016 0 0.08 sshd: root#pts/0 "" "" "" ""
2023 1 100.11 -bash
2630 0 0.04 /usr/libexec/systemtap/stapio -R stap_3020c9e7ba76838179be68cd2390a10c_2630 -F3
I understand that perf is not the proper way to do it, although you can limit perf per CPU, as using perf record -C <cpulist> or even perf stat -c <cpulist>.
The close you are going to see is the context-switch event, but, this is not going to provide you the application names at all.
I think you are going to need something more powerful, as systemtap.

Segmentation fault while using bash script to generate mobility file

I am using a bash script to generate mobility files (setdest) in ns2 for various seeds. But I am running into this troublesome segmentation fault. Any help would be appreciated. The setdest.cc has been modified, so its not the standard ns2 file.
I will walk you through the problem.
This code in a shell script returns the segmentation fault.
#! /bin/sh
setdest="/root/ns-allinone-2.1b9a/ns-2.1b9a/indep-utils/cmu-scen-gen/setdest/setdest_mesh_seed_mod"
let nn="70" #Number of nodes in the simulation
let time="900" #Simulation time
let x="1000" #Horizontal dimensions
let y="1000" #Vertical dimensions
for speed in 5
do
for pause in 10
do
for seed in 1 5
do
echo -e "\n"
echo Seed = $seed Speed = $speed Pause Time = $pause
chmod 700 $setdest
setdest -n $nn -p $pause -s $speed -t $time -x $x -y $y -l 1 -m 50 > scen-mesh-n$nn-seed$seed-p$pause-s$speed-t$time-x$x-y$y
done
done
done
error is
scengen_mesh: line 21: 14144 Segmentation fault $setdest -n $nn -p $pause -s $speed -t $time -x $x -y $y -l 1 -m 50 >scen-mesh-n$nn-seed$seed-p$pause-s$speed-t$time-x$x-y$y
line 21 is the last line of the shell script (done)
The strange thing is If i run the same setdest command on the terminal, there is no problem! like
$setdest -n 70 -p 10 -s 5 -t 900 -x 1000 -y 1000 -l 1 -m 50
I have made out where the problem is exactly. Its with the argument -l. If i remove the argument in the shell script, there is no problem. Now i will walk you through the modified setdest.cc where this argument is coming from.
This modified setdest file uses a text file initpos to read XY coordinates of static nodes for a wireless mesh topology. the relevant lines of code are
FILE *fp_loc;
int locinit;
fp_loc = fopen("initpos","r");
while ((ch = getopt(argc, argv, "r:m:l:n:p:s:t:x:y:i:o")) != EOF) {
switch (ch) {
case 'l':
locinit = atoi(optarg);
break;
default:
usage(argv);
exit(1);
if(locinit)
fscanf(fp_loc,"%lf %lf",&position.X, &position.Y);
if (position.X == -1 && position.Y == -1){
position.X = uniform() * MAXX;
position.Y = uniform() * MAXY;
}
What i dont get is...
In Shell script..
-option -l if supplied by 0 returns no error,
-but if supplied by any other value (i used 1 mostly) returns this segmentation fault.
In Terminal..
-no segmentation fault with any value. 0 or 1
something to do with the shell script surely. I am amazed what is going wrong where!
Your help will be highly appreciated.
Cheers

Resources