Weka command line attributes arguments - attributes

On the command line, I'm able to get this rolling with no problem:
java weka.Run weka.classifiers.timeseries.WekaForecaster -W
"weka.classifiers.functions.MultilayerPerceptron -L 0.01 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H 20 " -t "C:\MyFile.arff" -F DirectionNumeric -L 1 -M 3 -prime 3 -horizon 6 -holdout 100 -G TradeDay -dayofweek -weekend -future
But once I try to put the skip list, I start to get errors saying that it's missing a date that is not in the skip list even though the date is in fact on it:
java weka.Run weka.classifiers.timeseries.WekaForecaster -W "weka.classifiers.functions.MultilayerPerceptron -L 0.01 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H 20 " -t "C:\MyFile.arff" -F DirectionNumeric -L 1 -M 3 -prime 3 -horizon 6 -holdout 100 -G TradeDay -dayofweek -weekend -future -skip ""2014-06-07#yyyy-MM-dd, 2014-06-12"
Does anybody knows how to get this working? Weka is low on documentation as far as I know.
Thank's in advance!

Forget it. I got it, the problem was the 's' must be in capital letters:
-Skip
instead of
-skip.

Related

Best hyperparameters from GridSearch give inverted results compared to the actual SMOreg prediction results

I am a beginner in Machine Learning, and I have a question regarding the results of Grid Search to find the best hyperparameters of C and gamma for SMOreg.
I am doing Grid Search to find the best C and gamma that will give me the highest correlation. I am using GridSearch as shown below:
weka.classifiers.meta.GridSearch -E CC -y-property kernel.gamma -y-min -3.0 -y-max 3.0 -y-step 1.0 -y-base 10.0 -y-expression pow(BASE,I) -x-property C -x-min -3.0 -x-max 3.0 -x-step 1.0 -x-base 10.0 -x-expression pow(BASE,I) -sample-size 100.0 -traversal ROW-WISE -log-file "D:\Program Files\Weka-3-9-4" -num-slots 1 -S 1 -W weka.classifiers.functions.SMOreg -- -C 1.0 -N 0 -I "weka.classifiers.functions.supportVector.RegSMOImproved -T 0.001 -V -P 1.0E-12 -L 0.001 -W 1" -K "weka.classifiers.functions.supportVector.RBFKernel -C 250007 -G 0.01"
As shown above, I set the evaluation into CC (Correlation Coefficient), the XProperty into C and YProperty into kernel.gamma, with both the C and the gamma parameters ranging from 10^-3,10^-2,10^-1,1,10,100, and 1000. It should be noted that I use SMOreg as the base classifier and RBFKernel as the kernel function. Furthermore, I set the evaluation metrics into Correlation, MAE, RMSE, and MAPE.
The hyperparameter output of the GridSearch is shown below:
weka.classifiers.meta.GridSearch:
Classifier: weka.classifiers.functions.SMOreg -C 1.0 -N 0 -I "weka.classifiers.functions.supportVector.RegSMOImproved -T 0.001 -V -P 1.0E-12 -L 0.001 -W 1" -K "weka.classifiers.functions.supportVector.RBFKernel -C 250007 -G 10.0"
X property: C
Y property: kernel.gamma
Evaluation: Correlation coefficient
Coordinates: [0.0, 1.0]
Values: 1.0 (X coordinate), 10.0 (Y coordinate)
The meaning of this result is that the best value for C is 1 and the best gamma is 10
At the end of the GridSearch result, the following summary data is shown:
=== Summary ===
Correlation coefficient 0.945
Mean absolute percentage error 0.1434
Mean absolute error 3.8465
Root mean squared error 5.5417
Total Number of Instances 309
This implies that, by using the C=1 and gamma=10 respectively, the prediction result should be the one showed in the summary above.
However, when I am doing an SMOreg with the same dataset using the best parameters acquired from the GridSearch (C=1 and gamma=10), I get different results as shown below:
The classifier:
weka.classifiers.functions.SMOreg -C 1.0 -N 0 -I "weka.classifiers.functions.supportVector.RegSMOImproved -T 0.001 -V -P 1.0E-12 -L 0.001 -W 1" -K "weka.classifiers.functions.supportVector.RBFKernel -C 250007 -G 10.0"
The output:
=== Summary ===
Correlation coefficient 0.9417
Mean absolute percentage error 0.1381
Mean absolute error 3.7672
Root mean squared error 5.7114
Total Number of Instances 309
Funny thing is, if I invert the best hyperparameters result (C=10, gamma=1), I get the perfectly similar result to the GridSearch prediction result:
The classifier:
weka.classifiers.functions.SMOreg -C 10.0 -N 0 -I "weka.classifiers.functions.supportVector.RegSMOImproved -T 0.001 -V -P 1.0E-12 -L 0.001 -W 1" -K "weka.classifiers.functions.supportVector.RBFKernel -C 250007 -G 1.0"
The output:
=== Summary ===
Correlation coefficient 0.945
Mean absolute percentage error 0.1434
Mean absolute error 3.8465
Root mean squared error 5.5417
Total Number of Instances 309
Does anyone know why this happens?
One more thing: In the GridSearch, the maximum search space for both C and gamma is 1000. However, I tried using C=100, and gamma=1 and it gave better/higher Correlation Coefficient number compared to the best hyperparameters according to the GridSearch which is C=1 and gamma=10. Why the GridSearch didn't give the best results?
Thank you.

Weka, SVM technique output

I am working with SVM at Weka
I have some data and I try SVM (I tried different values of C) technique to analize the data. But the output has totally confused me, that is why I hope some help.
This is the output for a polynomial kernel of degree 1:
Scheme:weka.classifiers.meta.CVParameterSelection -P "C 0.001 10.0 5.0" -X 10 -S 1 -W weka.classifiers.functions.SMO -- -C 0.7 -L 0.001 -P 1.0E-12 -N 0 -V -1 -W 1 -K "weka.classifiers.functions.supportVector.PolyKernel -C 250007 -E 1.0"
Classifier Options: -C 7.5003 -L 0.001 -P 1.0E-12 -N 0 -V -1 -W 1 -K "weka.classifiers.functions.supportVector.PolyKernel -C 250007 -E 1.0"
=== Summary ===
Correctly Classified Instances 83 79.0476 %
Incorrectly Classified Instances 22 20.9524 %
Kappa statistic 0.6555
Mean absolute error 0.0487
Root mean squared error 0.1549
Relative absolute error 91.5633 %
Root relative squared error 100.2828 %
Total Number of Instances 105
This is the output for a polynomial kernel of degree 2:
Scheme:weka.classifiers.meta.CVParameterSelection -P "C 0.001 10.0 5.0" -X 10 -S 1 -W weka.classifiers.functions.SMO -- -C 0.7 -L 0.001 -P 1.0E-12 -N 0 -V -1 -W 1 -K "weka.classifiers.functions.supportVector.PolyKernel -C 250007 -E 2.0"
Classifier Options: -C 2.5008 -L 0.001 -P 1.0E-12 -N 0 -V -1 -W 1 -K "weka.classifiers.functions.supportVector.PolyKernel -C 250007 -E 2.0"
=== Summary ===
Correctly Classified Instances 87 82.8571 %
Incorrectly Classified Instances 18 17.1429 %
Kappa statistic 0.7236
Mean absolute error 0.0486
Root mean squared error 0.1547
Relative absolute error 91.4748 %
Root relative squared error 100.1845 %
Total Number of Instances 105
This is the output for a gaussian kernel and gamma value of 1.0:
Scheme:weka.classifiers.meta.CVParameterSelection -P "C 0.001 10.0 5.0" -X 10 -S 1 -W weka.classifiers.functions.SMO -- -C 0.7 -L 0.001 -P 1.0E-12 -N 0 -V -1 -W 1 -K "weka.classifiers.functions.supportVector.RBFKernel -C 250007 -G 1.0"
Classifier Options: -C 2.5008 -L 0.001 -P 1.0E-12 -N 0 -V -1 -W 1 -K "weka.classifiers.functions.supportVector.RBFKernel -C 250007 -G 1.0"
=== Summary ===
Correctly Classified Instances 87 82.8571 %
Incorrectly Classified Instances 18 17.1429 %
Kappa statistic 0.721
Mean absolute error 0.0486
Root mean squared error 0.1547
Relative absolute error 91.4571 %
Root relative squared error 100.1651 %
Total Number of Instances 105
These are my questions:
Why the classifiers at “Classifier Options: -Cxxxxx” have different values of C? I think it is related to the values of C that I am checking with “C 0.001 10.0 5.0”, but I may be wrong and I may be getting wrong conclusions. I think that with “C 0.001 10.0 5.0” it tries the values 0.001, 0.01, 0.1, 1.0 and 10.0. If that is true why the values of C are: 7.5003 and 2.5008??
The percentage of the Correctly Classified Instances is very similar in all of the different runs, I don’t understand why… What does it mean? If it were 12% I would think SVM was not a proper technique (it wasn't linearly separable) but with 80% (no or little variations after changing the parameters) I don’t know what to think…
If I check the data in the “Visualize” of Weka the data doesn’t seem linearly separable but due to the conversion of a bigger number of dimensions when I use SVM I don’t think it is possible to get conclusions from the visualize.
1) Yes you use CVParameterSelection which trys different parameters in your case from 0.001 to 10 in 5 steps. The steps would be 0.001 + k * (10-0.001)/4) because your first step is allready defined as 0.001. If you round the following values (for k= 0/1/2/3/4) you see that they fit.
Step 1) 0.001
Step 2) 2.50075
Step 3) 5.005
Step 4) 7.50025
Step 5) 10.0
2) If you would have 12% accuracy there would be something really strange. Random classification (assuming you have balanced data) would result in 50%. I cannot tell you if 82% is a good result because i don't know about your data.
3) To your question about linar separability. Thats exactly why the SVM is so good. SVM with the right Kernel transforms your data into a higher featurespace to get rid of the non-separability. An RBF kernel can transform your data even into an infinite-dimensional feature space. Thats why a perfect linar separation is allways possible. At this point overfitting can occur. To avoid that and reach a good generalisation you have your complexity parameter C.
Here you can read more in a good stackexchange post about the last point
https://stats.stackexchange.com/questions/80398/how-can-svm-find-an-infinite-feature-space-where-linear-separation-is-always-p

Segmentation fault while using bash script to generate mobility file

I am using a bash script to generate mobility files (setdest) in ns2 for various seeds. But I am running into this troublesome segmentation fault. Any help would be appreciated. The setdest.cc has been modified, so its not the standard ns2 file.
I will walk you through the problem.
This code in a shell script returns the segmentation fault.
#! /bin/sh
setdest="/root/ns-allinone-2.1b9a/ns-2.1b9a/indep-utils/cmu-scen-gen/setdest/setdest_mesh_seed_mod"
let nn="70" #Number of nodes in the simulation
let time="900" #Simulation time
let x="1000" #Horizontal dimensions
let y="1000" #Vertical dimensions
for speed in 5
do
for pause in 10
do
for seed in 1 5
do
echo -e "\n"
echo Seed = $seed Speed = $speed Pause Time = $pause
chmod 700 $setdest
setdest -n $nn -p $pause -s $speed -t $time -x $x -y $y -l 1 -m 50 > scen-mesh-n$nn-seed$seed-p$pause-s$speed-t$time-x$x-y$y
done
done
done
error is
scengen_mesh: line 21: 14144 Segmentation fault $setdest -n $nn -p $pause -s $speed -t $time -x $x -y $y -l 1 -m 50 >scen-mesh-n$nn-seed$seed-p$pause-s$speed-t$time-x$x-y$y
line 21 is the last line of the shell script (done)
The strange thing is If i run the same setdest command on the terminal, there is no problem! like
$setdest -n 70 -p 10 -s 5 -t 900 -x 1000 -y 1000 -l 1 -m 50
I have made out where the problem is exactly. Its with the argument -l. If i remove the argument in the shell script, there is no problem. Now i will walk you through the modified setdest.cc where this argument is coming from.
This modified setdest file uses a text file initpos to read XY coordinates of static nodes for a wireless mesh topology. the relevant lines of code are
FILE *fp_loc;
int locinit;
fp_loc = fopen("initpos","r");
while ((ch = getopt(argc, argv, "r:m:l:n:p:s:t:x:y:i:o")) != EOF) {
switch (ch) {
case 'l':
locinit = atoi(optarg);
break;
default:
usage(argv);
exit(1);
if(locinit)
fscanf(fp_loc,"%lf %lf",&position.X, &position.Y);
if (position.X == -1 && position.Y == -1){
position.X = uniform() * MAXX;
position.Y = uniform() * MAXY;
}
What i dont get is...
In Shell script..
-option -l if supplied by 0 returns no error,
-but if supplied by any other value (i used 1 mostly) returns this segmentation fault.
In Terminal..
-no segmentation fault with any value. 0 or 1
something to do with the shell script surely. I am amazed what is going wrong where!
Your help will be highly appreciated.
Cheers

find all users who has over N process and echo them in shell

I'm writing script is ksh. Need to find all users who has over N process and echo them in shell.
N reads from ksh.
I know what I should use ps -elf but how parse it, find users with >N process and create array with them. Little troubles with array in ksh. Please help. Maybe simple solutions can help me instead of array creating.
s162103#helios:/home/s162103$ ps -elf
0 S s153308 4804 1 0 40 20 ? 17666 ? 11:03:08 ? 0:00 /usr/lib/gnome-settings-daemon --oa
0 S root 6546 1327 0 40 20 ? 3584 ? 11:14:06 ? 0:00 /usr/dt/bin/dtlogin -daemon -udpPor
0 S webservd 15646 485 0 40 20 ? 2823 ? п╪п╟я─я ? 0:23 /opt/csw/sbin/nginx
0 S s153246 6746 6741 0 40 20 ? 18103 ? 11:14:21 ? 0:00 iiim-panel --disable-crash-dialog
0 S s153246 23512 1 0 40 20 ? 17903 ? 09:34:08 ? 0:00 /usr/bin/metacity --sm-client-id=de
0 S root 933 861 0 40 20 ? 5234 ? 10:26:59 ? 0:00 dtgreet -display :14
...
when i type
ps -elf | awk '{a[$3]++;}END{for(i in a)if (a[i]>N)print i, a[i];}' N=1
s162103#helios:/home/s162103$ ps -elf | awk '{a[$3]++;}END{for(i in a)if (a[i]>N)print i, a[i];}' N=1
root 118
/usr/sadm/lib/smc/bin/smcboot 3
/usr/lib/autofs/automountd 2
/opt/SUNWut/lib/utsessiond 2
nasty 31
dima 22
/opt/oracle/product/Oracle_WT1/ohs/ 7
/usr/lib/ssh/sshd 5
/usr/bin/bash 11
that is not user /usr/sadm/lib/smc/bin/smcboot
there is last field in ps -elf ,not user
Something like this(assuming 3rd field of your ps command gives the user id):
ps -elf |
awk '{a[$3]++;}
END {
for(i in a)
if (a[i]>N)
print i, a[i];
}' N=3
The minimal ps command you want to use here is ps -eo user=. This will just print the username for each process and nothing more. The rest can be done with awk:
ps -eo user= |
awk -v max=3 '{ n[$1]++ }
END {
for (user in n)
if (n[user]>max)
print n[user], user
}'
I recommend to put the count in the first column for readability.
read number
ps -elfo user= | sort | uniq -c | while read count user
do
if (( $count > $number ))
then
echo $user
fi
done
That is best solution and it works!

How to find user memory usage in linux

How i can see memory usage by user in linux centos 6
For example:
USER USAGE
root 40370
admin 247372
user2 30570
user3 967373
This one-liner worked for me on at least four different Linux systems with different distros and versions. It also worked on FreeBSD 10.
ps hax -o rss,user | awk '{a[$2]+=$1;}END{for(i in a)print i" "int(a[i]/1024+0.5);}' | sort -rnk2
About the implementation, there are no shell loop constructs here; this uses an associative array in awk to do the grouping & summation.
Here's sample output from one of my servers that is running a decent sized MySQL, Tomcat, and Apache. Figures are in MB.
mysql 1566
joshua 1186
tomcat 353
root 28
wwwrun 12
vbox 1
messagebus 1
avahi 1
statd 0
nagios 0
Caveat: like most similar solutions, this is only considering the resident set (RSS), so it doesn't count any shared memory segments.
EDIT: A more human-readable version.
echo "USER RSS PROCS" ; echo "-------------------- -------- -----" ; ps hax -o rss,user | awk '{rss[$2]+=$1;procs[$2]+=1;}END{for(user in rss) printf "%-20s %8.0f %5.0f\n", user, rss[user]/1024, procs[user];}' | sort -rnk2
And the output:
USER RSS PROCS
-------------------- -------- -----
mysql 1521 1
joshua 1120 28
tomcat 379 1
root 19 107
wwwrun 10 10
vbox 1 3
statd 1 1
nagios 1 1
messagebus 1 1
avahi 1 1
Per-user memory usage in percent using standard tools:
for _user in $(ps haux | awk '{print $1}' | sort -u)
do
ps haux | awk -v user=${_user} '$1 ~ user { sum += $4} END { print user, sum; }'
done
or for more precision:
TOTAL=$(free | awk '/Mem:/ { print $2 }')
for _user in $(ps haux | awk '{print $1}' | sort -u)
do
ps hux -U ${_user} | awk -v user=${_user} -v total=$TOTAL '{ sum += $6 } END { printf "%s %.2f\n", user, sum / total * 100; }'
done
The first version just sums up the memory percentage for each process as reported by ps. The second version sums up the memory in bytes instead and calculates the total percentage afterwards, thus leading to a higher precision.
If your system supports, try to install and use smem:
smem -u
User Count Swap USS PSS RSS
gdm 1 0 308 323 820
nobody 1 0 912 932 2240
root 76 0 969016 1010829 1347768
or
smem -u -t -k
User Count Swap USS PSS RSS
gdm 1 0 308.0K 323.0K 820.0K
nobody 1 0 892.0K 912.0K 2.2M
root 76 0 937.6M 978.5M 1.3G
ameskaas 46 0 1.2G 1.2G 1.5G
124 0 2.1G 2.2G 2.8G
In Ubuntu, smem can be installed by typing
sudo apt install smem
This will return the total ram usage by users in GBs, reverse sorted
sudo ps --no-headers -eo user,rss | awk '{arr[$1]+=$2}; END {for (i in arr) {print i,arr[i]/1024/1024}}' | sort -nk2 -r
You can use the following Python script to find per-user memory usage using only sys and os module.
import sys
import os
# Get list of all users present in the system
allUsers = os.popen('cut -d: -f1 /etc/passwd').read().split('\n')[:-1]
for users in allUsers:
# Check if the home directory exists for the user
if os.path.exists('/home/' + str(users)):
# Print the current usage of the user
print(os.system('du -sh /home/' + str(users)))

Resources