Avoid some unexpected results from Association Rules

Avoid some unexpected results from Association Rules - apache-spark

I'm trying to extract some association rules from this dataset:
49
70
27,66
6
27
66,8,64
32
82
66
71
44
1
33
17
31,83
50,29
22
72
8
8,16
56
83,61
85,63,37
50,57
2
50
96,6
73
57
12
62
96
3
47,50,73
35
85,45
25,96,22,17
85
24
17,57
34,4
60,96,45
25
85,66,73
30
14
73,85
64
48
5
37
13,55
37,17
I've this code:
val transactions = sc.textFile("/user/cloudera/dataset1")
import org.apache.spark.mllib.fpm.AssociationRules
import org.apache.spark.mllib.fpm.FPGrowth.FreqItemset
val freqItemsets = transactions.flatMap(xs =>
(xs.combinations(1) ++ xs.combinations(2) ++ xs.combinations(3) ++ xs.combinations(4) ++ xs.combinations(5)).map(x => (x.toList, 1L))
).reduceByKey(_ + _).map{case (xs, cnt) => new FreqItemset(xs.toArray, cnt)}
val ar = new AssociationRules().setMinConfidence(0.4)
val results = ar.run(freqItemsets)
results.collect().foreach { rule =>
println("[" + rule.antecedent.mkString(",")
+ "=>"
+ rule.consequent.mkString(",") + "]," + rule.confidence)
}
But I'm getting some unexpected lines in my output:
[2,9=>5],0.5
[8,5,,,3=>6],1.0
[8,5,,,3=>7],0.5
[8,5,,,3=>7],0.5
[,,,=>6],0.5
[,,,=>7],0.5
[,,,=>5],0.5
[,,,=>3],0.5
[4,3=>7],1.0
[4,3=>,,,],1.0
[4,3=>,,,],1.0
[4,3=>5],1.0
[4,3=>7,7],1.0
[4,3=>7,7],1.0
[4,3=>0],1.0
Why I'm getting outputs like this:
[,,,=>3],0.5
I'm not understanding the issue... Anyone knows how to solve this problem?
Many Thanks!

All of these results should be unexpected, because you have a bug in your code!
You need to create combinations of the items. As it stands, your code is creating combinations of characters in the string (like "25,96,22,17"), which of course won't give the right result (and that's why you see the "," as an element).
To fix, add: val freqItemsets = transactions.map(_.split(",")).
So instead of
val freqItemsets = transactions.flatMap(xs =>
(xs.combinations(1) ++ xs.combinations(2) ++ xs.combinations(3) ++ xs.combinations(4) ++ xs.combinations(5)).map(x => (x.toList, 1L))
).reduceByKey(_ + _).map{case (xs, cnt) => new FreqItemset(xs.toArray, cnt)}
You have:
val freqItemsets = transactions.map(_.split(",")).flatMap(xs =>
(xs.combinations(1) ++ xs.combinations(2) ++ xs.combinations(3) ++ xs.combinations(4) ++ xs.combinations(5)).filter(_.nonEmpty).map(x => (x.toList, 1L)) ).reduceByKey(_ + _).map{case (xs, cnt) => new FreqItemset(xs.toArray, cnt)}
Which will give the expected:
[96,17=>22],1.0
[96,17=>25],1.0
[85,37=>63],1.0
[47,73=>50],1.0
[31=>83],1.0
[60,45=>96],1.0
[60=>45],1.0
[60=>96],1.0
[96,45=>60],1.0
[22,17=>25],1.0
[22,17=>96],1.0
[66,8=>64],1.0
[63,37=>85],1.0
[66,64=>8],1.0
[25,22,17=>96],1.0
[27=>66],0.5
[96,22,17=>25],1.0
[61=>83],1.0
[64=>66],0.5
[64=>8],0.5
[45=>60],0.5
[45=>96],0.5
[45=>85],0.5
[6=>96],0.5
[47=>73],1.0
[47=>50],1.0
[50,73=>47],1.0
[96,22=>17],1.0
[96,22=>25],1.0
[66,73=>85],1.0
[8,64=>66],1.0
[29=>50],1.0
[83=>31],0.5
[83=>61],0.5
[25,96,17=>22],1.0
[85,66=>73],1.0
[25,96,22=>17],1.0
[25,96=>17],1.0
[25,96=>22],1.0
[22=>17],0.5
[22=>96],0.5
[22=>25],0.5
[85,73=>66],1.0
[55=>13],1.0
[60,96=>45],1.0
[63=>37],1.0
[63=>85],1.0
[25,22=>17],1.0
[25,22=>96],1.0
[16=>8],1.0
[25=>96],0.5
[25=>22],0.5
[25=>17],0.5
[34=>4],1.0
[85,63=>37],1.0
[47,50=>73],1.0
[13=>55],1.0
[4=>34],1.0
[25,17=>22],1.0
[25,17=>96],1.0

Related

Changing protobuff optional field to oneof

I have the following message:
message Message {
int64 id = 1;
google.protobuf.FloatValue weight = 2;
google.protobuf.FloatValue override_weight = 3;
}
and I wish to change the type of weight and override_weight(optional fields) to google.protobuf.DoubleValue so what I did was the fllowing:
message Message {
int64 id = 1;
oneof weight_oneof {
google.protobuf.FloatValue weight = 2 [deprecated=true];
google.protobuf.DoubleValue double_weight = 4;
}
oneof override_weight_oneof {
google.protobuf.FloatValue override_weight = 3 [deprecated=true];
google.protobuf.DoubleValue double_override_weight = 5;
}
}
My question is, lets assume I have old messages who were compiled by the previous protobuff message compiler for the old message, would I be able to parse them as the new message?
The documentation is very vague about this:
"Move optional fields into or out of a oneof: You may lose some of your information (some fields will be cleared) after the message is serialized and parsed. However, you can safely move a single field into a new oneof and may be able to move multiple fields if it is known that only one is ever set."
Has anyone tried this before? what is the best practice for this situation?

As far as I know fields in an oneof are just serialize using their tag number. The serialized data does not indicate if a field is part of an oneof. This is all handled by the serializer and deserializer. So as long as the tag numbers do not conflict it can be assumed that it will work in both directions, old messages to a new serializer and new messages to an old serializer.
You could test this using an online protobuf deserializer.
Verification:
The code does indeed produce the same byte strings. Below you will find the message definitions and python code I used. The python code will output a byte string you can copy and use in the decoder of Marc Gravell.
syntax = "proto3";
message MessageA {
int64 id = 1;
float weight = 2;
float override_weight = 3;
}
message MessageB {
int64 id = 1;
oneof weight_oneof {
float weight = 2 [deprecated=true];
double double_weight = 4;
}
oneof override_weight_oneof {
float override_weight = 3 [deprecated=true];
double double_override_weight = 5;
}
}
import Example_pb2
# Set some data in the original message
msgA = Example_pb2.MessageA()
msgA.id = 1234
msgA.weight = 3.21
msgA.override_weight = 5.43
# Output the serialized bytes in a pretty format
str = 'msgA = '
for x in msgA.SerializeToString():
str += "{:02x} ".format(x)
print(str)
# Next set the original fields in the new message
msgB = Example_pb2.MessageB()
msgB.id = 1234
msgB.weight = 3.21
msgB.override_weight = 5.43
# Output the serialized bytes in a pretty format
str = 'msgB 1 = '
for x in msgB.SerializeToString():
str += "{:02x} ".format(x)
print(str)
# And finally set the new fields in msgB
msgB.double_weight = 3.21
msgB.double_override_weight = 5.43
# Output the serialized bytes in a pretty format
str = 'msgB 2 = '
for x in msgB.SerializeToString():
str += "{:02x} ".format(x)
print(str)
The output of the python script was:
msgA = 08 d2 09 15 a4 70 4d 40 1d 8f c2 ad 40
msgB 1 = 08 d2 09 15 a4 70 4d 40 1d 8f c2 ad 40
msgB 2 = 08 d2 09 21 ae 47 e1 7a 14 ae 09 40 29 b8 1e 85 eb 51 b8 15 40
As you can see message A and message B yield the same byte string when setting the original fields. Only when you set the new fields you get a different string.

Filter in PySpark/Python RDD

I have a list like this:
["Dhoni 35 WC 785623", "Sachin 40 Batsman 4500", "Dravid 45 Batsman 50000", "Kumble 41 Bowler 456431", "Srinath 41 Bowler 65465"]
After applying filter I want like this:
["Dhoni WC", "Sachin Batsman", "Dravid Batsman", "Kumble Bowler", "Srinath Bowler"]
I tried out this way
m = sc.parallelize(["Dhoni 35 WC 785623","Sachin 40 Batsman 4500","Dravid 45 Batsman 50000","Kumble 41 Bowler 456431","Srinath 41 Bowler 65465"])
n = m.map(lambda k:k.split(' '))
o = n.map(lambda s:(s[0]))
o.collect()
['Dhoni', 'Sachin', 'Dravid', 'Kumble', 'Srinath']
q = n.map(lambda s:s[2])
q.collect()
['WC', 'Batsman', 'Batsman', 'Bowler', 'Bowler']

Provided, all your list items are of same format, one way to achieve this is with map.
rdd = sc.parallelize(["Dhoni 35 WC 785623","Sachin 40 Batsman 4500","Dravid 45 Batsman 50000","Kumble 41 Bowler 456431","Srinath 41 Bowler 65465"])
rdd.map(lambda x:(x.split(' ')[0]+' '+x.split(' ')[2])).collect()
Output:
['Dhoni WC', 'Sachin Batsman', 'Dravid Batsman', 'Kumble Bowler', 'Srinath Bowler']

Octave advanced textread usage, bash

I have following text file:
079082084072079032084069067072000000000,0
082078032049050032067072065082071069000,1
076065066032065083083084000000000000000,0
082078032049050072082000000000000000000,1
082078032049050072082000000000000000000,1
082078032049050072082000000000000000000,1
070083087032073073032080068000000000000,0
080067065032049050032072082000000000000,0
082078032056072082000000000000000000000,1
070083087032073073073000000000000000000,0
082078032087069069075069078068000000000,1
082078032049050072082000000000000000000,1
077065073078084032077069067072032073073,0
082078032049050072082000000000000000000,1
080067065032049050032072082000000000000,0
082078032049050072082000000000000000000,1
I need too matrices:
X size 16x13
Y size 16x1
I want to separate each row of the file into 13 values, example:
079 082 084 072 079 032 084 069 067 072 000 000 000
Is it possible to import it into octave using textread function?
If no, can it be done using Linux bash command?

Yes, you can do this with textscan (see bottom if you really want to use textread:
octave> txt = "079082084072079032084069067072000000000,0\n082078032049050032067072065082071069000,1";
octave> textscan (txt, repmat ("%3d", 1, 13))
ans =
{
[1,1] =
79
82
[1,2] =
82
78
[1,3] =
84
32
[1,4] =
72
49
[...]
Note that you are reading them as numeric values, so you do not get the preceding zeros. If you want them, you can either read them as string by using "%3s" in the format (extra trouble to handle and reduced performance since you will then be handling cell arrays).
Since you are reading from a file:
[fid, msg] = fopen ("data.txt", "r");
if (fid)
error ("failed to fopen 'data.txt': %s", msg);
endif
data = textscan (fid, repmat ("%3d", 1, 13));
fclose (fid);
If you really want to use textread:
octave> [d1, d2, d3, d4, d5, d6, d7, d8, d9, d10, d11, d12, d13] = textread ("data.txt", repmat ("%3d", 1, 13))
d1 =
79
82
76
[...]
d2 =
82
78
65
[...]
or:
octave> data = cell (1, 13);
octave> [data{:}] = textread ("data.txt", repmat ("%3d", 1, 13))
data =
{
[1,1] =
79
82
76
[...]
[1,2] =
82
78
65
[...]
If you need to capture the value after the comma (not really part of your original question), you can use:
octave> textscan (txt, [repmat("%3d", 1, 13) ",%1d"])
ans =
{
[1,1] =
79
82
[1,2] =
82
78
[1,3] =
84
32
[...]
[1,14] =
0
1
}

You can do this pretty easily by reading three characters at a time using read in the shell:
while IFS="${IFS}," read -rn3 val tail; do
[[ $tail ]] && echo || printf '%s ' "$val"
done < file
This implementation assumes that if we encounter a value after the comma, we should go to the next line.

Urlencode/decode, different representation of the same string

I am a bit out of my comfort zone here, so I'm not even sure I'm aproaching the problem appropriately. Anyhow, here goes:
So I have a problem where I shall hash some info with sha1 that will work as that info's id.
when a client wants to signal what current info is being used, it sends a percent-encoded sha1-string.
So one example is, my server hashes some info and gets a hex representation like so:
44 c1 b1 0d 6a de ce 01 09 fd 27 bc 81 7f 0e 90 e3 b7 93 08
and the client sends me
D%c1%b1%0dj%de%ce%01%09%fd%27%bc%81%7f%0e%90%e3%b7%93%08
Removing the % we get
D c1 b1 0dj de ce 01 09 fd 27 bc 81 7f 0e 90 e3 b7 93 08
which matches my hash except for the beginning D and the j after the 0d, but replacing those with their ascii hex no, we have identical hash.
So, as I have read and understood the urlencoding, the standard would allow a client to send the D as either D or %44? So different clients would be able to send different representations off the same hash, and I will not be able to just compare them for equality?
I would prefer to be able to compare the urlencoded strings as they are when they are sent, but one way to do it would be to decode them, removing all '%' and get the ascii hex value for whatever mismatch I get, much like the D and the j in my above example.
This all seems to be a very annoying way to do things, am I missing something, please tell me I am? :)
I am doing this in node.js but I suppose the solution would be language/platform agnostic.

I made this crude solution for now:
var unreserved = 'A B C D E F G H I J S O N K L M N O P Q R S T U V W X Y Za b c d e f g h i j s o n k l m n o p q r s t u v w x y z + 1 2 3 4 5 6 7 8 9 0 - _ . ~';
function hexToPercent(hex){
var index = 0,
end = hex.length,
delimiter = '%',
step = 2,
result = '',
tmp = '';
if(end % step !== 0){
console.log('\'' + hex + '\' must be dividable by ' + step + '.');
return result;
}
while(index < end){
tmp = hex.slice(index, index + step);
if(unreserved.indexOf(String.fromCharCode('0x' + tmp)) !== -1){
result = result + String.fromCharCode('0x' + tmp);
}
else{
result = result + delimiter + tmp;
}
index = index + step;
}
return result;
}

How to use sed or awk regex to parse this data in linux shell

I have this data in my file
65 ---
66 FieldType: Text
67 FieldName: STATE
68 FieldNameAlt: STATE
69 FieldFlags: 4194304
70 FieldJustification: Left
71 FieldMaxLength: 2
72 ---
73 FieldType: Text
74 FieldName: ZIP
75 FieldNameAlt: ZIP
76 FieldFlags: 0
77 FieldJustification: Left
78 ---
79 FieldType: Signature
80 FieldName: EMPLOYEE SIGNATURE
81 FieldNameAlt: EMPLOYEE SIGNATURE
82 FieldFlags: 0
83 FieldJustification: Left
84 ---
85 FieldType: Text
86 FieldName: Name_Last
87 FieldNameAlt: LAST
88 FieldFlags: 0
89 FieldValue: Billa
90 FieldJustification: Left
91 ---
How can i make that a array and store as key value pair in array like
array['fieldtype']
array['fieldName']
for all the objects.
i think the separater is only "---" but i don't know how can i do that

Here's one way with GNU awk. It splits the input into records which can then be worked on.
parse.awk
BEGIN {
RS = " +[0-9]+ +---\n"
FS = "\n"
}
{
for(i=1; i<=NF; i++) { # for each line
sf = split($i, a, ":")
if(sf > 1) { # only accept successfully split lines
sub("^ +[0-9]+ +", "", a[1]) # trim key
sub("^ +", "", a[2]) # trim value
array[a[1]] = a[2] # save into array hash
}
}
}
{
print "Record: " NR
for(k in array) {
print k " -> " array[k]
}
print ""
}
Save the above into parse.awk and run it like this:
awk -f parse.awk infile
Where infile contains the data you want to parse. Output:
Record: 1
Record: 2
FieldFlags -> 4194304
FieldNameAlt -> STATE
FieldJustification -> Left
FieldType -> Text
FieldMaxLength -> 2
FieldName -> STATE
Record: 3
FieldFlags -> 0
FieldNameAlt -> ZIP
FieldJustification -> Left
FieldType -> Text
FieldMaxLength -> 2
FieldName -> ZIP
Record: 4
FieldFlags -> 0
FieldNameAlt -> EMPLOYEE SIGNATURE
FieldJustification -> Left
FieldType -> Signature
FieldMaxLength -> 2
FieldName -> EMPLOYEE SIGNATURE
Record: 5
FieldFlags -> 0
FieldNameAlt -> LAST
FieldJustification -> Left
FieldType -> Text
FieldMaxLength -> 2
FieldValue -> Billa
FieldName -> Name_Last

You can use something like this:
sed -n '/FieldType/,/FieldName/{N};s/FieldType: \([^\n]*\)\nFieldName: \([^\n]*\)/a["\2"]=\1/gp' input >> tmp.sh
and do:
source tmp.sh
or use eval instead of redirection and source, however the space in the employee signature field will cause problems.
Using Perl makes more sense though.

In any type of awk:
#!awk -F':[[:blank:]]*' -f
BEGIN {
counter = 0
}
/:/ {
array[counter,$1] = $2
}
/---/ {
counter++;
}
END {
# Deal with the array.
}
This creates an array where each cell counted off by 'counter' contains the fields as described above with array[x,key] = value.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Avoid some unexpected results from Association Rules - apache-spark

Related

Changing protobuff optional field to oneof

Filter in PySpark/Python RDD

Octave advanced textread usage, bash

Urlencode/decode, different representation of the same string

How to use sed or awk regex to parse this data in linux shell

Categories

Resources