Apache Spark: GraphFrame and random walk implementation - apache-spark

Suppose a directed graph where each node has a flag (boolean)
Main goal is: "jump" to a node for the next step (node) until we will find the node with flag=true with limited count of steps (configurable) for each vertex.
P.S. its okay to go back to initial node
For example:
Vertex A (flag=false) -> [B, C, D, G]
Vertex B (flag=false) -> [A, E, F, G]
Vertex C (flag=false) -> [A, E, G]
Vertex D (flag=false) -> [A, F, G]
Vertex E (flag=true) -> [B, C, G]
Vertex F (flag=false) -> [B, D]
Vertex G (flag=false) -> [A, B, C, D, E]
Vertex H (flag=true) -> [G]
stepsCount = 10
Algorithm:
vertex A has flag=false, there are randomly selected node B where flag=false also
so here again we randomly selecting the next node, lets say its a node F where flag=false
so "jump" to the next node while stepsCount < 10
if there are no flag=true mark the Vertex A with 0 else 1
// Vertex DataFrame
val v = spark.createDataFrame(List(
("A", false),
("B", false),
("C", false),
("D", false),
("E", true),
("F", false),
("G", false),
("H", true)
)).toDF("id", "flag")
// Edge DataFrame
val e = spark.createDataFrame(List(
("A", "B", 0),
("B", "A", 0),
("A", "C", 0),
("C", "A", 0),
("A", "D", 0),
("D", "A", 0),
("F", "B", 0),
("E", "B", 0),
("F", "D", 0),
("D", "F", 0),
("F", "D", 0),
("A", "G", 0),
("B", "G", 0),
("C", "G", 0),
("D", "G", 0),
("E", "G", 0),
("G", "H", 0)
)).toDF("src", "dst", "mark")
val g = GraphFrame(v, e)

Related

Calculate the number of times a combination is used in a log

I have a log of events in the bellow form:
A B C D
A B C D
A B C D
A B C D
D E F G
D E F G
D E F G
D E F G
D E F G
D E F G
D E F G
A D E F G
D E F G
A D E G
I am trying to calculate the frequency of for example how many times A -> B.
With the bellow code I calculate the frequency of each trace.
from collections import Counter
flog = []
input_file ="test.txt"
with open(input_file, "r") as f:
for line in f.readlines():
line = line.split()
flog.append(line)
trace_frequency= map(tuple,flog)
flog=list(Counter(trace_frequency).items())
That gives me :
(('A', 'B', 'C', 'D'), 4)
(('D', 'E', 'F', 'G'), 8)
(('A', 'D', 'E', 'F', 'G'), 1)
(('A', 'D', 'E', 'G'), 1)
So my question is how can I go from the above to a format where I calculate all instances of the log to the bellow format:
A B 4
B C 4
C D 4
A D 2
D E 10...etc
Thanks to all for your time.
Instead of counting each line as a whole, split each line to pairs then count the appearance of each pair.
For example, instead of counting ('A', 'B', 'C', 'D'), count ('A', 'B'), ('B', 'C'), ('C', 'D') individually.
from collections import Counter
flog = []
input_file = "test.txt"
with open(input_file, "r") as f:
for line in f.readlines():
line = line.split()
flog.extend(line[i: i + 2] for i in range(len(line) - 1))
# ^ note extend instead of append
trace_frequency = map(tuple, flog)
flog = list(Counter(trace_frequency).items())
flog is now
[(('A', 'B'), 4), (('B', 'C'), 4), (('C', 'D'), 4), (('D', 'E'), 10),
(('E', 'F'), 9), (('F', 'G'), 9), (('A', 'D'), 2), (('E', 'G'), 1)]
To get your desired format (with the bonus of order) you can use:
flog = Counter(trace_frequency)
for entry, count in flog.most_common():
print(' '.join(entry), count)
Outputs
D E 10
E F 9
F G 9
A B 4
B C 4
C D 4
A D 2
E G 1
Not sure if it's the best, but one possibility is to use Pandas. Given a file log.txt that looks like this:
0 1 2 3 4
A B C D
A B C D
A B C D
A B C D
D E F G
D E F G
D E F G
D E F G
D E F G
D E F G
D E F G
A D E F G
D E F G
A D E G
This code will work:
import pandas as pd
import numpy as np
df = pd.read_csv('log.txt', sep='\s+')
combos = [[(y[1][x], y[1][x + 1]) for x in range(len(df.loc[0]) - 1)] for y in df.iterrows()]
combos = [item for sublist in combos for item in sublist if np.nan not in item]
from collections import Counter
print(Counter(combos))
Giving you:
('A', 'B') 4
('B', 'C') 4
('C', 'D') 4
('D', 'E') 10
('E', 'F') 9
('F', 'G') 9
('A', 'D') 2
('E', 'G') 1

Haskell filtering data

I want to be able to filter the below data so I can find specific data for example if I wanted to find an item with only apples it would look similar to this output: [("apple","crate",6),("apple","box",3)]
fruit :: [(String, String, Int)]
fruit = [("apple", "crate", 6), ("pear", "crate", 5), ("mango", "box", 4),
("apple", "box", 3), ("banana", "box", 5), ("pear", "box", 10), ("apricot",
"box", 4), ("peach", "box", 5), ("walnut", "box", 4), ("blueberry", "tray", 10),
("blackberry", "tray", 4), ("watermelon", "piece", 8), ("marrow", "piece", 7),
("hazelnut", "sack", 2), ("walnut", "sack", 4)]
first :: (a, b, c) -> a
first (x, _, _) = x
second :: (a, b, c) -> b
second (_, y, _) = y
third :: (a, b, c) -> c
third (_, _, z) = z
A couple of alternatives:
filter ((=="apple") . first) fruit
[ f | f#("apple",_,_) <- fruit ]
The first one exploits your first projection, checking whether its result is equal to "apple".
The second one instead exploits list comprehensions, where elements that fail to pattern match are discarded.
Perhaps an even more basic approach is using a lambda abstraction and equality.
filter (\(s,_,_) -> s == "apple") fruit

Recursive function

Let S be the subset of the set of ordered pairs of integers defined recursively by
Basis step: (0, 0) ∈ S.
Recursive step: If (a,b) ∈ S,
then (a,b + 1) ∈ S, (a + 1, b + 1) ∈ S, and (a + 2, b + 1) ∈ S.
List the elements of S produced by the first four application
def subset(a,b):
base=[]
if base == []:
base.append((a,b))
return base
elif (a,b) in base:
base.append(subset(a,b+1))
base.append(subset(a+1,b+1))
base.append(subset(a+2,b+1))
return base
for number in range(0,5):
for number2 in range(0,5):
print(*subset(number,number2))
The output is
(0, 0)
(0, 1)
(0, 2)
(0, 3)
(0, 4)
(1, 0)
(1, 1)
(1, 2)
(1, 3)
(1, 4)
(2, 0)
(2, 1)
(2, 2)
(2, 3)
(2, 4)
(3, 0)
(3, 1)
(3, 2)
(3, 3)
(3, 4)
(4, 0)
(4, 1)
(4, 2)
(4, 3)
(4, 4)
But the correct answer is more than what I got.
(0, 1), (1, 1), and (2, 1) are all in S. If we apply the recursive step to these we add (0, 2), (1, 2), (2, 2), (3, 2), and (4, 2). The next round gives us (0, 3), (1, 3), (2, 3), (3, 3), (4, 3), (5, 3), and (6, 3). And a fourth set of applications adds (0,4), (1,4), (2,4), (3,4), (4,4), (5,4), (6,4), (7,4), and (8,4).
What did I do wrong with my code?
Is this what you want ? Based on the result you wanted :
def subset(a):
#Returns a list that contains all (i, a + 1) from i = 0 to i = (a + 1) * 2 + 1
return [(i, a + 1) for i in range((a + 1) * 2 + 1)]
setList = []
for i in range(4):
setList += subset(i) #Fill a list with subset result from i = 0 -> 3
print(setList)
If you want to use a recursive function, you can do that too :
def subset(a, b):
if a < b * 2:
#Returns a list that contains (a, b) and unzipped result of subset(a + 1, b)
#Otherwise it would add a list in the list
return [(a, b), *subset(a + 1, b)]
else:
return [(a, b)] #If deepest element of recursion, just return [(a, b)]
setList = []
for i in range(5):
setList += subset(0, i)
print(setList)

How to join two RDDs in spark with python?

Suppose
rdd1 = ( (a, 1), (a, 2), (b, 1) ),
rdd2 = ( (a, ?), (a, *), (c, .) ).
Want to generate
( (a, (1, ?)), (a, (1, *)), (a, (2, ?)), (a, (2, *)) ).
Any easy methods?
I think it is different from the cross join but can't find a good solution.
My solution is
(rdd1
.cartesian( rdd2 )
.filter( lambda (k, v): k[0]==v[0] )
.map( lambda (k, v): (k[0], (k[1], v[1])) ))
You are just looking for a simple join, e.g.
rdd = sc.parallelize([("red",20),("red",30),("blue", 100)])
rdd2 = sc.parallelize([("red",40),("red",50),("yellow", 10000)])
rdd.join(rdd2).collect()
# Gives [('red', (20, 40)), ('red', (20, 50)), ('red', (30, 40)), ('red', (30, 50))]

Find the score of a scrabble word in haskell

I am having a letter mapped with its respective score
dict = fromList([("A",1), ("B",3), ("C", 3), ("E", 1), ("D", 2), ("G", 2), ("F", 4), ("I", 1), ("H", 4), ("K", 5), ("J", 8), ("M", 3), ("L", 1), ("O", 1), ("N", 1), ("Q", 10), ("P", 3), ("S", 1), ("R", 1), ("U", 1), ("T", 1), ("W", 4), ("V", 4), ("Y", 4), ("X", 8), ("Z", 10)])
If the main send a word to the function the function should return the score with respect to the dict and the letters in the word.
EX:- Main :- APPLE
Function should return :- 9
(A Score)1+(P Score)3 +(P Score)3 + (L Score) 1+(E Score)1 = 9
You could use lookup to create a function that maps keys to values:
mapper :: Eq k => [(k, v)] -> k -> v
mapper dict k = case lookup k dict of Nothing -> undefined
(Just v) -> v
scrabble :: Char -> Int
scrabble = mapper [ ('A', 1)
, ('B', 3)
, ('C', 3)
, ('E', 1)
, ('D', 2)
, ('G', 2)
, ('F', 4)
, ('I', 1)
, ('H', 4)
, ('K', 5)
, ('J', 8)
, ('M', 3)
, ('L', 1)
, ('O', 1)
, ('N', 1)
, ('Q', 10)
, ('P', 3)
, ('S', 1)
, ('R', 1)
, ('U', 1)
, ('T', 1)
, ('W', 4)
, ('V', 4)
, ('Y', 4)
, ('X', 8)
, ('Z', 10)
]
Now all you need to do is create a function which takes a string and returns its score:
score :: String -> Int
score = sum . map scrabble
main = print $ score "APPLE"
That's all.
Edit: There's nothing wrong with returning undefined in mapper when a lookup fails. If you need error handling you could simply define mapper as flip lookup and hey presto scrabble is now of type Char -> Maybe Int.
Consider how you would write scrabble using pattern matching:
scrabble :: Char -> Int
scrabble 'A' = 1
scrabble 'B' = 3
scrabble 'C' = 3
scrabble 'D' = 2
scrabble 'E' = 1
scrabble 'F' = 4
scrabble 'G' = 2
scrabble 'H' = 4
scrabble 'I' = 1
scrabble 'J' = 8
scrabble 'K' = 5
scrabble 'L' = 1
scrabble 'M' = 3
scrabble 'N' = 1
scrabble 'O' = 1
scrabble 'P' = 3
scrabble 'Q' = 10
scrabble 'R' = 1
scrabble 'S' = 1
scrabble 'T' = 1
scrabble 'U' = 1
scrabble 'V' = 4
scrabble 'W' = 4
scrabble 'X' = 8
scrabble 'Y' = 4
scrabble 'Z' = 10
If the pattern match fails then you end up with a bottom value anyway. This is not a problem if you know that the pattern match will never fail. If you do need to handle failures then simply use flip lookup as I mentioned above.
main = print $ calWordScore "APPLE"
calcWordScore :: String -> Int
calcWordScore word = sum $ map calcLetterScore word
calcLetterScore :: Char -> Int
calcLetterScore ch = Map.fromList([('A',1), ('B',3), ('C', 3), ('E', 1), ('D', 2), ('G', 2), ('F', 4), ('I', 1), ('H', 4), ('K', 5), ('J', 8), ('M', 3), ('L', 1), ('O', 1), ('N', 1), ('Q', 10), ('P', 3), ('S', 1), ('R', 1), ('U', 1), ('T', 1), ('W', 4), ('V', 4), ('Y', 4), ('X', 8), ('Z', 10)]) Map.! ch
And you need to
import qualified Data.Map.Lazy as Map

Resources