why python spark slow when use for loop - apache-spark

I am learning pyspark from write a page rank program.
But when I use for loop to compute, every iteration is getting slower.
I try to use cache, but it seems don't work.
I have no idea how to fix this problem.
Here is my loop code
from time import time
for idx, i in tqdm(enumerate(range(10))):
start_time = time() # <-- start timing
new_values = stochastic_matrix.flatMap(lambda x: get_new_value(x, beta, N))
new_values = new_values.reduceByKey(add).map(lambda x: [x[0], x[1] + ((1-beta)/N)] )
S = new_values.values().reduce(add)
new_stochastic_matrix = stochastic_matrix.fullOuterJoin(new_values)
stochastic_matrix = new_stochastic_matrix.map(lambda x: sum_new_value(x, S, N))
new_stochastic_matrix.cache()
stochastic_matrix.cache() # <--- cache here
end_time = time()
print(idx, end_time - start_time)
sorted(stochastic_matrix.collect())[:10]
Update
After I comment this line
stochastic_matrix = new_stochastic_matrix.map(lambda x: sum_new_value(x, S, N))
It work normal !!
But I still don't know why and how to fix it.
Update 2
I set S as a constant, the speed is normal.
But I still don't know why and how to fix it.
All Flow
After Input Data
variable: stochastic_matrix - data struct looks like this.
[
(key,[value, this_node_connect_to_which_node]),
(1, [0.2, [2, 3]]),
(2, [0.2, [4]]),
(3, [0.2, [1, 4, 5]]),
(4, [0.2, []]),
(5, [0.2, [1, 4]])
]
Map
def get_new_value(item, beta, N):
key, tmp = item
value, dest = tmp
N_dest = len(dest)
new_values = []
for i in dest:
new_values.append([i, beta * (value/ N_dest)] )
return new_values
new_values = stochastic_matrix.flatMap(lambda x: get_new_value(x, beta, N))
new_values.collect()
########### output
[node, each_node_new_value]
[[2, 0.08000000000000002],
[3, 0.08000000000000002],
[4, 0.16000000000000003],
[1, 0.05333333333333334],
[4, 0.05333333333333334],
[5, 0.05333333333333334],
[1, 0.08000000000000002],
[4, 0.08000000000000002]]
Reduce by key
beta and N is just a float number
new_values = new_values.reduceByKey(add).map(lambda x: [x[0], x[1] + ((1-beta)/N)] )
new_values.collect()
###### Output
[[2, 0.12000000000000001],
[3, 0.12000000000000001],
[4, 0.33333333333333337],
[1, 0.17333333333333334],
[5, 0.09333333333333332]]
Combine new_values and stochastic_matrix
new_stochastic_matrix = stochastic_matrix.fullOuterJoin(new_values)
new_stochastic_matrix.collect()
#### Output
# (key, ([value, this_node_connect_to_which_node], new_value))
[(2, ([0.2, [4]], 0.12000000000000001)),
(4, ([0.2, []], 0.33333333333333337)),
(1, ([0.2, [2, 3]], 0.17333333333333334)),
(3, ([0.2, [1, 4, 5]], 0.12000000000000001)),
(5, ([0.2, [1, 4]], 0.09333333333333332))]
Update new_value to value
S and N are just a number
def sum_new_value(item, S, N):
key, value = item
if value[1] == None:
new_value = 0 + (1-S)/N
else:
new_value = value[1] + (1-S)/N
# new_value = value[1]
return [key, [new_value, value[0][1]]]
stochastic_matrix = new_stochastic_matrix.map(lambda x: sum_new_value(x, S, N))
sorted(stochastic_matrix.collect())[:10]
######## Output
[[1, [0.2053333333333333, [2, 3]]],
[2, [0.152, [4]]],
[3, [0.152, [1, 4, 5]]],
[4, [0.36533333333333334, []]],
[5, [0.1253333333333333, [1, 4]]]]

Related

How do I un-shuffle a list back to its original form

How would I undo the shuffle I have done on alist and bring it back to its original sequence:
[1, 2, 3 , 4]
import random
alist = [1, 2, 3, 4]
random.shuffle(alist) # alist is randomly shuffled
I just took this answer from A good way to shuffle and then unshuffle a python list question's accepted answer and did the small change to it. It's working perfect and please refer #trincot and #canton7 answers for more information, they are very educated.
import random
def getperm(l):
seed = sum(l)
random.seed(seed)
perm = list(range(len(l)))
random.shuffle(perm)
random.seed() # optional, in order to not impact other code based on random
return perm
def shuffle(l): # [1, 2, 3, 4]
perm = getperm(l) # [3, 2, 1, 0]
l[:] = [l[j] for j in perm] # [4, 3, 2, 1]
def unshuffle(l): # [4, 3, 2, 1]
perm = getperm(l) # [3, 2, 1, 0]
res = [None] * len(l) # [None, None, None, None]
for i, j in enumerate(perm):
res[j] = l[i]
l[:] = res # [1, 2, 3, 4]
alist = [1, 2, 3, 4]
print(alist) # [1, 2, 3, 4]
shuffle(alist)
print(alist) # shuffled, [4, 3, 2, 1]
unshuffle(alist)
print(alist) # the original, [1, 2, 3, 4]
random.shuffle shuffles the input sequence in-place. To get back the original list, you need to keep a copy of it.
# make a copy
temp = alist[:]
# shuffle
random.shuffle(alist)
# alist is now shuffled in-place
# restore from the copy
alist = temp

How to generate permutations by decreasing cycles?

Here are two related SO questions 1 2 that helped me formulate my preliminary solution.
The reason for wanting to do this is to feed permutations by edit distance into a Damerau-Levenshtein NFA; the number of permutations grows fast, so it's a good idea to delay (N-C) cycle N permutations candidates until (N-C) iterations of the NFA.
I've only studied engineering math up to Differential Equations and Discrete Mathematics, so I lack the foundation to approach this task from a formal perspective. If anyone can provide reference materials to help me understand this problem properly, I would appreciate that!
Through brief empirical analysis, I've noticed that I can generate the swaps for all C cycle N permutations with this procedure:
Generate all 2-combinations of N elements (combs)
Subdivide combs into arrays where the smallest element of each 2-combination is the same (ncombs)
Generate the cartesian products of the (N-C)-combinations of ncombs (pcombs)
Sum pcombs to get a list of the swaps that will generate all C cycle N permutations (swaps)
The code is here.
My Python is a bit rusty, so helpful advice about the code is appreciated (I have the feeling that lines 17, 20, and 21 should be combined. I'm not sure if I should be making lists of the results of itertools.(combinations|product). I don't know why line 10 can't be ncombs += ... instead of ncombs.append(...)).
My primary question is how to solve this question properly. I did the rounds on my own due diligence by finding a solution, but I am sure there's a better way. I've also only verified my solution for N=3 and N=4, is it really correct?
The ideal solution would be functionally identical to heap's algorithm, except it would generate the permutations in decreasing cycle order (by the minimum number of swaps to generate the permutation, increasing).
This is far from Heap's efficiency, but it does produce only the necessary cycle combinations restricted by the desired number of cycles, k, in the permutation. We use the partitions of k to create all combinations of cycles for each partition. Enumerating the actual permutations is just a cartesian product of applying each cycle n-1 times, where n is the cycle length.
Recursive Python 3 code:
from math import ceil
def partitions(N, K, high=float('inf')):
if K == 1:
return [[N]]
result = []
low = ceil(N / K)
high = min(high, N-K+1)
for k in range(high, low - 1, -1):
for sfx in partitions(N-k, K - 1, k):
result.append([k] + sfx)
return result
print("partitions(10, 3):\n%s\n" % partitions(10, 3))
def combs(ns, subs):
def g(i, _subs):
if i == len(ns):
return [tuple(tuple(x) for x in _subs)]
res = []
cardinalities = set()
def h(j):
temp = [x[:] for x in _subs]
temp[j].append(ns[i])
res.extend(g(i + 1, temp))
for j in range(len(subs)):
if not _subs[j] and not subs[j] in cardinalities:
h(j)
cardinalities.add(subs[j])
elif _subs[j] and len(_subs[j]) < subs[j]:
h(j)
return res
_subs = [[] for x in subs]
return g(0, _subs)
A = [1,2,3,4]
ns = [2, 2]
print("combs(%s, %s):\n%s\n" % (A, ns, combs(A, ns)))
A = [0,1,2,3,4,5,6,7,8,9,10,11]
ns = [3, 3, 3, 3]
print("num combs(%s, %s):\n%s\n" % (A, ns, len(combs(A, ns))))
def apply_cycle(A, cycle):
n = len(cycle)
last = A[ cycle[n-1] ]
for i in range(n-1, 0, -1):
A[ cycle[i] ] = A[ cycle[i-1] ]
A[ cycle[0] ] = last
def permutations_by_cycle_count(n, num_cycles):
arr = [x for x in range(n)]
cycle_combs = []
for partition in partitions(n, num_cycles):
cycle_combs.extend(combs(arr, partition))
result = {}
def f(A, cycle_comb, i):
if i == len(cycle_comb):
result[cycle_comb].append(A)
return
if len(cycle_comb[i]) == 1:
f(A[:], cycle_comb, i+1)
for k in range(1, len(cycle_comb[i])):
apply_cycle(A, cycle_comb[i])
f(A[:], cycle_comb, i+1)
apply_cycle(A, cycle_comb[i])
for cycle_comb in cycle_combs:
result[cycle_comb] = []
f(arr, cycle_comb, 0)
return result
result = permutations_by_cycle_count(4, 2)
print("permutations_by_cycle_count(4, 2):\n")
for e in result:
print("%s: %s\n" % (e, result[e]))
Output:
partitions(10, 3):
[[8, 1, 1], [7, 2, 1], [6, 3, 1], [6, 2, 2], [5, 4, 1], [5, 3, 2], [4, 4, 2], [4, 3, 3]]
# These are the cycle combinations
combs([1, 2, 3, 4], [2, 2]):
[((1, 2), (3, 4)), ((1, 3), (2, 4)), ((1, 4), (2, 3))]
num combs([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], [3, 3, 3, 3]):
15400
permutations_by_cycle_count(4, 2):
((0, 1, 2), (3,)): [[2, 0, 1, 3], [1, 2, 0, 3]]
((0, 1, 3), (2,)): [[3, 0, 2, 1], [1, 3, 2, 0]]
((0, 2, 3), (1,)): [[3, 1, 0, 2], [2, 1, 3, 0]]
((1, 2, 3), (0,)): [[0, 3, 1, 2], [0, 2, 3, 1]]
((0, 1), (2, 3)): [[1, 0, 3, 2]]
((0, 2), (1, 3)): [[2, 3, 0, 1]]
((0, 3), (1, 2)): [[3, 2, 1, 0]]

How to fix tribonacci series function when using generators

Following is my approach to return the n element in Tribonacci series
def tri(n,seq = [1, 1, 1]):
for i in range(n-2):
seq = seq[1:] + [sum(seq)]
return seq[-1]
I get the correct result when passing argument through print().
print(tri(10))
Output : 193
However, when using generator(using repl.it), I get error of can only concatenate tuple (not"list") to tuple
I am using below for generator
def tri_generator():
for i in range(1000):
yield (i, (1, 1, 1))
yield (i, (1, 0, 1))
yield (i, (1, 2, 3))
Not sure what I am missing? Any help is appreciated.
Here's a simple generator (you can clean up the code as you may like):
def tri_generator():
i = 0
seq = [1, 1, 1]
while True:
seq = [seq[1], seq[2], seq[0] + seq[1] + seq[2]]
yield i, seq
i += 1
n = 10
xx = tri_generator()
for i in range(n - 2):
print(next(xx))
## Output:
## (0, [1, 1, 3])
## (1, [1, 3, 5])
## (2, [3, 5, 9])
## (3, [5, 9, 17])
## (4, [9, 17, 31])
## (5, [17, 31, 57])
## (6, [31, 57, 105])
## (7, [57, 105, 193])

How do you update a local class variable in Python in a backtracking loop?

I am trying to append permutations of a list of integers to a local variable in python but end of appending a single permutation several times. The code will print all the results correctly but not sure why my class variable will not be updated correctly.
Any help would be extremely appreciated!!
I am working on a leetCode problem but can't output the right results. I am trying to update my array, myVals, within a backtracking loop but and storing the just one permutation.
I can print all the results however when
class Solution:
def __init__(self):
self.myVals = []
def add(self, x):
self.myVals.append(x)
def permuteArray(self, nums, l, r):
if l == r:
print(nums)
self.add(nums)
else:
for i in range(l, r + 1):
nums[l], nums[i] = nums[i], nums[l]
self.permuteArray(nums, 1+l, r)
nums[l], nums[i] = nums[i], nums[l]
def permute(self, nums):
"""
:type nums: List[int]
:rtype: List[List[int]]
"""
numCount = len(nums)
start = 0
self.permuteArray(nums, start, numCount - 1)
print(self.myVals)
nums = [1,2,3]
driver = Solution()
result = driver.permute(nums)
Expected Results:
[[1, 2, 3], [1, 3, 2], [2, 1, 3], [2, 3, 1], [3, 2, 1], [3, 1, 2]]
Actual Results:
[[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3]]

How to process different row in tensor based on the first column value in tensorflow

let's say I have a 4 by 3 tensor:
sample = [[10, 15, 25], [1, 2, 3], [4, 4, 10], [5, 9, 8]]
I would like to return another tensor of shape 4: [r1,r2,r3,r4] where r is either equal to tf.reduce_sum(row) if row[0] is less than 5, or r is equal to tf.reduce_mean(row) if row[0] is greater or equal to 5.
output:
output = [16.67, 6, 18, 7.33]
I'm not an adept to tensorflow, please do assist me on how to achieve the above in python 3 without a for loop.
thank you
UPDATES:
So I've tried to adapt the answer given by #Onyambu to include two samples in the functions but it gave me an error in all instances.
here is the answer for the first case:
def f(x):
c = tf.constant(5,tf.float32)
def fun1():
return tf.reduce_sum(x)
def fun2():
return tf.reduce_mean(x)
return tf.cond(tf.less(x[0],c),fun1,fun2)
a = tf.map_fn(f,tf.constant(sample,tf.float32))
The above works well.
The for two samples:
sample1 = [[10, 15, 25], [1, 2, 3], [4, 4, 10], [5, 9, 8]]
sample2 = [[0, 15, 25], [1, 2, 3], [0, 4, 10], [1, 9, 8]]
def f2(x1,x2):
c = tf.constant(1,tf.float32)
def fun1():
return tf.reduce_sum(x1[:,0] - x2[:,0])
def fun2():
return tf.reduce_mean(x1 - x2)
return tf.cond(tf.less(x2[0],c),fun1,fun2)
a = tf.map_fn(f2,tf.constant(sample1,tf.float32), tf.constant(sample2,tf.float32))
The adaptation does give errors, but the principle is simple:
calculate the tf.reduce_sum of sample1[:,0] - sample2[:,0] if row[0] is less than 1
calculate the tf.reduce_sum of sample1 - sample2 if row[0] is greater or equal to 1
Thank you for your assistance in advance!
import tensorflow as tf
def f(x):
y = tf.constant(5,tf.float32)
def fun1():
return tf.reduce_sum(x)
def fun2():
return tf.reduce_mean(x)
return tf.cond(tf.less(x[0],y),fun1,fun2)
a = tf.map_fn(f,tf.constant(sample,tf.float32))
with tf.Session() as sess: print(sess.run(a))
[16.666666 6. 18. 7.3333335]
If you want to shorten it:
y = tf.constant(5,tf.float32)
f=lambda x: tf.cond(tf.less(x[0], y), lambda: tf.reduce_sum(x),lambda: tf.reduce_mean(x))
a = tf.map_fn(f,tf.constant(sample,tf.float32))
with tf.Session() as sess: print(sess.run(a))

Resources