Combining two different lines in Python Spark RDD - apache-spark

I am having small problem dealing with python spark rdd. My rdd looks like
old_rdd = [( A1, Vector(V1)), (A2, Vector(V2)), (A3, Vector(V3)), ....].
I want to use flatMap, so as to get new rdd like:
new_rdd = [((A1, A2), (V1, V2)), ((A1, A3), (V1, V3))] and so on.
The problem is flatMap removed tuple like [(A1, V1, A2, V2)...]. Do you have any alternative suggestions with or without flatMap(). Thank you in advance.

It is related to Explicit sort in Cartesian transformation in Scala Spark. However, I will suppose that you already cleaned up the RDD for duplicates, and I will assume that the ids have some simple pattern to parse and then identify, and for simplicity I will think on Lists instead of Vectors
old_rdd = sc.parallelize([(1, [1, -2]), (2, [5, 7]), (3, [8, 23]), (4, [-1, 90])])
# It will provide all the permutations, but combinations are a subset of the permutations, so we need to filter.
combined_rdd = old_rdd.cartesian(old_
combinations = combined_rdd.filter(lambda (s1, s2): s1[0] < s2[0])
combinations.collect()
# The output will be...
# -----------------------------
# [((1, [1, -2]), (2, [5, 7])),
# ((1, [1, -2]), (3, [8, 23])),
# ((1, [1, -2]), (4, [-1, 90])),
# ((2, [5, 7]), (3, [8, 23])),
# ((2, [5, 7]), (4, [-1, 90])),
# ((3, [8, 23]), (4, [-1, 90]))]
# Now we need to set the tuple as you want
combinations = combinations.map(lambda (s1, s1): ((s1[0], s2[0]), (s1[1], s2[1]))).collect()
# The output will be...
# ----------------------
# [((1, 2), ([1, -2], [5, 7])),
# ((1, 3), ([1, -2], [8, 23])),
# ((1, 4), ([1, -2], [-1, 90])),
# ((2, 3), ([5, 7], [8, 23])),
# ((2, 4), ([5, 7], [-1, 90])),
# ((3, 4), ([8, 23], [-1, 90]))]

Related

Find integer consecutive run starting indices in a list

Example:
nums = [1,2,3,5,10,9,8,9,10,11,7,8,7]
I am trying to find the first index of numbers in consecutive runs of -1 or 1 direction where the runs are >= 3.
So the desired output from the above nums would be:
[0,4,6,7]
I have tried
grplist = [list(group) for group in more_itertools.consecutive_groups(A)]
output: [[1, 2, 3], [5], [10], [9], [8, 9, 10, 11], [7, 8], [7]]
It returns nested lists but does not but that only seems to go in +1 direction. And it does not return the starting index.
listindx = [list(j) for i, j in groupby(enumerate(A), key=itemgetter(1))]
output: [[(0, 1)], [(1, 2)], [(2, 3)], [(3, 5)], [(4, 10)], [(5, 9)], [(6, 8)], [(7, 9)], [(8, 10)], [(9, 11)], [(10, 7)], [(11, 8)], [(12, 7)]]
This does not check for consecutive runs but it does return indices.

Transforming adjaceny-matrix into an adjaceny-list

For the given adjacency matrix below:
[[0, 1, 0, 4], [1, 0, 3, 1], [0, 3, 0, 2], [4, 1, 2, 0]]
How would I go about converting it into an adjacency list like below where each tuple is (node,weight)
[[(1, 1), (3, 4)], [(0, 1), (2, 3), (3, 1)], [(1, 3), (3, 2)], [(0, 4), (1, 1), (2, 2)]]
thanks a lot!
Use a double comprehension:
a=[[0, 1, 0, 4], [1, 0, 3, 1], [0, 3, 0, 2], [4, 1, 2, 0]]
print([[(i,x) for i,x in enumerate(y) if x!=0] for y in a])

Creating a column of edges

I need to concatenate a uid from uids column to each of the uids in the list of the friends column, as shown in the following example:
Given a pandas.DataFrame object A:
uid friends
0 1 [10, 2, 1, 5]
1 2 [1, 2]
2 3 [5, 4]
3 4 [10, 5]
4 5 [1, 2, 5]
the desired output is:
uid friends in_edges
0 1 [10, 2, 1, 5] [(1, 10), (1, 2), (1, 1), (1, 5)]
1 2 [1, 2] [(2, 1), (2, 2)]
2 3 [5, 4] [(3, 5), (3, 4)]
3 4 [10, 5] [(4, 10), (4, 5)]
4 5 [1, 2, 5] [(5, 1), (5, 2), (5, 5)]
I use the following code to achieve this outcome:
import numpy as np
import pandas as pd
A = pd.DataFrame(dict(uid=[1, 2, 3, 4, 5], friends=[[10, 2, 1, 5], [1, 2], [5, 4], [10, 5], [1, 2, 5]]))
A.loc[:, 'in_edges'] = A.loc[:, 'uid'].apply(lambda uid: [(uid, f) for f in A.loc[A.loc[:, 'uid']==uid, 'friends'].values[0]])
but it the A.loc[A.loc[:, 'uid']==uid, 'friends'] part looks kind of cumbersome to me, so I wondered if there is an easier way to accomplish this task?
Thanks in advance.
You can use .apply() with axis=1 parameter:
df["in_edges"] = df[["uid", "friends"]].apply(
lambda x: [(x["uid"], f) for f in x["friends"]], axis=1
)
print(df)
Prints:
uid friends in_edges
0 1 [10, 2, 1, 5] [(1, 10), (1, 2), (1, 1), (1, 5)]
1 2 [1, 2] [(2, 1), (2, 2)]
2 3 [5, 4] [(3, 5), (3, 4)]
3 4 [10, 5] [(4, 10), (4, 5)]
4 5 [1, 2, 5] [(5, 1), (5, 2), (5, 5)]
Why not try product
import itertools
A['in_edges'] = A.apply(lambda x : [*itertools.product([x['uid']], x['friends'])],axis=1)
A
Out[50]:
uid friends in_edges
0 1 [10, 2, 1, 5] [(1, 10), (1, 2), (1, 1), (1, 5)]
1 2 [1, 2] [(2, 1), (2, 2)]
2 3 [5, 4] [(3, 5), (3, 4)]
3 4 [10, 5] [(4, 10), (4, 5)]
4 5 [1, 2, 5] [(5, 1), (5, 2), (5, 5)]

How to iterate python windowed() to last element?

According to the more_itertools.windowed specification, you can do:
list(windowed(seq=[1, 2, 3, 4], n=2, step=1))
>>> [(1, 2), (2, 3), (3, 4)]
But what if I want to run it all to the end? Is it possible to get:
>>> [(1, 2), (2, 3), (3, 4), (4, None)]
A workaround but not the best solution is to append None with the sequence.
list(windowed(seq=[1, 2, 3, 4,None], n=2, step=1))
I believe you can do this programmatically based on the step= value which I refer to as win_step in the following code. I also removed hardcoding where possible to make it easier to test various sequence_list, win_width, and win_step data sets:
sequence_list = [1, 2, 3, 4]
win_width = 2
win_step = 1
none_list = []
for i in range(win_step):
none_list.append(None)
sequence_list.extend(none_list)
tuple_list = list(windowed(seq=sequence_list, n=win_width, step=win_step))
print('tuple_list:', tuple_list)
Here are my results based on your original question's data set, and on the current data set:
For original, where:
sequence_list = [1, 2, 3, 4, 5, 6]
win_width = 3
win_step = 2
The result is:
tuple_list: [(1, 2, 3), (3, 4, 5), (5, 6, None), (None, None, None)]
And for the present data set, where:
sequence_list = [1, 2, 3, 4]
win_width = 2
win_step = 1
The result is:
tuple_list: [(1, 2), (2, 3), (3, 4), (4, None)]

Networkx: How to find all unique paths of max given length in an unweighted, undirected, unlabeled, connected graph?

Let's say I have the following unweighted (all edges weight = 1), undirected, unlabeled, connected graph and I want to find all unique paths of maximum given length. Also, nodes cannot appear twice in a path. I cannot find a routine that does this in networkx atm.
Does anyone knows if any such thing exists ?
Or what could be a good solution for this problem ?
import networkx as nx
G = nx.Graph()
G.add_nodes_from([1, 2, 3, 4, 5, 6, 7, 8, 9])
G.add_edges_from([(1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (2, 4), (6, 9), (8, 9), (9, 6)])
The exemple graph looks like this
Let's say I require max length = 2, I would like this output
[1 2]
[2 3]
[2 4]
[3 4]
[4 5]
[5 6]
[6 7]
[7 8]
[8 9]
[6 9]
[1 2 3]
[1 2 4]
[2 3 4]
[2 4 5]
[3 4 5]
[4 5 6]
[5 6 7]
[5 6 9]
[6 7 9]
[6 7 8]
[7 8 9]
[6 9 8]
EDIT: I'm looking for a better solution than using itertools to generate all nodes combinations of required_max_path_length-1 number of nodes + checking for connectivity using G.has_edge(node_1, node_2) within the combinations groups or something similar, which seems like a super bad solution.
So now I'm doing this thx to #user3483203 and it yields the expected output. Itertools usage can be avoided but I don't mind in my specific case.
I still feel like it would scale a bit worst than something else for larger graphs though, I will change the accepted answer if someone finds a better solution.
import networkx as nx
import itertools
required_max_path_length = 2 # (inferior or equal to)
G = nx.Graph()
G.add_nodes_from([1, 2, 3, 4, 5, 6, 7, 8, 9])
G.add_edges_from([(1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (2, 4), (6, 9), (8, 9), (9, 6)])
all_paths = []
nodes_combs = itertools.combinations(G.nodes, 2)
for source, target in nodes_combs:
paths = nx.all_simple_paths(G, source=source, target=target, cutoff=required_max_path_length)
for path in paths:
if path not in all_paths and path[::-1] not in all_paths:
all_paths.append(path)
for path in all_paths:
print(path)
In case you want the paths as lists of edges you can do:
for path in map(nx.utils.pairwise, all_paths):
print(list(path))
And you will get:
[(1, 2)]
[(1, 2), (2, 3)]
[(1, 2), (2, 4)]
[(2, 3)]
[(2, 3), (3, 4)]
[(2, 4)]
[(2, 4), (4, 5)]
[(3, 4)]
[(3, 4), (4, 5)]
[(4, 5)]
[(4, 5), (5, 6)]
[(5, 6)]
[(5, 6), (6, 7)]
[(5, 6), (6, 9)]
[(6, 7)]
[(6, 7), (7, 8)]
[(6, 8), (8, 9)]
[(6, 9)]
[(7, 8)]
[(6, 7), (7, 9)]
[(7, 8), (8, 9)]
[(8, 9)]
The following code should solve your task, but it outputs more paths than you have given (e.g. [1,2] and [2,1]):
def find_all_simple_paths(graph, cutoff):
if cutoff == 0:
return [[node] for node in graph]
else:
all_paths = []
current_paths = [[node] for node in graph]
# If you want to include paths of length 0
# all_paths.extend(current_paths)
for _ in range(min(cutoff, len(graph))):
next_paths = []
for path in current_paths:
#print(path)
for neighbor in graph.neighbors(path[-1]):
if neighbor not in path:
new_path = path[:] + [neighbor]
next_paths.append(new_path)
all_paths.append(new_path)
current_paths = next_paths
return all_paths
find_all_simple_paths(G,2)
Output
[[1, 2],
[2, 1],
[2, 3],
[2, 4],
[3, 2],
[3, 4],
[4, 3],
[4, 5],
[4, 2],
[5, 4],
[5, 6],
[6, 5],
[6, 7],
[6, 9],
[7, 6],
[7, 8],
[8, 7],
[8, 9],
[9, 6],
[9, 8],
[1, 2, 3],
[1, 2, 4],
[2, 3, 4],
[2, 4, 3],
[2, 4, 5],
[3, 2, 1],
[3, 2, 4],
[3, 4, 5],
[3, 4, 2],
[4, 3, 2],
[4, 5, 6],
[4, 2, 1],
[4, 2, 3],
[5, 4, 3],
[5, 4, 2],
[5, 6, 7],
[5, 6, 9],
[6, 5, 4],
[6, 7, 8],
[6, 9, 8],
[7, 6, 5],
[7, 6, 9],
[7, 8, 9],
[8, 7, 6],
[8, 9, 6],
[9, 6, 5],
[9, 6, 7],
[9, 8, 7]]

Resources