GraphFrames: Merge edge nodes with similar column values - apache-spark

tl;dr: How do you simplify a graph, removing edge nodes with identical name values?
I have a graph defined as follows:
import graphframes
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
vertices = spark.createDataFrame([
('1', 'foo', '1'),
('2', 'bar', '2'),
('3', 'bar', '3'),
('4', 'bar', '5'),
('5', 'baz', '9'),
('6', 'blah', '1'),
('7', 'blah', '2'),
('8', 'blah', '3')
], ['id', 'name', 'value'])
edges = spark.createDataFrame([
('1', '2'),
('1', '3'),
('1', '4'),
('1', '5'),
('5', '6'),
('5', '7'),
('5', '8')
], ['src', 'dst'])
f = graphframes.GraphFrame(vertices, edges)
Which produces a graph that looks like this (where the numbers represent the vertex ID):
Starting from vertex ID equal to 1, I'd like to simplify the graph. Such that nodes with similar name values are coalesced into a single node. A resulting graph would look something
like this:
Notice how we only have one foo (ID 1), one bar (ID 2), one baz (ID 5) and one blah (ID 6). The value of the vertex is irrelevant, and just to show that each vertex is unique.
I attempted to implement a solution, however it is hacky, extremely inefficient and I'm certain there is a better way (I also don't think it works):
f = graphframes.GraphFrame(vertices, edges)
# Get the out degrees for our nodes. Nodes that do not appear in
# this dataframe have zero out degrees.
outs = f.outDegrees
# Merge this with our nodes.
vertices = f.vertices
vertices = f.vertices.join(outs, outs.id == vertices.id, 'left').select(vertices.id, 'name', 'value', 'outDegree')
vertices.show()
# Create a new graph with our out degree nodes.
f = graphframes.GraphFrame(vertices, edges)
# Find paths to all edge vertices from our vertex ID = 1
# Can we make this one operation instead of two??? What if we have more than two hops?
one_hop = f.find('(a)-[e]->(b)').filter('b.outDegree is null').filter('a.id == "1"')
one_hop.show()
two_hop = f.find('(a)-[e1]->(b); (b)-[e2]->(c)').filter('c.outDegree is null').filter('a.id == "1"')
two_hop.show()
# Super ugly, but union the vertices from the `one_hop` and `two_hop` above, and unique
# on the name.
vertices = one_hop.select('a.*').union(one_hop.select('b.*'))
vertices = vertices.union(two_hop.select('a.*').union(two_hop.select('b.*').union(two_hop.select('c.*'))))
vertices = vertices.dropDuplicates(['name'])
vertices.show()
# Do the same for the edges
edges = two_hop.select('e1.*').union(two_hop.select('e2.*')).union(one_hop.select('e.*')).distinct()
# We need to ensure that we have the respective nodes from our edges. We do this by
# Ensuring the referenced vertex ID is in our `vertices` in both the `src` and the `dst`
# columns - This does NOT seem to work as I'd expect!
edges = edges.join(vertices, vertices.id == edges.src, "left").select("src", "dst")
edges = edges.join(vertices, vertices.id == edges.dst, "left").select("src", "dst")
edges.show()
Is there an easier way to remove nodes (and their corresponding edges) so that edge nodes are uniqued on their name?

Why don't you simply treat the name column as new id?
import graphframes
vertices = spark.createDataFrame([
('1', 'foo', '1'),
('2', 'bar', '2'),
('3', 'bar', '3'),
('4', 'bar', '5'),
('5', 'baz', '9'),
('6', 'blah', '1'),
('7', 'blah', '2'),
('8', 'blah', '3')
], ['id', 'name', 'value'])
edges = spark.createDataFrame([
('1', '2'),
('1', '3'),
('1', '4'),
('1', '5'),
('5', '6'),
('5', '7'),
('5', '8')
], ['src', 'dst'])
#create a dataframe with only one column
new_vertices = vertices.select(vertices.name.alias('id')).distinct()
#replace the src ids with the name column
new_edges = edges.join(vertices, edges.src == vertices.id, 'left')
new_edges = new_edges.select(new_edges.dst, new_edges.name.alias('src'))
#replace the dst ids with the name column
new_edges = new_edges.join(vertices, new_edges.dst == vertices.id, 'left')
new_edges = new_edges.select(new_edges.src, new_edges.name.alias('dst'))
#drop duplicate edges
new_edges = new_edges.dropDuplicates(['src', 'dst'])
new_edges.show()
new_vertices.show()
f = graphframes.GraphFrame(new_vertices, new_edges)
Output:
+---+----+
|src| dst|
+---+----+
|foo| baz|
|foo| bar|
|baz|blah|
+---+----+
+----+
| id|
+----+
|blah|
| bar|
| foo|
| baz|
+----+

Related

creating a list of lists from text file where the new list is based on a condition [duplicate]

This question already has answers here:
Sorting sub-lists into new sub-lists based on common first items
(4 answers)
Closed 2 years ago.
I have a text file that has lines in following order:
1 id:0 e1:"a" e2:"b"
0 id:0 e1:"4" e2:"c"
0 id:1 e1:"6" e2:"d"
2 id:2 e1:"8" e2:"f"
2 id:2 e1:"9" e2:"f"
2 id:2 e1:"d" e2:"k"
and I have to extract a list of lists containing elements (e1,e2) with id determining the index of the outer list and inner list following the order of the lines. So in the above case my output will be
[[("a","b"),("4","c")],[("6","d")],[("8","f"),("9","f"),("d","k")]]
The problem for me is that to know that the beginning of the new inner list, I need to check if the id value has changed. Each id does not have fixed number of elements. For example id:0 has 2, id:1 has 1 and id:2 has 3. Is there a efficient way to check this condition in next line while making the list?
You can use itertools.groupby() for the job:
import itertools
def split_by(
items,
key=None,
processing=None,
container=list):
for key_value, grouping in itertools.groupby(items, key):
if processing:
grouping = (processing(group) for group in grouping)
if container:
grouping = container(grouping)
yield grouping
to be called as:
from operator import itemgetter
list(split_by(items, itemgetter(0), itemgetter(slice(1, None))))
The items can be easily generated from text above (assuming it is contained in the file data.txt):
def get_items():
# with io.StringIO(text) as file_obj: # to read from `text`
with open(filename, 'r') as file_obj: # to read from `filename`
for line in file_obj:
if line.strip():
vals = line.replace('"', '').split()
yield tuple(val.split(':')[1] for val in vals[1:])
Finally, to test all the pieces (where open(filename, 'r') in get_items() is replaced by io.StringIO(text)):
import io
import itertools
from operator import itemgetter
text = """
1 id:0 e1:"a" e2:"b"
0 id:0 e1:"4" e2:"c"
0 id:1 e1:"6" e2:"d"
2 id:2 e1:"8" e2:"f"
2 id:2 e1:"9" e2:"f"
2 id:2 e1:"d" e2:"k"
""".strip()
print(list(split_by(get_items(), itemgetter(0), itemgetter(slice(1, None)))))
# [[('a', 'b'), ('4', 'c')], [('6', 'd')], [('8', 'f'), ('9', 'f'), ('d', 'k')]]
This efficiently iterates through the input without unnecessary memory allocation.
No other packages are required
Load and parse the file:
Beginning with a text file, formatted as shown in the question
# parse text file into dict
with open('test.txt', 'r') as f:
text = [line[2:].replace('"', '').strip().split() for line in f.readlines()] # clean each line and split it into a list
text = [[v.split(':') for v in t] for t in text] # split each value in the list into a list
d =[{v[0]: v[1] for v in t} for t in text] # convert liest to dicts
# text will appear as:
[[['id', '0'], ['e1', 'a'], ['e2', 'b']],
[['id', '0'], ['e1', '4'], ['e2', 'c']],
[['id', '1'], ['e1', '6'], ['e2', 'd']],
[['id', '2'], ['e1', '8'], ['e2', 'f']],
[['id', '2'], ['e1', '9'], ['e2', 'f']],
[['id', '2'], ['e1', 'd'], ['e2', 'k']]]
# d appears as:
[{'id': '0', 'e1': 'a', 'e2': 'b'},
{'id': '0', 'e1': '4', 'e2': 'c'},
{'id': '1', 'e1': '6', 'e2': 'd'},
{'id': '2', 'e1': '8', 'e2': 'f'},
{'id': '2', 'e1': '9', 'e2': 'f'},
{'id': '2', 'e1': 'd', 'e2': 'k'}]
Parse the list of dicts to expected output
Use .get to determine if a key exists, and return some specified value, None in this case, if the key is nonexistent.
dict.get defaults to None, so this method never raises a KeyError.
If None is a value in the dictionary, then change the default value returned by .get.
test.get(v[0], 'something here')
test = dict()
for r in d:
v = list(r.values())
if test.get(v[0]) == None:
test[v[0]] = [tuple(v[1:])]
else:
test[v[0]].append(tuple(v[1:]))
# test dict appears as:
{'0': [('a', 'b'), ('4', 'c')],
'1': [('6', 'd')],
'2': [('8', 'f'), ('9', 'f'), ('d', 'k')]}
# final output
final = list(test.values())
[[('a', 'b'), ('4', 'c')], [('6', 'd')], [('8', 'f'), ('9', 'f'), ('d', 'k')]]
Code Updated and reduced:
In this case, text is a list of lists, and there's no need to convert it to dict d, as above.
For each list t in text, index [0] is always the key, and index [1:] are the values.
with open('test.txt', 'r') as f:
text = [line[2:].replace('"', '').strip().split() for line in f.readlines()] # clean each line and split it into a list
text = [[v.split(':')[1] for v in t] for t in text] # list of list of only value at index 1
# text appears as:
[['0', 'a', 'b'],
['0', '4', 'c'],
['1', '6', 'd'],
['2', '8', 'f'],
['2', '9', 'f'],
['2', 'd', 'k']]
test = dict()
for t in text:
if test.get(t[0]) == None:
test[t[0]] = [tuple(t[1:])]
else:
test[t[0]].append(tuple(t[1:]))
final = list(test.values())
Using defaultdict
Will save a few lines of code
Using text as a list of lists from above
from collections import defaultdict as dd
test = dd(list)
for t in text:
test[t[0]].append(tuple(t[1:]))
final = list(test.values())

How to normalize the distribution in the tuples?

I tried to do some normalization in my code and I have a list with inner-list:
a = [[ ('1', 0.03),
('2', 0.03),
('3', 0.06)]
[ ('4', 0.03),
('5', 0.06),
('6', 0.06)]
[ ('7', 0.07),
('8', 0.014),
('9', 0.07)]
]
I tried to normalized the distribution in the tuples to get list b
b = [[ ('1', 0.25),
('2', 0.25),
('3', 0.50)]
[ ('4', 0.20),
('5', 0.40),
('6', 0.40)]
[ ('7', 0.25),
('8', 0.50),
('9', 0.25)]
]
And I tried:
for i in a:
for n, (ee,ww) in enumerate(i):
i[n] = (ee,ww/sum(ww))
But it failed.
How to get b in python?
a = [[ ('1', 0.03),
('2', 0.03),
('3', 0.06)],
[ ('4', 0.03),
('5', 0.06),
('6', 0.06)],
[ ('7', 0.07),
('8', 0.14),
('9', 0.07)]
]
for i in a:
s = sum(v[1] for v in i)
i[:] = [(v[0], v[1] / s) for v in i]
from pprint import pprint
pprint(a)
Prints:
[[('1', 0.25), ('2', 0.25), ('3', 0.5)],
[('4', 0.2), ('5', 0.4), ('6', 0.4)],
[('7', 0.25), ('8', 0.5), ('9', 0.25)]]
Note:
i[:] = [(v[0], v[1] / s) for v in i] replaces all values in list i with new values from the list comprehension.

Creating custom combination from two lists

I am looking to use two lists:
L1 = ['a', 's', 'd']
L2 = [str(1), str(2)]
I need to create a third list:
L3 = [(a1, s1, d1), (a1, s1, d2), ... ]
L3 has tuples of size 3 where each tuple has only non-repetitive elements from L1 but can have repetitive elements from L2.
i.e. a pair as (a1, s2, d2) is allowed but (a1, a2, d1) is not allowed.
L3 has tuples of size 3.
I am working with large L1 and L2 so the above example is only for illustration. I am not sure how to approach this problem. I have thought about using itertools permutation and combination modules but I am not getting the list L3 above. One brute force solution is to do something as:
L3 = list(itertools.combinations(list(itertools.product(L1, L2)), 3))
and then condition out the elements as ('a', '1'), ('a', '2'), ('d', '2') but for a large combination that is not efficient for loop.
Since it sounds like you want L1 to be the first elements of the tuples, I think we simply need to zip them, not iter-anything them. We only need to take the product of L2.
In [327]: [list(zip(L1, p)) for p in itertools.product(L2, repeat=len(L1))]
Out[327]:
[[('a', '1'), ('s', '1'), ('d', '1')],
[('a', '1'), ('s', '1'), ('d', '2')],
[('a', '1'), ('s', '2'), ('d', '1')],
[('a', '1'), ('s', '2'), ('d', '2')],
[('a', '2'), ('s', '1'), ('d', '1')],
[('a', '2'), ('s', '1'), ('d', '2')],
[('a', '2'), ('s', '2'), ('d', '1')],
[('a', '2'), ('s', '2'), ('d', '2')]]
where you can replace [ and ] with ( and ) to turn the listcomp into a genexp if you don't want to materialize the whole object at once.
If you want to merge your tuples' elements into one string, you could do that too:
In [338]: gen = (tuple(''.join(pair) for pair in zip(L1, p))
for p in itertools.product(L2, repeat=len(L1)))
In [339]: for elem in gen:
...: print(elem)
('a1', 's1', 'd1')
('a1', 's1', 'd2')
('a1', 's2', 'd1')
('a1', 's2', 'd2')
('a2', 's1', 'd1')
('a2', 's1', 'd2')
('a2', 's2', 'd1')
('a2', 's2', 'd2')

remove multiple rows from a array in python

array([
['192', '895'],
['14', '269'],
['1', '23'],
['1', '23'],
['50', '322'],
['19', '121'],
['17', '112'],
['12', '72'],
['2', '17'],
['5,250', '36,410'],
['2,546', '17,610'],
['882', '6,085'],
['571', '3,659'],
['500', '3,818'],
['458', '3,103'],
['151', '1,150'],
['45', '319'],
['44', '335'],
['30', '184']
])
How can I remove some of the rows and left the array like:
Table3=array([
['192', '895'],
['14', '269'],
['1', '23'],
['50', '322'],
['17', '112'],
['12', '72'],
['2', '17'],
['5,250', '36,410'],
['882', '6,085'],
['571', '3,659'],
['500', '3,818'],
['458', '3,103'],
['45', '319'],
['44', '335'],
['30', '184']
])
I removed the index 2,4,6. I am not sure how should I do it. I have tried few ways, but still can't work.
It seems like you actually deleted indices 2, 5, and 10 (not 2, 4 and 6). To do this you can use np.delete, pass it a list of the indices you want to delete, and apply it along axis=0:
Table3 = np.delete(arr, [[2,5,10]], axis=0)
>>> Table3
array([['192', '895'],
['14', '269'],
['1', '23'],
['50', '322'],
['17', '112'],
['12', '72'],
['2', '17'],
['5,250', '36,410'],
['882', '6,085'],
['571', '3,659'],
['500', '3,818'],
['458', '3,103'],
['151', '1,150'],
['45', '319'],
['44', '335'],
['30', '184']],
dtype='<U6')

Referencing a number inside a tuple

I want to arrange a list of tuples similar to the one bellow in descending order using the numbers:
data = [('ralph picked', ['nose', '4', 'apple', '30', 'winner', '3']),
('aaron popped', ['soda', '1', 'popcorn', '6', 'pill', '4',
'question', '29'])]
I would like to sort the nested list so that the outcome would look somewhat like:
data2 = [('ralph picked', ['apple', '30', 'nose', '4', 'winner', '3']),
('aaron popped', ['question', '29', 'popcorn', '6', 'pill', '4', 'soda', '1'])]
I am trying to use this code for this:
data2=[]
for k, v in data:
data2 = ((k, sorted(zip(data[::2], data[1::2]), key=lambda x: int(x[1]), reverse=True) ))
[value for pair in data2 for value in pair]
print(data2)
But I keep getting the error message:
TypeError: int() argument must be a string or a number, not 'tuple'
I tried to rearrange the int in key=lambda x: int(x[1]) to different things, but I kept getting the same message, I am very new to python, the syntax often gets me. Any ideas on how to solve this? I really thank you very much!
Rather than trying to do everything at once, let's give things names:
data = [('ralph picked', ['nose', '4', 'apple', '30', 'winner', '3']),
('aaron popped', ['soda', '1', 'popcorn', '6', 'pill', '4', 'question', '29'])]
data2 = []
for k, v in data:
new_list = sorted(zip(v[::2], v[1::2]), key=lambda x: int(x[1]), reverse=True)
flattened = [value for pair in new_list for value in pair]
new_tuple = (k, flattened)
data2.append(new_tuple)
produces
>>> print(data2)
[('ralph picked', ['apple', '30', 'nose', '4', 'winner', '3']),
('aaron popped', ['question', '29', 'popcorn', '6', 'pill', '4', 'soda', '1'])]
You need to distinguish between data and v -- you only want to sort v, and you need to store the result of the list comprehension, otherwise you're just building it and throwing it away.
When you're having trouble with the syntax, break everything apart into its pieces and print them to see what's going on. For example, you could decompose new_list into
words = v[::2]
numbers = v[1::2]
pairs = zip(words, numbers)
sorted_pairs = sorted(pairs, key=lambda x: int(x[1]), reverse=True)
and sorted_pairs is really what new_list is.

Resources