Using reduceByKey method in Pyspark to update a dictionary - apache-spark

I have the following rdd data.
[(13, 'Munich#en'), (13, 'Munchen#de'), (14, 'Vienna#en'), (14, 'Wien#de'),(15, 'Paris#en')]
I want to combine the above rdd , using reduceByKey method, that would result the following output, i.e to join the entries into a dictionary based on entry's language.
[
(13, {'en':'Munich','de':'Munchen'}),
(14, {'en':'Vienna', 'de': 'Wien'}),
(15, {'en':'Paris', 'de':''})
]
The examples for reduceByKey were all numerical operations such as addition, so I am not very sure how to go about updating a dictionary in each reduce step.
This is my code:
rd0 = sc.parallelize(
[(13, 'munich#en'),(13, 'munchen#de'), (14, 'Vienna#en'),(14,'Wien#de'),(15,'Paris#en')]
)
def updateDict(x,xDict):
xDict[x[:-3]]=x[-2:]
rd0.map(lambda x: (x[0],(x[1],{'en':'','de':''}))).reduceByKey(updateDict).collect()
I am getting the following error message but not sure what I am doing wrong.
return f(*args, **kwargs)
File "<ipython-input-209-16cfa907be76>", line 2, in ff
TypeError: 'tuple' object does not support item assignment
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)

There are some problems with your code - for instance, your updateDict does not return a value. Here is a different approach:
First, map the values into dictionaries. One way is to split on "#", reverse, and pass the result into the dict constructor.
rd1 = rd0.mapValues(lambda x: dict([reversed(x.split("#"))]))
print(rd1.collect())
#[(13, {'en': 'munich'}),
# (13, {'de': 'munchen'}),
# (14, {'en': 'Vienna'}),
# (14, {'de': 'Wien'}),
# (15, {'en': 'Paris'})]
Now you can call reduceByKey and merge the two dictionaries. Finally add in the missing keys with a dictionary comprehension over the required keys, defaulting to empty string if the key is missing.
def merge_two_dicts(x, y):
# from https://stackoverflow.com/a/26853961/5858851
# works for python 2 and 3
z = x.copy() # start with x's keys and values
z.update(y) # modifies z with y's keys and values & returns None
return z
rd2 = rd1.reduceByKey(merge_two_dicts)\
.mapValues(lambda x: {k: x.get(k, '') for k in ['en', 'de']})
print(rd2.collect())
#[(14, {'de': 'Wien', 'en': 'Vienna'}),
# (13, {'de': 'munchen', 'en': 'munich'}),
# (15, {'de': '', 'en': 'Paris'})]

Related

Mapping of string to index value in networkx graph

Using Python Networkx, I am creating a graph of cities with their population.
#Graph Creation
my_Graph = nx.DiGraph()
my_Graph.add_nodes_from([
("S1", {"Population": 100}),
("S2", {"Population": 200}),
("S3", {"Population": 300})])
my_Graph.add_edge("S1", "S2", capacity=50)
my_Graph.add_edge("S2", "S1", capacity=30)
my_Graph.add_edge("S2", "S3", capacity=70)
my_Graph.add_edge("S3", "S2", capacity=50)
my_Graph.add_edge("S1", "S3", capacity=55)
When I print the list of cities using:
print("The cities are :", my_Graph.nodes)
my output is
The cities are :['S1', 'S2', 'S3']
I want a mapping function that maps the cities (nodes) to their indices in the network.
Based on your comments, I think this might be what you're after.
Using the built-in convert_node_labels_to_integers, you can do the following (assuming you want to start the labeling at 1 as indicated in your comments):
my_Graph = nx.convert_node_labels_to_integers(my_Graph, first_label = 1)
my_Graph.nodes()
>>>[1, 2, 3]
my_Graph.edges()
>>>[(1, 2), (1, 3), (2, 1), (2, 3), (3, 2)]
Nodes are converted to integers and the edges are kept between the correct nodes with the updated names. The attributes of the nodes and edges are maintained as well if you need them.

Can I chain groupByKey calls on pair_rdd in Pyspark?

Is it possible in Pyspark to chain a groupByKey() call on a pair_rdd twice?
I have two levels of keys I want to group by before I aggregate by creating a special list of all values.
Here's my code. First groupByKey() call groups by the outer key and is then given to a map function in which I hope to turn the resultIterable object into a pair_rdd again so I can do the second groupByKey() and map my function to it.
(Since I'm reducing I guess I could also use reduceByKey() there?)
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.appName("test")\
.master("local")\
.config('spark.sql.shuffle.partitions', '4')\
.getOrCreate()
sc = spark.sparkContext
def group_by(ws):
L = ws[0]
E = ...ws[1]... <-- Do something here to turn this from resultIterable to Pair_RDD
rr = E.groupByKey().map(output_lists)
return (L, rr)
def output_lists(ws):
el = [e[0] for e in ws[1]]
res = [ws[0]] + el
return (ws[0], res)
input_data = (('A', ('G', ('xyz',))),
('A', ('G', ('xys',))),
('A', ('H', ('asd',))),
('B', ('K', ('qwe',))),
('B', ('K', ('wer',))))
data = sc.parallelize(input_data)
data = data.groupByKey().map(group_by)
print(data.take(5))
Now, is this even doable or do I need a different approach.
I know two other ways around:
Concatenate both keys into one.
Use a SparkSQL dataframe.
But I'm curious if there is a way with the above approach as I'm still learning Spark.
I found out I can use tuples as keys in pair RDDs. Remapping my input data like this means only one groupByKey() is needed and the problem can be solved:
input_data = ((('A', 'G'), 'xyz'),
(('A', 'G'), 'xys'),
(('A', 'H'), 'asd'),
(('B', 'K'), 'qwe'),
(('B', 'K'), 'wer'))

Can't order a dict using OrderedDict and sorted

dict = {'word1':8, 'word2':5, 'word3' : 15, 'word4' : 1}
sorted(dict.items(), key=lambda x: x[1])
[('word4', 1), ('word2', 5), ('word1', 8), ('word3', 15)]
Ok that's what I want.
OrderedDict(sorted(dict.items(), key=lambda x: x[1]))
OrderedDict([('word1', 8), ('word2', 5), ('word3', 15), ('word4', 1)])
That's not what I want.
I don't understand. What am I doing wrong ? Why is the OrderedDict not ordered as desired ?
I checked the code that you wrote,the same case is working for me.Maybe since you are using it in a shell based environment so the values are not getting saved or something
Here is a screenshot of the code along with the output that you expect
Let me know if you have anymore questions.

PySpark: how to aggregate over column arrays with variable width?

I am attempting to aggregate and create an array of means thus (this is a Minimal Working Example):
n = len(allele_freq_total.select("alleleFrequencies").first()[0])
allele_freq_by_site = allele_freq_total.groupBy("contigName", "start", "end", "referenceAllele").agg(
array(*[mean(col("alleleFrequencies")[i]) for i in range(n)]).alias("mean_alleleFrequencies")
using a solution that I got from
Aggregate over column arrays in DataFrame in PySpark?
but the problem is that n is variable, how do I alter
array(*[mean(col("alleleFrequencies")[i]) for i in range(n)])
so that it takes variable length into consideration?
With arrays of unequal size in the different groups (for you, a group is ("contigName", "start", "end", "referenceAllele"), which I'll simply rename to group), you could consider exploding the array column (the alleleFrequencies), with introduction of the position the values had within the arrays. That will give you an additional column you can use in grouping to compute the average you had in mind. At this point you might actually have enough for further computations (see df3.show() below).
If you really must have it back into an array, that's harder and I haven't an idea. One must keep track of the order, and I believe that's easy with a map (a dictionary, if you like). To do so, I use the aggregation function collect_list on two columns. While collect_list isn't deterministic (you don't know the order in which values will be returned in the list, because rows are shuffled), the aggregation over both arrays will preserve their order, as the rows get shuffled in their entirety (see df4.show(), below). From there, you can create a mapping of the position to the average with map_from_arrays.
Example:
>>> from pyspark.sql.functions import mean, col, posexplode, collect_list, map_from_arrays
>>>
>>> df = spark.createDataFrame([
... ("A", [0, 1, 2]),
... ("A", [0, 3, 6]),
... ("B", [1, 2, 4, 5]),
... ("B", [1, 2, 6, 1])],
... schema=("group", "values"))
>>> df2 = df.select(df.group, posexplode(df.values)) # adds the "pos" and "col" columns
>>> df3 = (df2
... .groupBy("group", "pos")
... .agg(mean(col("col")).alias("avg_of_positions"))
... )
>>> df4 = (df3
... .groupBy("group")
... .agg(
... collect_list("pos").alias("pos"),
... collect_list("avg_of_positions").alias("avgs")
... )
... )
>>> df5 = df4.select(
... "group",
... map_from_arrays(col("pos"), col("avgs")).alias("positional_averages")
... )
>>> df5.show(truncate=False)
[Stage 0:> (0 + 4) / 4]
+-----+----------------------------------------+
|group|positional_averages |
+-----+----------------------------------------+
|B |[0 -> 1.0, 1 -> 2.0, 3 -> 3.0, 2 -> 5.0]|
|A |[0 -> 0.0, 1 -> 2.0, 2 -> 4.0] |
+-----+----------------------------------------+

CombineByKey does not allow work with flatMapValues to derive (Key, Value) pairs

I am working on a MapReduce problem where I want to filter every Map partition output.I want to filter only those keys which occur more than a threshold value in their map-partition.
So I have a RDD like (key, value<tuple>)
For every value of the tuple I want to get the count of its occurrence throughout the RDD, divided by map-partitions. Then I'll filter this count.
Eg: RDD: {(key1, ("a","b","c")),
(key2, ("a","d"),
(key3, ("b","c")}
Using flatMapValues I have been able to reduce this as
{(key1, a), (key1, b), (key1, c), (key2, a), (key2, d), (key3, b), (key3, c)}
Now using combineByKey step I have been able to get the count of each value in the respective partition.
suppose there were two partitions then it will return like
("a", [1, 1]), ("b", [1,1]), ("c", [1,1]), ("d", 1)
Now I want to filter this (Key, Value) such that the tuple of values makes individual Key-Value pair, i.e. what flatMapValues had done for me before but I am unable to use flatMapValues
python
from itertools import count
import pyspark
import sys
import json
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.rdd import RDD
sc = SparkContext('local[*]', 'assignment2_task1')
RDD = sc.textFile(sys.argv[1])
rdd1_original = RDD.map(lambda x: x.split(",")).map(lambda x: (x[0], [x[1]])).reduceByKey(lambda x, y: x + y)
rdd3_candidate = rdd1_original.flatMapValues(lambda a: a).map(lambda x: (x[1], 1)).combineByKey(lambda value: (value),lambda x, y: (x + y), lambda x, y: (x,y))
new_rdd = rdd3_candidate.flatMapValues(lambda a:a)
print(new_rdd.collect())
Expected answer:
[("a",1),("a", 1), ("b", 1), ("b", 1), ("c", 1), ("c", 1), ("d", 1)
Current error:
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/Users/yashphogat/Downloads/spark-2.33-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 253, in main
process()
File "/Users/yashphogat/Downloads/spark-2.33-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 248, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/Users/yashphogat/Downloads/spark-2.33-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 379, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/Users/yashphogat/Downloads/spark-2.33-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line 55, in wrapper
return f(*args, **kwargs)
File "/Users/yashphogat/Python_Programs/lib/python3.6/site-packages/pyspark/rdd.py", line 1967, in <lambda>
<b>flat_map_fn = lambda kv: ((kv[0], x) for x in f(kv[1]))
TypeError: 'int' object is not iterable

Resources