PySpark Reduce on RDD with only single element - apache-spark

Is there anyway to deal with RDDs with only a single element (this can sometimes happen for what I am doing)? When that's the case, reduce stops working as the operation requires 2 inputs.
I am working with key-value pairs such as:
(key1, 10),
(key2, 20),
And I want to aggregate their values, so the result should be:
30
But there are cases where the rdd only contain a single key-value pair, so reduce does not work here, example:
(key1, 10)
This will return nothing.

If you do a .values() before doing reduce, it should work even if there is only 1 element in the RDD:
from operator import add
rdd = sc.parallelize([('key1', 10),])
rdd.values().reduce(add)
# 10

Related

How to efficiently select distinct rows on an RDD based on a subset of its columns`

Consider a Case Class:
case class Prod(productId: String, date: String, qty: Int, many other attributes ..)
And an
val rdd: RDD[Prod]
containing many instances of that class.
The unique key is intended to be the (productId,date) tuple. However we do have some duplicates.
Is there any efficient means to remove the duplicates?
The operation
rdd.distinct
would look for entire rows that are duplicated.
A fallback would involve joining the unique (productId,date) combinations back to the entire rows: I am working through exactly how to do this. But even so it is several operations. A simpler approach (faster as well?) would be useful if it exists.
I'd use dropDuplicates on Dataset:
val rdd = sc.parallelize(Seq(
Prod("foo", "2010-01-02", 1), Prod("foo", "2010-01-02", 2)
))
rdd.toDS.dropDuplicates("productId", "date")
but reduceByKey should work as well:
rdd.keyBy(prod => (prod.productId, prod.date)).reduceByKey((x, _) => x).values

Difference between RDD.foreach() and RDD.map()

I am learning Spark in Python and wondering can anyone explain the difference between the action foreach() and transformation map()?
rdd.map() returns a new RDD, like the original map function in Python. However, I want to see a rdd.foreach() function and understand the differences. Thanks!
A very simple example would be rdd.foreach(print) which would print the value of each row in the RDD but not modify the RDD in any way.
For example, this produces an RDD with the numbers 1 - 10:
>>> rdd = sc.parallelize(xrange(0, 10)).map(lambda x: x + 1)
>>> rdd.take(10)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
The map call computed a new value for each row and it returned it so that I get a new RDD. However, if I used foreach that would be useless because foreach doesn't modify the rdd in any way:
>>> rdd = sc.parallelize(range(0, 10)).foreach(lambda x: x + 1)
>>> type(rdd)
<class 'NoneType'>
Conversely, calling map on a function that returns None like print isn't very useful:
>>> rdd = sc.parallelize(range(0, 10)).map(print)
>>> rdd.take(10)
0
1
2
3
4
5
6
7
8
9
[None, None, None, None, None, None, None, None, None, None]
The print call returns None so mapping that just gives you a bunch of None values and you didn't want those values and you didn't want to save them so returning them is a waste. (Note the lines with 1, 2, etc. are the print being executed and they don't show up until you call take since the RDD is executed lazily. However the contents of the RDD are just a bunch of None.
More simply, call map if you care about the return value of the function. Call foreach if you don't.
Map is a transformation, thus when you perform a map you apply a function to each element in the RDD and return a new RDD where additional transformations or actions can be called.
Foreach is an action, it takes each element and applies a function, but it does not return a value. This is particularly useful in you have to call perform some calculation on an RDD and log the result somewhere else, for example a database or call a REST API with each element in the RDD.
For example let's say that you have an RDD with many queries that you wish to log in another system. The queries are stored in an RDD.
queries = <code to load queries or a transformation that was applied on other RDDs>
Then you want to save those queries in another system via a call to another API
import urllib2
def log_search(q):
response = urllib2.urlopen('http://www.bigdatainc.org/save_query/' + q)
queries.foreach(log_search)
Now you have executed the log_query on each element of the RDD. If you have done a map, nothing would have happened yet, until you called an action.

pyspark How can I collect together values without using combineByKey or any reduce?

I am trying to implement my own simpler function than combineByKey which will basically just take in a function and an iterator and return key value pairs withh the function applied.
For example:
If I have rdd that looks like this: ([("x", 2), ("y", 1), ("x", 3)]) and a function that multiplies values together. I want to plug these both into my newly created function called collector and get this in return ([("x", 6), ("y", 1)]).
I want to make it as simple as possible but this is my first time coding in pyspark so I am not too sure how to start this.
Use partitionbykey over PairRDD and call mappartitions and supply your function.
partitionebykey ensures same keys present in same partitions.

Ordered union on spark RDDs

I am trying to do a sort on key of key-record pairs using apache spark. The key is 10 bytes long and the value is about 90 bytes long. In other words I am trying to replicate the sort benchmark Databricks used to break the sorting record. One of the things I noticed from the documentation is that they sorted on key-line-number pairs as opposed to key-record pairs to probably be cache/tlb friendly. I tried to replicate this approach but have not found a suitable solution. Here is what I have tried:
var keyValueRDD_1 = input.map(x => (x.substring(0, 10), x.substring(12, 13)))
var keyValueRDD_2 = input.map(x => (x.substring(0, 10), x.substring(14, 98))
var result = keyValueRDD_1.sortByKey(true, 1) // assume partitions = 1
var unionResult = result.union(keyValueRDD_2)
var finalResult = unionResult.foldByKey("")(_+_)
When I do a union on the result RDD and keyValueRDD_2 RDD and print the output of the unionResultRDD, the result and keyValueRDD_2 are not interleaved. In other words, it looks like the unionResult RDD has the keyValueRDD_2 contents followed by the result RDD contents. However, when I do a foldByKey operation which combines the values of same key into a single key-value pair, the sorted order is destroyed. I need to do a fold by key operation in order to save the result as the original key-record pair. Is there an alternate rdd function that could be used to achieve this?
Any tips or suggestions would be quite useful.
Thanks
The union method just puts two RDDs one after the other, except if they have the same partitioner. Then it joins the partitions.
What you want to do is impossible.
When you have one RDD sorted (keyValueRDD_1) and another unsorted RDD with the same keys (keyValueRDD_2) then the only way to get the second RDD sorted is to sort it.
The existence of the sorted RDD does not help us sort the second RDD.
The Databricks article talks about an optimization that happens locally on the executors. After the shuffle step, the records are roughly sorted. Each partition now covers a range of keys, but the partitions are unsorted.
Now you have to sort each partition locally, and this is where the prefix optimization helps with cache locality.

dot product of a combination of elements of an RDD using pySpark

I have an RDD where each element is a tuple of the form
[ (index1,SparseVector({idx1:1,idx2:1,idx3:1,...})) , (index2,SparseVector() ),... ]
I would like to take a dot-product of each of the values in this RDD by using the SparseVector1.dot(SparseVector2) method provided by mllib.linalg.SparseVector class. I am aware that python has an itertools.combinations module that can be used to achieve the combinations of dot-products to be calculated. Could someone provide a code-snippet to achieve the same? I can only thing of doing an RDD.collect() so I receive a list of all elements in the RDD and then running the itertools.combinations on this list but this as per my understanding would perform all the calculations on the root and wouldn't be distributed per-se. Could someone please suggest a more distributed way of achieving this?
def computeDot(sparseVectorA, sparseVectorB):
"""
Function to compute dot product of two SparseVectors
"""
return sparseVectorA.dot(sparseVectorB)
# Use Cartesian function on the RDD to create tuples containing
# 2-combinations of all the rows in the original RDD
combinationRDD = (originalRDD.cartesian(originalRDD))
# The records in combinationRDD will be of the form
# [(Index, SV1), (Index, SV1)], therefore, you need to
# filter all the records where the index is not equal giving
# RDD of the form [(Index1, SV1), (Index2, SV2)] and so on,
# then use the map function to use the SparseVector's dot function
dottedRDD = (combinationRDD
.filter(lambda x: x[0][0] != x[1][0])
.map(lambda x: computeDot(x[0][1], x[1][1])
.cache())
The solution to this question should be along this line.

Resources