Collect results from RDDs in a dstream driver program - apache-spark

I have this function in the driver program which collects the result from rdds into an array and send it back. However, even though the RDDs (in the dstream) have data, the function is returning an empty array...What am I doing wrong?
def runTopFunction() : Array[(String, Int)] = {
val topSearches = some function....
val summary = new ArrayBuffer[(String,Int)]()
topSearches.foreachRDD(rdd => {
summary = summary.++(rdd.collect())
})
return summary.toArray
}

So while the foreachRDD will do what you are looking to do, it is also non-blocking which means it won't wait until all of the stream is processed. Since you cal toArray on your buffer right after the call to foreachRDD, there won't have been any elements processed yet.

DStream.forEachRDD is an action on given DStream and will be scheduled for execution on each streaming batch interval. It's a declarative construction of the job to be executed later on.
Accumulating over the values in this way is not supported because while the Dstream.forEachRDD is just saying "do this on each iteration", the surrounding accumulation code is executed immediately, resulting in an empty array.
Depending of what happens to the summary data after it's calculated, there're few options on how to implement this:
If the data needs to be retrieved by another process, use a shared thread-safe structure. A priority queue is great for top-k uses.
If the data will be stored (fs, db), you can just write to the storage after applying the topSearches function to the dstream.

Related

Spark forEach vs Map functions

I was going to through a review and it was stated:
forEach forces all of the data to be sent to a single process (the Driver)
which will cause issues (such as OutOfMemory issues) at scale. Instead the
map() function serves the same purpose and distributes processing across
different Worker nodes in the cluster.
Is it correct, I could not find any document which says forEach is not distributed while map is distributed!!!
I believe you talking about applying these functions on RDD or Dataset.
Nothing comes to driver in either case. All code is executed in executor. foreach is an action that returns nothing where as map() acts as transformer from one value to other.
def foreach(f: (T) ⇒ Unit): Unit
Applies a function f to all elements of this RDD.
foreach(func) Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems.
Note: modifying variables other than Accumulators outside of the foreach() may result in undefined behavior. See Understanding closures for more details.
def map[U](f: (T) ⇒ U)(implicit arg0: ClassTag[U]): RDD[U]
Return a new RDD by applying a function to all elements of this RDD.

Parallel method invocation in spark and using of spark session in the passed method

Let me first inform all of you that I am very new to Spark.
I need to process a huge number of records in a table and when it is grouped by email it is around 1 million. I need to perform multiple logical calculations based on the data set against individual email and update the database based on the logical calculation
Roughly my code structure is like
//Initial Data Load ...
import sparkSession.implicits._
var tableData = sparkSession.read.jdbc(<JDBC_URL>, <TABLE NAME>, connectionProperties).select("email").where(<CUSTOM CONDITION>)
//Data Frame with Records with grouping on email count greater than one
var recordsGroupedBy =tableData.groupBy("email").count().withColumnRenamed("count", "recordcount").filter("recordcount > 1 ").toDF()
//Now comes the processing after grouping against email using processDataAgainstEmail() method
recordsGroupedBy.collect().foreach(x=>processDataAgainstEmail(x.getAs("email"),sparkSession))
Here I see foreach is not parallelly executed. I need to invoke the method processDataAgainstEmail(,) in parallel.
But if I try to parallelize by doing
Hi I can get a list by invoking
val emailList =dataFrameWithGroupedByMultipleRecords.select("email").rdd.map(r => r(0).asInstanceOf[String]).collect().toList
var rdd = sc.parallelize(emailList )
rdd.foreach(x => processDataAgainstEmail(x.getAs("email"),sparkSession))
This is not supported as I can not pass sparkSession when using parallelize.
Can anybody help me with this as in processDataAgainstEmail(,) multiple operations would be performed related to database insert and update and also spark dataframe and spark SQL operations needs to be performed?
To summerize I need to invoke parallelly processDataAgainstEmail(,) with sparksession
In case it is not all possible to pass spark sessions, the method won't be able to perform anything on the database. I am not sure what would be the alternate way as parallelism on email is a must for my scenario.
The forEach is the method the list that operates on each element of the list sequentially, so you are acting on it one at a time, and passing that to processDataAgainstEmail method.
Once you have gotten the resultant list, you then invoke the sc.parallelize on to parallelize the creation of the dataframe from the list of records you created/manipulated in the previous step. The parallelization, as I can see in the pySpark, is the property of creating of the dataframe, not acting the result of any operation.

How do I run RDD operations after a groupby in Spark?

I have a large set of data that I want to perform clustering on. The catch is, I don't want one clustering for the whole set, but a clustering for each user. Essentially I would do a groupby userid first, then run KMeans.
The problem is, once you do a groupby, any mapping would be outside the spark controller context, so any attempt to create RDDs would fail. Spark's KMeans lib in mllib requires an RDD (so it can parallelize).
I see two workarounds, but I was hoping there was a better solution.
1) Manually loop through all the thousands of users in the controller (maybe millions when things get big), and run kmeans for each of them.
2) Do the groupby in the controller, then in map run a non-parallel kmeans provided by an external library.
Please tell me there is another way, I'd rather just have everything || as possible.
Edit: I didn't know it was pyspark at the moment of the response. However, I will leave it as an idea that may be adapted
I had a similar problem and I was able to improve the performance, but it was still not the ideal solution for me. Maybe for you it could work.
The idea was to break the RDD in many smaller RDDs (a new one for each user id), saving them to an array, then calling the processing function (clustering in your case) for each "sub-RDD". The suggested code is given below (explanation in the comments):
// A case class just to use as example
case class MyClass(userId: Long, value: Long, ...)
// A Scala local array with the user IDs (Could be another iterator, such as List or Array):
val userList: Seq[Long] = rdd.map{ _.userId }.distinct.collect.toSeq // Just a suggestion!
// Now we can create the new rdds:
val rddsList: Seq[RDD[MyClass]] = userList.map {
userId => rdd.filter({ item: MyClass => item.userId == userId })
}.toSeq
// Finally, we call the function we want for each RDD, saving the results in a new list.
// Note the ".par" call, which is used to start the expensive execution for multiple RDDs at the same time
val results = rddsList.par.map {
r => myFunction(r)
}
I know this is roughly the same as your first option, but by using the .par call, I was able to improve the performance.
This call transforms the rddsList object to a ParSeq object. This new Scala object allows parallel computation, so, ideally, the map function will call myFunction(r) for multiple RDDs at once, which can improve the performance.
For more details about parallel collections, please check the Scala Documentation.

Why Spark streaming creates batches with 0 events?

Spark streaming keeps on creating batches with 0 events and queues them to be processed in next job iteration. But is it really necessary to queue batches which have nothing to be processed or is there something hidden going on?
This is working as intended because your job could still produce output even in the absence of data (which can also happen after filtering your data).
For example you might write some record to the database that indicates that there's no data available at a given timestamp.
stream.foreachRDD { rdd =>
if (rdd.isEmpty) // write "empty" record to db
else // write data to db
}

collect RDD output as a stream

I have a job that end like this
val iteratorRDD: RDD[Iterator[SomeClass]] = ....
val results = iteratorRDD.map( iterator => iterator.toSeq)
.collect
The iterators are lazy, i.e. they compute the data when their items are accessed, here the toSeq which would basically call .next() iteratively.
Now, this computation is slow and I want to get the output of the iterator as soon as they are generated, basically at each iterator.next(). The reason is that the later steps (run locally) are processing the items in order: f(all the first items), then f(all the seconds), etc... and I need to get these a soon as possible, thus before the end of the job.
Does spark provide some mean to retrieve intermediate results as some kind of stream? Or maybe there exist a distributed data structure to which the iterator could send the intermediate data?
What I could do is the setup a web-service that would act as such a buffer: it would listen for data that would be send by each call to iterator.next(). Then have my main program call that web-service to get what it stores. But I don't like the idea of having all the worker communicating to an external service.
I would not make any sense to do so. An iterator if fine for traversal when you don't want to create copies in local memory, but Spark works differently. Your iterators are distributed across multiple executors (in separate nodes, with separate memory), so when you call collect you will force them to be iterated and sent to the master anyway, where they will be loaded into memory. There is simply no way you can do lazy evaluation from the master to data that lives in the executors.
You should strive to send the computation to the data and not the other way around, especially if you are going to be running the same code on each sequence anyway! For instance:
val results = iteratorRDD
.map(iter => f(iter)) // Whatever f() returns.
.collect()
You then evaluate your iterator lazily and in parallel, on the executors, only bringing the actual results to the master.

Resources