collect RDD output as a stream - apache-spark

I have a job that end like this
val iteratorRDD: RDD[Iterator[SomeClass]] = ....
val results = iteratorRDD.map( iterator => iterator.toSeq)
.collect
The iterators are lazy, i.e. they compute the data when their items are accessed, here the toSeq which would basically call .next() iteratively.
Now, this computation is slow and I want to get the output of the iterator as soon as they are generated, basically at each iterator.next(). The reason is that the later steps (run locally) are processing the items in order: f(all the first items), then f(all the seconds), etc... and I need to get these a soon as possible, thus before the end of the job.
Does spark provide some mean to retrieve intermediate results as some kind of stream? Or maybe there exist a distributed data structure to which the iterator could send the intermediate data?
What I could do is the setup a web-service that would act as such a buffer: it would listen for data that would be send by each call to iterator.next(). Then have my main program call that web-service to get what it stores. But I don't like the idea of having all the worker communicating to an external service.

I would not make any sense to do so. An iterator if fine for traversal when you don't want to create copies in local memory, but Spark works differently. Your iterators are distributed across multiple executors (in separate nodes, with separate memory), so when you call collect you will force them to be iterated and sent to the master anyway, where they will be loaded into memory. There is simply no way you can do lazy evaluation from the master to data that lives in the executors.
You should strive to send the computation to the data and not the other way around, especially if you are going to be running the same code on each sequence anyway! For instance:
val results = iteratorRDD
.map(iter => f(iter)) // Whatever f() returns.
.collect()
You then evaluate your iterator lazily and in parallel, on the executors, only bringing the actual results to the master.

Related

Does mapreduce fetch entire map-combiner output before starting reducer or does it make many partial progresses?

I am confused with the following two conflicting notions about mapreduce both arising from the same source:
Is it:
reducer side fetches the entire output of (map-combine)er, sorts and then applies the reduce function in one shot. I get this notion from :
However in MapReduce the reducer input data needs to be sorted, so the reduce() logic is applied after the shuffle-sort process. Since Spark does not require a sorted order for the reducer input data, we don't need to wait until all the data gets fetched to start processing.
or, is it,
reducer side fetches a pre-specified amount of map-combiner output and then applies the combiner, then receives the next batch and applies the combiner on this next batch and so on and so forth. Then the result of all these combiners are put together, sorted and, fed to the reduce function for final aggregation. I get this notion from
reduce side: Shuffle process in Hadoop will fetch the data until a certain amount, then applies combine() logic, then merge sort the data to feed the reduce() function.
Can you help me understand which one is the correct notion. I have never read anywhere that the combiner runs on the reduce side as well. However, I am not sure of that after reading the blog I hyperlinked earlier

Synchronization between Spark RDD partitions

Say that I have an RDD with 3 partitions and I want to run each executor/ worker in a sequence, such that, after partition 1 has been computed, then partition 2 can be computed, and after 2 is computed, finally, partition 3 can be computed. The reason I need this synchronization is because each partition has a dependency on some computation of a previous partition. Correct me if I'm wrong, but this type of synchronization does not appear to be well suited for the Spark framework.
I have pondered opening a JDBC connection in each worker task node as illustrated below:
rdd.foreachPartition( partition => {
// 1. open jdbc connection
// 2. poll database for the completion of dependent partition
// 3. read dependent edge case value from computed dependent partition
// 4. compute this partition
// 5. write this edge case result to database
// 6. close connection
})
I have even pondered using accumulators, picking the acc value up in the driver, and then re-broadcasting a value so the appropriate worker can start computation, but apparently broadcasting doesn't work like this, i.e., once you have shipped the broadcast variable through foreachPartition, you cannot re-broadcast a different value.
Synchronization is not really an issue. Problem is that you want to use a concurrency layer to achieve this and as a result you get completely sequential execution. No to mention that by pushing changes to the database just to fetch these back on another worker means you get not benefits of in-memory processing. In the current form it doesn't make sense to use Spark at all.
Generally speaking if you want to achieve synchronization in Spark you should think in terms of transformations. Your question is rather sketchy but you can try something like this:
Create first RDD with data from the first partition. Process in parallel and optionally push results outside
Compute differential buffer
Create second RDD with data from the second partition. Merge with differential buffer from 2, process, optionally push results to database.
Back to 2. and repeat
What do you gain here? First of all you can utilize your whole cluster. Moreover partial results are kept in memory and don't have to be transfered back and forth between the workers and the database.

Concurrent operations in spark streaming

I wanted to understand something about the internals of spark streaming executions.
If I have a stream X, and in my program I send stream X to function A and function B:
In function A, I do a few transform/filter operations etc. on X->Y->Z to create stream Z. Now I do a forEach Operation on Z and print the output to a file.
Then in function B, I reduce stream X -> X2 (say min value of each RDD), and print the output to file
Are both functions being executed for each RDD in parallel? How does it work?
Thanks
--- Comments from Spark Community ----
I am adding comments from the spark community -
If you execute the collect step (foreach in 1, possibly reduce in 2) in two threads in the driver then both of them will be executed in parallel. Whichever gets submitted to Spark first gets executed first - you can use a semaphore if you need to ensure the ordering of execution, though I would assume that the ordering wouldn't matter.
#Eswara's answer is seems right but it does not apply to your use case as your separate transformation DAG's (X->Y->Z and X->X2) have a common DStream ancestor in X. This means that when the actions are run to trigger each of these flows, the transformation X->Y and the transformation X->X2 cannot happen at the same time. What will happen is the partitions for RDD X will be either computed or loaded from memory (if cached) for each of these transformations separately in a non-parallel manner.
Ideally what would happen is that the transformation X->Y would resolve and then the transformations Y->Z and X->X2 would finish in parallel as there is no shared state between them. I believe Spark's pipelining architecture would optimize for this. You can ensure faster computation on X->X2 by persisting DStream X so that it can be loaded from memory rather than being recomputed or being loaded from disk. See here for more information on persistence.
What would be interesting is if you could provide the replication storage levels *_2 (e.g. MEMORY_ONLY_2 or MEMORY_AND_DISK_2) to be able to run transformations concurrently on the same source. I think those storage levels are currently only useful against lost partitions right now, as the duplicate partition will be processed in place of the lost one.
Yes.
It's similar to spark's execution model which uses DAGs and lazy evaluation except that streaming runs the DAG repeatedly on each fresh batch of data.
In your case, since the DAGs(or sub-DAGs of larger DAG if one prefers to call that way) required to finish each action(each of the 2 foreachs you have) do not have common links all the way back till source, they run completely in parallel.The streaming application as a whole gets X executors(JVMs) and Y cores(threads) per executor allotted at the time of application submission to resource manager.At any time, a given task(i.e., thread) in X*Y tasks will be executing a part or whole of one of these DAGs.Note that any 2 given threads of an application, whether in same executor or otherwise, can execute different actions of the same application at the same time.

reducer concept in Spark

I'm coming from a Hadoop background and have limited knowledge about Spark. BAsed on what I learn so far, Spark doesn't have mapper/reducer nodes and instead it has driver/worker nodes. The worker are similar to the mapper and driver is (somehow) similar to reducer. As there is only one driver program, there will be one reducer. If so, how simple programs like word count for very big data sets can get done in spark? Because driver can simply run out of memory.
The driver is more of a controller of the work, only pulling data back if the operator calls for it. If the operator you're working on returns an RDD/DataFrame/Unit, then the data remains distributed. If it returns a native type then it will indeed pull all of the data back.
Otherwise, the concept of map and reduce are a bit obsolete here (from a type of work persopective). The only thing that really matters is whether the operation requires a data shuffle or not. You can see the points of shuffle by the stage splits either in the UI or via a toDebugString (where each indentation level is a shuffle).
All that being said, for a vague understanding, you can equate anything that requires a shuffle to a reducer. Otherwise it's a mapper.
Last, to equate to your word count example:
sc.textFile(path)
.flatMap(_.split(" "))
.map((_, 1))
.reduceByKey(_+_)
In the above, this will be done in one stage as the data loading (textFile), splitting(flatMap), and mapping can all be done independent of the rest of the data. No shuffle is needed until the reduceByKey is called as it will need to combine all of the data to perform the operation...HOWEVER, this operation has to be associative for a reason. Each node will perform the operation defined in reduceByKey locally, only merging the final data set after. This reduces both memory and network overhead.
NOTE that reduceByKey returns an RDD and is thus a transformation, so the data is shuffled via a HashPartitioner. All of the data does NOT pull back to the driver, it merely moves to nodes that have the same keys so that it can have its final value merged.
Now, if you use an action such as reduce or worse yet, collect, then you will NOT get an RDD back which means the data pulls back to the driver and you will need room for it.
Here is my fuller explanation of reduceByKey if you want more. Or how this breaks down in something like combineByKey

Collect results from RDDs in a dstream driver program

I have this function in the driver program which collects the result from rdds into an array and send it back. However, even though the RDDs (in the dstream) have data, the function is returning an empty array...What am I doing wrong?
def runTopFunction() : Array[(String, Int)] = {
val topSearches = some function....
val summary = new ArrayBuffer[(String,Int)]()
topSearches.foreachRDD(rdd => {
summary = summary.++(rdd.collect())
})
return summary.toArray
}
So while the foreachRDD will do what you are looking to do, it is also non-blocking which means it won't wait until all of the stream is processed. Since you cal toArray on your buffer right after the call to foreachRDD, there won't have been any elements processed yet.
DStream.forEachRDD is an action on given DStream and will be scheduled for execution on each streaming batch interval. It's a declarative construction of the job to be executed later on.
Accumulating over the values in this way is not supported because while the Dstream.forEachRDD is just saying "do this on each iteration", the surrounding accumulation code is executed immediately, resulting in an empty array.
Depending of what happens to the summary data after it's calculated, there're few options on how to implement this:
If the data needs to be retrieved by another process, use a shared thread-safe structure. A priority queue is great for top-k uses.
If the data will be stored (fs, db), you can just write to the storage after applying the topSearches function to the dstream.

Resources