As per the spark documention as per the link here
http://spark.apache.org/docs/latest/streaming-programming-guide.html#other-points-to-remember
"DStreams are executed lazily by the output operations, just like RDDs are lazily executed by RDD actions. Specifically, RDD actions inside the DStream output operations force the processing of the received data. Hence, if your application does not have any output operation, or has output operations like dstream.foreachRDD() without any RDD action inside them, then nothing will get executed. The system will simply receive the data and discard it."
We have a spark streaming application with map operation followed by the DStream output operation As pert the documentation we have a RDD action inside foreachRDD which is rdd.first() but still nothing happen's
tempRequestsWithState is DStream
tempRequestsWithState.foreachRDD { rdd =>
rdd.first()
}
but interestingly if we do rdd.foreach inside foreachRDD application run's perfectly ....
tempRequestsWithState.foreachRDD { rdd =>
rdd.foreach {
}
}
In our case rdd.foreach is a very slow operation and would like to avoid it as we are dealing with huge data load of 10,000 events/sec also we need the foreachRDD ..
Please let us know if we are missing anything and if we can try any other RDD action inside foreachRDD
Related
I'm aware that the typical way of writing RDD or Dataframe rows to HDFS or S3 is by using saveAsTextFile or df.write. However, I would like to figure out how to write individual records from inside a map transformation like this:
myRDD.map(row => {
if(row.contains("something")) {
// write record to HDFS or S3
}
row
}
I know that this can be accomplished with the following code,
val newRDD = myRDD.filter(row => row.contains("something"))
newRDD.saveAsTextFile("myFile")
but I want to continue processing the original myRDD after writing to HDFS and that would require caching myRDD and I am low on memory resources.
I want to continue processing the original myRDD after writing to HDFS and that would require caching myRDD and I am low on memory resources.
The above statement is not correct. You can operate on an RDD further without caching if you have low memory.
You can write inside a map() function using the Hadoop API, but it's not a good idea to operate terminal actions inside a map() function. map() operations should be side effect free. However you can use the mappartition() function.
You don't need to cache an RDD for doing subsequent operations on it. Caching helps in avoiding recomputation, but RDDs are immutable. A new RDD will be created (preserving the lineage) on each and every transformation.
What is the functionality of the queueStream function in Spark StreamingContext. According to my understanding it is a queue which queues the incoming DStream. If that is the case then how it is handled in the cluster with many node. Does each node will have this queueStream and the DStream is partitioned among all the nodes in the cluster? How does this queueStream work in cluster setup?
I have read below explanation in the [Spark Streaming documentation][https://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources), but I didn't understand it completely. Please help me to understand it.
Queue of RDDs as a Stream: For testing a Spark Streaming application with test data, one can also create a DStream based on a queue of RDDs, using streamingContext.queueStream(queueOfRDDs). Each RDD pushed into the queue will be treated as a batch of data in the DStream, and processed like a stream.
val myQueueRDD= scala.collection.mutable.Queue[RDD[MyObject]]()
val myStream= ssc.queueStream(myQueueRDD)
for(count <- 1 to 100) {
val randomData= generateData() //Generated random data
val rdd= ssc.sparkContext.parallelize(randomData) //Creates the rdd of the random data.
myQueueRDD+= rdd //Addes data to queue.
}
myStream.foreachRDD(rdd => rdd.mapPartitions(data => evaluate(data)))
How the above part of the code will get executed in the spark streaming context with respect to partitions on different nodes.
QueueInputDStream is intended for testing. It uses standard scala.collection.mutable.Queue to store RDDs which imitate incoming batches.
Does each node will have this queueStream and the DStream is partitioned among all the nodes in the cluster
No. There is only one copy of the queue and all data distribution is handled by RDDs. compute logic is very simple with dequeue (oneAtATime set to true) or union of the current queue (oneAtATime set to false) at each tick. This applies to DStreams in general - each stream is just a sequence of RDDs, which provide data distribution mechanism.
While it still follows InputDStream API, conceptually it is just a local collection from which you take elements every batchDuration.
I am consuming from Kafka topic. This topic has 3 partitions.
I am using foreachRDD to process each batch RDD (using processData method to process each RDD, and ultimately create a DataSet from that).
Now, you can see that i have count variable , and i am incrementing this count variable in "processData" method to check how many actual records i have processed. (i understand , each RDD is collection of kafka topic records , and the number depends on batch interval size)
Now , the output is something like this :
1 1 1 2 3 2 4 3 5 ....
This makes me think that its because i might have 3 consumers( as i have 3 partitions), and each of these will call "foreachRDD" method separately, so the same count is being printed more than once, as each consumer might have cached its copy of count.
But the final output DataSet that i get has all the records.
So , does Spark internally union all the data? How does it makes out what to union?
I am trying to understand the behaviour , so that i can form my logic
int count = 0;
messages.foreachRDD(new VoidFunction<JavaRDD<ConsumerRecord<K, String>>>() {
public void call(JavaRDD<ConsumerRecord<K, V>> rdd) {
System.out.println("NUmber of elements in RDD : "+ rdd.count());
List<Row> rows = rdd.map(record -> processData(record))
.reduce((rows1, rows2) -> {
rows1.addAll(rows2);
return rows1;
});
StructType schema = DataTypes.createStructType(fields);
Dataset ds = ss.createDataFrame(rows, schema);
ds.createOrReplaceTempView("trades");
ds.show();
}
});
The assumptions are not completely accurate.
foreachRDD is one of the so-called output operations in Spark Streaming. The function of output operations is to schedule the provided closure at the interval dictated by the batch interval. The code in that closure executes once each batch interval on the spark driver. Not distributed in the cluster.
In particular, foreachRDD is a general purpose output operation that provides access to the underlying RDD within the DStream. Operations applied on that RDD will execute on the Spark cluster.
So, coming back to the code of the original question, code in the foreachRDD closure such as System.out.println("NUmber of elements in RDD : "+ rdd.count()); executes on the driver. That's also the reason why we can see the output in the console. Note that the rdd.count() in this print will trigger a count of the RDD on the cluster, so count is a distributed operation that returns a value to the driver, then -on the driver- the print operation takes place.
Now comes a transformation of the RDD:
rdd.map(record -> processData(record))
As we mentioned, operations applied to the RDD will execute on the cluster. And that execution will take place following the Spark execution model; that is, transformations are assembled into stages and applied to each partition of the underlying dataset. Given that we are dealing with 3 kafka topics, we will have 3 corresponding partitions in Spark. Hence, processData will be applied once to each partition.
So, does Spark internally union all the data? How does it make out what to union?
The same way we have output operations for Spark Streaming, we have actions for Spark. Actions will potentially apply an operation to the data and bring the results to the driver. The most simple operation is collect which brings the complete dataset to the driver, with the risk that it might not fit in memory. Other common action, count summarizes the number of records in the dataset and returns a single number to the driver.
In the code above, we are using reduce, which is also an action that applies the provided function and brings the resulting data to the driver. It's the use of that action that is "internally union all the data" as expressed in the question. In the reduce expression, we are actually collecting all the data that was distributed into a single local collection. It would be equivalent to do: rdd.map(record -> processData(record)).collect()
If the intention is to create a Dataset, we should avoid "moving" all the data to the driver first.
A better approach would be:
val rows = rdd.map(record -> processData(record))
val df = ss.createDataFrame(rows, schema);
...
In this case, the data of all partitions will remain local to the executor where they are located.
Note that moving data to the driver should be avoided. It is slow and in cases of large datasets will probably crash the job as the driver cannot typically hold all data available in a cluster.
Below is my code snippet. I have DStream which I am trying to save it to HDFS. Just wanted to know efficient way with compression.
pairedDStream.foreachRDD { rdd =>
val time = Calendar.getInstance.getTimeInMillis;
val textOutputFolder = outputDir + "/output-" + time
if (args.length == 4) {
val compressionCodec = args(3)
rdd.saveAsTextFile(textOutputFolder, CommonUtils.getCompressionCodec(compressionCodec))
} else {
rdd.saveAsTextFile(textOutputFolder, CommonUtils.getCompressionCodec(null))
}
}
rdd.saveAsTextFile is executed on worker nodes, in fact all rdd operations are executed parallelly inside dstream.foreachRDD. Spark documentation mention we should use this dstream operation for pushing the data in each RDD to an external system.
foreachRDD(func): The most generic output operator that applies a
function, func, to each RDD generated from the stream. This function
should push the data in each RDD to an external system, such as saving
the RDD to files, or writing it over the network to a database. Note
that the function func is executed in the driver process running the
streaming application, and will usually have RDD actions in it that
will force the computation of the streaming RDDs.
Design Patterns for using foreachRDD section also clearly states dstream.foreachRDD is a powerful primitive that allows data to be sent out to external systems. You can further read this section to know how to optimize operations on RDD in a dstream.
Hope this helps!
I wrote this program in spark shell
val array = sc.parallelize(List(1, 2, 3, 4))
array.foreach(x => println(x))
this prints some debug statements but not the actual numbers.
The code below works fine
for(num <- array.take(4)) {
println(num)
}
I do understand that take is an action and therefore will cause spark to trigger the lazy computation.
But foreach should have worked the same way... why did foreach not bring anything back from spark and start doing the actual processing (get out of lazy mode)
How can I make the foreach on the rdd work?
The RDD.foreach method in Spark runs on the cluster so each worker which contains these records is running the operations in foreach. I.e. your code is running, but they are printing out on the Spark workers stdout, not in the driver/your shell session. If you look at the output (stdout) for your Spark workers, you will see these printed to the console.
You can view the stdout on the workers by going to the web gui running for each running executor. An example URL is http://workerIp:workerPort/logPage/?appId=app-20150303023103-0043&executorId=1&logType=stdout
In this example Spark chooses to put all the records of the RDD in the same partition.
This makes sense if you think about it - look at the function signature for foreach - it doesn't return anything.
/**
* Applies a function f to all elements of this RDD.
*/
def foreach(f: T => Unit): Unit
This is really the purpose of foreach in scala - its used to side effect.
When you collect records, you bring them back into the driver so logically collect/take operations are just running on a Scala collection within the Spark driver - you can see the log output as the spark driver/spark shell is whats printing to stdout in your session.
A use case of foreach may not seem immediately apparent, an example - if for each record in the RDD you wanted to do some external behaviour, like call a REST api, you could do this in the foreach, then each Spark worker would submit a call to the API server with the value. If foreach did bring back records, you could easily blow out the memory in the driver/shell process. This way you avoid these issues and can do side-effects on all the items in an RDD over the cluster.
If you want to see whats in an RDD I use;
array.collect.foreach(println)
//Instead of collect, use take(...) or takeSample(...) if the RDD is large
You can use RDD.toLocalIterator() to bring the data to the driver (one RDD partition at a time):
val array = sc.parallelize(List(1, 2, 3, 4))
for(rec <- array.toLocalIterator) { println(rec) }
See also
Spark: Best practice for retrieving big data from RDD to local machine
this blog post about toLocalIterator