Below is my code snippet. I have DStream which I am trying to save it to HDFS. Just wanted to know efficient way with compression.
pairedDStream.foreachRDD { rdd =>
val time = Calendar.getInstance.getTimeInMillis;
val textOutputFolder = outputDir + "/output-" + time
if (args.length == 4) {
val compressionCodec = args(3)
rdd.saveAsTextFile(textOutputFolder, CommonUtils.getCompressionCodec(compressionCodec))
} else {
rdd.saveAsTextFile(textOutputFolder, CommonUtils.getCompressionCodec(null))
}
}
rdd.saveAsTextFile is executed on worker nodes, in fact all rdd operations are executed parallelly inside dstream.foreachRDD. Spark documentation mention we should use this dstream operation for pushing the data in each RDD to an external system.
foreachRDD(func): The most generic output operator that applies a
function, func, to each RDD generated from the stream. This function
should push the data in each RDD to an external system, such as saving
the RDD to files, or writing it over the network to a database. Note
that the function func is executed in the driver process running the
streaming application, and will usually have RDD actions in it that
will force the computation of the streaming RDDs.
Design Patterns for using foreachRDD section also clearly states dstream.foreachRDD is a powerful primitive that allows data to be sent out to external systems. You can further read this section to know how to optimize operations on RDD in a dstream.
Hope this helps!
Related
Even with a .cache()d RDD, Spark still seems to serialize the data for each task run. Consider this code:
class LoggingSerializable() extends Externalizable {
override def writeExternal(out: ObjectOutput): Unit = {
println("xxx serializing")
}
override def readExternal(in: ObjectInput): Unit = {
println("xxx deserializing")
}
}
object SparkSer {
def main(args: Array[String]) = {
val conf = new SparkConf().setAppName("SparkSer").setMaster("local")
val spark = new SparkContext(conf)
val rdd: RDD[LoggingSerializable] = spark.parallelize(Seq(new LoggingSerializable())).cache()
println("xxx done loading")
rdd.foreach(ConstantClosure)
println("xxx done 1")
rdd.foreach(ConstantClosure)
println("xxx done 2")
spark.stop()
}
}
object ConstantClosure extends (LoggingSerializable => Unit) with Serializable {
def apply(t: LoggingSerializable): Unit = {
println("xxx closure ran")
}
}
It prints
xxx done loading
xxx serializing
xxx deserializing
xxx closure ran
xxx done 1
xxx serializing
xxx deserializing
xxx closure ran
xxx done 2
Even though I called .cache() on rdd, Spark still serializes the data for each call to .foreach. The official docs say
When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it).
and that MEMORY_ONLY means
Store RDD as deserialized Java objects in the JVM.
Note that Spark tries to serialize the data it's when serializing the task, but ConstantClosure does not close over anything, so I don't understand why it would need to serialize any data.
I am asking because I would like to be able to run Spark in local mode without any performance loss, but having to serialize large elements in an RDD for each RDD action can be very costly. I am not sure if this problem is unique to local mode. It seems like Spark can't possibly send the data in an RDD over the wire to workers for every action, even when the RDD is cached.
I'm using spark-core 3.0.0.
This is because you are using parallelize. parallelize is using a special RDD, ParallelCollectionRDD, which put the data into Partitions. Partition defines a Spark task and it will be sent to executors inside a Spark task (ShuffleMapTask or ResultTask). If you print the stack trace in readExternal and writeExternal, you should be able to see that it happens when serializing and deserializing a Spark task.
In other words, data is a part of the Spark task metadata for ParallelCollectionRDD, and Spark has to send tasks to run in executors, that's where the serialization happens.
Most of other RDDs read data from external systems (such as files), so they don't have such behavior.
I agree that behavior looks surprising. Off the top of my head, I might guess that it's because caching the blocks is asynchronous, and all of this happens very fast. It's possible it simply does not wait around for the cached partition to become available and recomputes it the second time.
To test that hypothesis, introduce a lengthy wait before the second foreach just to see if that changes things.
I'm aware that the typical way of writing RDD or Dataframe rows to HDFS or S3 is by using saveAsTextFile or df.write. However, I would like to figure out how to write individual records from inside a map transformation like this:
myRDD.map(row => {
if(row.contains("something")) {
// write record to HDFS or S3
}
row
}
I know that this can be accomplished with the following code,
val newRDD = myRDD.filter(row => row.contains("something"))
newRDD.saveAsTextFile("myFile")
but I want to continue processing the original myRDD after writing to HDFS and that would require caching myRDD and I am low on memory resources.
I want to continue processing the original myRDD after writing to HDFS and that would require caching myRDD and I am low on memory resources.
The above statement is not correct. You can operate on an RDD further without caching if you have low memory.
You can write inside a map() function using the Hadoop API, but it's not a good idea to operate terminal actions inside a map() function. map() operations should be side effect free. However you can use the mappartition() function.
You don't need to cache an RDD for doing subsequent operations on it. Caching helps in avoiding recomputation, but RDDs are immutable. A new RDD will be created (preserving the lineage) on each and every transformation.
What is the functionality of the queueStream function in Spark StreamingContext. According to my understanding it is a queue which queues the incoming DStream. If that is the case then how it is handled in the cluster with many node. Does each node will have this queueStream and the DStream is partitioned among all the nodes in the cluster? How does this queueStream work in cluster setup?
I have read below explanation in the [Spark Streaming documentation][https://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources), but I didn't understand it completely. Please help me to understand it.
Queue of RDDs as a Stream: For testing a Spark Streaming application with test data, one can also create a DStream based on a queue of RDDs, using streamingContext.queueStream(queueOfRDDs). Each RDD pushed into the queue will be treated as a batch of data in the DStream, and processed like a stream.
val myQueueRDD= scala.collection.mutable.Queue[RDD[MyObject]]()
val myStream= ssc.queueStream(myQueueRDD)
for(count <- 1 to 100) {
val randomData= generateData() //Generated random data
val rdd= ssc.sparkContext.parallelize(randomData) //Creates the rdd of the random data.
myQueueRDD+= rdd //Addes data to queue.
}
myStream.foreachRDD(rdd => rdd.mapPartitions(data => evaluate(data)))
How the above part of the code will get executed in the spark streaming context with respect to partitions on different nodes.
QueueInputDStream is intended for testing. It uses standard scala.collection.mutable.Queue to store RDDs which imitate incoming batches.
Does each node will have this queueStream and the DStream is partitioned among all the nodes in the cluster
No. There is only one copy of the queue and all data distribution is handled by RDDs. compute logic is very simple with dequeue (oneAtATime set to true) or union of the current queue (oneAtATime set to false) at each tick. This applies to DStreams in general - each stream is just a sequence of RDDs, which provide data distribution mechanism.
While it still follows InputDStream API, conceptually it is just a local collection from which you take elements every batchDuration.
As per the spark documention as per the link here
http://spark.apache.org/docs/latest/streaming-programming-guide.html#other-points-to-remember
"DStreams are executed lazily by the output operations, just like RDDs are lazily executed by RDD actions. Specifically, RDD actions inside the DStream output operations force the processing of the received data. Hence, if your application does not have any output operation, or has output operations like dstream.foreachRDD() without any RDD action inside them, then nothing will get executed. The system will simply receive the data and discard it."
We have a spark streaming application with map operation followed by the DStream output operation As pert the documentation we have a RDD action inside foreachRDD which is rdd.first() but still nothing happen's
tempRequestsWithState is DStream
tempRequestsWithState.foreachRDD { rdd =>
rdd.first()
}
but interestingly if we do rdd.foreach inside foreachRDD application run's perfectly ....
tempRequestsWithState.foreachRDD { rdd =>
rdd.foreach {
}
}
In our case rdd.foreach is a very slow operation and would like to avoid it as we are dealing with huge data load of 10,000 events/sec also we need the foreachRDD ..
Please let us know if we are missing anything and if we can try any other RDD action inside foreachRDD
I wrote this program in spark shell
val array = sc.parallelize(List(1, 2, 3, 4))
array.foreach(x => println(x))
this prints some debug statements but not the actual numbers.
The code below works fine
for(num <- array.take(4)) {
println(num)
}
I do understand that take is an action and therefore will cause spark to trigger the lazy computation.
But foreach should have worked the same way... why did foreach not bring anything back from spark and start doing the actual processing (get out of lazy mode)
How can I make the foreach on the rdd work?
The RDD.foreach method in Spark runs on the cluster so each worker which contains these records is running the operations in foreach. I.e. your code is running, but they are printing out on the Spark workers stdout, not in the driver/your shell session. If you look at the output (stdout) for your Spark workers, you will see these printed to the console.
You can view the stdout on the workers by going to the web gui running for each running executor. An example URL is http://workerIp:workerPort/logPage/?appId=app-20150303023103-0043&executorId=1&logType=stdout
In this example Spark chooses to put all the records of the RDD in the same partition.
This makes sense if you think about it - look at the function signature for foreach - it doesn't return anything.
/**
* Applies a function f to all elements of this RDD.
*/
def foreach(f: T => Unit): Unit
This is really the purpose of foreach in scala - its used to side effect.
When you collect records, you bring them back into the driver so logically collect/take operations are just running on a Scala collection within the Spark driver - you can see the log output as the spark driver/spark shell is whats printing to stdout in your session.
A use case of foreach may not seem immediately apparent, an example - if for each record in the RDD you wanted to do some external behaviour, like call a REST api, you could do this in the foreach, then each Spark worker would submit a call to the API server with the value. If foreach did bring back records, you could easily blow out the memory in the driver/shell process. This way you avoid these issues and can do side-effects on all the items in an RDD over the cluster.
If you want to see whats in an RDD I use;
array.collect.foreach(println)
//Instead of collect, use take(...) or takeSample(...) if the RDD is large
You can use RDD.toLocalIterator() to bring the data to the driver (one RDD partition at a time):
val array = sc.parallelize(List(1, 2, 3, 4))
for(rec <- array.toLocalIterator) { println(rec) }
See also
Spark: Best practice for retrieving big data from RDD to local machine
this blog post about toLocalIterator