Spark in Foreach - apache-spark

val a = sc.textFile("/user/cts367689/datagen.txt")
val b = a.map(x => (x.split(",")(0),x.split(",")(2),x.split(4))))
val c = b.filter(x => (x._3.toInt > 500))
c.foreach(x => println(x))
or
c.foreach {x => {println(x)}}
I am not getting the expected output when i use for-each statement.I want output to be print one in a line but not sure what wrong in my code.

I think that this have been answered already a couple of times before but here we go again and from the Official Programming Guide :
Printing elements of an RDD
One common idiom is attempting to print out the elements of an RDD using rdd.foreach(println) or rdd.map(println). On a single machine, this will generate the expected output and print all the RDD’s elements.
scala> val rdd = sc.parallelize(Seq((1,2,3),(2,3,4)))
// rdd: org.apache.spark.rdd.RDD[(Int, Int, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:27
scala> rdd.foreach(println)
// (1,2,3)
// (2,3,4)
However, in cluster mode, the output to stdout being called by the executors is now writing to the executor’s stdout instead, not the one on the driver, so stdout on the driver won’t show these!
To print all elements on the driver, one need to collect() the data back to the driver node thus:
scala> rdd.collect().foreach(println)
// (1,2,3)
// (2,3,4)
And here is the limitation. If your data doesn't fit on the driver, this can cause the driver to run out of memory, though, because collect() fetches the entire RDD to a single machine; Thus causing your driver to blow.
If you only need to print a few elements of the RDD, a safer approach is to use the take():
scala> val rdd = sc.parallelize(Range(1, 1000000000))
// rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:27
scala> rdd.take(100).foreach(println)
// 1
// 2
// 3
// 4
// 5
// 6
// 7
// 8
// 9
// 10
// [...]
PS: A small note concerning the foreach method. foreach runs a function on each element of the dataset. This method is usually done for side effects such as updating an Accumulator or interacting with external storage systems.
I hope that this answers your question.

Related

How can I get N elements from Dataset at a time, without collecting them in memory?

I have a use case where I need to collect the contents of a Dataset, but since this requires memory on my driver, I want to do it 'N' at a time, sequentially.
I would like to query only once, to create the Dataset.
You could do something like:
val dfSrc = (1 to 100).map(i => (i,scala.util.Random.nextDouble())).toDF("id","x")
// define an index, e.g. with row_number
val dsWithRnb = dfSrc
.withColumn("rnb",row_number().over(Window.orderBy($"id")))
.cache
// get rows in chunks of 10
val N = 10
(0L to dsWithRnb.count).grouped(N).foreach{batch =>
// collect batch
val batchData = dsWithRnb.where($"rnb".isin(batch:_*)).collect()
}
you can use toLocalIterator
it will give an iterator in the driver to the elements in the ds
for example:
val ds = ...
ds
.cache
.toLocalIterator
.grouped(N)
.map(nRows => ...)

Is foreachRDD executed on the Driver?

I am trying to process some XML data received on a JMS queue (QPID) using Spark Streaming. After getting xml as DStream I convert them to Dataframes so I can join them with some of my static data in form of Dataframes already loaded.
But as per API documentation for foreachRdd method on DStream:
it gets executed on Driver, so does that mean all processing logic will only run on Driver and not get distributed to workers/executors.
API Documentation
foreachRDD(func)
The most generic output operator that applies a
function, func, to each RDD generated from the stream. This function
should push the data in each RDD to an external system, such as saving
the RDD to files, or writing it over the network to a database. Note
that the function func is executed in the driver process running the
streaming application, and will usually have RDD actions in it that
will force the computation of the streaming RDDs.
so does that mean all processing logic will only run on Driver and not
get distributed to workers/executors.
No, the function itself runs on the driver, but don't forget that it operates on an RDD. The inner functions that you'll use on the RDD, such as foreachPartition, map, filter etc will still run on the worker nodes. This won't cause all the data to be sent back over the network to the driver, unless you call methods like collect, which do.
To make this clear, if you run the following, you will see "monkey" on the driver's stdout:
myDStream.foreachRDD { rdd =>
println("monkey")
}
If you run the following, you will see "monkey" on the driver's stdout, and the filter work will be done on whatever executors the rdd is distributed across:
myDStream.foreachRDD { rdd =>
println("monkey")
rdd.filter(element => element == "Save me!")
}
Let's add the simplification that myDStream only ever receives one RDD, and that this RDD is spread across a set of partitions that we'll call PartitionSetA that exist on MachineSetB where ExecutorSetC are running. If you run the following, you will see "monkey" on the driver's stdout, you will see "turtle" on the stdouts of all executors in ExecutorSetC ("turtle" will appear once for each partition -- many partitions could be on the machine where an executor is running), and the work of both the filter and addition operations will be done across ExecutorSetC:
myDStream.foreachRDD { rdd =>
println("monkey")
rdd.filter(element => element == "Save me!")
rdd.foreachPartition { partition =>
println("turtle")
val x = 1 + 1
}
}
One more thing to note is that in the following code, y would end up being sent across the network from the driver to all of ExecutorSetC for each rdd:
val y = 2
myDStream.foreachRDD { rdd =>
println("monkey")
rdd.filter(element => element == "Save me!")
rdd.foreachPartition { partition =>
println("turtle")
val x = 1 + 1
val z = x + y
}
}
To avoid this overhead, you can use broadcast variables, which send the value from the driver to the executors just once. For example:
val y = 2
val broadcastY = sc.broadcast(y)
myDStream.foreachRDD { rdd =>
println("monkey")
rdd.filter(element => element == "Save me!")
rdd.foreachPartition { partition =>
println("turtle")
val x = 1 + 1
val z = x + broadcastY.value
}
}
For sending more complex things over as broadcast variables, such as objects that aren't easily serializable once instantiated, you can see the following blog post: https://allegro.tech/2015/08/spark-kafka-integration.html

Filtering records for all values of an array in Spark

I am very new to Spark.
I have a very basic question. I have an array of values:
listofECtokens: Array[String] = Array(EC-17A5206955089011B, EC-17A5206955089011A)
I want to filter an RDD for all of these token values. I tried the following way:
val ECtokens = for (token <- listofECtokens) rddAll.filter(line => line.contains(token))
Output:
ECtokens: Unit = ()
I got an empty Unit even when there are records with these tokens. What am I doing wrong?
You can get that result in a more efficient way and the result would be a filtered RDD:
val filteredRDD = rddAll.filter(line => listofECtokens.exists(line.contains))
And then to get the result as an array you should call collect or take on the filteredRDD:
//collect brings the RDD to the driver so be carefull cause that can result in a OutOfMemory in that machine
val ECtokens = filteredRDD.collect()
//if you only need to print a few elements of the RDD, a safer approach is to use the take()
val ECtokens = filteredRDD.take(5)

How does lineage get passed down in RDDs in Apache Spark

Do each RDD point to the same lineage graph
or
when a parent RDD gives its lineage to a new RDD, is the lineage graph copied by the child as well so both the parent and child have different graphs. In this case isn't it memory intensive?
Each RDD maintains a pointer to one or more parent along with the metadata about what type of relationship it has with the parent. For example, when we call val b = a.map() on an RDD, the RDD b just keeps a reference (and never copies) to its parent a, that's a lineage.
And when the driver submits the job, the RDD graph is serialized to the worker nodes so that each of the worker nodes apply the series of transformations (like, map filter and etc..) on different partitions. Also, this RDD lineage will be used to recompute the data if some failure occurs.
To display the lineage of an RDD, Spark provides a debug method toDebugString() method.
Consider the following example:
val input = sc.textFile("log.txt")
val splitedLines = input.map(line => line.split(" "))
.map(words => (words(0), 1))
.reduceByKey{(a,b) => a + b}
Executing toDebugString() on splitedLines RDD, will output the following,
(2) ShuffledRDD[6] at reduceByKey at <console>:25 []
+-(2) MapPartitionsRDD[5] at map at <console>:24 []
| MapPartitionsRDD[4] at map at <console>:23 []
| log.txt MapPartitionsRDD[1] at textFile at <console>:21 []
| log.txt HadoopRDD[0] at textFile at <console>:21 []
For more information about how Spark works internally, please read my another post
When a transformation(map or filter etc) is called, it is not executed by Spark immediately, instead a lineage is created for each transformation.
A lineage will keep track of what all transformations has to be applied on that RDD,
including the location from where it has to read the data.
For example, consider the following example
val myRdd = sc.textFile("spam.txt")
val filteredRdd = myRdd.filter(line => line.contains("wonder"))
filteredRdd.count()
sc.textFile() and myRdd.filter() do not get executed immediately,
it will be executed only when an Action is called on the RDD - here filteredRdd.count().
An Action is used to either save result to some location or to display it.
RDD lineage information can also be printed by using the command filteredRdd.toDebugString(filteredRdd is the RDD here).
Also, DAG Visualization shows the complete graph in a very intuitive manner as follows:

Performing operations only on subset of a RDD

I would like to perform some transformations only on a subset of a RDD (to make experimenting in REPL faster).
Is it possible?
RDD has take(num: Int): Array[T] method, I think I'd need something similar, but returning RDD[T]
You can use RDD.sample to get an RDD out, not an Array. For example, to sample ~1% without replacement:
val data = ...
data.count
...
res1: Long = 18066983
val sample = data.sample(false, 0.01, System.currentTimeMillis().toInt)
sample.count
...
res3: Long = 180190
The third parameter is a seed, and is thankfully optional in the next Spark version.
RDDs are distributed collections which are materialized on actions only. It is not possible to truncate your RDD to a fixed size, and still get an RDD back (hence RDD.take(n) returns an Array[T], just like collect)
I you want to get similarly sized RDDs regardless of the input size, you can truncate items in each of your partitions - this way you can better control the absolute number of items in resulting RDD. Size of the resulting RDD will depend on spark parallelism.
An example from spark-shell:
import org.apache.spark.rdd.RDD
val numberOfPartitions = 1000
val millionRdd: RDD[Int] = sc.parallelize(1 to 1000000, numberOfPartitions)
val millionRddTruncated: RDD[Int] = rdd.mapPartitions(_.take(10))
val billionRddTruncated: RDD[Int] = sc.parallelize(1 to 1000000000, numberOfPartitions).mapPartitions(_.take(10))
millionRdd.count // 1000000
millionRddTruncated.count // 10000 = 10 item * 1000 partitions
billionRddTruncated.count // 10000 = 10 item * 1000 partitions
Apparently it's possible to create RDD subset by first using its take method and then passing returned array to SparkContext's makeRDD[T](seq: Seq[T], numSlices: Int = defaultParallelism) which returns new RDD.
This approach seems dodgy to me though. Is there a nicer way?
I always use parallelize function of SparkContext to distribute from Array[T] but it seems makeRDD do the same. It's correct way both of them.

Resources