Cartesian of DStream - apache-spark

I use Spark cartesian function to to generate a list N pairs of values.
I then map over these values to generate a distance metric between each of the users :
val cartesianUsers: org.apache.spark.rdd.RDD[(distance.classes.User, distance.classes.User)] = users.cartesian(users)
cartesianUsers.map(m => manDistance(m._1, m._2))
This works as expected.
Using Spark Streaming library I create a DStream and then map over it :
val customReceiverStream: ReceiverInputDStream[String] = ssc.receiverStream....
customReceiverStream.foreachRDD(m => {
println("size is " + m)
})
I could use cartesian function within customReceiverStream.foreachRDD but according to doc http://spark.apache.org/docs/1.2.0/streaming-programming-guide.htm this is not its intended use :
foreachRDD(func) The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to a external system, like saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.
How to compute the cartesian of a DStream ? Perhaps I'm misunderstanding the use of DStreams ?

I wasn't aware of transform method :
cartesianUsers.transform(car => car.cartesian(car))
Nice talk which also mentions transform function at approx 17:00 https://www.youtube.com/watch?v=g171ndOHgJ0

Related

spark collect as Array[T] and not as Array[Row] from data frame

I can collect a column like this using the RDD API.
df.map(r => r.getAs[String]("column")).collect
However, as I am initially using a Dataset I rather would like to not switch the API level. A simple df.select("column).collect returns an Array[Row] where the .flatten operator no longer works.
How can I collect to Array[T e.g. String] directly?
With Datasets ( Spark version >= 2.0.0 ), you just need to convert the dataframe to dataset and then collect it.
df.select("column").as[String].collect()
would return you an Array[String]

Is foreachRDD executed on the Driver?

I am trying to process some XML data received on a JMS queue (QPID) using Spark Streaming. After getting xml as DStream I convert them to Dataframes so I can join them with some of my static data in form of Dataframes already loaded.
But as per API documentation for foreachRdd method on DStream:
it gets executed on Driver, so does that mean all processing logic will only run on Driver and not get distributed to workers/executors.
API Documentation
foreachRDD(func)
The most generic output operator that applies a
function, func, to each RDD generated from the stream. This function
should push the data in each RDD to an external system, such as saving
the RDD to files, or writing it over the network to a database. Note
that the function func is executed in the driver process running the
streaming application, and will usually have RDD actions in it that
will force the computation of the streaming RDDs.
so does that mean all processing logic will only run on Driver and not
get distributed to workers/executors.
No, the function itself runs on the driver, but don't forget that it operates on an RDD. The inner functions that you'll use on the RDD, such as foreachPartition, map, filter etc will still run on the worker nodes. This won't cause all the data to be sent back over the network to the driver, unless you call methods like collect, which do.
To make this clear, if you run the following, you will see "monkey" on the driver's stdout:
myDStream.foreachRDD { rdd =>
println("monkey")
}
If you run the following, you will see "monkey" on the driver's stdout, and the filter work will be done on whatever executors the rdd is distributed across:
myDStream.foreachRDD { rdd =>
println("monkey")
rdd.filter(element => element == "Save me!")
}
Let's add the simplification that myDStream only ever receives one RDD, and that this RDD is spread across a set of partitions that we'll call PartitionSetA that exist on MachineSetB where ExecutorSetC are running. If you run the following, you will see "monkey" on the driver's stdout, you will see "turtle" on the stdouts of all executors in ExecutorSetC ("turtle" will appear once for each partition -- many partitions could be on the machine where an executor is running), and the work of both the filter and addition operations will be done across ExecutorSetC:
myDStream.foreachRDD { rdd =>
println("monkey")
rdd.filter(element => element == "Save me!")
rdd.foreachPartition { partition =>
println("turtle")
val x = 1 + 1
}
}
One more thing to note is that in the following code, y would end up being sent across the network from the driver to all of ExecutorSetC for each rdd:
val y = 2
myDStream.foreachRDD { rdd =>
println("monkey")
rdd.filter(element => element == "Save me!")
rdd.foreachPartition { partition =>
println("turtle")
val x = 1 + 1
val z = x + y
}
}
To avoid this overhead, you can use broadcast variables, which send the value from the driver to the executors just once. For example:
val y = 2
val broadcastY = sc.broadcast(y)
myDStream.foreachRDD { rdd =>
println("monkey")
rdd.filter(element => element == "Save me!")
rdd.foreachPartition { partition =>
println("turtle")
val x = 1 + 1
val z = x + broadcastY.value
}
}
For sending more complex things over as broadcast variables, such as objects that aren't easily serializable once instantiated, you can see the following blog post: https://allegro.tech/2015/08/spark-kafka-integration.html

Apache Spark: stepwise execution

Due to a performance measurement I want to execute my scala programm written for spark stepwise, i.e.
execute first operator; materialize result;
execute second operator; materialize result;
...
and so on. The original code:
var filename = new String("<filename>")
var text_file = sc.textFile(filename)
var counts = text_file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
counts.saveAsTextFile("file://result")
So I want the execution of var counts = text_file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b) to be stepwise.
Is calling counts.foreachPartition(x => {}) after every operator the right way to do it?
Or is writing to /dev/null with saveAsTextFile() a better alternative? And does spark actually have something like a NullSink for that purpose? I wasn't able to write to /dev/null with saveAsTextFile() because /dev/null already exists. Is there a way to overwrite the spark result folder?
And does the temporary result after each operation should be cached with cache()?
What is the best way to separate the execution?
Spark supports two types of operations: actions and transformations. Transformations, as the name implies, turn datasets into new ones through the combination of the transformation operator and (in some cases, optionally) a function provided to the transformation. Actions, on the other hand, run through a dataset with some computation to provide a value to the driver.
There are two things that Spark does that makes your desired task a little difficult: it bundles non-shuffling transformations into execution blocks called stages and stages in the scheduling graph must be triggered through actions.
For your case, provided your input isn't massive, I think it would be easiest to trigger your transformations with a dummy action (e.g. count(), collect()) as the RDD will be materialized. During RDD computation, you can check the Spark UI to gather any performance statistics about the steps/stages/jobs used to create it.
This would look something like:
val text_file = sc.textFile(filename)
val words = text_file.flatMap(line => line.split(" "))
words.count()
val wordCount = words.map(word => (word, 1))
wordCount.count()
val wordCounts = wordCount.reduceByKey(_ + _)
wordCounts.count()
Some notes:
Since RDD's for all intents and purposes are immutable, they should be stored in val's
You can shorten your reduceByKey() syntax with underscore notation
Your approach with foreachPartition() could work since it is an action but it would require a change in your functions since your are operating over an iterator on your partition
Caching only makes since if you either create multiple RDD's from a parent RDD (branching out) or run iterated computation over the same RDD (perhaps in a loop)
You can also simple invoke RDD.persist() or RDD.cache() after every transformation. but ensure that you have right level of StorageLevel defined.

Is the DStream return by updateStateByKey function only contains one RDD?

Is the DStream return by updateStateByKey function only contains one RDD? If not,Under what circumstances will the DStream contains more than one RDD?
It contains a RDD every batch. The DStream returned by updateStateByKey is a "state" DStream. You can still view this DStream as a normal DStream though. For every batch, the RDD is representing the latest state (key-value pairs) according to your update function that you pass in to updateStateByKey.
it seemed not like what you said, the code as a part of application bleow only print once every batch, so i think every stateful DStream just have only one RDD
#transient val statefulDStream = lines.transform(...).map(x => (x, 1)).updateStateByKey(updateFuncs)
statefulDStream.foreachRDD { rdd =>
println(rdd.first())
}
Yes, the DStream return by updateStateByKey only hava one RDD

Apache spark applying map transformation on RDDs

I have a HadoopRDD from which I'm creating a first RDD with a simple Map function then a second RDD from the first RDD with another simple Map function. Something like :
HadoopRDD -> RDD1 -> RDD2.
My question is whether Spak will iterate over the HadoopRDD record by record to generate RDD1 then it will iterate over RDD1 record by record to generate RDD2 or does it ietrate over HadoopRDD and then generate RDD1 and then RDD2 in one go.
Short answer: rdd.map(f).map(g) will be executed in one pass.
tl;dr
Spark splits a job into stages. A stage applied to a partition of data is a task.
In a stage, Spark will try to pipeline as many operations as possible. "Possible" is determined by the need to rearrange data: an operation that requires a shuffle will typically break the pipeline and create a new stage.
In practical terms:
Given `rdd.map(...).map(..).filter(...).sort(...).map(...)`
will result in two stages:
.map(...).map(..).filter(...)
.sort(...).map(...)
This can be retrieved from an rdd using rdd.toDebugString
The same job example above will produce this output:
val mapped = rdd.map(identity).map(identity).filter(_>0).sortBy(x=>x).map(identity)
scala> mapped.toDebugString
res0: String =
(6) MappedRDD[9] at map at <console>:14 []
| MappedRDD[8] at sortBy at <console>:14 []
| ShuffledRDD[7] at sortBy at <console>:14 []
+-(8) MappedRDD[4] at sortBy at <console>:14 []
| FilteredRDD[3] at filter at <console>:14 []
| MappedRDD[2] at map at <console>:14 []
| MappedRDD[1] at map at <console>:14 []
| ParallelCollectionRDD[0] at parallelize at <console>:12 []
Now, coming to the key point of your question: pipelining is very efficient. The complete pipeline will be applied to each element of each partition once. This means that rdd.map(f).map(g) will perform as fast as rdd.map(f andThen g) (with some neglectable overhead)
Apache Spark will iterate over the HadoopRDD record by record in no specific order (data will be split and sent to the workers) and "apply" the first transformation to compute RDD1. After that, the second transformation is applied to each element of RDD1 to get RDD2, again in no specific order, and so on for successive transformations. You can notice it from the map method signature:
// Return a new RDD by applying a function to all elements of this RDD.
def map[U](f: (T) ⇒ U)(implicit arg0: ClassTag[U]): RDD[U]
Apache Spark follows a DAG (Directed Acyclic Graph) execution engine. It won't actually trigger any computation until a value is required, so you have to distinguish between transformations and actions.
EDIT:
In terms of performance, I am not completely aware of the underlying implementation of Spark, but I understand there shouldn't be a significant performance loss other than adding extra (unnecessary) tasks in the related stage. From my experience, you don't normally use transformations of the same "nature" successively (in this case two successive map's). You should be more concerned of performance when shuffling operations take place, because you are moving data around and this has a clear impact on your job performance. Here you can find a common issue regarding that.

Resources