Apache spark applying map transformation on RDDs - apache-spark

I have a HadoopRDD from which I'm creating a first RDD with a simple Map function then a second RDD from the first RDD with another simple Map function. Something like :
HadoopRDD -> RDD1 -> RDD2.
My question is whether Spak will iterate over the HadoopRDD record by record to generate RDD1 then it will iterate over RDD1 record by record to generate RDD2 or does it ietrate over HadoopRDD and then generate RDD1 and then RDD2 in one go.

Short answer: rdd.map(f).map(g) will be executed in one pass.
tl;dr
Spark splits a job into stages. A stage applied to a partition of data is a task.
In a stage, Spark will try to pipeline as many operations as possible. "Possible" is determined by the need to rearrange data: an operation that requires a shuffle will typically break the pipeline and create a new stage.
In practical terms:
Given `rdd.map(...).map(..).filter(...).sort(...).map(...)`
will result in two stages:
.map(...).map(..).filter(...)
.sort(...).map(...)
This can be retrieved from an rdd using rdd.toDebugString
The same job example above will produce this output:
val mapped = rdd.map(identity).map(identity).filter(_>0).sortBy(x=>x).map(identity)
scala> mapped.toDebugString
res0: String =
(6) MappedRDD[9] at map at <console>:14 []
| MappedRDD[8] at sortBy at <console>:14 []
| ShuffledRDD[7] at sortBy at <console>:14 []
+-(8) MappedRDD[4] at sortBy at <console>:14 []
| FilteredRDD[3] at filter at <console>:14 []
| MappedRDD[2] at map at <console>:14 []
| MappedRDD[1] at map at <console>:14 []
| ParallelCollectionRDD[0] at parallelize at <console>:12 []
Now, coming to the key point of your question: pipelining is very efficient. The complete pipeline will be applied to each element of each partition once. This means that rdd.map(f).map(g) will perform as fast as rdd.map(f andThen g) (with some neglectable overhead)

Apache Spark will iterate over the HadoopRDD record by record in no specific order (data will be split and sent to the workers) and "apply" the first transformation to compute RDD1. After that, the second transformation is applied to each element of RDD1 to get RDD2, again in no specific order, and so on for successive transformations. You can notice it from the map method signature:
// Return a new RDD by applying a function to all elements of this RDD.
def map[U](f: (T) ⇒ U)(implicit arg0: ClassTag[U]): RDD[U]
Apache Spark follows a DAG (Directed Acyclic Graph) execution engine. It won't actually trigger any computation until a value is required, so you have to distinguish between transformations and actions.
EDIT:
In terms of performance, I am not completely aware of the underlying implementation of Spark, but I understand there shouldn't be a significant performance loss other than adding extra (unnecessary) tasks in the related stage. From my experience, you don't normally use transformations of the same "nature" successively (in this case two successive map's). You should be more concerned of performance when shuffling operations take place, because you are moving data around and this has a clear impact on your job performance. Here you can find a common issue regarding that.

Related

Avoid repartition costs when filtering and then coalescing

I am implementing a range query on an RDD of (x,y) points in pyspark. I partitioned the xy space into a 16*16 grid (256 cells) and assigned each point in my RDD to one of these cells.
The gridMappedRDD is a PairRDD: (cell_id, Point object)
I partitioned this RDD to 256 partitions, using:
gridMappedRDD.partitionBy(256)
The range query is a rectangular box. I have a method for my Grid object which can return the list of cell ids which overlap with the query range. So, I used this as a filter to prune the unrelated cells:
filteredRDD = gridMappedRDD.filter(lambda x: x[0] in candidateCells)
But the problem is that when running the query and then collecting the results, all the 256 partitions are evaluated; A task is created for each partition.
To avoid this problem, I tried coalescing the filteredRDD to the length of candidateCell list and I hoped this could solve the problem.
filteredRDD.coalesce(len(candidateCells))
In fact the resulting RDD has len(candidateCells) partitions but the partitions are not the same as gridMappedRDD.
As stated in the coalesce documentation, the shuffle parameter is False and no shuffle should be performed among partitions but I can see (with the help of glom()) that this is not the case.
For example after a coalesce(4) with candidateCells=[62, 63, 78, 79] the partitions are like this:
[[(62, P), (62, P) .... , (63, P)],
[(78, P), (78, P) .... , (79, P)],
[], []
]
Actually, by coalescing, I have a shuffle read which equals to the size of my whole dataset for every task, which takes a significant time. What I need is an RDD with only partitions related to cells in candidateCells, without any shuffles.
So, my question is that is it possible to filter only some partitions without reshuffling? For the above example, my filteredRDD would have 4 partitions with exactly the same data as originalRDD's 62, 63, 78, 79th partitions. Doing so, the query could be directed to affecting partitions only.
You made a few incorrect assumptions here:
The shuffle is not related to coalesce (nor coalesce is useful here). It is caused by partitionBy. Partitioning by definition requires shuffle.
Partitioning cannot be used to optimize filter. Spark knows nothing about the function you use (it is a black box).
Partitioning doesn't uniquely map keys to partitions. Multiple keys can be placed on the same partition - How does HashPartitioner work?
What can you do:
If resulting subset is small repartition and apply lookup for each key:
from itertools import chain
partitionedRDD = gridMappedRDD.partitionBy(256)
chain.from_iterable(
((c, x) for x in partitionedRDD.lookup(c))
for c in candidateCells
)
If data is large you can try to skip scanning partitions (number of tasks won't change, but some task can be short circuited):
candidatePartitions = [
partitionedRDD.partitioner.partitionFunc(c) for c in candidateCells
]
partitionedRDD.mapPartitionsWithIndex(
lambda i, xs: (x for x in xs if x[0] in candidateCells) if i in candidatePartitions else []
)
This two methods make sense only if you perform multiple "lookups". If it is one-off operation, it is better to perform linear filter:
It is cheaper than shuffle and repartitioning.
If initial data is uniformly distributed downstream processing will be able to better utilize available resources.

Is Dataset.rdd an action or transformation?

one of the way to evaluate if a dataframe is empty or not is to do df.rdd.isEmpty(), however, I see rdd at mycode.scala:123 in sparkUI executions. which makes me wonder if this rdd() function is actually an action is instead of a transformation.
I know that isEmpty() is an action, but I do see a separate stage where isEmpty() at mycode.scala:234, so I think they are different actions?
rdd is generated to represent a structured query in "RDD terms" so Spark can execute it. It is an RDD of JVM objects of your type T. If used incorrectly can cause memory problems since:
Transfers internally-managed optimized rows that live outside JVM to the memory space in JVM
Transforms the binary rows to your business objects (the JVM "true" representation)
The first will increase the JVM memory required for the computation while the latter is an extra transformation step.
For such a simple calculation where you count the number of rows, you'd rather stick to count as the optimized and fairly cheap computation (that can avoid copying objects and applying schema).
Internally, Dataset keeps rows in their InternalRow. That decreases JVM memory requirement for your Spark application. The RDD (from rdd) is computed to represent the Spark transformations that are going to be executed once a Spark action is executed.
Please note that executing rdd creates a RDD and does require some calculations too.
So, yes, rdd might be considered an action as it "executes" the query (i.e. the physical plan of the Dataset that sits behind), but in the end it just gives an RDD (so it can't be an action by definition since Spark actions return a non-RDD value).
As you can see in the code:
lazy val rdd: RDD[T] = {
val objectType = exprEnc.deserializer.dataType
val deserialized = CatalystSerde.deserialize[T](logicalPlan) // <-- HERE see explanation below
sparkSession.sessionState.executePlan(deserialized).toRdd.mapPartitions { rows =>
rows.map(_.get(0, objectType).asInstanceOf[T])
}
}
rdd is computed lazily and only once.
one of the way to evaluate if a dataframe is empty or not is to do df.rdd.isEmpty()
I wonder where did you find it. I'd just count:
count(): Long Returns the number of rows in the Dataset.
toRdd Lazy Value
If you insist on going fairly low-level to check whether your Dataset is empty or not, I'd rather use Dataset.queryExecution.toRdd instead. That's almost like rdd without this extra copying and applying schema.
df.queryExecution.toRdd.isEmpty
Compare the following RDD lineages and think which may seem better.
val dataset = spark.range(5).withColumn("group", 'id % 2)
scala> dataset.rdd.toDebugString
res1: String =
(8) MapPartitionsRDD[8] at rdd at <console>:26 [] // <-- extra deserialization step
| MapPartitionsRDD[7] at rdd at <console>:26 []
| MapPartitionsRDD[6] at rdd at <console>:26 []
| MapPartitionsRDD[5] at rdd at <console>:26 []
| ParallelCollectionRDD[4] at rdd at <console>:26 []
// Compare with a more memory-optimized alternative
// Avoids copies and has no schema
scala> dataset.queryExecution.toRdd.toDebugString
res2: String =
(8) MapPartitionsRDD[11] at toRdd at <console>:26 []
| MapPartitionsRDD[10] at toRdd at <console>:26 []
| ParallelCollectionRDD[9] at toRdd at <console>:26 []
From Spark perspective, the transformations are fairly cheap since they don't cause any shuffles, but given the memory requirements change between the computation I'd use the latter (with toRdd).
rdd Lazy Value
rdd represents the content of the Dataset as a (lazily-created) RDD with rows of the JVM type T.
rdd: RDD[T]
As you can see in the source code (pasted above), requesting rdd in the end will trigger one extra computation just to get the RDD.
Creates a new logical plan to deserialize the Dataset’s logical plan, i.e. you get extra deserialization from internal binary row format that is managed outside JVM to its corresponding representation as JVM objects living inside JVM (think of GC that you should avoid at all cost)

What happens to previous RDD when the next RDD is materialized?

In spark,I would like to know what happens to previous RDD when the next RDD is materialized.
let say I have the below scala code
val lines = sc.textFile("/user/cloudera/data.txt")
val lineLengths = lines.map(s => s.length)
val totalLength = lineLengths.reduce((a, b) => a + b)
I have linesRDD is a base RDD
and similarly i have linesLengths RDD
I know that these two RDD gets materialized when reduce Action is invoked.
My question is while the data is flowing through these 2 RDD's , What happens to linesRDD when the linesLengthsRDD gets materialized .
once the linesLengthsRDD gets materialized then does the data inside linesRDD gets removed?
Let's say in production spark job there might 100 RDD's, a single Action is called against 100th RDD.
what happens to data in 1st RDD when the 99th RDD gets materialized?
Data in all RDD's get deleted only the respective final Action returned the respective output ?
Or
Data in each RDD gets removed automatically once that RDD passes its data to its next RDD as per DAG?
Actually both lines and lineLength will hold their rdds after the reduce. You can think of the rdd as DAG of transformations, as you mentioned. So if later you would like to perform some other transformations on lines or lineLength you can. Even though they materialize during the reduce, unless you cache the directly, they will run through their transformations again when another action will be invoked on a DAG they belong to.

Apache Spark: stepwise execution

Due to a performance measurement I want to execute my scala programm written for spark stepwise, i.e.
execute first operator; materialize result;
execute second operator; materialize result;
...
and so on. The original code:
var filename = new String("<filename>")
var text_file = sc.textFile(filename)
var counts = text_file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
counts.saveAsTextFile("file://result")
So I want the execution of var counts = text_file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b) to be stepwise.
Is calling counts.foreachPartition(x => {}) after every operator the right way to do it?
Or is writing to /dev/null with saveAsTextFile() a better alternative? And does spark actually have something like a NullSink for that purpose? I wasn't able to write to /dev/null with saveAsTextFile() because /dev/null already exists. Is there a way to overwrite the spark result folder?
And does the temporary result after each operation should be cached with cache()?
What is the best way to separate the execution?
Spark supports two types of operations: actions and transformations. Transformations, as the name implies, turn datasets into new ones through the combination of the transformation operator and (in some cases, optionally) a function provided to the transformation. Actions, on the other hand, run through a dataset with some computation to provide a value to the driver.
There are two things that Spark does that makes your desired task a little difficult: it bundles non-shuffling transformations into execution blocks called stages and stages in the scheduling graph must be triggered through actions.
For your case, provided your input isn't massive, I think it would be easiest to trigger your transformations with a dummy action (e.g. count(), collect()) as the RDD will be materialized. During RDD computation, you can check the Spark UI to gather any performance statistics about the steps/stages/jobs used to create it.
This would look something like:
val text_file = sc.textFile(filename)
val words = text_file.flatMap(line => line.split(" "))
words.count()
val wordCount = words.map(word => (word, 1))
wordCount.count()
val wordCounts = wordCount.reduceByKey(_ + _)
wordCounts.count()
Some notes:
Since RDD's for all intents and purposes are immutable, they should be stored in val's
You can shorten your reduceByKey() syntax with underscore notation
Your approach with foreachPartition() could work since it is an action but it would require a change in your functions since your are operating over an iterator on your partition
Caching only makes since if you either create multiple RDD's from a parent RDD (branching out) or run iterated computation over the same RDD (perhaps in a loop)
You can also simple invoke RDD.persist() or RDD.cache() after every transformation. but ensure that you have right level of StorageLevel defined.

How does lineage get passed down in RDDs in Apache Spark

Do each RDD point to the same lineage graph
or
when a parent RDD gives its lineage to a new RDD, is the lineage graph copied by the child as well so both the parent and child have different graphs. In this case isn't it memory intensive?
Each RDD maintains a pointer to one or more parent along with the metadata about what type of relationship it has with the parent. For example, when we call val b = a.map() on an RDD, the RDD b just keeps a reference (and never copies) to its parent a, that's a lineage.
And when the driver submits the job, the RDD graph is serialized to the worker nodes so that each of the worker nodes apply the series of transformations (like, map filter and etc..) on different partitions. Also, this RDD lineage will be used to recompute the data if some failure occurs.
To display the lineage of an RDD, Spark provides a debug method toDebugString() method.
Consider the following example:
val input = sc.textFile("log.txt")
val splitedLines = input.map(line => line.split(" "))
.map(words => (words(0), 1))
.reduceByKey{(a,b) => a + b}
Executing toDebugString() on splitedLines RDD, will output the following,
(2) ShuffledRDD[6] at reduceByKey at <console>:25 []
+-(2) MapPartitionsRDD[5] at map at <console>:24 []
| MapPartitionsRDD[4] at map at <console>:23 []
| log.txt MapPartitionsRDD[1] at textFile at <console>:21 []
| log.txt HadoopRDD[0] at textFile at <console>:21 []
For more information about how Spark works internally, please read my another post
When a transformation(map or filter etc) is called, it is not executed by Spark immediately, instead a lineage is created for each transformation.
A lineage will keep track of what all transformations has to be applied on that RDD,
including the location from where it has to read the data.
For example, consider the following example
val myRdd = sc.textFile("spam.txt")
val filteredRdd = myRdd.filter(line => line.contains("wonder"))
filteredRdd.count()
sc.textFile() and myRdd.filter() do not get executed immediately,
it will be executed only when an Action is called on the RDD - here filteredRdd.count().
An Action is used to either save result to some location or to display it.
RDD lineage information can also be printed by using the command filteredRdd.toDebugString(filteredRdd is the RDD here).
Also, DAG Visualization shows the complete graph in a very intuitive manner as follows:

Resources