Is Dataset#persist() a terminal operation? - apache-spark

Does spark actually cache the Dataset when org.apache.spark.sql.Dataset#persist() is called? Or it will be cached lazily when some terminal operation (like count) will be called on a Dataset.

As all caching operations in Spark Dataset.persist is lazy and only marks given object for caching, if it is ever evaluated.
The main difference compared to RDDs is that the evaluation is much harder to reason about. See related discussion on the developers list: Will .count() always trigger an evaluation of each row?

Related

In Apache Spark, can I incrementally cache an RDD partition?

I was under the impression that both RDD execution and caching are lazy: Namely, if an RDD is cached, and only part of it was used, then the caching mechanism will only cache that part, and the other part will be computed on-demand.
Unfortunately, the following experiment seems to indicate otherwise:
val acc = new LongAccumulator()
TestSC.register(acc)
val rdd = TestSC.parallelize(1 to 100, 16).map { v =>
acc add 1
v
}
rdd.persist()
val sliced = rdd
.mapPartitions { itr =>
itr.slice(0, 2)
}
sliced.count()
assert(acc.value == 32)
Running it yields the following exception:
100 did not equal 32
ScalaTestFailureLocation:
Expected :32
Actual :100
Turns out the entire RDD was computed instead of only the first 2 items in each partition. This is very inefficient in some cases (e.g. when you need to determine whether the RDD is empty quickly). Ideally, the caching manager should allow the caching buffer to be incrementally written and accessed randomly, does this feature exists? If not, what should I do to make it happen? (preferrably using existing memory & disk caching mechanism)
Thanks a lot for your opinion
UPDATE 1 It appears that Spark already has 2 classes:
ExternalAppendOnlyMap
ExternalAppendOnlyUnsafeRowArray
that supports more granular caching of many values. Even better, they don't rely on StorageLevel, instead make its own decision which storage device to use. I'm however surprised that they are not options for RDD/Dataset caching directly, rather than for co-group/join/streamOps or accumulators.
In hindsight interesting, here is my take:
You cannot cache incrementally. So the answer to your question is No.
The persist is RDD for all partitions of that RDD, used for multiple Actions or single Action with multiple processing from same common RDD phase onwards.
The rdd Optimizer does not look to see how that could be optimized as you state - if you use the persist. You issued that call, method, api, so it executes it.
But, if you do not use the persist, the lazy evaluation and fusing of code within Stage, seems to tie the slice cardinality and the acc together. That is clear. Is it logical, yes as there is no further reference elsewhere as part of another Action. Others may see it as odd or erroneous. But it does not imply imo incremental persistence / caching.
So, imho, interesting observation I would not have come up with, and not convinced it proves anything about partial caching.

what is the best way to decompose spark transformation

I have a spark application that applied many transformations on many files
firstly I created one transformation (many data Frames that carry out those transformation ) an a single action (persistence the result, about 1M row), however, this version doesn't work it always throws CG, or heap Exceptions, therefore, I decompose it to intermediate actions, and I persist every intermediate result, At first I thought that having many read/write operations will have performance issue however it works, so my question is:
what is the best way to decompose spark transformation (I think that reading/writing operations are not optimal)?
IO is slower than simple computation, but extremely complex computation may be slower than IO. Cache is limited and need to be used to reduce compute time.
I would cache the extremely complex computation so that they won't be reevaluated multiple times. If the data is used more than twice then it breaks even the IO time.
If the computation is not that complex then you needn't cache and just recompute. But see how many times its being reused, if reuse is high then cache yields better performance.
There are various storage options (memory, Disk, both) to cache intermediate data, you can leverage that instead of writing them explicitly to disk.

Can Spark automatically detect nondeterministic results and adjust failure recovery accordingly?

If nondeterministic code runs on Spark, this can cause a problem when recovery from failure of a node is necessary, because the new output may not be exactly the same as the old output. My interpretation is that the entire job might need to be rerun in this case, because otherwise the output data could be inconsistent with itself (as different data was produced at different times). At the very least any nodes that are downstream from the recovered node would probably need to be restarted from scratch, because they have processed data that may now change. That's my understanding of the situation anyway, please correct me if I am wrong.
My question is whether Spark can somehow automatically detect if code is nondeterministic (for example by comparing the old output to the new output) and adjust the failure recovery accordingly. If this were possible it would relieve application developers of the requirement to write nondeterministic code, which might sometimes be challenging and in any case this requirement can easily be forgotten.
No. Spark will not be able to handle non deterministic code in case of failures. The fundamental data structure of Spark, RDD is not only immutable but it
should also be determinstic function of it's input. This is necessary otherwise Spark framework will not be able to recompute the partial RDD (partition) in case of
failure. If the recomputed partition is not deterministic then it had to re-run the transformation again on full RDDs in lineage. I don't think that Spark is a right
framework for non-deterministic code.
If Spark has to be used for such use case, application developer has to take care of keeping the output consistent by writing code carefully. It can be done by using RDD only (no datframe or dataset) and persisting output after every transformation executing non-determinstic code. If performance is the concern, then the intermediate RDDs can be persisted on Alluxio.
A long term approach would be to open a feature request in apache spark jira. But I am not too positive about the acceptance of feature. A little hint in syntax to know wether code is deterministic or not and framework can switch to recover RDD partially or fully.
Non-deterministic results are not detected and accounted for in failure recovery (at least in spark 2.4.1, which I'm using).
I have encountered issues with this a few times on spark. For example, let's say I use a window function:
first_value(field_1) over (partition by field_2 order by field_3)
If field_3 is not unique, the result is non-deterministic and can differ each time that function is run. If a spark executor dies and restarts while calculating this window function, you can actually end up with two different first_value results output for the same field_2 partition.

Best procedure to modify Inmutable Spark RDDs

In the past, I worked with low level parallelization (openmpi, openmp,...)
I am currently working in a Spark project and I don't know the best procedure to work with RDDs because they are inmutable.
I will explain my problem with a simple example, imagine that in my RDD I have an object and I need to update one attribute.
The most practical and memory efficient way to solve this is implementing a method called setAttribute(new_value).
Spark RDDs are inmutable, so I need to create a function (for example myModifiedCopy(new_value)) that returns a copy of this object but with the new_value in its attribute and updating the RDD like this:
myRDD = myRDD.map(x->x.myModifiedCopy(new_value)).cache()
My objects are very complex and they use a lot of RAM memory (they are really huge). This procedure is slow, you have to create a complete copy of every element of the RDD just to modify an small value.
Is there a better procedure to deal with this kind of problems?
Do you recommend a different technology?
I would kill for a mutable RDD.
Thank you very much in advance.
I beleive you have some misconceptions of Apache Spark. When you do a transformation, indeed you aren't creating a whole copy of that RDD in memory, you are just "designing" the series of tiny conversions to execute in each record when you run an action.
For instance, map, filter and flatMap are entirely transformations, thus lazy, so when you execute them you just design the plan but don't execute it. On the other hand, collect or count behave differently they trigger all previous transformations (doing everything that was defined in the intermediate stages) until they get the result.

reducer concept in Spark

I'm coming from a Hadoop background and have limited knowledge about Spark. BAsed on what I learn so far, Spark doesn't have mapper/reducer nodes and instead it has driver/worker nodes. The worker are similar to the mapper and driver is (somehow) similar to reducer. As there is only one driver program, there will be one reducer. If so, how simple programs like word count for very big data sets can get done in spark? Because driver can simply run out of memory.
The driver is more of a controller of the work, only pulling data back if the operator calls for it. If the operator you're working on returns an RDD/DataFrame/Unit, then the data remains distributed. If it returns a native type then it will indeed pull all of the data back.
Otherwise, the concept of map and reduce are a bit obsolete here (from a type of work persopective). The only thing that really matters is whether the operation requires a data shuffle or not. You can see the points of shuffle by the stage splits either in the UI or via a toDebugString (where each indentation level is a shuffle).
All that being said, for a vague understanding, you can equate anything that requires a shuffle to a reducer. Otherwise it's a mapper.
Last, to equate to your word count example:
sc.textFile(path)
.flatMap(_.split(" "))
.map((_, 1))
.reduceByKey(_+_)
In the above, this will be done in one stage as the data loading (textFile), splitting(flatMap), and mapping can all be done independent of the rest of the data. No shuffle is needed until the reduceByKey is called as it will need to combine all of the data to perform the operation...HOWEVER, this operation has to be associative for a reason. Each node will perform the operation defined in reduceByKey locally, only merging the final data set after. This reduces both memory and network overhead.
NOTE that reduceByKey returns an RDD and is thus a transformation, so the data is shuffled via a HashPartitioner. All of the data does NOT pull back to the driver, it merely moves to nodes that have the same keys so that it can have its final value merged.
Now, if you use an action such as reduce or worse yet, collect, then you will NOT get an RDD back which means the data pulls back to the driver and you will need room for it.
Here is my fuller explanation of reduceByKey if you want more. Or how this breaks down in something like combineByKey

Resources