Why is this Spark SQL UDF slower than an RDD? - apache-spark

I have some expensive analysis I need to perform on a DataFrame of pairs of objects. The setup looks something like this.
# This does the expensive work and holds some reference data
# Expensive to initialize so done only once
analyze = Analyze()
def analyze_row(row):
# Turn the row into objects and pass them to the function above
foo = Foo.from_dict(row.foo.asDict(recursive=True))
bar = Bar.from_dict(row.bar.asDict(recursive=True))
return analyze(foo, bar)
When I apply analyze_row as a UDF like so
analyze_row_udf = udf(analyze_row, result_schema)
results_df = input_df.withColumn("result", analyze_row_udf).select("result.*")
it is empirically slower than applying it to an RDD like so
results = content.rdd.map(analyze_row)
results_df = spark.createDataFrame(results, schema=result_schema)
All other things being equal, the UDF version didn't seem to make progress in an hour, while the RDD version completely finished in 30 mins. The cluster CPU was maxed out in both cases. Same behavior was reproduced on multiple tries.
I thought DataFrames are meant to supersede RDDs, partially because of better performance. How come an RDD seems to be much faster in this case?

DataFrames can supersede RDDs where:
There execution plan optimizations (here none can be applied).
There low level optimizations used - off-heap memory, code generation (once again none are applied when you execute black box code outside JVM)
Optimized columnar storage is used - (ditto).
Additionally passing data between contexts is expensive, and merging partial results requires additional operations. Also it more than doubles memory requirements.
It is hard to say why RDD are strictly faster in your case (there have significant improvements time, and you didn't provide a version) but I'd guess you hit some case border-case.
Overall, for arbitrary Python code DataFrames are not a better option at all. This might change a bit in the future, for vectorized operations backed with Arrow.

Related

In Apache Spark, can I incrementally cache an RDD partition?

I was under the impression that both RDD execution and caching are lazy: Namely, if an RDD is cached, and only part of it was used, then the caching mechanism will only cache that part, and the other part will be computed on-demand.
Unfortunately, the following experiment seems to indicate otherwise:
val acc = new LongAccumulator()
TestSC.register(acc)
val rdd = TestSC.parallelize(1 to 100, 16).map { v =>
acc add 1
v
}
rdd.persist()
val sliced = rdd
.mapPartitions { itr =>
itr.slice(0, 2)
}
sliced.count()
assert(acc.value == 32)
Running it yields the following exception:
100 did not equal 32
ScalaTestFailureLocation:
Expected :32
Actual :100
Turns out the entire RDD was computed instead of only the first 2 items in each partition. This is very inefficient in some cases (e.g. when you need to determine whether the RDD is empty quickly). Ideally, the caching manager should allow the caching buffer to be incrementally written and accessed randomly, does this feature exists? If not, what should I do to make it happen? (preferrably using existing memory & disk caching mechanism)
Thanks a lot for your opinion
UPDATE 1 It appears that Spark already has 2 classes:
ExternalAppendOnlyMap
ExternalAppendOnlyUnsafeRowArray
that supports more granular caching of many values. Even better, they don't rely on StorageLevel, instead make its own decision which storage device to use. I'm however surprised that they are not options for RDD/Dataset caching directly, rather than for co-group/join/streamOps or accumulators.
In hindsight interesting, here is my take:
You cannot cache incrementally. So the answer to your question is No.
The persist is RDD for all partitions of that RDD, used for multiple Actions or single Action with multiple processing from same common RDD phase onwards.
The rdd Optimizer does not look to see how that could be optimized as you state - if you use the persist. You issued that call, method, api, so it executes it.
But, if you do not use the persist, the lazy evaluation and fusing of code within Stage, seems to tie the slice cardinality and the acc together. That is clear. Is it logical, yes as there is no further reference elsewhere as part of another Action. Others may see it as odd or erroneous. But it does not imply imo incremental persistence / caching.
So, imho, interesting observation I would not have come up with, and not convinced it proves anything about partial caching.

Spark with divide and conquer

I'm learning Spark and trying to process some huge dataset. I don't understand why I don't see decrease in stage completion times with following strategy (pseudo):
data = sc.textFile(dataset).cache()
while True:
data.count()
y = data.map(...).reduce(...)
data = data.filter(lambda x: x < y).persist()
So idea is to pick y so that it most of the time ~halves the data. But for some reason it looks like all the data is always processed again on each count().
Is this some kind of an anti-pattern? How I'm supposed to do this with Spark?
Yes, that is an anti-pattern.
map, same as most, but not all, of the distributed primitives in Spark, is pretty much by definition a divide and conquer approach. You take the data, you compute splits, and transparently distribute computing of individual splits over the cluster.
Trying to further divide this process, using high level API, makes no sense at all. At best it will provide no benefits at all, at worst it will incur the cost of multiple data scans, caching and spills.
Spark is lazily evaluated so in the for or while loop above each call to data.filter does not sequentially return the data but instead sequentially returns Spark calls to be executed later. All these calls get aggregated and then executed simultaneously when you do something later.
In particular, results remain unevaluated and merely represented until a Spark Action gets called. Past a certain point the application can’t handle that many parallel tasks.
In a way we’re running into a conflict between two different representations: conventional structured coding with its implicit (or at least implied) execution patterns and independent, distributed, lazily-evaluated Spark representations.

PySpark: Best practice to add more columns to a DataFrame

Spark Dataframes has a method withColumn to add one new column at a time. To add multiple columns, a chain of withColumns are required. Is this the best practice to do this?
I feel that usingmapPartitions has more advantages. Let's say I have a chain of three withColumns and then one filter to remove Rows based on certain conditions. These are four different operations (I am not sure if any of these are wide transformations, though). But I can do it all in one go if I do a mapPartitions. It also helps if I have a database connection that I would prefer to open once per RDD partition.
My question has two parts.
The first part, this is my implementation of mapPartitions. Are there any unforeseen issues with this approach? And is there a more elegant way to do this?
df2 = df.rdd.mapPartitions(add_new_cols).toDF()
def add_new_cols(rows):
db = open_db_connection()
new_rows = []
new_row_1 = Row("existing_col_1", "existing_col_2", "new_col_1", "new_col_2")
i = 0
for each_row in rows:
i += 1
# conditionally omit rows
if i % 3 == 0:
continue
db_result = db.get_some_result(each_row.existing_col_2)
new_col_1 = ''.join([db_result, "_NEW"])
new_col_2 = db_result
new_f_row = new_row_1(each_row.existing_col_1, each_row.existing_col_2, new_col_1, new_col_2)
new_rows.append(new_f_row)
db.close()
return iter(new_rows)
The second part, what are the tradeoffs in using mapPartitions over a chain of withColumn and filter?
I read somewhere that using the available methods with Spark DFs are always better than rolling out your own implementation. Please let me know if my argument is wrong. Thank you! All thoughts are welcome.
Are there any unforeseen issues with this approach?
Multiple. The most severe implications are:
A few times higher memory footprint to compared to plain DataFrame code and significant garbage collection overhead.
High cost of serialization and deserialization required to move data between execution contexts.
Introducing breaking point in the query planner.
As is, cost of schema inference on toDF call (can be avoided if proper schema is provided) and possible re-execution of all preceding steps.
And so on...
Some of these can be avoided with udf and select / withColumn, other cannot.
let's say I have a chain of three withColumns and then one filter to remove Rows based on certain conditions. These are four different operations (I am not sure if any of these are wide transformations, though). But I can do it all in one go if I do a mapPartitions
Your mapPartitions doesn't remove any operations, and doesn't provide any optimizations, that Spark planner cannot excluding. Its only advantage is that it provides a nice scope for expensive connection objects.
I read somewhere that using the available methods with Spark DFs are always better than rolling out your own implementation
When you start using executor-side Python logic you already diverge from Spark SQL. Doesn't matter if you use udf, RDD or newly added vectorized udf. At the end of the day you should make decision based on overall structure of your code - if it is predominantly Python logic executed directly on the data it might be better to stick with RDD or skip Spark completely.
If it is just a fraction of the logic, and doesn't cause severe performance issue, don't sweat about it.
using df.withColumn() is the best way to add columns. they're all added lazily

How to deal with strongly varying data sizes in spark

I'm wondering about the best practice in designing spark-jobs where the volume of data is not known in advance (or is strongly varying). In my case, the application should both handle initial loads and later on incremental data.
I wonder how I should set the number of partitions in my data (e.g. using repartition or setting parameters like spark.sql.shuffle.partitions in order to avoid OOM excpetion in the executors (giving fixed amount of allocated memory per executor). I could
define a very high number of partition to make sure that even on very high workloads, the job does not fail
Set number of partitions at runtime depending on the size of source-data
Introduce an iteration over independent chunks of data (i.e. looping)
In all option, I see issues:
1: I imagine this to be inefficient for small data sizes as taks get very small
2: Needs additional querys (e.g. count) and e.g. for setting spark.sql.shuffle.partitions, SparkContext needs to be restartet which I would like to avoid
3: Seems to contradict the spirit of Spark
So I wonder what the most efficient strategy is for strongly varying data volumes.
EDIT:
I was wrong about setting spark.sql.shuffle.partitions, this can be set at runtime woutout restarting spark context
Do not set a high number of partitions without knowing this is needed. You will absolutely kill the performance of your job.
Yes
As you said, don't loop!
As you mention, you introduce an extra step which is to count your data, which at first glance seems wrong. However, you shouldn't think of this as mis-spent computation. Usually, the time it takes to count your data is significantly less than the time it would take to do further processing if you partition the data badly. Think of the count operation as an investment, it's certainly worth it.
You do not need to set partitions through the config and restart Spark. Instead, do the following:
Note current number of partitions for RDD / Dataframe / Dataset
Count number of entries / rows in your data
Based on an estimate of average row size, compute the target number of partitions
If #targetPartitions << #actualPartitions Then coalesce
Else If #targetPartitions >> #actualPartitions Then repartition
Else #targetPartitions ~= #actualPartitions Then do nothing
The coalesce operation will re-partition your data without shuffling, and so is much more efficient when it is available.
Ideally you can estimate the number of rows your will generate, rather than count them. Also, you will need to think carefully about when it is appropriate to perform this operation. With a long RDD lineage you can kill performance, because you may inadvertently reduce the number of cores which can execute complex code, due to scala lazy execution. Look into checkpointing to mitigate this problem.

In spark, how to estimate the number of elements in a dataframe quickly

In spark, is there a fast way to get an approximate count of the number of elements in a Dataset ? That is, faster than Dataset.count() does.
Maybe we could calculate this information from the number of partitions of the DataSet, could we ?
You could try to use countApprox on RDD API, altough this also launches a Spark job, it should be faster as it just gives you an estimate of the true count for a given time you want to spend (milliseconds) and a confidence interval (i.e. the probabilty that the true value is within that range):
example usage:
val cntInterval = df.rdd.countApprox(timeout = 1000L,confidence = 0.90)
val (lowCnt,highCnt) = (cntInterval.initialValue.low, cntInterval.initialValue.high)
You have to play a bit with the parameters timeout and confidence. The higher the timeout, the more accurate is the estimated count.
If you have a truly enormous number of records, you can get an approximate count using something like HyperLogLog and this might be faster than count(). However you won't be able to get any result without kicking off a job.
When using Spark there are two kinds of RDD operations: transformations and actions. Roughly speaking, transformations modify an RDD and return a new RDD. Actions calculate or generate some result. Transformations are lazily evaluated, so they don't kick off a job until an action is called at the end of a sequence of transformations.
Because Spark is a distributed batch programming framework, there is a lot of overhead for running jobs. If you need something that feels more like "real time" whatever that means, either use basic Scala (or Python) if your data is small enough, or move to a streaming approach and do something like update a counter as new records flow through.

Resources