When does Spark do a "Scan ExistingRDD"? - apache-spark

I have a job that takes in a huge dataset and joins it with another dataset. The first time it ran, it took a really long time and Spark executed a FileScan parquet when reading the dataset, but in future jobs the query plan shows Scan ExistingRDD and the build takes minutes.
Why and how is Spark able to scan an existing RDD? Would it ever fall back to scanning the parquet files that back a dataset (and hence revert to worse performance)?

There are two common situations in Foundry in which you'll see this:
You're using a DataFrame you defined manually through createDataFrame
You're running an incremental transform with an input that doesn't have any changes, so you're using an empty synthetic DataFrame that Transforms has created for you (a special case of 1.)
If we follow the Spark code, we see the definition of the call noted, Scan ExistingRDD, this in turn calls into RDDScanExec, which is a mapper for InternalRows (a representation of literal values held by the Driver and synthesized into a DataFrame).

Related

How does spark structured streaming job handle stream - static DataFrame join?

I have a spark structured streaming job which reads a mapping table from cassandra and deltalake and joins with streaming df. I would like to understand the exact mechanism here. Does spark hit these data sources(cassandra and deltalake) for every cycle of microbatch? If that is the case i see in spark web ui that these tables are read only once.
Please help me understand this.
Thanks in advance
"Does spark hit these data sources(cassandra and deltalake) for every cycle of microbatch?"
According to the book "Learning Spark, 2nd edition" from O'Reilly on static-stream joins it is mentioned that the static DataFrame is read in every micro-batch.
To be more precise, I find the following section in the book quite helpful:
Stream-static joins are stateless operations, and therfore do not required any kind of watermarking
The static DataFrame is read repeatedly while joining with the streaming data of every micro-batch, so you can cache the static DataFrame to speed up reads.
If the underlying data in the data source on which the static DataFrame was defined changes, wether those changes are seen by the streaming query depends on the specific behavior of the data source. For example, if the static DataFrame was defined on files, then changes to those files (e.g. appends) will not be picked up until the streaming query is restarted.
When applying a "static-stream" join it is assumed that the static part is not changing at all or only slowly changing. If you plan to join two rapidly changing data sources it is required to switch to a "stream-stream" join.

what if I use an action like createdataframe() to break a very long lineage rather than checkpoint? [duplicate]

I have a recursive spark algorithm that applies a sliding window of 10 days to a Dataset.
The original dataset is loaded from a Hive table partitioned by date.
At each iteration a complex set of operations is applied to Dataset containing the ten day window.
The last date is then inserted back into the original Hive table and the next date loaded from Hive and unioned to the remaining nine days.
I realise that I need to break the spark lineage to prevent the DAG from growing unmanageable.
I believe I have two options:
Checkpointing - involves a costly write to HDFS.
Convert to rdd and back again
spark.createDataset(myDS.rdd)
Are there any disadvantages using the second option - I am assuming this is an in memory operation and is therefore cheaper.
Check pointing and converting back to RDD are indeed the best/only ways to truncate lineage.
Many (all?) of the Spark ML Dataset/DataFrame algorithms are actually implemented using RDDs, but the APIs exposed are DS/DF due to the optimizer not being parallelized and lineage size from iterative/recursive implementations.
There is a cost to converting to and from RDD, but smaller than the file system checkpointing option.

Issues with long lineages (DAG) in Spark

We usually use Spark as processing engines for data stored on S3 or HDFS. We use Databricks and EMR platforms.
One of the issues I frequently face is when the task size grows, the job performance is degraded severely. For example, let's say I read data from five tables with different levels of transformation like (filtering, exploding, joins, etc), union subset of data from these transformations, then do further processing (ex. remove some rows based on a criteria that requires windowing functions etc) and then some other processing stages and finally save the final output to a destination s3 path. If we run this job without it takes very long time. However, if we save(stage) temporary intermediate dataframes to S3 and use this saved (on S3) dataframe for the next steps of queries, the job finishes faster. Does anyone have similar experience? Is there a better way to handle this kind of long tasks lineages other than checkpointing?
What is even more strange is for longer lineages spark throws an expected error like column not found, while the same code works if intermediate results are temporarily staged.
Writing the intermediate data by saving the dataframe, or using a checkpoint is the only way to fix it. You're probably running into an issue where the optimizer is taking a really long time to generate the plan. The quickest/most efficient way to fix this is to use localCheckpoint. This materializes a checkpoint locally.
val df = df.localCheckpoint()

How to do Incremental MapReduce in Apache Spark

In CouchDB and system designs like Incoop, there's a concept called "Incremental MapReduce" where results from previous executions of a MapReduce algorithm are saved and used to skip over sections of input data that haven't been changed.
Say I have 1 million rows divided into 20 partitions. If I run a simple MapReduce over this data, I could cache/store the result of reducing each separate partition, before they're combined and reduced again to produce the final result. If I only change data in the 19th partition then I only need to run the map & reduce steps on the changed section of the data, and then combine the new result with the saved reduce results from the unchanged partitions to get an updated result. Using this sort of catching I'd be able to skip almost 95% of the work for re-running a MapReduce job on this hypothetical dataset.
Is there any good way to apply this pattern to Spark? I know I could write my own tool for splitting up input data into partitions, checking if I've already processed those partitions before, loading them from a cache if I have, and then running the final reduce to join all the partitions together. However, I suspect that there's an easier way to approach this.
I've experimented with checkpointing in Spark Streaming, and that is able to store results between restarts, which is almost what I'm looking for, but I want to do this outside of a streaming job.
RDD caching/persisting/checkpointing almost looks like something I could build off of - it makes it easy to keep intermediate computations around and reference them later, but I think cached RDDs are always removed once the SparkContext is stopped, even if they're persisted to disk. So caching wouldn't work for storing results between restarts. Also, I'm not sure if/how checkpointed RDDs are supposed to be loaded when a new SparkContext is started... They seem to be stored under a UUID in the checkpoint directory that's specific to a single instance of the SparkContext.
Both use cases suggested by the article (incremental logs processing and incremental query processing) can be generally solved by Spark Streaming.
The idea is that you have incremental updates coming in using DStreams abstraction. Then, you can process new data, and join it with previous calculation either using time window based processing or using arbitrary stateful operations as part of Structured Stream Processing. Results of the calculation can be later dumped to some sort of external sink like database or file system, or they can be exposed as an SQL table.
If you're not building an online data processing system, regular Spark can be used as well. It's just a matter of how incremental updates get into the process, and how intermediate state is saved. For example, incremental updates can appear under some path on a distributed file system, while intermediate state containing previous computation joined with new data computation can be dumped, again, to the same file system.

Fast Parquet row count in Spark

The Parquet files contain a per-block row count field. Spark seems to read it at some point (SpecificParquetRecordReaderBase.java#L151).
I tried this in spark-shell:
sqlContext.read.load("x.parquet").count
And Spark ran two stages, showing various aggregation steps in the DAG. I figure this means it reads through the file normally instead of using the row counts. (I could be wrong.)
The question is: Is Spark already using the row count fields when I run count? Is there another API to use those fields? Is relying on those fields a bad idea for some reason?
That is correct, Spark is already using the rowcounts field when you are running count.
Diving into the details a bit, the SpecificParquetRecordReaderBase.java references the Improve Parquet scan performance when using flat schemas commit as part of [SPARK-11787] Speed up parquet reader for flat schemas. Note, this commit was included as part of the Spark 1.6 branch.
If the query is a row count, it pretty much works the way you described it (i.e. reading the metadata). If the predicates are fully satisfied by the min/max values, that should work as well though that is not as fully verified. It's not a bad idea to use those Parquet fields but as implied in the previous statement, the key issue is to ensure that the predicate filtering matches the metadata so you are doing an accurate count.
To help understand why there are two stages, here's the DAG created when running the count() statement.
When digging into the two stages, notice that the first one (Stage 25) is running the file scan while the second stage (Stage 26) runs the shuffle for the count.
Thanks to Nong Li (the author of the SpecificParquetRecordReaderBase.java commit) for validating!
Updated
To provide additional context on the bridge between Dataset.count and Parquet, the flow of the internal logic surrounding this is:
Spark does not read any Parquet columns to calculate the count
Passing of the Parquet schema to the VectorizedParquetRecordReader is actually an empty Parquet message
Computing the count using the metadata stored in the Parquet file footers.
involves the wrapping of the above within an iterator that returns an InternalRow per InternalRow.scala.
To work with the Parquet File format, internally, Apache Spark wraps the logic with an iterator that returns an InternalRow; more information can be found in InternalRow.scala. Ultimately, the count() aggregate function interacts with the underlying Parquet data source using this iterator. BTW, this is true for both vectorized and non-vectorized Parquet reader.
Therefore, to bridge the Dataset.count() with the Parquet reader, the path is:
The Dataset.count() call is planned into an aggregate operator with a single count() aggregate function.
Java code is generated at planning time for the aggregate operator as well as the count() aggregate function.
The generated Java code interacts with the underlying data source ParquetFileFormat with an RecordReaderIterator, which is used internally by the Spark data source API.
For more information, please refer to Parquet Count Metadata Explanation.
We can also use
java.text.NumberFormat.getIntegerInstance.format(sparkdf.count)

Resources