Lazy loading of partitioned parquet in Apache Spark - apache-spark

As I understand it, Apache Spark uses lazy evaluation. So for example code like the following that consists only of transformations will do no actual processing:
val transformed_df = df.filter("some_field = 10").select("some_other_field", "yet_another_field")
Only when we do an "action" will any processing actually occur:
transformed_df.show()
I had been under the impression that load operations are also lazy in spark. (See How spark loads the data into memory.)
However, my experiences with spark have not borne this out. When I do something like the following,
val df = spark.read.parquet("/path/to/parquet/")
execution seems to depend greatly on the size of the data in the path. In other words, it's not strictly lazy. This is inconvenient if the data is partitioned and I only need to look at a fraction of the partitions.
For example:
df.filter("partitioned_field = 10").show()
If the data is partitioned in storage on "partitioned_field", I would have expected spark to wait until show() is called, and then read only data under "/path/to/parquet/partitioned_field=10/". But again, this doesn't seem to be the case. Spark appears to perform at least some operations on all of the data as soon as read or load is called.
I could get around this by only loading /path/to/parquet/partitioned_field=10/ in the first place, but this is much less elegant than just calling "read" and filtering on the partitioned field, and it's harder to generalize.
Is there a more elegant preferred way to lazily load partitions of parquet data?
(To clarify, I am using Spark 2.4.3)

I think I've stumbled on an answer to my question while learning about a key distinction that is often overlooked when talking about lazy evaluation in spark.
Data is lazily evaluated, but schemas are not. So if we are reading parquet, which is a structured data type, spark does have to at least determine the schema of any files it's reading as soon as read() or load() is called. So calling read() on a large number of files will take longer than on a small number of files.
Given that partitions are part of the schema, it's less surprising to me now that spark has to look at all of the files in the path to determine the schema before filtering on a partition field.
It would be convenient for my purposes if spark were to wait until schema evaluation was strictly necessary and was able to filter on partition fields prior to determining the rest of the schema, but it sounds like this is not the case. I believe Dataset objects always must have a schema, so I'm not sure there's a way around this problem without significant changes to Spark.
In conclusion, it seems like my only option currently is to pass in a list of paths for the partitions that I need rather than the base path if I want to avoid evaluating the schema over the entire data repository.

Related

How to spark partitionBy/bucketBy correctly?

Q1. Will adhoc (dynamic) repartition of the data a line before a join help to avoid shuffling or will the shuffling happen anyway at the repartition and there is no way to escape it?
Q2. should I repartition/partitionBy/bucketBy? what is the right approach if I will join according to column day and user_id in the future? (I am saving the results as hive tables with .write.saveAsTable). I guess to partition by day and bucket by user_id but that seems to create thousands of files (see Why is Spark saveAsTable with bucketBy creating thousands of files?)
Some 'guidance' off the top of my head, noting that title and body of text differ to a degree:
Question 1:
A JOIN will do any (hash) partitioning / repartitioning required automatically - if needed and if not using a Broadcast JOIN. You may
set the number of partitions for shuffling or use the default - 200.
There are more parties (DF's) to consider.
repartition is a transformation, so any up-front repartition may not be executed at all due to Catalyst optimization - see the physical plan generated from the .explain. That's the deal with lazy
evaluation - determining if something is necessary upon Action
invocation.
Question 2:
If you have a use case to JOIN certain input / output regularly, then using Spark's bucketBy is a good approach. It obviates shuffling. The
databricks docs show this clearly.
A Spark schema using bucketBy is NOT compatible with Hive. so these remain Spark only tables, unless this changed recently.
Using Hive partitioning as you state depend on push-down logic, partition pruning etc. It should work as well but you may have have
different number of partitions inside Spark framework after the read.
It's a bit more complicated than saying I have N partitions so I will
get N partitions on the initial read.

Creating Spark RDD or Dataframe from an External Source

I am building a substantial application in Java that uses Spark and Json. I anticipate that the application will process large tables, and I want to use Spark SQL to execute queries against those tables. I am trying to use a streaming architecture so that data flows directly from an external source into Spark RDDs and dataframes. I'm having two difficulties in building my application.
First, I want to use either JavaSparkContext or SparkSession to parallelize the data. Both have a method that accepts a Java List as input. But, for streaming, I don't want to create a list in memory. I'd rather supply either a Java Stream or an Iterator. I figured out how to wrap those two objects so that they look like a List, but it cannot compute the size of the list until after the data has been read. Sometimes this works, but sometimes Spark calls the size method before the entire input data has been read, which causes an unsupported operation exception.
Is there a way to create an RDD or a dataframe directly from a Java Stream or Iterator?
For my second issue, Spark can create a dataframe directly from JSON, which would be my preferred method. But, the DataFrameReader class has methods for this operation that require a string to specify a path. The nature of the path is not documented, but I assume that it represents a path in the file system or possibly a URL or URI (the documentation doesn't say how Spark resolves the path). For testing, I'd prefer to supply the JSON as a string, and in the production, I'd like the user to specify where the data resides. As a result of this limitation, I'm having to roll my own JSON deserialization, and it's not working because of issues related to parallelization of Spark tasks.
Can Spark read JSON from an InputStream or some similar object?
These two issues seem to really limit the adaptability of Spark. I sometimes feel that I'm trying to fill an oil tanker with a garden hose.
Any advice would be welcome.
Thanks for the suggestion. After a lot of work, I was able to adapt the example at github.com/spirom/spark-data-sources. It is not straightforward, and because the DataSourceV2 API is still evolving, my solution may break in a future iteration. The details are too intricate to post here, so if you are interested, please contact me directly.

Consistent view of data in SPARK Data Frame

Considering lazy evaluation, actions, etc. my understanding is from others, that:
if I make repeated access to a dataframe,
that was built from, say, a Hive table,
that (the Hive table) is subject to mutation,
then this changed data will show up on every dataframe operation that is issued subsequently.
How can I get a consistent dataframe then a la ORACLE's read consistency model, other than copying to a separate non-mutable Hive table?
I am assuming that a TempView will solve the problem, or is that not so? Actually I think not. Performance issues.
Ideally I would like the dataframe will all records persisted, but may be that is not how it works with the lazy protocol.
How can I get a consistent dataframe then a la ORACLE's read consistency model, other than copying to a separate non-mutable Hive table?
There is simply no such option.
Naively one could suggest cache and forced evaluation:
val df: DataFrame = ???
df.cache // Default StorageLevel - MEMORY_AND_DISK
df.foreach(_ => ())
but it just doesn't provide required guarantees, especially in case of node failures. You could increase reliability by setting StorageLevel to MEMORY_AND_DISK_2, but it still can result in silent correctness errors.
So to be blunt - Spark is not a database, don't try to treat it like a one. if you already use Hive, and mutable state, then skip Spark and use Hive's ACID and transaction options.

How to do Incremental MapReduce in Apache Spark

In CouchDB and system designs like Incoop, there's a concept called "Incremental MapReduce" where results from previous executions of a MapReduce algorithm are saved and used to skip over sections of input data that haven't been changed.
Say I have 1 million rows divided into 20 partitions. If I run a simple MapReduce over this data, I could cache/store the result of reducing each separate partition, before they're combined and reduced again to produce the final result. If I only change data in the 19th partition then I only need to run the map & reduce steps on the changed section of the data, and then combine the new result with the saved reduce results from the unchanged partitions to get an updated result. Using this sort of catching I'd be able to skip almost 95% of the work for re-running a MapReduce job on this hypothetical dataset.
Is there any good way to apply this pattern to Spark? I know I could write my own tool for splitting up input data into partitions, checking if I've already processed those partitions before, loading them from a cache if I have, and then running the final reduce to join all the partitions together. However, I suspect that there's an easier way to approach this.
I've experimented with checkpointing in Spark Streaming, and that is able to store results between restarts, which is almost what I'm looking for, but I want to do this outside of a streaming job.
RDD caching/persisting/checkpointing almost looks like something I could build off of - it makes it easy to keep intermediate computations around and reference them later, but I think cached RDDs are always removed once the SparkContext is stopped, even if they're persisted to disk. So caching wouldn't work for storing results between restarts. Also, I'm not sure if/how checkpointed RDDs are supposed to be loaded when a new SparkContext is started... They seem to be stored under a UUID in the checkpoint directory that's specific to a single instance of the SparkContext.
Both use cases suggested by the article (incremental logs processing and incremental query processing) can be generally solved by Spark Streaming.
The idea is that you have incremental updates coming in using DStreams abstraction. Then, you can process new data, and join it with previous calculation either using time window based processing or using arbitrary stateful operations as part of Structured Stream Processing. Results of the calculation can be later dumped to some sort of external sink like database or file system, or they can be exposed as an SQL table.
If you're not building an online data processing system, regular Spark can be used as well. It's just a matter of how incremental updates get into the process, and how intermediate state is saved. For example, incremental updates can appear under some path on a distributed file system, while intermediate state containing previous computation joined with new data computation can be dumped, again, to the same file system.

Fast Parquet row count in Spark

The Parquet files contain a per-block row count field. Spark seems to read it at some point (SpecificParquetRecordReaderBase.java#L151).
I tried this in spark-shell:
sqlContext.read.load("x.parquet").count
And Spark ran two stages, showing various aggregation steps in the DAG. I figure this means it reads through the file normally instead of using the row counts. (I could be wrong.)
The question is: Is Spark already using the row count fields when I run count? Is there another API to use those fields? Is relying on those fields a bad idea for some reason?
That is correct, Spark is already using the rowcounts field when you are running count.
Diving into the details a bit, the SpecificParquetRecordReaderBase.java references the Improve Parquet scan performance when using flat schemas commit as part of [SPARK-11787] Speed up parquet reader for flat schemas. Note, this commit was included as part of the Spark 1.6 branch.
If the query is a row count, it pretty much works the way you described it (i.e. reading the metadata). If the predicates are fully satisfied by the min/max values, that should work as well though that is not as fully verified. It's not a bad idea to use those Parquet fields but as implied in the previous statement, the key issue is to ensure that the predicate filtering matches the metadata so you are doing an accurate count.
To help understand why there are two stages, here's the DAG created when running the count() statement.
When digging into the two stages, notice that the first one (Stage 25) is running the file scan while the second stage (Stage 26) runs the shuffle for the count.
Thanks to Nong Li (the author of the SpecificParquetRecordReaderBase.java commit) for validating!
Updated
To provide additional context on the bridge between Dataset.count and Parquet, the flow of the internal logic surrounding this is:
Spark does not read any Parquet columns to calculate the count
Passing of the Parquet schema to the VectorizedParquetRecordReader is actually an empty Parquet message
Computing the count using the metadata stored in the Parquet file footers.
involves the wrapping of the above within an iterator that returns an InternalRow per InternalRow.scala.
To work with the Parquet File format, internally, Apache Spark wraps the logic with an iterator that returns an InternalRow; more information can be found in InternalRow.scala. Ultimately, the count() aggregate function interacts with the underlying Parquet data source using this iterator. BTW, this is true for both vectorized and non-vectorized Parquet reader.
Therefore, to bridge the Dataset.count() with the Parquet reader, the path is:
The Dataset.count() call is planned into an aggregate operator with a single count() aggregate function.
Java code is generated at planning time for the aggregate operator as well as the count() aggregate function.
The generated Java code interacts with the underlying data source ParquetFileFormat with an RecordReaderIterator, which is used internally by the Spark data source API.
For more information, please refer to Parquet Count Metadata Explanation.
We can also use
java.text.NumberFormat.getIntegerInstance.format(sparkdf.count)

Resources