Databricks IO Cache with sc.parallelize()?

Databricks IO Cache with sc.parallelize()? - databricks

I went through https://docs.databricks.com/user-guide/databricks-io-cache.html but there is not a single line of code of example on how to use the DBIO cache(instead of the standard Spark RDD cache) in code, apart from setting a configuration setting to enable DBIO cache.
Am I to assume that if I enable that setting spark.conf.set("spark.databricks.io.cache.enabled", "true") then in my spark job whatever RDD I create will be basically treated as a DBIO cache?? What if I want to distinguish and have both in my code?

DBIO Caching works with Parquet datasets only at this time. So as long as you're loading a DataFrame from Parquet then you will get use of the cache. You can confirm by looking at the Storage tab in the Spark UI which will show how much you've cached so far. Also, to make it easier just use i3 instance types so that the DBIO cache is enabled by default.

Related

Write Spark dataframe to database (Exasol) using jdbc slow

I am reading from AWS(s3) and writing in to database (exasol) taking too much time even setting batchsize is not effecting performance.
I am writing 6.18m rows (around 3.5 gb) taking 17min
running in cluster mode 20 node cluster
how I can make it fast
Dataset ds = session.read().parquet(s3Path)
ds.write().format("jdbc").option("user", username).option("password", password).option("driver", Conf.DRIVER).option("url", dbURL).option("dbtable", exasolTableName).option("batchsize", 50000).mode(SaveMode.Append).save();

Ok, it's an interesting question.
I did not check the implementation details of recently released Spark connector. But you may go with some previously existing methods.
Save Spark job results as CSV files in Hadoop. Run standard parallel IMPORT from all created files via WebHDFS http calls.
Official UDF script is capable of importing directly from Parquet, as far as I know.
You may implement your own Java UDF script to read Parquet in way you want. For example, this is how it works for ORC files.
Generally speaking, the best way to achieve some real performance is to bypass Spark altogether.

Spark on Databricks - Caching Hive table

We have fact table(30 columns) stored in parquet files on S3 and also created table on this files and cache it afterwards. Table is created using this code snippet:
val factTraffic = spark.read.parquet(factTrafficData)
factTraffic.write.mode(SaveMode.Overwrite).saveAsTable("f_traffic")
%sql CACHE TABLE f_traffic
We run many different calculations on this table(files) and are looking the best way to cache data for faster access in subsequent calculations. Problem is, that for some reason it's faster to read the data from parquet and do the calculation then access it from memory. One important note is that we do not utilize every column. Usually, around 6-7 columns per calculation and different columns each time.
Is there a way to cache this table in memory so we can access it faster then reading from parquet?

It sounds like you're running on Databricks, so your query might be automatically benefitting from the Databricks IO Cache. From the Databricks docs:
The Databricks IO cache accelerates data reads by creating copies of remote files in nodes’ local storage using fast intermediate data format. The data is cached automatically whenever a file has to be fetched from a remote location. Successive reads of the same data are then executed locally, which results in significantly improved reading speed.
The Databricks IO cache supports reading Parquet files from DBFS, Amazon S3, HDFS, Azure Blob Storage, and Azure Data Lake. It does not support other storage formats such as CSV, JSON, and ORC.
The Databricks IO Cache is supported on Databricks Runtime 3.3 or newer. Whether it is enabled by default depends on the instance type that you choose for the workers on your cluster: currently it is enabled automatically for Azure Ls instances and AWS i3 instances (see the AWS and Azure versions of the Databricks documentation for full details).
If this Databricks IO cache is taking effect then explicitly using Spark's RDD cache with an untransformed base table may harm query performance because it will be storing a second redundant copy of the data (and paying a roundtrip decoding and encoding in order to do so).
Explicit caching can still can make sense if you're caching a transformed dataset, e.g. after filtering to significantly reduce the data volume, but if you only want to cache a large and untransformed base relation then I'd personally recommend relying on the Databricks IO cache and avoiding Spark's built-in RDD cache.
See the full Databricks IO cache documentation for more details, including information on cache warming, monitoring, and a comparision of RDD and Databricks IO caching.

The materalize dataframe in cache, you should do:
val factTraffic = spark.read.parquet(factTrafficData)
factTraffic.write.mode(SaveMode.Overwrite).saveAsTable("f_traffic")
val df_factTraffic = spark.table("f_traffic").cache
df_factTraffic.rdd.count
// now df_factTraffic is materalized in memory
See also https://stackoverflow.com/a/42719358/1138523
But it's questionable whether this makes sense at all because parquet is a columnar file format (meaning that projection is very efficient), and if you need different columns for each query the caching will not help you.

persist() difference between spark 2.0.2 and spark 2.2.0+ [duplicate]

Recently I saw some strange behaviour of Spark.
I have a pipeline in my application in which I'm manipulating one big Dataset - pseudocode:
val data = spark.read (...)
data.join(df1, "key") //etc, more transformations
data.cache(); // used to not recalculate data after save
data.write.parquet() // some save
val extension = data.join (..) // more transformations - joins, selects, etc.
extension.cache(); // again, cache to not double calculations
extension.count();
// (1)
extension.write.csv() // some other save
extension.groupBy("key").agg(some aggregations) //
extension.write.parquet() // other save, without cache it will trigger recomputation of whole dataset
However when I call data.unpersist() i.e. in place (1), Spark deletes from Storage all datasets, also the extension Dataset which is not the dataset I tried to unpersist.
Is that an expected behaviour? How can I free some memory by unpersist on old Dataset without unpersisting all Dataset that was "next in chain"?
My setup:
Spark version: current master, RC for 2.3
Scala: 2.11
Java: OpenJDK 1.8
Question looks similar to Understanding Spark's caching, but here I'm doing some actions before unpersist. At first I'm counting everything and then save into storage - I don't know if caching works the same in RDD like in Datasets

This is an expected behavior from spark caching. Spark doesn't want to keep invalid cache data. It completely removes all the cached plans refer to the dataset.
This is to make sure the query is correct. In the example you are creating extension dataset from cached dataset data. Now if the dataset data is unpersisted essentially extension dataset can no longer rely on the cached dataset data.
Here is the Pull request for the fix they made. You can see similar JIRA ticket

Answer for Spark 2.4:
There was a ticket about correctness in Datasets and caching behaviour, see https://issues.apache.org/jira/browse/SPARK-24596
From Maryann Xue description, now caching will work in following manner:
Drop tables and regular (persistent) views: regular mode
Drop temporary views: non-cascading mode
Modify table contents (INSERT/UPDATE/MERGE/DELETE): regular mode
Call DataSet.unpersist(): non-cascading mode
Call Catalog.uncacheTable(): follow the same convention as drop tables/view, which is, use non-cascading mode for temporary views and regular mode for the rest
Where "regular mode" means mdoe from the questions and #Avishek's answer and non-cascading mode means, that extension won't be unpersisted

How to ensure that loading of Spark DataFrame from Parquet is distributed and parallelized?

When Spark loads source data from a file into a DataFrame, what factors govern whether the data are loaded fully into memory on a single node (most likely the driver/master node) or in the minimal, parallel subsets needed for computation (presumably on the worker/executor nodes)?
In particular, if using Parquet as the input format and loading via the Spark DataFrame API, what considerations are necessary in order to ensure that loading from the Parquet file is parallelized and deferred to the executors, and limited in scope to the columns needed by the computation on the executor node in question?
(I am looking to understand the mechanism Spark uses to schedule loading of source data in the distributed execution plan, in order to avoid exhausting memory on any one node by loading the full data set.)

As long as you use spark operations, all data transformations and aggregations are perfored only on executors. Therefore there is no need for driver to load the data, its job is to manage processing flow. Driver gets the data only in case you use some terminal operations, like collect(), first(), show(), toPandas(), toLocalIterator() and similar. Additionally, executors does not load all files content into memory, but gets the smallest posible chunks (which are called partitions).
If you use column store format such as Parquet only columns required for the execution plan are loaded - this is default behaviour in spark.
Edit: I just saw that there might be a bug in spark and if you use nested columns inside your schema then unnecessary columns may be loaded, see: Why does Apache Spark read unnecessary Parquet columns within nested structures?

Does Spark support true column scans over parquet files in S3?

One of the great benefits of the Parquet data storage format is that it's columnar. If I've got a 'wide' dataset with hundreds of columns, but my query only touches a few of those, then it's possible read only the data that stores those few columns, and skip the rest.
Presumably this feature works by reading a bit of metadata at the head of a parquet file that indicates the locations on the filesystem for each column. The reader can then seek on disk to read in only the necessary columns.
Does anyone know whether spark's default parquet reader correctly implements this kind of selective seeking on S3? I think it's supported by S3, but there's a big difference between theoretical support and an implementation that properly exploits that support.

This needs to be broken down
Does the Parquet code get the predicates from spark (yes)
Does parquet then attempt to selectively read only those columns, using the Hadoop FileSystem seek() + read() or readFully(position, buffer, length) calls? Yes
Does the S3 connector translate these File Operations into efficient HTTP GET requests? In Amazon EMR: Yes. In Apache Hadoop, you need hadoop 2.8 on the classpath and set the properly spark.hadoop.fs.s3a.experimental.fadvise=random to trigger random access.
Hadoop 2.7 and earlier handle the aggressive seek() round the file badly, because they always initiate a GET offset-end-of-file, get surprised by the next seek, have to abort that connection, reopen a new TCP/HTTPS 1.1 connection (slow, CPU heavy), do it again, repeatedly. The random IO operation hurts on bulk loading of things like .csv.gz, but is critical to getting ORC/Parquet perf.
You don't get the speedup on Hadoop 2.7's hadoop-aws JAR. If you need it you need to update hadoop*.jar and dependencies, or build Spark up from scratch against Hadoop 2.8
Note that Hadoop 2.8+ also has a nice little feature: if you call toString() on an S3A filesystem client in a log statement, it prints out all the filesystem IO stats, including how much data was discarded in seeks, aborted TCP connections &c. Helps you work out what's going on.
2018-04-13 warning:: Do not try to drop the Hadoop 2.8+ hadoop-aws JAR on the classpath along with the rest of the hadoop-2.7 JAR set and expect to see any speedup. All you will see are stack traces. You need to update all the hadoop JARs and their transitive dependencies.

DISCLAIMER: I don't have a definitive answer and don't want to act as an authoritative source either, but have spent some time on parquet support in Spark 2.2+ and am hoping that my answer can help us all to get closer to the right answer.
Does Parquet on S3 avoid pulling the data for unused columns from S3 and only retrieve the file chunks it needs, or does it pull the whole file?
I use Spark 2.3.0-SNAPSHOT that I built today right from the master.
parquet data source format is handled by ParquetFileFormat which is a FileFormat.
If I'm correct, the reading part is handled by buildReaderWithPartitionValues method (that overrides the FileFormat's).
buildReaderWithPartitionValues is used exclusively when FileSourceScanExec physical operator is requested for so-called input RDDs that are actually a single RDD to generate internal rows when WholeStageCodegenExec is executed.
With that said, I think that reviewing what buildReaderWithPartitionValues does may get us closer to the final answer.
When you look at the line you can get assured that we're on the right track.
// Try to push down filters when filter push-down is enabled.
That code path depends on spark.sql.parquet.filterPushdown Spark property that is turned on by default.
spark.sql.parquet.filterPushdown Enables Parquet filter push-down optimization when set to true.
That leads us to parquet-hadoop's ParquetInputFormat.setFilterPredicate iff the filters are defined.
if (pushed.isDefined) {
ParquetInputFormat.setFilterPredicate(hadoopAttemptContext.getConfiguration, pushed.get)
}
The code gets more interesting a bit later when the filters are used when the code falls back to parquet-mr (rather than using the so-called vectorized parquet decoding reader). That's the part I don't really understand (except what I can see in the code).
Please note that the vectorized parquet decoding reader is controlled by spark.sql.parquet.enableVectorizedReader Spark property that is turned on by default.
TIP: To know what part of the if expression is used, enable DEBUG logging level for org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat logger.
In order to see all the pushed-down filters you could turn INFO logging level of org.apache.spark.sql.execution.FileSourceScanExec logger on. You should see the following in the logs:
INFO Pushed Filters: [pushedDownFilters]
I do hope that if it's not close to be a definitive answer it has helped a little and someone picks it up where I left off to make it one soon. Hope dies last :)

parquet reader of spark is just like any other InputFormat,
None of the inputFormat have any thing special for S3. The input formats can read from LocalFileSystem , Hdfs and S3 no special optimization done for that.
Parquet InpuTFormat depending on the columns you ask will selectively read the columns for you .
If you want to be dead sure (although push down predicates works in latest spark version) manually select the columns and write the transformation and actions , instead of depending on SQL

No, predicate pushdown is not fully supported. This, of course, depends on:
Specific use case
Spark version
S3 connector type and version
In order to check your specific use case, you can enable DEBUG log level in Spark, and run your query. Then, you can see whether there are "seeks" during S3 (HTTP) requests as well as how many requests to were actually sent. Something like this:
17/06/13 05:46:50 DEBUG wire: http-outgoing-1 >> "GET /test/part-00000-b8a8a1b7-0581-401f-b520-27fa9600f35e.snappy.parquet HTTP/1.1[\r][\n]"
....
17/06/13 05:46:50 DEBUG wire: http-outgoing-1 << "Content-Range: bytes 0-7472093/7472094[\r][\n]"
....
17/06/13 05:46:50 DEBUG wire: http-outgoing-1 << "Content-Length: 7472094[\r][\n]"
Here's example of an issue report that was opened recently due to inability of Spark 2.1 to calculate COUNT(*) of all the rows in a dataset based on metadata stored in Parquet file: https://issues.apache.org/jira/browse/SPARK-21074

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string