Databricks result cache - apache-spark

Does Databricks have the concept of a results cache? When I run a SQL query, does it cache the resultset somewhere for sub-second access or do we only have the Delta lake cache? I could not find anything in the documentation and at this stage am assuming it does not exist as a feature. Can someone clarify?

Cache the data accessed by the specified simple SELECT query in the Delta cache. You can choose a subset of columns to be cached by providing a list of column names and choose a subset of rows by providing a predicate. This enables subsequent queries to avoid scanning the original files as much as possible. This construct is applicable only to Parquet tables. Views are also supported, but the expanded queries are restricted to the simple queries.
Examples:
CACHE SELECT * FROM boxes
CACHE SELECT width, length FROM boxes WHERE height=3
Reference: See Delta and Apache Spark caching comparison for the differences between the RDD cache and the Databricks IO cache.
The Delta cache accelerates data reads by creating copies of remote files in nodes’ local storage using a fast intermediate data format. The data is cached automatically whenever a file has to be fetched from a remote location. Successive reads of the same data are then performed locally, which results in significantly improved reading speed.
There are two types of caching available in Databricks:
Delta caching
Apache Spark caching
You can use Delta caching and Spark caching at the same time. This section outlines the key differences between them so that you can choose the best tool for your workflow.
Type of stored data: The Delta cache contains local copies of remote data. It can improve the performance of a wide range of queries, but cannot be used to store results of arbitrary subqueries. The Spark cache can store the result of any subquery data and data stored in formats other than Parquet (such as CSV, JSON, and ORC).
Performance: The data stored in the Delta cache can be read and operated on faster than the data in the Spark cache. This is because the Delta cache uses efficient decompression algorithms and outputs data in the optimal format for further processing using whole-stage code generation.
Automatic vs manual control: When the Delta cache is enabled, data that has to be fetched from a remote source is automatically added to the cache. This process is fully transparent and does not require any action. However, to preload data into the cache beforehand, you can use the CACHE command (see Cache a subset of the data). When you use the Spark cache, you must manually specify the tables and queries to cache.
Disk vs memory-based: The Delta cache is stored entirely on the local disk, so that memory is not taken away from other operations within Spark. Due to the high read speeds of modern SSDs, the Delta cache can be fully disk-resident without a negative impact on its performance. In contrast, the Spark cache uses memory.
Hope this helps.

Related

How do I replicate a Cassandra's local node for other Cassandra's remote node?

I need to replicate a local node with a SimpleStrategy to a remote node in other Cassandra's DB. Does anyone have any idea where I begin?
The main complexity here, if you're writing data into both clusters is how to avoid overwriting the data that has changed in the cloud later than your local setup. There are several possibilities to do that:
If structure of the tables is the same (including the names of the keyspaces if user-defined types are used), then you can just copy SSTables from your local machine to the cloud, and use sstableloader to replay them - in this case, Cassandra will obey the actual writetime, and won't overwrite changed data. Also, if you're doing deletes from tables, then you need to copy SSTables before tombstones are expired. You may not copy all SSTables every time, just the files that has changed since last data upload. But you always need to copy SSTables from all nodes from which you're doing upload.
If structure isn't the same, then you can either look to using DSBulk or Spark Cassandra Connector. In both cases you'll need to export data with writetime as well, and then load it also with timestamp. Please note that in both cases if different columns have different writetime, then you will need to load that data separately because Cassandra allows to specify only one timestamp when updating/inserting data.
In case of DSBulk you can follow the example 19.4 for exporting of data from this blog post, and example 11.3 for loading (from another blog post). So this may require some shell scripting. Plus you'll need to have disk space to keep exported data (but you can use compression).
In case of Spark Cassandra Connector you can export data without intermediate storage if both nodes are accessible from Spark. But you'll need to write some Spark code for reading data using RDD or DataFrame APIs.

Where does Spark saves retrieved data on Azure Databricks?

I would like to understand the difference between the RAM and storage in Azure databricks.
Suppose I am reading csv data from the Azure data lake (ADLS Gen 2) as follows:
df = spark.read.csv("path to the csv file").collect()
I am aware that the read method in spark is a Transformation method in spark. And this is not going to be run immediately. However, now if I perform an Action using the collect() method, I would assume that the data is now actually been read from the data lake by Spark and loaded into RAM or Disk. First, I would like to know, where is the data stored. Is it in RAM or in Disk. And, if the data is stored in RAM, then what is cache used for?; and if the data is retrieved and stored on disk, then what does persist do? I am aware that cache stores the data in memory for late use, and that if I have very large amount of data, I can use persist to store the data into a disk.
I would like to know, how much can databricks scale if we have peta bytes of data?
How much does the RAM and Disk differ in size?
how can I know where the data is stored at any point in time?
What is the underlying operating system running Azure Databricks?
Please note that I am newbie to Azure Databricks and Spark.
I would like to get some recommendation on the best practices when using Spark.
Your help is much appreciated!!
First, I would like to know, where is the data stored.
When you run any action (i.e. collect or others) Data is collected from executors nodes to driver node and stored in ram (memory)
And, if the data is stored in RAM, then what is cache used for
Spark has lazy evaluation what does that mean is until you call an action it doesn't do anything, and once you call it, it creates a DAG and then executed that DAF.
Let's understand it by an example. let's consider you have three tables Table A, Table B and Table C. You have joined this table and apply some business logic (maps and filters), let's call this dataframe filtered_data. and now you are using this DataFrame in let's say 5 different places (another dataframes) for either lookup or join and other business reason.
if you won't persist(cache) your filterd_data dataframe, everytime it will be referenced, it will again go through joins and other business logic. So it's advisable to persist(cache) dataframe if you are going to use that into multiple places.
By Default Cache stored data in memory (RAM) but you can set the storage level to disk
would like to know, how much can databricks scale if we have petabytes of data?
It's a distributed environment, so what you need to do is add more executors. and may be need to increase the memory and CPU configuration,
how can I know where the data is stored at any point in time?
if you haven't created a table or view, it's stored in memory.
What is the underlying operating system running Azure Databricks?
it uses linux operation system.
specifically Linux-4.15.0-1050-azure-x86_64-with-Ubuntu-16.04-xenial
you can run the following command to know.
import platform
println(platform.platform())

Spark's amnesia of parquet partitions when cached in-memory (native spark cache)

I am working on some batch processing with Spark, reading data from a partitioned parquet file which is around 2TB. Right now, I am caching the whole file, in-memory, since I need to restrict the reading of the same parquet file, multiple times (given the way, we are analyzing the data).
Till some time back, the code is working fine. Recently, we have added use-cases which needs to work on some selective partitions (like average of a metric for the last 2 years where the complete data spawns across 6+ years).
When we started taking metrics for the execution times, we have observed that the use-case, which will work on a subset of partitioned data, is also taking similar time when compared to the time taken by the use-case which requires to work on complete data.
So, my question is that whether Spark's in-memory caching honors partitions of a Parquet file i.e., will spark holds the partition information even after caching the data, in-memory ?
Note: Since this is really a general question about Spark's processing style, I didn't added any metrics or the code.

Different default persist for RDD and Dataset

I was trying to find a good answer for why the default persist for RDD is MEMORY_ONLY whereas for Dataset it is MEMORY_AND_DISK. But I couldn't find it.
Does anyone know why the default persistence levels are different?
Simply because MEMORY_ONLY is rarely useful - it is not that common in practice to have enough memory to store all required data, so you're often have to evict some of the blocks or cache data only partially.
Compared to that DISK_AND_MEMORY evicts data to disk, so no cached block is lost.
The exact reason behind choosing MEMORY_AND_DISK as a default caching mode is explained by, SPARK-3824 (Spark SQL should cache in MEMORY_AND_DISK by default):
Spark SQL currently uses MEMORY_ONLY as the default format. Due to the use of column buffers however, there is a huge cost to having to recompute blocks, much more so than Spark core. Especially since now we are more conservative about caching blocks and sometimes won't cache blocks we think might exceed memory, it seems good to keep persisted blocks on disk by default.
For rdd the default storage level for persist api is MEMORY and for dataset is MEMORY_AND_DISK
Please check the below
[SPARK-3824][SQL] Sets in-memory table default storage level to MEMORY_AND_DISK
As mentioned by #user6910411 "Spark SQL currently uses MEMORY_ONLY as the default format. Due to the use of column buffers however, there is a huge cost to having to recompute blocks, much more so than Spark core." i.e dataset/dataframe apis use column buffers to store the column datattype and column details about the raw data so in case while caching the data does not fit in to memory then it will not cache the rest of the partition and will recompute whenever needed.So in the case of dataset/dataframe the recomputation cost is more compared to rdd due to its columnar structure.So the default persist option changed to MEMORY_AND_DISK so that the blocks that does not fit in to memory will spill to disk and it will retrieved from disk whenever needed rather than recomputing next time.

Spark on Databricks - Caching Hive table

We have fact table(30 columns) stored in parquet files on S3 and also created table on this files and cache it afterwards. Table is created using this code snippet:
val factTraffic = spark.read.parquet(factTrafficData)
factTraffic.write.mode(SaveMode.Overwrite).saveAsTable("f_traffic")
%sql CACHE TABLE f_traffic
We run many different calculations on this table(files) and are looking the best way to cache data for faster access in subsequent calculations. Problem is, that for some reason it's faster to read the data from parquet and do the calculation then access it from memory. One important note is that we do not utilize every column. Usually, around 6-7 columns per calculation and different columns each time.
Is there a way to cache this table in memory so we can access it faster then reading from parquet?
It sounds like you're running on Databricks, so your query might be automatically benefitting from the Databricks IO Cache. From the Databricks docs:
The Databricks IO cache accelerates data reads by creating copies of remote files in nodes’ local storage using fast intermediate data format. The data is cached automatically whenever a file has to be fetched from a remote location. Successive reads of the same data are then executed locally, which results in significantly improved reading speed.
The Databricks IO cache supports reading Parquet files from DBFS, Amazon S3, HDFS, Azure Blob Storage, and Azure Data Lake. It does not support other storage formats such as CSV, JSON, and ORC.
The Databricks IO Cache is supported on Databricks Runtime 3.3 or newer. Whether it is enabled by default depends on the instance type that you choose for the workers on your cluster: currently it is enabled automatically for Azure Ls instances and AWS i3 instances (see the AWS and Azure versions of the Databricks documentation for full details).
If this Databricks IO cache is taking effect then explicitly using Spark's RDD cache with an untransformed base table may harm query performance because it will be storing a second redundant copy of the data (and paying a roundtrip decoding and encoding in order to do so).
Explicit caching can still can make sense if you're caching a transformed dataset, e.g. after filtering to significantly reduce the data volume, but if you only want to cache a large and untransformed base relation then I'd personally recommend relying on the Databricks IO cache and avoiding Spark's built-in RDD cache.
See the full Databricks IO cache documentation for more details, including information on cache warming, monitoring, and a comparision of RDD and Databricks IO caching.
The materalize dataframe in cache, you should do:
val factTraffic = spark.read.parquet(factTrafficData)
factTraffic.write.mode(SaveMode.Overwrite).saveAsTable("f_traffic")
val df_factTraffic = spark.table("f_traffic").cache
df_factTraffic.rdd.count
// now df_factTraffic is materalized in memory
See also https://stackoverflow.com/a/42719358/1138523
But it's questionable whether this makes sense at all because parquet is a columnar file format (meaning that projection is very efficient), and if you need different columns for each query the caching will not help you.

Resources