Spark on Databricks - Caching Hive table - apache-spark

We have fact table(30 columns) stored in parquet files on S3 and also created table on this files and cache it afterwards. Table is created using this code snippet:
val factTraffic = spark.read.parquet(factTrafficData)
factTraffic.write.mode(SaveMode.Overwrite).saveAsTable("f_traffic")
%sql CACHE TABLE f_traffic
We run many different calculations on this table(files) and are looking the best way to cache data for faster access in subsequent calculations. Problem is, that for some reason it's faster to read the data from parquet and do the calculation then access it from memory. One important note is that we do not utilize every column. Usually, around 6-7 columns per calculation and different columns each time.
Is there a way to cache this table in memory so we can access it faster then reading from parquet?

It sounds like you're running on Databricks, so your query might be automatically benefitting from the Databricks IO Cache. From the Databricks docs:
The Databricks IO cache accelerates data reads by creating copies of remote files in nodes’ local storage using fast intermediate data format. The data is cached automatically whenever a file has to be fetched from a remote location. Successive reads of the same data are then executed locally, which results in significantly improved reading speed.
The Databricks IO cache supports reading Parquet files from DBFS, Amazon S3, HDFS, Azure Blob Storage, and Azure Data Lake. It does not support other storage formats such as CSV, JSON, and ORC.
The Databricks IO Cache is supported on Databricks Runtime 3.3 or newer. Whether it is enabled by default depends on the instance type that you choose for the workers on your cluster: currently it is enabled automatically for Azure Ls instances and AWS i3 instances (see the AWS and Azure versions of the Databricks documentation for full details).
If this Databricks IO cache is taking effect then explicitly using Spark's RDD cache with an untransformed base table may harm query performance because it will be storing a second redundant copy of the data (and paying a roundtrip decoding and encoding in order to do so).
Explicit caching can still can make sense if you're caching a transformed dataset, e.g. after filtering to significantly reduce the data volume, but if you only want to cache a large and untransformed base relation then I'd personally recommend relying on the Databricks IO cache and avoiding Spark's built-in RDD cache.
See the full Databricks IO cache documentation for more details, including information on cache warming, monitoring, and a comparision of RDD and Databricks IO caching.

The materalize dataframe in cache, you should do:
val factTraffic = spark.read.parquet(factTrafficData)
factTraffic.write.mode(SaveMode.Overwrite).saveAsTable("f_traffic")
val df_factTraffic = spark.table("f_traffic").cache
df_factTraffic.rdd.count
// now df_factTraffic is materalized in memory
See also https://stackoverflow.com/a/42719358/1138523
But it's questionable whether this makes sense at all because parquet is a columnar file format (meaning that projection is very efficient), and if you need different columns for each query the caching will not help you.

Related

How do I replicate a Cassandra's local node for other Cassandra's remote node?

I need to replicate a local node with a SimpleStrategy to a remote node in other Cassandra's DB. Does anyone have any idea where I begin?
The main complexity here, if you're writing data into both clusters is how to avoid overwriting the data that has changed in the cloud later than your local setup. There are several possibilities to do that:
If structure of the tables is the same (including the names of the keyspaces if user-defined types are used), then you can just copy SSTables from your local machine to the cloud, and use sstableloader to replay them - in this case, Cassandra will obey the actual writetime, and won't overwrite changed data. Also, if you're doing deletes from tables, then you need to copy SSTables before tombstones are expired. You may not copy all SSTables every time, just the files that has changed since last data upload. But you always need to copy SSTables from all nodes from which you're doing upload.
If structure isn't the same, then you can either look to using DSBulk or Spark Cassandra Connector. In both cases you'll need to export data with writetime as well, and then load it also with timestamp. Please note that in both cases if different columns have different writetime, then you will need to load that data separately because Cassandra allows to specify only one timestamp when updating/inserting data.
In case of DSBulk you can follow the example 19.4 for exporting of data from this blog post, and example 11.3 for loading (from another blog post). So this may require some shell scripting. Plus you'll need to have disk space to keep exported data (but you can use compression).
In case of Spark Cassandra Connector you can export data without intermediate storage if both nodes are accessible from Spark. But you'll need to write some Spark code for reading data using RDD or DataFrame APIs.

Parquet vs Delta format in Azure Data Lake Gen 2 store

I am importing fact and dimension tables from SQL Server to Azure Data Lake Gen 2.
Should I save the data as "Parquet" or "Delta" if I am going to wrangle the tables to create a dataset useful for running ML models on Azure Databricks ?
What is the difference between storing as parquet and delta ?
Delta is storing the data as parquet, just has an additional layer over it with advanced features, providing history of events, (transaction log) and more flexibility on changing the content like, update, delete and merge capabilities. This link delta explains quite good how the files organized.
One drawback that it can get very fragmented on lots of updates, which could be harmful for performance. AS the AZ Data Lake Store Gen2 is anyway not optimized for large IO this is not really a big problem. Some optimization on the parquet format though will not be very effective this way.
I would use delta, just for the advanced features. It is very handy if there is a scenario where the data is updating over time, not just appending. Specially nice feature that you can read the delta tables as of a given point in time they existed.
SQL as of syntax
This is useful for having consistent training sets (to always have the same training dataset without separating to individual parquet files). In case for the ML models handling delta format as input may could be problematic, as likely only few frameworks will be able to read it in directly, so you will need to convert it during some pre-processing step.
Delta Lake uses versioned Parquet files to store your data in your cloud storage. Apart from the versions, Delta Lake also stores a transaction log to keep track of all the commits made to the table or blob store directory to provide ACID transactions.
Reference : https://learn.microsoft.com/en-us/azure/databricks/delta/delta-faq
As per the other answers Delta Lake is a feature layer over Parquet.
Consider - do you need Delta features? if you are just reading the data & wrangling elsewhere Delta is just extra complexity for little additional benefit.
Also Parquet is compatible with almost every data system out there, Delta is widely adopted but not everything can work with Delta.
Consider using parquet if you don't need a transaction log.
We extract data daily and replace it with the Delta file. However, it re-creates the same number of parquet files every time though there is a minor change to data.

Databricks result cache

Does Databricks have the concept of a results cache? When I run a SQL query, does it cache the resultset somewhere for sub-second access or do we only have the Delta lake cache? I could not find anything in the documentation and at this stage am assuming it does not exist as a feature. Can someone clarify?
Cache the data accessed by the specified simple SELECT query in the Delta cache. You can choose a subset of columns to be cached by providing a list of column names and choose a subset of rows by providing a predicate. This enables subsequent queries to avoid scanning the original files as much as possible. This construct is applicable only to Parquet tables. Views are also supported, but the expanded queries are restricted to the simple queries.
Examples:
CACHE SELECT * FROM boxes
CACHE SELECT width, length FROM boxes WHERE height=3
Reference: See Delta and Apache Spark caching comparison for the differences between the RDD cache and the Databricks IO cache.
The Delta cache accelerates data reads by creating copies of remote files in nodes’ local storage using a fast intermediate data format. The data is cached automatically whenever a file has to be fetched from a remote location. Successive reads of the same data are then performed locally, which results in significantly improved reading speed.
There are two types of caching available in Databricks:
Delta caching
Apache Spark caching
You can use Delta caching and Spark caching at the same time. This section outlines the key differences between them so that you can choose the best tool for your workflow.
Type of stored data: The Delta cache contains local copies of remote data. It can improve the performance of a wide range of queries, but cannot be used to store results of arbitrary subqueries. The Spark cache can store the result of any subquery data and data stored in formats other than Parquet (such as CSV, JSON, and ORC).
Performance: The data stored in the Delta cache can be read and operated on faster than the data in the Spark cache. This is because the Delta cache uses efficient decompression algorithms and outputs data in the optimal format for further processing using whole-stage code generation.
Automatic vs manual control: When the Delta cache is enabled, data that has to be fetched from a remote source is automatically added to the cache. This process is fully transparent and does not require any action. However, to preload data into the cache beforehand, you can use the CACHE command (see Cache a subset of the data). When you use the Spark cache, you must manually specify the tables and queries to cache.
Disk vs memory-based: The Delta cache is stored entirely on the local disk, so that memory is not taken away from other operations within Spark. Due to the high read speeds of modern SSDs, the Delta cache can be fully disk-resident without a negative impact on its performance. In contrast, the Spark cache uses memory.
Hope this helps.

Where does Spark saves retrieved data on Azure Databricks?

I would like to understand the difference between the RAM and storage in Azure databricks.
Suppose I am reading csv data from the Azure data lake (ADLS Gen 2) as follows:
df = spark.read.csv("path to the csv file").collect()
I am aware that the read method in spark is a Transformation method in spark. And this is not going to be run immediately. However, now if I perform an Action using the collect() method, I would assume that the data is now actually been read from the data lake by Spark and loaded into RAM or Disk. First, I would like to know, where is the data stored. Is it in RAM or in Disk. And, if the data is stored in RAM, then what is cache used for?; and if the data is retrieved and stored on disk, then what does persist do? I am aware that cache stores the data in memory for late use, and that if I have very large amount of data, I can use persist to store the data into a disk.
I would like to know, how much can databricks scale if we have peta bytes of data?
How much does the RAM and Disk differ in size?
how can I know where the data is stored at any point in time?
What is the underlying operating system running Azure Databricks?
Please note that I am newbie to Azure Databricks and Spark.
I would like to get some recommendation on the best practices when using Spark.
Your help is much appreciated!!
First, I would like to know, where is the data stored.
When you run any action (i.e. collect or others) Data is collected from executors nodes to driver node and stored in ram (memory)
And, if the data is stored in RAM, then what is cache used for
Spark has lazy evaluation what does that mean is until you call an action it doesn't do anything, and once you call it, it creates a DAG and then executed that DAF.
Let's understand it by an example. let's consider you have three tables Table A, Table B and Table C. You have joined this table and apply some business logic (maps and filters), let's call this dataframe filtered_data. and now you are using this DataFrame in let's say 5 different places (another dataframes) for either lookup or join and other business reason.
if you won't persist(cache) your filterd_data dataframe, everytime it will be referenced, it will again go through joins and other business logic. So it's advisable to persist(cache) dataframe if you are going to use that into multiple places.
By Default Cache stored data in memory (RAM) but you can set the storage level to disk
would like to know, how much can databricks scale if we have petabytes of data?
It's a distributed environment, so what you need to do is add more executors. and may be need to increase the memory and CPU configuration,
how can I know where the data is stored at any point in time?
if you haven't created a table or view, it's stored in memory.
What is the underlying operating system running Azure Databricks?
it uses linux operation system.
specifically Linux-4.15.0-1050-azure-x86_64-with-Ubuntu-16.04-xenial
you can run the following command to know.
import platform
println(platform.platform())

Parquet VS Database

I am trying to understand which of the below two would be better option especially in case of Spark environment :
Loading the parquet file directly into a dataframe and access the data (1TB of data table)
Using any database to store and access the data.
I am working on data pipeline design and trying to understand which of the above two options will result in more optimized solution.
Loading the parquet file directly into a dataframe and access the data is more scalable comparing to reading RDBMS like Oracle through JDBC connector. I handle the data more the 10TB but I prefer ORC format for better performance. I suggest you have to directly read data from files the reason for that is data locality - if your run your Spark executors on the same hosts, where HDFS data nodes located and can effectively read data into memory without network overhead. See https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-data-locality.html and How does Apache Spark know about HDFS data nodes? for more details.

Resources