I have a scenario where I'm reading data from remote storage:
df = spark.read.load("abfss://mycontainer#mystorageacct.dfs.core.windows.net/mydata.csv")
It's a small dataset of a few GBs and it takes around 1 to 2 mins to load.
The same notebook is manually run everyday and I'm looking to make some small optimisations.
It appears cache() and persist() will not help because the data will be uncached\unpersisted at the end of the session?
Is it an ok pattern to write the data to local storage on the cluster and read it from there, e.g.
localfile = '/X/myfile.parquet'
if os.path.exists(localfile):
df = spark.read.parquet(localfile)
else:
df = spark.read.csv("abfss://mycontainer#mystorageacct.dfs.core.windows.net/mydata.csv")
# do some basic munging
df.write.parquet(localfile)
How can I determine where the local disks (i.e. disks attached to driver and worker nodes) are mounted and are the users permitted to write to them?
Update:
The cluster will occasionally get restarted, but not often.
Since your cluster is restarted periodically, I would not write to disk but instead as a Delta Table to cloud storage (S3, Azure Blob Storage) if possible.
This should speed up your query immensely.
Related
I am trying to understand how data is stored and managed in the DataBricks environment. I have a fairly decent understanding of what is going on under the hood but have seen some conflicting information online, therefore would like to get a detailed explanation to solidify my understanding. To ask my questions, I'd like to summarize what I have done as as part of one of exercises in the Apache Spark Developer course.
As a part of the exercise, I have followed the following steps on the Databricks platform:
Started my cluster
Read a parquet file as a DataFrame
Stored the DataFrame as a Delta Table in my user directory in DBFS
Made some changes to the Delta Table created in the previous step
Partitioned the same Delta table based on a specific column e.g. State and saved in the same user directory in DBFS using the overwrite mode
After following the above steps, here's how my DBFS directory looks:
DBFS Delta Log Directory
In the root folder which I used to store the Delta Table (picture above) I have the following types folders/files
Delta log folder
Folders with the 'State' name (step 5. previous section), Each state folder also contains 4 parquet files which I suspect are partitions of the dataset
Four Separate parquet files which I suspect are files from when I created this delta table (in Step 3 of the previous section)
Based on the above exercise following are my questions:
Is the data that I see in the above directory - State named folders that contain the partitions, parquet files, delta log etc distributed across my nodes (The answer I presume is yes).
The four parquet files in the root folder (from when I created the delta table, before the partition) - assuming they are distributed across my nodes - are they stored in my Node's RAM?
Where is the data from the delta_log folder stored? If it's across my nodes - is it stored in RAM or disk memory?
Where is the data (parquet files/partitions under each state name folder - from screenshot above) stored? If this is also distributed across my nodes is it in memory (RAM) or on the disk?
Some of the answers I looked at online say that all the partitions are stored in memory (RAM). By that logic, once I turn off my cluster - they should be removed from memory , right?
However, even when I turn off my cluster I am able to view all the data in DBFS (exactly similar to the picture I have included above) . I suspect once the cluster is turned off, the RAM would be cleared therefore, I should not be able to see any data that is in my RAM. Is my understanding incorrect?
Would appreciate if you can answer my questions in order with as much detail as possible.
When you write out the data to DBFS it is stored in some form of permanent object storage separate from your cluster. This is why it is still there after the cluster shuts down. What storage this is depends on which cloud you are running your Databricks workspace.
This is the main idea of separating compute and storage, your clusters are the compute and the storage elsewhere. When you read in and process the data only then it is distributed across your nodes for processing. Once your cluster shuts down all data on the nodes RAM or disk is gone unless you've written it out to some form of permanent storage.
I would like to understand the difference between the RAM and storage in Azure databricks.
Suppose I am reading csv data from the Azure data lake (ADLS Gen 2) as follows:
df = spark.read.csv("path to the csv file").collect()
I am aware that the read method in spark is a Transformation method in spark. And this is not going to be run immediately. However, now if I perform an Action using the collect() method, I would assume that the data is now actually been read from the data lake by Spark and loaded into RAM or Disk. First, I would like to know, where is the data stored. Is it in RAM or in Disk. And, if the data is stored in RAM, then what is cache used for?; and if the data is retrieved and stored on disk, then what does persist do? I am aware that cache stores the data in memory for late use, and that if I have very large amount of data, I can use persist to store the data into a disk.
I would like to know, how much can databricks scale if we have peta bytes of data?
How much does the RAM and Disk differ in size?
how can I know where the data is stored at any point in time?
What is the underlying operating system running Azure Databricks?
Please note that I am newbie to Azure Databricks and Spark.
I would like to get some recommendation on the best practices when using Spark.
Your help is much appreciated!!
First, I would like to know, where is the data stored.
When you run any action (i.e. collect or others) Data is collected from executors nodes to driver node and stored in ram (memory)
And, if the data is stored in RAM, then what is cache used for
Spark has lazy evaluation what does that mean is until you call an action it doesn't do anything, and once you call it, it creates a DAG and then executed that DAF.
Let's understand it by an example. let's consider you have three tables Table A, Table B and Table C. You have joined this table and apply some business logic (maps and filters), let's call this dataframe filtered_data. and now you are using this DataFrame in let's say 5 different places (another dataframes) for either lookup or join and other business reason.
if you won't persist(cache) your filterd_data dataframe, everytime it will be referenced, it will again go through joins and other business logic. So it's advisable to persist(cache) dataframe if you are going to use that into multiple places.
By Default Cache stored data in memory (RAM) but you can set the storage level to disk
would like to know, how much can databricks scale if we have petabytes of data?
It's a distributed environment, so what you need to do is add more executors. and may be need to increase the memory and CPU configuration,
how can I know where the data is stored at any point in time?
if you haven't created a table or view, it's stored in memory.
What is the underlying operating system running Azure Databricks?
it uses linux operation system.
specifically Linux-4.15.0-1050-azure-x86_64-with-Ubuntu-16.04-xenial
you can run the following command to know.
import platform
println(platform.platform())
I am searching for working of google dataproc with GCS. I am using pyspark of dataproc. Data is read from and written to GCS.But unable to figure out best machine types for my use case. Questions
1) Does spark on dataproc copies data to local disk? e.g. If I am processing 2 TB of data, is it ok If I use 4 machine node with 200GB hdd? OR I should at least provide disk that can hold input data?
2) If the local disk is not at all used then is it ok to use high memory low disk instances?
3) If local disk is used then which instance type is good for processing 2 TB of data with minimum possible number of nodes? I mean is good to use SSD ?
Thanks
Manish
Spark will read data directly into memory and/or disk depending on if you use RDD or DataFrame. You should have at least enough disk to hold all data. If you are performing joins, then amount of disk necessary grows to handle shuffle spill.
This equation changes if you discard significant amount of data through filtering.
Whether you use pd-standard, pd-ssd, or local-ssd comes down to cost and if your application is CPU or IO bound.
Disk IOPS is proportional to disk size, so very small disks are inadvisable. Keep in mind that disk (relative to CPU) is cheap.
Same advice goes for network IO: more CPUs = more bandwidth.
Finally, default Dataproc settings are a reasonable place to start experimenting and tweaking your settings.
Source: https://cloud.google.com/compute/docs/disks/performance
We have fact table(30 columns) stored in parquet files on S3 and also created table on this files and cache it afterwards. Table is created using this code snippet:
val factTraffic = spark.read.parquet(factTrafficData)
factTraffic.write.mode(SaveMode.Overwrite).saveAsTable("f_traffic")
%sql CACHE TABLE f_traffic
We run many different calculations on this table(files) and are looking the best way to cache data for faster access in subsequent calculations. Problem is, that for some reason it's faster to read the data from parquet and do the calculation then access it from memory. One important note is that we do not utilize every column. Usually, around 6-7 columns per calculation and different columns each time.
Is there a way to cache this table in memory so we can access it faster then reading from parquet?
It sounds like you're running on Databricks, so your query might be automatically benefitting from the Databricks IO Cache. From the Databricks docs:
The Databricks IO cache accelerates data reads by creating copies of remote files in nodes’ local storage using fast intermediate data format. The data is cached automatically whenever a file has to be fetched from a remote location. Successive reads of the same data are then executed locally, which results in significantly improved reading speed.
The Databricks IO cache supports reading Parquet files from DBFS, Amazon S3, HDFS, Azure Blob Storage, and Azure Data Lake. It does not support other storage formats such as CSV, JSON, and ORC.
The Databricks IO Cache is supported on Databricks Runtime 3.3 or newer. Whether it is enabled by default depends on the instance type that you choose for the workers on your cluster: currently it is enabled automatically for Azure Ls instances and AWS i3 instances (see the AWS and Azure versions of the Databricks documentation for full details).
If this Databricks IO cache is taking effect then explicitly using Spark's RDD cache with an untransformed base table may harm query performance because it will be storing a second redundant copy of the data (and paying a roundtrip decoding and encoding in order to do so).
Explicit caching can still can make sense if you're caching a transformed dataset, e.g. after filtering to significantly reduce the data volume, but if you only want to cache a large and untransformed base relation then I'd personally recommend relying on the Databricks IO cache and avoiding Spark's built-in RDD cache.
See the full Databricks IO cache documentation for more details, including information on cache warming, monitoring, and a comparision of RDD and Databricks IO caching.
The materalize dataframe in cache, you should do:
val factTraffic = spark.read.parquet(factTrafficData)
factTraffic.write.mode(SaveMode.Overwrite).saveAsTable("f_traffic")
val df_factTraffic = spark.table("f_traffic").cache
df_factTraffic.rdd.count
// now df_factTraffic is materalized in memory
See also https://stackoverflow.com/a/42719358/1138523
But it's questionable whether this makes sense at all because parquet is a columnar file format (meaning that projection is very efficient), and if you need different columns for each query the caching will not help you.
I currently have a spark cluster set up with 4 worker nodes and 2 head nodes. I have a 1.5 GB CSV file in blob storage that I can access from one of the head nodes. I find that it takes quite a while to load this data and cache it using PySpark. Is there a way to load the data faster?
One thought I had was loading the data, then partitioning the data into k (number of nodes) different segments and saving them back to blob as parquet files. This way, I can load in different parts of the data set in parallel then union... However, I am unsure if all the data is just loaded on the head node, then when computation occurs, it distributes to the other machines. If the latter is true, then the partitioning would be useless.
Help would be much appreciated. Thank you.
Generally, you will want to have smaller file sizes on blob storage so that way you can transfer data between blob storage to compute in parallel so you have faster transfer rates. A good rule of thumb is to have a file size between 64MB - 256MB; a good reference is Vida Ha's Data Storage Tips for Optimal Spark Performance.
Your call out for reading the file and then saving it back to Parquet (with default snappy codec compression) is a good idea. Parquet is natively used by Spark and is often faster to query against. The only tweak would be to partition more by the file size vs. # of worker nodes. The data is loaded onto the worker nodes but partitioning is helpful because more tasks are created to read more files.