How to tell if spark session will be able to hold data size in dataframe? - apache-spark

Intend to read data from an Oracle DB with pyspark (running in local mode) and store locally as parquet. Is there a way to tell whether a spark session dataframe will be able to hold the amount of data from the query (which will be the whole table, ie. select * from mytable)? Are there common solutions for if the data would not be able to fit in a dataframe?
* Saw a similar question here, but was a little confused by the discussion in the comments

As you are running on local, So I assume it is not on a cluster. You can not say exactly how much memory would require? However, you can go close to it. You check your respective table size that how much disk space it's using. Suppose you mytable has occupied 1GB of Hard disk then spark would be required RAM more than that, because Spark's engine required some memory for its own processing. Try to have 2GB extra, for safer side than actual table size.
To check you table size in Oracle, You can use below query:
select segment_name,segment_type,bytes/1024/1024 MB
from dba_segments
where segment_type='TABLE' and segment_name='<yourtablename>';
It will give you a result in MB.
To configure JVM related parameter in Apache-Spark you can check this.

It doesn't matter how big the table is if you are running spark in a distributed manner. You would need to worry about the memory if:-
You are reading the data in the driver and then doing a broadcast.
Caching the dataframe for some computation.
Usually for your spark application a DAG gets generated and if you are using JDBC source then the workers will read the data directly and use the shuffle space and off-heap to disk for memory intensive computation.

Related

Spark SQL data storage life cycle

I recently had a issue with with one of my spark jobs, where I was reading a hive table having several billion records, that resulted in job failure due to high disk utilization, But after adding AWS EBS volume, the job ran without any issues. Although it resolved the issue, I have few doubts, I tried doing some research but couldn't find any clear answers. So my question is?
when a spark SQL reads a hive table, where the data is stored for processing initially and what is the entire life cycle of data in terms of its storage , if I didn't explicitly specify anything? And How adding EBS volumes solves the issue?
Spark will read the data, if it does not fit in memory, it will spill it out on disk.
A few things to note:
Data in memory is compressed, from what I read, you gain about 20% (e.g. a 100MB file will take only 80MB of memory).
Ingestion will start as soon as you read(), it is not part of the DAG, you can limit how much you ingest in the SQL query itself. The read operation is done by the executors. This example should give you a hint: https://github.com/jgperrin/net.jgp.books.spark.ch08/blob/master/src/main/java/net/jgp/books/spark/ch08/lab300_advanced_queries/MySQLWithWhereClauseToDatasetApp.java
In latest versions of Spark, you can push down the filter (for example if you filter right after the ingestion, Spark will know and optimize the ingestion), I think this works only for CSV, Avro, and Parquet. For databases (including Hive), the previous example is what I'd recommend.
Storage MUST be seen/accessible from the executors, so if you have EBS volumes, make sure they are seen/accessible from the cluster where the executors/workers are running, vs. the node where the driver is running.
Initially the data is in table location in HDFS/S3/etc. Spark spills data on local storage if it does not fit in memory.
Read Apache Spark FAQ
Does my data need to fit in memory to use Spark?
No. Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data. Likewise, cached datasets
that do not fit in memory are either spilled to disk or recomputed on
the fly when needed, as determined by the RDD's storage level.
Whenever spark reads data from hive tables, it stores it in RDD. One point i want to make clear here is hive is just a warehouse so it is like a layer which is above HDFS, when spark interacts with hive , hive provides the spark the location where the hdfs loaction exists.
Thus, Spark reads a file from HDFS, it creates a single partition for a single input split. Input split is set by the Hadoop (whatever the InputFormat used to read this file. ex: if you use textFile() it would be TextInputFormat in Hadoop, which would return you a single partition for a single block of HDFS (note:the split between partitions would be done on line split, not the exact block split), unless you have a compressed file format like Avro/parquet.
If you manually add rdd.repartition(x) it would perform a shuffle of the data from N partititons you have in rdd to x partitions you want to have, partitioning would be done on round robin basis.
If you have a 10GB uncompressed text file stored on HDFS, then with the default HDFS block size setting (256MB) it would be stored in 40blocks, which means that the RDD you read from this file would have 40partitions. When you call repartition(1000) your RDD would be marked as to be repartitioned, but in fact it would be shuffled to 1000 partitions only when you will execute an action on top of this RDD (lazy execution concept)
Now its all up to spark that how it will process the data as Spark is doing lazy evaluation , before doing the processing, spark prepare a DAG for optimal processing. One more point spark need configuration for driver memory, no of cores , no of executors etc and if the configuration is inappropriate the job will fail.
Once it prepare the DAG , then it start processing the data. So it divide your job into stages and stages into tasks. Each task will further use specific executors, shuffle , partitioning. So in your case when you do processing of bilions of records may be your configuration is not adequate for the processing. One more point when we say spark load the data in RDD/Dataframe , its managed by spark, there are option to keep the data in memory/disk/memory only etc ref -storage_spark.
Briefly,
Hive-->HDFS--->SPARK>>RDD(Storage depends as its a lazy evaluation).
you may refer the following link : Spark RDD - is partition(s) always in RAM?

Memory Management Pyspark

1.) I understand that "Spark's operators spills data to disk if it does not fit memory allowing it to run well on any sized data".
If this is true, why do we ever get OOM (Out of Memory) errors?
2.) Increasing the no. of executor cores increases parallelism. Would that also increase the chances of OOM, because the same memory is now divided into smaller parts for each core?
3.) Spark is much more susceptible to OOM because it performs operations in memory as compared to Hive, which repeatedly reads, writes into disk. Is that correct?
There is one angle that you need to consider there. You may get memory leaks if the data is not properly distributed. That means that you need to distribute your data evenly (if possible) on the Tasks so that you reduce shuffling as much as possible and make those Tasks to manage their own data. So if you need to perform a join, if data is distributed randomly, every Task (and therefore executor) will have to:
See what data they have
Send data to other executors (and tasks) to provide the same keys they need
Request the data that is needed by that task to the others
All that data exchange may cause network bottlenecks if you have a large dataset and also will make every Task to hold their data in memory plus whatever has been sent and temporary objects. All of those will blow up memory.
So to prevent that situation you can:
Load the data already repartitioned. By that I mean, if you are loading from a DB, try Spark stride as defined here. Please refer to the partitionColumn, lowerBound, upperBound attributes. That way you will create a number of partitions on the dataframe that will set the data on different tasks based on the criteria you need. If you are going to use a join of two dataframes, try similar approach on them so that partitions are similar (for not to say same) and that will prevent shuffling over network.
When you define partitions, try to make those values as evenly distributed among tasks as possible
The size of each partition should fit on memory. Although there could be spill to disk, that would slow down performance
If you don't have a column that make the data evenly distributed, try to create one that would have n number of different values, depending on the n number of tasks that you have
If you are reading from a csv, that would make it harder to create partitions, but still it's possible. You can either split the data (csv) on multiple files and create multiple dataframes (performing a union after they are loaded) or you can read that big csv and apply a repartition on the column you need. That will create shuffling as well, but it will be done once if you cache the dataframe already repartitioned
Reading from parquet it's possible that you may have multiple files but if they are not evenly distributed (because the previous process that generated didn't do it well) you may end up on OOM errors. To prevent that situation, you can load and apply repartition on the dataframe too
Or another trick valid for csv, parquet files, orc, etc. is to create a Hive table on top of that and run a query from Spark running a distribute by clause on the data, so that you can make Hive to redistribute, instead of Spark
To your question about Hive and Spark, I think you are right up to some point. Depending on the execute engine that Hive uses in your case (map/reduce, Tez, Hive on Spark, LLAP) you can have different behaviours. With map/reduce, as they are mostly disk operations, the chance to have a OOM is much lower than on Spark. Actually from Memory point of view, map/reduce is not that affected because of a skewed data distribution. But (IMHO) your goal should be to find always the best data distribution for the Spark job you are running and that will prevent that problem
Another consideration is if you are testing in a dev environment that doesn't have same data as in a prod environment. I suppose the data distribution should be similar although volumes may differ a lot (I am talking from experience ;)). In that case, when you assign Spark tuning parameters on the spark-submit command, they may be different in prod. So you need to invest some time on finding the best approach on dev and fine tune in prod
Huge majority of OOM in Spark are on the driver, not executors. This is usually a result of running .collect or similar actions on a dataset that won't fit in the driver memory.
Spark does a lot of work under the hood to parallelize the work, when using structured APIs (in contrast to RDDs) the chances of causing OOM on executor are really slim. Some combinations of cluster configuration and jobs can cause memory pressure that will impact performance and cause lots of garbage collection to happen so you need to address it, however spark should be able to handle low memory without explicit exception.
Not really - as above, Spark should be able to recover from memory issues when using structured APIs, however it may need intervention if you see garbage collection and performance impact.

Where does Spark saves retrieved data on Azure Databricks?

I would like to understand the difference between the RAM and storage in Azure databricks.
Suppose I am reading csv data from the Azure data lake (ADLS Gen 2) as follows:
df = spark.read.csv("path to the csv file").collect()
I am aware that the read method in spark is a Transformation method in spark. And this is not going to be run immediately. However, now if I perform an Action using the collect() method, I would assume that the data is now actually been read from the data lake by Spark and loaded into RAM or Disk. First, I would like to know, where is the data stored. Is it in RAM or in Disk. And, if the data is stored in RAM, then what is cache used for?; and if the data is retrieved and stored on disk, then what does persist do? I am aware that cache stores the data in memory for late use, and that if I have very large amount of data, I can use persist to store the data into a disk.
I would like to know, how much can databricks scale if we have peta bytes of data?
How much does the RAM and Disk differ in size?
how can I know where the data is stored at any point in time?
What is the underlying operating system running Azure Databricks?
Please note that I am newbie to Azure Databricks and Spark.
I would like to get some recommendation on the best practices when using Spark.
Your help is much appreciated!!
First, I would like to know, where is the data stored.
When you run any action (i.e. collect or others) Data is collected from executors nodes to driver node and stored in ram (memory)
And, if the data is stored in RAM, then what is cache used for
Spark has lazy evaluation what does that mean is until you call an action it doesn't do anything, and once you call it, it creates a DAG and then executed that DAF.
Let's understand it by an example. let's consider you have three tables Table A, Table B and Table C. You have joined this table and apply some business logic (maps and filters), let's call this dataframe filtered_data. and now you are using this DataFrame in let's say 5 different places (another dataframes) for either lookup or join and other business reason.
if you won't persist(cache) your filterd_data dataframe, everytime it will be referenced, it will again go through joins and other business logic. So it's advisable to persist(cache) dataframe if you are going to use that into multiple places.
By Default Cache stored data in memory (RAM) but you can set the storage level to disk
would like to know, how much can databricks scale if we have petabytes of data?
It's a distributed environment, so what you need to do is add more executors. and may be need to increase the memory and CPU configuration,
how can I know where the data is stored at any point in time?
if you haven't created a table or view, it's stored in memory.
What is the underlying operating system running Azure Databricks?
it uses linux operation system.
specifically Linux-4.15.0-1050-azure-x86_64-with-Ubuntu-16.04-xenial
you can run the following command to know.
import platform
println(platform.platform())

Spark partitionBy on write.save brings all data to driver?

So basically I have a python spark job that reads some simple json files, and then tries to write them as orc files partitioned by one field. The partition is not very balanced, as some keys are really big, and other really small.
I had memory issues when doing something like this:
events.write.mode('append').partitionBy("type").save("s3n://mybucket/tofolder"), format="orc")
Adding memory to the executors didn't seem to have any effect, but I solved it increasing the driver memory. Does this mean that all the data is being send to the driver for it to write? Can't each executor write its own partition? Im using Spark 2.0.1
Even if you partition dataset and then write it on storage there is no possibility that records are sent to the driver. You should look at logs of memory issues (if they occur on driver on or executors) to figure out exact reason of failing.
Probably your driver has too low memory to handle this write because of previous computations. Try decreasing spark.ui.retainedJobs and spark.ui.retainedStages to save memory on old jobs and stages metadata. If this won't help, connect to driver with jvisualvm to find job/stage than consumes large heap fragments and try to optimize.

Spark Poor Query performance: How to improve query performance on Spark?

There is a lots of hype over how good and fast spark is for processing large amount of data.
So, we wanted to investigate the query performance of spark.
Machine configuration:
4 worker nodes, r3.2xlarge instances
Data
Our input data is stored in 12 splitted gzip files in S3.
What we did
We created a table using Spark SQL for the aforementioned input data set.
Then we cached the table. We found from Spark UI that Spark did not load all data into memory, rather it loaded some data into memory and some in disk.
UPDATE: We also tested with parquet files. In this case, all data was loaded in memory. Then we execute the same queries as below. Performance is still not good enough.
Query Performance
Let's assume the table name is Fact_data. We executed the following query on that cached table:
select date_key,sum(value) from Fact_data where date_key between 201401 and 201412 group by date_key order by 1
The query takes 1268.93sec to complete. This is huge compared to the execution time in Redshift (dc1.large cluster) which takes only 9.23 sec.
I also tested some other queries e.g, count, join etc. Spark is giving me really poor performance for each of the queries
Questions
Could you suggest anything that might improve the performance of the query? May be I am missing some optimization techniques. Any suggestion will be highly appreciated.
How to compel Spark to load all data in memory? Currently it stored some data in memory and some in disk.
Is there any performance difference in using Dataframe and SQL table? I think, no. Because under the hood they are using the same optimizer.
I suggest you use Parquet as your file format instead of gzipped files.
you can try increasing your --num-executors, --executor-memory and --executor-cores
if you're using YARN and your instance type is r3.2xlarge, make sure you container size yarn.nodemanager.resource.memory-mb is larger than your --executor-memory (maybe around 55G) you also need to set yarn.nodemanager.resource.cpu-vcores to 15.

Resources