What should be the optimal value for spark.sql.shuffle.partitions or how do we increase partitions when using Spark SQL? - apache-spark

I am using Spark SQL actually hiveContext.sql() which uses group by queries and I am running into OOM issues. So thinking of increasing value of spark.sql.shuffle.partitions from 200 default to 1000 but it is not helping.
I believe this partition will share data shuffle load so more the partitions less data to hold. I am new to Spark. I am using Spark 1.4.0 and I have around 1TB of uncompressed data to process using hiveContext.sql() group by queries.

If you're running out of memory on the shuffle, try setting spark.sql.shuffle.partitions to 2001.
Spark uses a different data structure for shuffle book-keeping when the number of partitions is greater than 2000:
private[spark] object MapStatus {
def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): MapStatus = {
if (uncompressedSizes.length > 2000) {
HighlyCompressedMapStatus(loc, uncompressedSizes)
} else {
new CompressedMapStatus(loc, uncompressedSizes)
}
}
...
I really wish they would let you configure this independently.
By the way, I found this information in a Cloudera slide deck.

OK so I think your issue is more general. It's not specific to Spark SQL, it's a general problem with Spark where it ignores the number of partitions you tell it when the files are few. Spark seems to have the same number of partitions as the number of files on HDFS, unless you call repartition. So calling repartition ought to work, but has the caveat of causing a shuffle somewhat unnecessarily.
I raised this question a while ago and have still yet to get a good answer :(
Spark: increase number of partitions without causing a shuffle?

It's actually depends on your data and your query, if Spark must load 1Tb, there is something wrong on your design.
Use the superbe web UI to see the DAG, mean how Spark is translating your SQL query to jobs/stages and tasks.
Useful metrics are "Input" and "Shuffle".
Partition your data (Hive / directory layout like /year=X/month=X)
Use spark CLUSTER BY feature, to work per data partition
Use ORC / Parquet file format because they provide "Push-down filter", useless data is not loaded to Spark
Analyze Spark History to see how Spark is reading data
Also, OOM could happen on your driver?
-> this is another issue, the driver will collect at the end the data you want. If you ask too much data, the driver will OOM, try limiting your query, or write another table (Spark syntax CREATE TABLE ...AS).

I came across this post from Cloudera about Hive Partitioning. Check out the "Pointers" section talking about number of partitions and number of files in each partition resulting in overloading the name node, which might cause OOM.

Related

PySpark Parallelism

I am new to spark and am trying to implement reading data from a parquet file and then after some transformation returning it to web ui as a paginated way. Everything works no issue there.
So now I want to improve the performance of my application, after some google and stack search I found out about pyspark parallelism.
What I know is that :
pyspark parallelism works by default and It creates a parallel process based on the number of cores the system has.
Also for this to work data should be partitioned.
Please correct me if my understanding is not right.
Questions/doubt:
I am reading data from one parquet file, so my data is not partitioned and if I use the .repartition() method on my dataframe that is expensive. so how should I use PySpark Parallelism here ?
Also I could not find any simple implementation of pyspark parallelism, which could explain how to use it.
In spark cluster 1 core reads one partition so if you are on multinode spark cluster
then you need to leave some meory for existing system manager like Yarn etc.
https://spoddutur.github.io/spark-notes/distribution_of_executors_cores_and_memory_for_spark_application.html
you can use reparation and specify number of partitions
df.repartition(n)
where n is the number of partition. Repartition is for parlelleism, it will be ess expensive then process your single file without any partition.

Spark SQL data storage life cycle

I recently had a issue with with one of my spark jobs, where I was reading a hive table having several billion records, that resulted in job failure due to high disk utilization, But after adding AWS EBS volume, the job ran without any issues. Although it resolved the issue, I have few doubts, I tried doing some research but couldn't find any clear answers. So my question is?
when a spark SQL reads a hive table, where the data is stored for processing initially and what is the entire life cycle of data in terms of its storage , if I didn't explicitly specify anything? And How adding EBS volumes solves the issue?
Spark will read the data, if it does not fit in memory, it will spill it out on disk.
A few things to note:
Data in memory is compressed, from what I read, you gain about 20% (e.g. a 100MB file will take only 80MB of memory).
Ingestion will start as soon as you read(), it is not part of the DAG, you can limit how much you ingest in the SQL query itself. The read operation is done by the executors. This example should give you a hint: https://github.com/jgperrin/net.jgp.books.spark.ch08/blob/master/src/main/java/net/jgp/books/spark/ch08/lab300_advanced_queries/MySQLWithWhereClauseToDatasetApp.java
In latest versions of Spark, you can push down the filter (for example if you filter right after the ingestion, Spark will know and optimize the ingestion), I think this works only for CSV, Avro, and Parquet. For databases (including Hive), the previous example is what I'd recommend.
Storage MUST be seen/accessible from the executors, so if you have EBS volumes, make sure they are seen/accessible from the cluster where the executors/workers are running, vs. the node where the driver is running.
Initially the data is in table location in HDFS/S3/etc. Spark spills data on local storage if it does not fit in memory.
Read Apache Spark FAQ
Does my data need to fit in memory to use Spark?
No. Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data. Likewise, cached datasets
that do not fit in memory are either spilled to disk or recomputed on
the fly when needed, as determined by the RDD's storage level.
Whenever spark reads data from hive tables, it stores it in RDD. One point i want to make clear here is hive is just a warehouse so it is like a layer which is above HDFS, when spark interacts with hive , hive provides the spark the location where the hdfs loaction exists.
Thus, Spark reads a file from HDFS, it creates a single partition for a single input split. Input split is set by the Hadoop (whatever the InputFormat used to read this file. ex: if you use textFile() it would be TextInputFormat in Hadoop, which would return you a single partition for a single block of HDFS (note:the split between partitions would be done on line split, not the exact block split), unless you have a compressed file format like Avro/parquet.
If you manually add rdd.repartition(x) it would perform a shuffle of the data from N partititons you have in rdd to x partitions you want to have, partitioning would be done on round robin basis.
If you have a 10GB uncompressed text file stored on HDFS, then with the default HDFS block size setting (256MB) it would be stored in 40blocks, which means that the RDD you read from this file would have 40partitions. When you call repartition(1000) your RDD would be marked as to be repartitioned, but in fact it would be shuffled to 1000 partitions only when you will execute an action on top of this RDD (lazy execution concept)
Now its all up to spark that how it will process the data as Spark is doing lazy evaluation , before doing the processing, spark prepare a DAG for optimal processing. One more point spark need configuration for driver memory, no of cores , no of executors etc and if the configuration is inappropriate the job will fail.
Once it prepare the DAG , then it start processing the data. So it divide your job into stages and stages into tasks. Each task will further use specific executors, shuffle , partitioning. So in your case when you do processing of bilions of records may be your configuration is not adequate for the processing. One more point when we say spark load the data in RDD/Dataframe , its managed by spark, there are option to keep the data in memory/disk/memory only etc ref -storage_spark.
Briefly,
Hive-->HDFS--->SPARK>>RDD(Storage depends as its a lazy evaluation).
you may refer the following link : Spark RDD - is partition(s) always in RAM?

Write to databricks table from spark worker node

Can someone let me know if I can write to a databricks table directly from a worker node in Spark ? Please provide the code snippets. I am partitioning big data around 100 million records and hence it is failing due to memory issues when I issue a collect statement to get the data back into driver node.
In general you are always writing from a Worker Node to a Databricks table. The collect should be avoided at all costs as you see - Driver OOM.
To avoid OOM issues you should do like most do, repartition your records so they fit inside the allowable partition sizes limit - 2GB or now 4GB with newer Spark releases, on your Worker Nodes and all well be fine. E.g.:
val repartitionedWikiDF = wikiDF.repartition(16)
val targetPath = f"{workingDir}/wiki.parquet"
repartitionedwikiDF.write.mode("OVERWRITE").parquet(targetPath)
display(dbutils.fs.ls(targetPath))
You can also perform df.repartition(col, N). There is also range partitioning.
Best approach is like this imo:
import org.apache.spark.sql.functions._
df.repartition(col("country"))
.write.partitionBy("country")
.parquet("repartitionedPartitionedBy.parquet")

spark partitionBy out of memory failures

I have a Spark 2.2 job written in pyspark that's trying to read in 300BT of Parquet data in a hive table, run it through a python udf, and then write it out.
The input is partitioned on about five keys and results in about 250k partitions.
I then want to write it out using the same partition scheme using the .partitionBy clause for the dataframe.
When I don't use a partitionBy clause the data writes out and the job does finish eventually. However with the partitionBy clause I continuously see out of memory failures on the spark UI.
Upon further investigation the source parquet data is about 800MB on disk (compressed using snappy), and each node has about 50G of memory available to it.
Examining the spark UI I see that the last step before writing out is doing a sort. I believe this sort is the cause of all my issues.
When reading in a dataframe of partitioned data, is there any way to preserve knowledge of this partitioning so spark doesn't run an unnecessary sort before writing it out?
I'm trying to avoid a shuffle step here by repartitioning that could equally result in further delays of this.
Ultimately I can rewrite to read one partition at a time, but I think that's not a good solution and that spark should already be able to handle this use case.
I'm running with about 1500 executors across 150 nodes on ec2 r3.8xlarge.
I've tried smaller executor configs and larger ones and always hit the same out of memory issues.

spark datasax cassandra connector slow to read from heavy cassandra table

I am new to Spark/ Spark Cassandra Connector. We are trying spark for the first time in our team and we are using spark cassandra connector to connect to cassandra Database.
I wrote a query which is using a heavy table of the database and I saw that Spark Task didn't start until the query to the table fetched all the records.
It is taking more than 3 hours just to fetch all the records from the database.
To get the data from the DB we use.
CassandraJavaUtil.javaFunctions(sparkContextManager.getJavaSparkContext(SOURCE).sc())
.cassandraTable(keyspaceName, tableName);
Is there a way to tell spark to start working even if all the data didn't finish to download ?
Is there an option to tell spark-cassandra-connector to use more threads for the fetch ?
thanks,
kokou.
If you look at the Spark UI, how many partitions is your table scan creating? I just did something like this and I found that Spark was creating too many partitions for the scan and it was taking much longer as a result. The way I decreased the time on my job was by setting the configuration parameter spark.cassandra.input.split.size_in_mb to a value higher than the default. In my case it took a 20 minute job down to about four minutes. There are also a couple more Cassandra read specific Spark variables that you can set found here.
These stackoverflow questions are what I referenced originally, I hope they help you out as well.
Iterate large Cassandra table in small chunks
Set number of tasks on Cassandra table scan
EDIT:
After doing some performance testing with regards to fiddling with some Spark configuration parameters, I found that Spark was creating far too many table partitions when I wasn't giving the Spark executors enough memory. In my case, upping the memory by a gigabyte was enough to render the input split size parameter unnecessary. If you can't give the executors more memory, you may still need to set spark.cassandra.input.split.size_in_mbhigher as a form of workaround.

Resources