Apache Spark SQL results not spilling to disk, exhausting Java heap space - apache-spark

According to the Spark FAQ:
Does my data need to fit in memory to use Spark?
No. Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data. Likewise, cached datasets
that do not fit in memory are either spilled to disk or recomputed on
the fly when needed, as determined by the RDD's storage level.
I'm querying a big table of 50 million entries. The initial data download won't fit in RAM, so Spark should spill to disk, right? And I filter out a small number of entries from those, which will fit in RAM.
SPARK_CLASSPATH=postgresql-9.4.1208.jre6-2.jar ./bin/pyspark --num-executors 4
url = \
"jdbc:postgresql://localhost:5432/mydatabase?user=postgres"
df = sqlContext \
.read \
.format("jdbc") \
.option("url", url) \
.option("dbtable", "accounts") \
.option("partitionColumn", "id") \
.option("numPartitions", 10) \
.option("lowerBound", 1) \
.option("upperBound", 50000000) \
.option("password", "password") \
.load()
# get the small number of accounts whose names contain "taco"
results = df.map(lambda row: row["name"]).filter(lambda name: "taco" in name).collect()
I see some queries run on the Postgres server, then they finish, and pyspark crashes due to the Java backend crashing. java.lang.OutOfMemoryError: Java heap space
Is there something else I need to do?

Related

PySpark + Dataproc - Can't get more than X executors and X GB/Ram per executors

I use a Dataproc cluster to lemmatize strings using Spark NLP.
My cluster has 5 nodes + 1 master, each of the worker nodes has 16CPU + 64GB RAM.
Doing some maths, my ideal Spark config is:
spark.executor.instances = 14
spark.executor.cores = 5
spark.executor.memory = 19G
With that conf, I maximize the usage of the machines and have enough room for ApplicationManager and Off-Heap memory.
However when creating the SparkSession with
spark = SparkSession \
.builder \
.appName('perf-test-extract-skills') \
.config("spark.default.parallelism", "140") \
.config("spark.driver.maxResultSize", "19G") \
.config("spark.executor.memoryOverhead", "1361m") \
.config("spark.driver.memoryOverhead", "1361m") \
.config("spark.sql.adaptive.enabled", "true") \
.config("spark.dynamicAllocation.enabled", "false") \
.config("spark.executor.instances", "14") \
.config("spark.executor.cores", "5") \
.config("spark.executor.memory", "19G") \
.config("spark.kryoserializer.buffer.max", "2000M") \
.config('spark.jars.packages', 'com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.26.0,com.johnsnowlabs.nlp:spark-nlp_2.12:4.1.0') \
.getOrCreate()
I can only get 10 workers with 10Gib RAM on each as shown on the screenshot below:
I tried editing the parameter yarn.nodemanager.resource.memory-mb to 64000 to let yarn manage nodes with up to 64GB RAM but it's still the same, I can't go beyond the 10 workers and 10GB RAM.
Also, when I check the values in the "environment" tab, everything looks ok and the values are set according to my SparkSession config, meaning that the master did a request but it cannot be fullfiled ?
Is there something I forgot or are my maths wrong ?
EDIT: I managed to increase the number of executors with the new SparkSession I shared above. I can now get 14 executors but each executor is still using 10GB Ram when it should use 19.
Here is one of my executors, is it using 19GB of RAM ? I don't really understand the meaning of the different "memory" columns.

Spark OutOfMemory Error is remedied by repartition

I have highly compressed non-splittable gzip archive ~100MB size and ~10 mln records. I'm trying to read it into spark dataframe and then write it to parquet. I have one driver and one executor (16GB RAM, 8vCPU, in fact, it's a Glue job with 2 G1.X nodes).
Read gzipped CSV / write parquet directly leads to OOM:
df = spark.read.option("sep", "|") \
.option("header", "true") \
.option("quote", "\"") \
.option("mode", "FAILFAST") \
.csv("path.gz")
df.write
.format("parquet") \
.mode("Overwrite") \
.save("path")
And I can understand this. DataFrame is loaded into single executor memory, it doesn't fit and OOM appears. But, if call .repartition(8) (same hardware setup) before write, then everything is OK, no OOM occurred. I don't understand why this happens, anyway we have to load all DataFrame into executor memory?

Spark process data more than available memory

I'm working with Apache Spark version 3.1.2 deployed on a cluster of 4 nodes, each having 24GB of memory and 8 cores i.e. ~96GB of distributed memory. I want to read-in and process about ~120GB of compressed (gzip) json data.
Following is a generic code flow of my processing
data = spark.read.option('multiline', True).json(data_path, schema=schema)
result = data.filter(data['col_1']['col_1_1'].isNotNull() | data['col2'].isNotNull()) \
.rdd \
.map(parse_json_and_select_columns_of_interest) \
.toDF(schema_of_interest) \
.filter(data['col_x'].isin(broadcast_filter_list)) \
.rdd \
.map(lambda x: (x['col_key'], x.asDict())) \
.groupByKey() \
.mapValues(compute_and_add_extra_columns) \
.flatMap(...) \
.reduceByKey(lambda a,b:a+b) \ <--- OOM
.sortByKey()
.map(append_columns_based_on_key)
.saveAsTextFile(...)
I have tried with following executor settings
# Tiny executors
--num-executors 32
--executor-cores 1
--executor-memory 2g
# Fat executors
--num-executors 4
--executor-cores 8
--executor-memory 20g
However, for all of these settings, I keep getting out of memory especially on .reduceByKey(lambda a,b:a+b). My question is, (1) Regardless of performance, can I change my code flow to avoid getting OOM? or (2) Should I add more memory to my cluster? (Avoiding this since it may not be a sustainable solution in long run)
Thanks
I would actually guess it's the sortByKey causing the OOM and would suggest increasing the number of partitions you are using by passing an argument sortByKey(numPartitions = X).
Also, I can suggest trying to use DataFrame API where possible.

Performance issues in loading data from Databricks to Azure SQL

I am trying to load 1 million records from Delta table in Databricks to Azure SQL database using the recently released connector by Microsoft supporting Python API and Spark 3.0.
Performance does not really look awesome to me. It takes 19 minutes to load 1 million records. Below is the code which I am using. Do you think I am missing something here?
Configurations:
8 Worker nodes with 28GB memory and 8 cores.
Azure SQL database is a 4 vcore Gen5 .
try:
df.write \
.format("com.microsoft.sqlserver.jdbc.spark") \
.mode("overwrite") \
.option("url", url) \
.option("dbtable", "lending_club_acc_loans") \
.option("user", username) \
.option("password", password) \
.option("tableLock", "true") \
.option("batchsize", "200000") \
.option("reliabilityLevel", "BEST_EFFORT") \
.save()
except ValueError as error :
print("Connector write failed", error)
Is there something I can do to boost the performance?
Repartition the data frame. Earlier I had single partition on my source data frame which upon re-partition to 8 helped improve the performance.

save spark dataframe to multiple targets parallel

I need to write my final dataframe to hdfs and oracle database.
currently once saving to hdfs done, it start writing to rdbms. is there any way to use java threads to save same dataframe to hdfs as well as rdbms parallel.
finalDF.write().option("numPartitions", "10").jdbc(url, exatable, jdbcProp);
finalDF.write().mode("OverWrite").insertInto(hiveDBWithTable);
Thanks.
Cache finalDF before writing to hdfs and rdbms. Then make sure that enough executors are available for writing simultaneously. If number of partitions in finalDF are p and cores per executors are c, then you need minimum ceilof(p/c)+ceilof(10/c) executors.
df.show and df.write are Actions. Actions occur sequentially in Spark. So, answer is No, not possible standardly unless threads used.
We can use below code to append dataframe values to table
DF.write
.mode("append")
.format("jdbc")
.option("driver", driverProp)
.option("url", urlDbRawdata)
.option("dbtable", TABLE_NAME)
.option("user", userName)
.option("password", password)
.option("numPartitions", maxNumberDBPartitions)
.option("batchsize",batchSize)
.save()

Resources