Spark OutOfMemory Error is remedied by repartition - apache-spark

I have highly compressed non-splittable gzip archive ~100MB size and ~10 mln records. I'm trying to read it into spark dataframe and then write it to parquet. I have one driver and one executor (16GB RAM, 8vCPU, in fact, it's a Glue job with 2 G1.X nodes).
Read gzipped CSV / write parquet directly leads to OOM:
df = spark.read.option("sep", "|") \
.option("header", "true") \
.option("quote", "\"") \
.option("mode", "FAILFAST") \
.csv("path.gz")
df.write
.format("parquet") \
.mode("Overwrite") \
.save("path")
And I can understand this. DataFrame is loaded into single executor memory, it doesn't fit and OOM appears. But, if call .repartition(8) (same hardware setup) before write, then everything is OK, no OOM occurred. I don't understand why this happens, anyway we have to load all DataFrame into executor memory?

Related

Spark load from Elasticsearch: number of executor and partitions

I'm trying to load data from an Elasticsearch index into a dataframe in Spark. My machine has 12 CPU's and 1 core. I'm using PySpark on a Jupyter Notebook with the following Spark config:
pathElkJar = currentUserFolder+"/elasticsearch-hadoop-"+connectorVersion+"/dist/elasticsearch- spark-20_2.11-"+connectorVersion+".jar"
spark = SparkSession.builder \
.appName("elastic") \
.config("spark.jars",pathElkJar) \
.enableHiveSupport() \
.getOrCreate()
Now whether I do:
df = es_reader.load()
or:
df = es_reader.load(numPartitions=12)
I get the same output from the following prints:
print('Master: {}'.format(spark.sparkContext.master))
print('Number of partitions: {}'.format(df.rdd.getNumPartitions()))
print('Number of executors:{}'.format(spark.sparkContext._conf.get('spark.executor.instances')))
print('Partitioner: {}'.format(df.rdd.partitioner))
print('Partitions structure: {}'.format(df.rdd.glom().collect()))
Master: local[*]
Number of partitions: 1
Number of executors: None
Partitioner: None
I was expecting 12 partitions, which I can only obtain by doing a repartition() on the dataframe. Furthermore I thought that the number of executors by default equals the number of CPU's. But even by doing the following:
spark.conf.set("spark.executor.instances", "12")
I can't manually set the number of executors. It is true I have 1 core for each of the 12 CPU's, but how should I go about it?
I modified the configuration file after creating the Spark session (without restarting this obviously leads to no changes), by specifying the number of executor as follows:
spark = SparkSession.builder \
.appName("elastic") \
.config("spark.jars",pathElkJar) \
.config("spark.executor.instances", "12") \
.enableHiveSupport() \
.getOrCreate()
I now correctly get 12 executors. Still I don't understand why it doesn't do it automatically and still the number of partitions when loading the dataframe is 1. I would expect it to be 12 as the number of executors, am I right?
The problem regarding the executors and partitioning arised from the fact that i was using spark in local mode which allows for one executor maximum. Using Yarn or other resource managers such as mesos solved the problem

Pyspark crashing on Dataproc cluster for small dataset

I am running a jupyter notebook created on a gcp dataproc cluster consisting of 3 worker nodes and 1 master node of type "N1-standard2" (2 cores, 7.5GB RAM), for my data science project. The dataset consists of ~0.4 mn rows. I have called a groupBy function with the groupBy column consisting of only 10 unique values, so that the output dataframe should consist of just 10 rows!
It's susprising that it crashes everytime I call grouped_df.show() or grouped_df.toPandas(), where grouped_df is obtained after calling groupBy() and sum() function.
This should be a cakewalk for spark which was originally built for processing large datasets. I am attaching the spark config that I am using which I have defined in a function.
builder = SparkSession.builder \
.appName("Spark NLP Licensed") \
.master("local[*]") \
.config("spark.driver.memory", "40G") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.config("spark.kryoserializer.buffer.max", "2000M") \
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.5.1") \
.config("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem") \
.config("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
return builder.getOrCreate()
`
This is the error I am getting. Please help.
Setting master's URL in setMaster() helped. Now I can load data as large as 20GB and perform groupBy() operations as well on the cluster.
Thanks #mazaneicha.

Structured Streaming OOM

I deploy a structured streaming job with on the k8s operator, which simply reads from kafka, deserializes, adds 2 columns and stores the results in the datalake (tried both delta and parquet) and after days the executor increases memory and eventually i get OOM. The input records are in terms of kbs really low.
P.s i use the exactly same code, but with cassandra as a sink which runs for almost a month now, without any issues. any ideas plz?
enter image description here
enter image description here
My code
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", MetisStreamsConfig.bootstrapServers)
.option("subscribe", MetisStreamsConfig.topics.head)
.option("startingOffsets", startingOffsets)
.option("maxOffsetsPerTrigger", MetisStreamsConfig.maxOffsetsPerTrigger)
.load()
.selectExpr("CAST(value AS STRING)")
.as[String]
.withColumn("payload", from_json($"value", schema))
// selection + filtering
.select("payload.*")
.select($"vesselQuantity.qid" as "qid", $"vesselQuantity.vesselId" as "vessel_id", explode($"measurements"))
.select($"qid", $"vessel_id", $"col.*")
.filter($"timestamp".isNotNull)
.filter($"qid".isNotNull and !($"qid"===""))
.withColumn("ingestion_time", current_timestamp())
.withColumn("mapping", MappingUDF($"qid"))
writeStream
.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
log.info(s"Storing batch with id: `$batchId`")
val calendarInstance = Calendar.getInstance()
val year = calendarInstance.get(Calendar.YEAR)
val month = calendarInstance.get(Calendar.MONTH) + 1
val day = calendarInstance.get(Calendar.DAY_OF_MONTH)
batchDF.write
.mode("append")
.parquet(streamOutputDir + s"/$year/$month/$day")
}
.option("checkpointLocation", checkpointDir)
.start()
i changed to foreachBatch because using delta or parquet with partitionBy cause issues faster
There is a bug that is resolved in Spark 3.1.0.
See https://github.com/apache/spark/pull/28904
Other ways of overcoming the issue & a credit for debugging:
https://www.waitingforcode.com/apache-spark-structured-streaming/file-sink-out-of-memory-risk/read
You may find this helpful even though you are using foreachBatch ...
I had the same issue for some Structured Streaming Spark 2.4.4 applications writing some Delta lake (or parquet) output with partitionBy.
Seem to be related to the jvm memory allocation within a container, as thorougly explained here:
https://merikan.com/2019/04/jvm-in-a-container/
My solution (but depends on your jvm version) was to add some option in the yaml definition for my spark application:
spec:
javaOptions: >-
-XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap
This way my Streamin App is functionning properly, with normal amount of memory (1GB for driver, 2GB for executors)
EDIT: while it seem that the first issue is solved (controller killing pods for memory consumption) there is still an issue with slowly growing non-heap memory size; after a few hours, the driver/executors are killed...

save spark dataframe to multiple targets parallel

I need to write my final dataframe to hdfs and oracle database.
currently once saving to hdfs done, it start writing to rdbms. is there any way to use java threads to save same dataframe to hdfs as well as rdbms parallel.
finalDF.write().option("numPartitions", "10").jdbc(url, exatable, jdbcProp);
finalDF.write().mode("OverWrite").insertInto(hiveDBWithTable);
Thanks.
Cache finalDF before writing to hdfs and rdbms. Then make sure that enough executors are available for writing simultaneously. If number of partitions in finalDF are p and cores per executors are c, then you need minimum ceilof(p/c)+ceilof(10/c) executors.
df.show and df.write are Actions. Actions occur sequentially in Spark. So, answer is No, not possible standardly unless threads used.
We can use below code to append dataframe values to table
DF.write
.mode("append")
.format("jdbc")
.option("driver", driverProp)
.option("url", urlDbRawdata)
.option("dbtable", TABLE_NAME)
.option("user", userName)
.option("password", password)
.option("numPartitions", maxNumberDBPartitions)
.option("batchsize",batchSize)
.save()

Apache Spark SQL results not spilling to disk, exhausting Java heap space

According to the Spark FAQ:
Does my data need to fit in memory to use Spark?
No. Spark's operators spill data to disk if it does not fit in memory,
allowing it to run well on any sized data. Likewise, cached datasets
that do not fit in memory are either spilled to disk or recomputed on
the fly when needed, as determined by the RDD's storage level.
I'm querying a big table of 50 million entries. The initial data download won't fit in RAM, so Spark should spill to disk, right? And I filter out a small number of entries from those, which will fit in RAM.
SPARK_CLASSPATH=postgresql-9.4.1208.jre6-2.jar ./bin/pyspark --num-executors 4
url = \
"jdbc:postgresql://localhost:5432/mydatabase?user=postgres"
df = sqlContext \
.read \
.format("jdbc") \
.option("url", url) \
.option("dbtable", "accounts") \
.option("partitionColumn", "id") \
.option("numPartitions", 10) \
.option("lowerBound", 1) \
.option("upperBound", 50000000) \
.option("password", "password") \
.load()
# get the small number of accounts whose names contain "taco"
results = df.map(lambda row: row["name"]).filter(lambda name: "taco" in name).collect()
I see some queries run on the Postgres server, then they finish, and pyspark crashes due to the Java backend crashing. java.lang.OutOfMemoryError: Java heap space
Is there something else I need to do?

Resources