Cannot Allocate Memory in Delta Lake - apache-spark

Problem
The goal is to have a Spark Streaming application that read data from Kafka and use Delta Lake to create store data. The granularity of the delta table is pretty granular, the first partition is the organization_id (there are more than 5000 organizations) and the second partition is the date.
The application has a expected latency, but it does not last more than one day up. The error is always about memory as I'll show below.
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000006f8000000, 671088640, 0) failed; error='Cannot allocate memory' (errno=12)
There is no persistence and the memory is already high for the whole application.
What I've tried
Increasing memory and workes were the first things I've tried, but the number of partitions were changed as well, from 4 to 16.
Script of Execution
spark-submit \
--verbose \
--master yarn \
--deploy-mode cluster \
--driver-memory 2G \
--executor-memory 4G \
--executor-cores 2 \
--num-executors 4 \
--files s3://my-bucket/log4j-driver.properties,s3://my-bucket/log4j-executor.properties \
--jars /home/hadoop/delta-core_2.12-0.8.0.jar,/usr/lib/spark/external/lib/spark-sql-kafka-0-10.jar \
--class my.package.app \
--conf spark.driver.memoryOverhead=512 \
--conf spark.executor.memoryOverhead=1024 \
--conf spark.memory.fraction=0.8 \
--conf spark.memory.storageFraction=0.3 \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.rdd.compress=true \
--conf spark.yarn.max.executor.failures=100 \
--conf spark.yarn.maxAppAttempts=100 \
--conf spark.task.maxFailures=100 \
--conf spark.executor.heartbeatInterval=20s \
--conf spark.network.timeout=300s \
--conf spark.driver.maxResultSize=0 \
--conf spark.driver.extraJavaOptions="-XX:-PrintGCDetails -XX:-PrintGCDateStamps -XX:-UseParallelGC -XX:+UseG1GC -XX:-UseConcMarkSweepGC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/dump-driver.hprof -Dlog4j.configuration=log4j-driver.properties -Dvm.logging.level=ERROR -Dvm.logging.name=UsageFact -Duser.timezone=UTC" \
--conf spark.executor.extraJavaOptions="-XX:-PrintGCDetails -XX:-PrintGCDateStamps -XX:-UseParallelGC -XX:+UseG1GC -XX:-UseConcMarkSweepGC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/dump-executor.hprof -Dlog4j.configuration=log4j-executor.properties -Dvm.logging.level=ERROR -Dvm.logging.name=UsageFact -Duser.timezone=UTC" \
--conf spark.sql.session.timeZone=UTC \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
--conf spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore \
--conf spark.databricks.delta.retentionDurationCheck.enabled=false \
--conf spark.databricks.delta.vacuum.parallelDelete.enabled=true \
--conf spark.sql.shuffle.partitions=16 \
--name "UsageFactProcessor" \
application.jar
Code
val source = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", broker)
.option("subscribe", topic)
.option("startingOffsets", "latest")
.option("failOnDataLoss", value = false)
.option("fetchOffset.numRetries", 10)
.option("fetchOffset.retryIntervalMs", 1000)
.option("maxOffsetsPerTrigger", 50000L)
.option("kafkaConsumer.pollTimeoutMs", 300000L)
.load()
val transformed = source
.transform(applySchema)
val query = transformed
.coalesce(16)
.writeStream
.trigger(Trigger.ProcessingTime("1 minute"))
.outputMode(OutputMode.Append)
.format("delta")
.partitionBy("organization_id", "date")
.option("path", table)
.option("checkpointLocation", checkpoint)
.option("mergeSchema", "true")
.start()
spark.catalog.clearCache()
query.awaitTermination()
Versions
Spark: 3.0.1
Delta: 0.8.0
Question
What do you think may be causing this problem?

Just upgraded the version to Delta.io 1.0.0 and it stopped happening.

Related

Can we have multiple executors in Spark master local[*] deployment code client

I have a 1 node Hadoop Cluster, I am submitting a spark job like this
spark-submit \
--class com.compq.scriptRunning \
--master local[*] \
--deploy-mode client \
--num-executors 3 \
--executor-cores 4 \
--executor-memory 21g \
--driver-cores 2 \
--driver-memory 5g \
--conf "spark.local.dir=/data/spark_tmp" \
--conf "spark.sql.shuffle.partitions=2000" \
--conf "spark.sql.inMemoryColumnarStorage.compressed=true" \
--conf "spark.sql.autoBroadcastJoinThreshold=200000" \
--conf "spark.speculation=false" \
--conf "spark.hadoop.mapreduce.map.speculative=false" \
--conf "spark.hadoop.mapreduce.reduce.speculative=false" \
--conf "spark.ui.port=8099" \
.....
Though I define 3 executors, I see only 1 executor in spark UI page running all the time. Can we have multiple executors running in parallel with
--master local[*] \
--deploy-mode client \
Its a on-prem, plain open source hadoop flavor installed in the cluster.
I tried changing master local to local[*] and playing around with deployment modes still, I could see only 1 executor running in spark UI

Kubernetes sport submit in cluster mode --packages not working as expected

I am trying to submit a spark job to a kubernetes cluster in cluster mode from a client in the cluster with --packages attribute to enable dependencies are downloaded by driver and executer but it is not working. It refers to path on submitting client. ( kubectl proxyis on )
here it the the submit options
/usr/local/bin/spark-submit \
--verbose \
--master=k8s://http://127.0.0.1:8001 \
--deploy-mode cluster \
--class org.apache.spark.examples.SparkPi \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.namespace=spark \
--conf spark.kubernetes.container.image= <...> \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.pyspark.pythonVersion=3 \
--conf spark.kubernetes.driver.secretKeyRef.AWS_ACCESS_KEY_ID=datazone-s3-secret:AWS_ACCESS_KEY_ID \
--conf spark.kubernetes.driver.secretKeyRef.AWS_SECRET_ACCESS_KEY=datazone-s3-secret:AWS_SECRET_ACCESS_KEY \
--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 \
s3.py 10
On the logs I can see that packages are referring my local file system.
Spark config:
(spark.kubernetes.namespace,spark)
(spark.jars,file:///Users/<my username>/.ivy2/jars/com.amazonaws_aws-java-sdk-1.7.4.jar,file:///Users/<my username>/.ivy2/jars/org.apache.hadoop_hadoop-aws-2.7.3.jar,file:///Users/<my username>/.ivy2/jars/joda-time_joda-time-2.10.5.jar, ....
Did someone face this problem?

Spark Thrift server queuing up queries

When Parallel queries are hitting Spark Thrift server, in Spark UI --> JDBC/ODBC Server , it shows up all queries as started but all of them gets executed in a sequential manner
Here's the Thrift Server startup script---
start_thriftserver (){
sudo /usr/lib/spark/sbin/start-thriftserver.sh \
--master yarn \
--deploy-mode client \
--executor-memory 3200m \
--executor-cores 2 \
--driver-memory 4g \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.shuffle.service.enabled=true \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.dynamicAllocation.schedulerBacklogTimeout=1s \
--conf spark.dynamicAllocation.minExecutors=50 \
--conf spark.executor.memoryOverhead=684
This is indeed a confusing topic.
spark.sql.hive.thriftServer.singleSession=false
Try this.
That said, I am a little sceptical on all this.

Spark Structured Stream Executors weird behavior

Using Spark Structured Stream, with Cloudera solution
I'm using 3 executors but when I launch the application the executor that is used it's only one.
How can I use multiple executors?
Let me give you more infos.
This is my parameters:
Command Launch:
spark2-submit --master yarn \
--deploy-mode cluster \
--conf spark.ui.port=4042 \
--conf spark.eventLog.enabled=false \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.streaming.backpressure.enabled=true \
--conf spark.streaming.kafka.consumer.poll.ms=512 \
--num-executors 3 \
--executor-cores 3 \
--executor-memory 2g \
--jars /data/test/spark-avro_2.11-3.2.0.jar,/data/test/spark-streaming-kafka-0-10_2.11-2.1.0.cloudera1.jar,/data/test/spark-sql-kafka-0-10_2.11-2.1.0.cloudera1.jar \
--class com.test.Hello /data/test/Hello.jar
The Code:
val lines = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", <topic_list:9092>)
.option("subscribe", <topic_name>)
.option("group.id", <consumer_group_id>)
.load()
.select($"value".as[Array[Byte]], $"timestamp")
.map((c) => { .... })
val query = lines
.writeStream
.format("csv")
.option("path", <outputPath>)
.option("checkpointLocation", <checkpointLocationPath>)
.start()
query.awaitTermination()
Result in SparkUI:
SparkUI Image
What i expected that all executors were working.
Any suggestions?
Thank you
Paolo
Looks like there is nothing wrong in your configuration, it's just the partitions that you are using might be just one. You need to increase the partitions in your kafka producer. Usually, the partitions are around 3-4 times the number of executors.
If you don't want to touch the producer code, you can come around this by doing repartition(3) before you apply the map method, so every executor works on it's own logical partition.
If you still want you explicitly mention the work each executor gets, you could do mapPerPartion method.

java.lang.ClassNotFoundException: org.apache.spark.deploy.kubernetes.submit.Client

I am running a sample spark job in kubernetes cluster with following command:
bin/spark-submit \
--deploy-mode cluster \
--class org.apache.spark.examples.SparkPi \
--master k8s://https://XXXXX \
--kubernetes-namespace sidartha-spark-cluster \
--conf spark.executor.instances=2 \
--conf spark.app.name=spark-pi \
--conf spark.kubernetes.driver.docker.image=kubespark/spark-driver:v2.1.0-kubernetes-0.1.0-rc1 \
--conf spark.kubernetes.executor.docker.image=kubespark/spark-executor:v2.1.0-kubernetes-0.1.0-rc1 \
examples/jars/spark-examples_2.11-2.1.0-k8s-0.1.0-SNAPSHOT.jar 1000
I am building the spark from apache-spark-on-k8s
I am not able find the jar for org.apache.spark.deploy.kubernetes.submit.Client Class.
This issue is resolved. We need to build the spark/resource-manager/kubernetes from the source.

Resources