PySpark -Streaming- java.lang.ClassNotFoundException: org.apache.kafka.common.serialization.ByteArraySerializer - apache-spark

I'm submitting a PySpark Streaming Job in a kubernetes environment. The job consumes data from kafka and process it using pyspark.
Spark version : 3.2.1,
Apache Kafka version : 2.4
I submit using the below spark-submit command:
/opt/spark/bin/spark-submit \
--master k8s://https://test.containers.cloud.ibm.com:35000 \
--deploy-mode cluster \
--name spark-streaming-su \
--conf spark.kubernetes.driver.pod.name=spark-streaming-driver-su \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.kubernetes.namespace=test-spark \
--conf spark.kubernetes.file.upload.path=/code-volume/upload_path \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.authenticate.submission.caCertFile=/etc/spark-secret/ca.crt \
--conf spark.kubernetes.authenticate.submission.oauthTokenFile=/etc/spark-secret/token \
--conf spark.kubernetes.driver.limit.cores=2 \
--conf spark.driver.memory=2g \
--conf spark.sql.shuffle.partitions=4 \
--conf spark.executor.instances=2 \
--conf spark.executor.memory=2g \
--conf spark.kubernetes.executor.limit.cores=1 \
--conf spark.kubernetes.pyspark.pythonVersion=3 \
--conf spark.kubernetes.container.image=us.ic.io/test/spark-test:v5 \
--conf spark.kubernetes.container.image.pullSecrets=test-us-icr-io \
--conf spark.kubernetes.container.image.pullPolicy=IfNotPresent \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.pvc-567yun-4b67-389u-9cfg1-gtabd234567.options.claimName=pvc-code \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.pvc-567yun-4b67-389u-9cfg1-gtabd234567.mount.path=/code-volume \
/code-volume/test/test_streaming.py
error:
Error occured due to : An error occurred while calling o60.load. :
java.lang.NoClassDefFoundError:
org/apache/kafka/common/serialization/ByteArraySerializer at
org.apache.spark.sql.kafka010.KafkaSourceProvider$.(KafkaSourceProvider.scala:599)
at
org.apache.spark.sql.kafka010.KafkaSourceProvider$.(KafkaSourceProvider.scala)
at
org.apache.spark.sql.kafka010.KafkaSourceProvider.org$apache$spark$sql$kafka010$KafkaSourceProvider$$validateStreamOptions(KafkaSourceProvider.scala:338)
at
org.apache.spark.sql.kafka010.KafkaSourceProvider.sourceSchema(KafkaSourceProvider.scala:71)
at
org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:236)
at
org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:118)
at
org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:118)
at
org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:34)
at
org.apache.spark.sql.streaming.DataStreamReader.loadInternal(DataStreamReader.scala:167)
at
org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:143)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at
py4j.Gateway.invoke(Gateway.java:282) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:750) Caused by:
java.lang.ClassNotFoundException:
org.apache.kafka.common.serialization.ByteArraySerializer at
java.net.URLClassLoader.findClass(URLClassLoader.java:387) at
java.lang.ClassLoader.loadClass(ClassLoader.java:418) at
sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at
java.lang.ClassLoader.loadClass(ClassLoader.java:351) ... 22 more
I tried with the following addition in the spark submit. But none worked
Trial 1)
--jars "/opt/spark/jars/spark-sql-kafka-0-10_2.12-3.2.1.jar,/opt/spark/jars/kafka-clients-2.4.0.jar,/opt/spark/jars/spark-streaming-kafka-0-10_2.12-3.2.1.jar"
Trial 2)
--conf "spark.driver.extraClassPath=/opt/spark/jars/spark-sql-kafka-0-10_2.12-3.2.1.jar:/opt/spark/jars/kafka-clients-2.4.0.jar:/opt/spark/jars/spark-streaming-kafka-0-10_2.12-3.2.1.jar" \
--conf "spark.executor.extraClassPath=/opt/spark/jars/spark-sql-kafka-0-10_2.12-3.2.1.jar:/opt/spark/jars/kafka-clients-2.4.0.jar:/opt/spark/jars/spark-streaming-kafka-0-10_2.12-3.2.1.jar" \
Trial 3)
--packages org.apache.spark:spark-streaming-kafka-0-10_2.12:3.2.1

I downloaded elastic-jars, spark-sql-kafka-0-10_2.12-3.2.1.jar and it's dependencies and made it available in "/code-volume/extrajars/" .
Note: as per OneCricketeer's comment , the below should be used together.
--jars "/code-volume/extrajars/*" \
--conf spark.driver.extraClassPath=/code-volume/extrajars/* \
--conf spark.executor.extraClassPath=/code-volume/extrajars/* \
Working spark-submit command:
/opt/spark/bin/spark-submit \
--master k8s://https://test.containers.cloud.ibm.com:35000 \
--deploy-mode cluster \
--name spark-streaming-su \
--conf spark.kubernetes.driver.pod.name=spark-streaming-driver-su \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.kubernetes.namespace=test-spark \
--conf spark.kubernetes.file.upload.path=/code-volume/upload_path \
--jars "/code-volume/extrajars/*" \
--conf spark.driver.extraClassPath=/code-volume/extrajars/* \
--conf spark.executor.extraClassPath=/code-volume/extrajars/* \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.authenticate.submission.caCertFile=/etc/spark-secret/ca.crt \
--conf spark.kubernetes.authenticate.submission.oauthTokenFile=/etc/spark-secret/token \
--conf spark.kubernetes.driver.limit.cores=2 \
--conf spark.driver.memory=2g \
--conf spark.sql.shuffle.partitions=4 \
--conf spark.executor.instances=2 \
--conf spark.executor.memory=2g \
--conf spark.kubernetes.executor.limit.cores=1 \
--conf spark.kubernetes.pyspark.pythonVersion=3 \
--conf spark.kubernetes.container.image=us.ic.io/test/spark-test:v5 \
--conf spark.kubernetes.container.image.pullSecrets=test-us-icr-io \
--conf spark.kubernetes.container.image.pullPolicy=IfNotPresent \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.pvc-567yun-4b67-389u-9cfg1-gtabd234567.options.claimName=pvc-code \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.pvc-567yun-4b67-389u-9cfg1-gtabd234567.mount.path=/code-volume \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.pvc-567yun-4b67-389u-9cfg1-gtabd234567.options.claimName=pvc-code \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.pvc-567yun-4b67-389u-9cfg1-gtabd234567.mount.path=/code-volume \
/code-volume/test/test_streaming.py

Related

Unable write data using spark submit

when I'm doing spark-submit using this command on Cloudera
**time spark-submit \
--deploy-mode client \
--conf spark.app.name='XXXxxxxxx'
--conf spark.master=local[*] \
--conf spark.driver.memory=20g \
--conf spark.driver.cores=2 \
--conf spark.executor.instances=4 \
--conf spark.executor.memory=20g \
--conf spark.executor.cores=7 \
--conf spark.dynamicAllocation.enabled=True \
--conf spark.dynamicAllocation.minExecutors=0 \
--conf spark.dynamicAllocation.maxExecutors=5 \
--conf spark.dynamicAllocation.executorAllocationRatio=0.5 \
--conf spark.local.dir=/data/xxx/ssss/spark_local/ \
--py-files test.py \
--files test_ed.ini \
test_py.py**
I am getting this in my output log file:
**Caused by: java.io.FileNotFoundException: /xxx/xxxxx/xxx/xxx/spark_local (Too many open files)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:105)
at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:118)
at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:245)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:158)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more**
I have tried the shuffle partition after this job gets successful, but my downstream job also fails after that.

Cannot Allocate Memory in Delta Lake

Problem
The goal is to have a Spark Streaming application that read data from Kafka and use Delta Lake to create store data. The granularity of the delta table is pretty granular, the first partition is the organization_id (there are more than 5000 organizations) and the second partition is the date.
The application has a expected latency, but it does not last more than one day up. The error is always about memory as I'll show below.
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000006f8000000, 671088640, 0) failed; error='Cannot allocate memory' (errno=12)
There is no persistence and the memory is already high for the whole application.
What I've tried
Increasing memory and workes were the first things I've tried, but the number of partitions were changed as well, from 4 to 16.
Script of Execution
spark-submit \
--verbose \
--master yarn \
--deploy-mode cluster \
--driver-memory 2G \
--executor-memory 4G \
--executor-cores 2 \
--num-executors 4 \
--files s3://my-bucket/log4j-driver.properties,s3://my-bucket/log4j-executor.properties \
--jars /home/hadoop/delta-core_2.12-0.8.0.jar,/usr/lib/spark/external/lib/spark-sql-kafka-0-10.jar \
--class my.package.app \
--conf spark.driver.memoryOverhead=512 \
--conf spark.executor.memoryOverhead=1024 \
--conf spark.memory.fraction=0.8 \
--conf spark.memory.storageFraction=0.3 \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.rdd.compress=true \
--conf spark.yarn.max.executor.failures=100 \
--conf spark.yarn.maxAppAttempts=100 \
--conf spark.task.maxFailures=100 \
--conf spark.executor.heartbeatInterval=20s \
--conf spark.network.timeout=300s \
--conf spark.driver.maxResultSize=0 \
--conf spark.driver.extraJavaOptions="-XX:-PrintGCDetails -XX:-PrintGCDateStamps -XX:-UseParallelGC -XX:+UseG1GC -XX:-UseConcMarkSweepGC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/dump-driver.hprof -Dlog4j.configuration=log4j-driver.properties -Dvm.logging.level=ERROR -Dvm.logging.name=UsageFact -Duser.timezone=UTC" \
--conf spark.executor.extraJavaOptions="-XX:-PrintGCDetails -XX:-PrintGCDateStamps -XX:-UseParallelGC -XX:+UseG1GC -XX:-UseConcMarkSweepGC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/dump-executor.hprof -Dlog4j.configuration=log4j-executor.properties -Dvm.logging.level=ERROR -Dvm.logging.name=UsageFact -Duser.timezone=UTC" \
--conf spark.sql.session.timeZone=UTC \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
--conf spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore \
--conf spark.databricks.delta.retentionDurationCheck.enabled=false \
--conf spark.databricks.delta.vacuum.parallelDelete.enabled=true \
--conf spark.sql.shuffle.partitions=16 \
--name "UsageFactProcessor" \
application.jar
Code
val source = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", broker)
.option("subscribe", topic)
.option("startingOffsets", "latest")
.option("failOnDataLoss", value = false)
.option("fetchOffset.numRetries", 10)
.option("fetchOffset.retryIntervalMs", 1000)
.option("maxOffsetsPerTrigger", 50000L)
.option("kafkaConsumer.pollTimeoutMs", 300000L)
.load()
val transformed = source
.transform(applySchema)
val query = transformed
.coalesce(16)
.writeStream
.trigger(Trigger.ProcessingTime("1 minute"))
.outputMode(OutputMode.Append)
.format("delta")
.partitionBy("organization_id", "date")
.option("path", table)
.option("checkpointLocation", checkpoint)
.option("mergeSchema", "true")
.start()
spark.catalog.clearCache()
query.awaitTermination()
Versions
Spark: 3.0.1
Delta: 0.8.0
Question
What do you think may be causing this problem?
Just upgraded the version to Delta.io 1.0.0 and it stopped happening.

MountVolume.Setup failed for volume "spark-conf-volume"

We are running Spark Cluster on Kubernetes. When we submited jobs as below, driver pod and executer pods were all up and running. However, the application failed to work as expected, the root cause we suspected is that it failed to find the source path as specified by parameter "py-files". As we witnessed, the driver pod has a warning MountVolume.Setup failed for volume "spark-conf-volume".
Would you please advise?
bin/spark-submit \
--master k8s://https://k8s-master-ip:6443 \
--deploy-mode cluster \
--name algo-vm \
--py-files hdfs://{our_ip}:9000/testdata/src.zip \
--conf spark.executor.instances=2 \
--conf spark.driver.port=10000 \
--conf spark.port.maxRetries=1 \
--conf spark.blockManager.port=20000 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.container.image.pullPolicy=Always \
--conf spark.kubernetes.pyspark.pythonVersion=3 \
--conf spark.kubernetes.container.image={our_ip}/sutpc/k8s-spark-242-entry/spark-py:1.0 \
--jars hdfs://hdfs-master-ip:9000/jar/spark-sql-kafka-0-10_2.11-2.4.5.jar,hdfs://{our_ip}:9000/jar/kafka-clients-0.11.0.2.jar \
hdfs://{our_ip}:9000/testdata/spark_main.py

Kubernetes sport submit in cluster mode --packages not working as expected

I am trying to submit a spark job to a kubernetes cluster in cluster mode from a client in the cluster with --packages attribute to enable dependencies are downloaded by driver and executer but it is not working. It refers to path on submitting client. ( kubectl proxyis on )
here it the the submit options
/usr/local/bin/spark-submit \
--verbose \
--master=k8s://http://127.0.0.1:8001 \
--deploy-mode cluster \
--class org.apache.spark.examples.SparkPi \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.namespace=spark \
--conf spark.kubernetes.container.image= <...> \
--conf spark.executor.instances=2 \
--conf spark.kubernetes.pyspark.pythonVersion=3 \
--conf spark.kubernetes.driver.secretKeyRef.AWS_ACCESS_KEY_ID=datazone-s3-secret:AWS_ACCESS_KEY_ID \
--conf spark.kubernetes.driver.secretKeyRef.AWS_SECRET_ACCESS_KEY=datazone-s3-secret:AWS_SECRET_ACCESS_KEY \
--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 \
s3.py 10
On the logs I can see that packages are referring my local file system.
Spark config:
(spark.kubernetes.namespace,spark)
(spark.jars,file:///Users/<my username>/.ivy2/jars/com.amazonaws_aws-java-sdk-1.7.4.jar,file:///Users/<my username>/.ivy2/jars/org.apache.hadoop_hadoop-aws-2.7.3.jar,file:///Users/<my username>/.ivy2/jars/joda-time_joda-time-2.10.5.jar, ....
Did someone face this problem?

java.lang.ClassNotFoundException: org.apache.spark.deploy.kubernetes.submit.Client

I am running a sample spark job in kubernetes cluster with following command:
bin/spark-submit \
--deploy-mode cluster \
--class org.apache.spark.examples.SparkPi \
--master k8s://https://XXXXX \
--kubernetes-namespace sidartha-spark-cluster \
--conf spark.executor.instances=2 \
--conf spark.app.name=spark-pi \
--conf spark.kubernetes.driver.docker.image=kubespark/spark-driver:v2.1.0-kubernetes-0.1.0-rc1 \
--conf spark.kubernetes.executor.docker.image=kubespark/spark-executor:v2.1.0-kubernetes-0.1.0-rc1 \
examples/jars/spark-examples_2.11-2.1.0-k8s-0.1.0-SNAPSHOT.jar 1000
I am building the spark from apache-spark-on-k8s
I am not able find the jar for org.apache.spark.deploy.kubernetes.submit.Client Class.
This issue is resolved. We need to build the spark/resource-manager/kubernetes from the source.

Resources