How to fix "BlockManagerMasterEndpoint - No more replicas available for rdd" issue? - apache-spark

I am using spark 2.4.1 version and java8 to copy data into cassandra-3.0.
My spark job script is
$SPARK_HOME/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--name MyDriver \
--jars "/local/jars/*.jar" \
--files hdfs://files/application-cloud-dev.properties,hdfs://files/column_family_condition.yml \
--class com.sp.MyDriver \
--executor-cores 3 \
--executor-memory 9g \
--num-executors 5 \
--driver-cores 2 \
--driver-memory 4g \
--driver-java-options -Dconfig.file=./application-cloud-dev.properties \
--conf spark.executor.extraJavaOptions=-Dconfig.file=./application-cloud-dev.properties \
--conf spark.driver.extraClassPath=. \
--driver-class-path . \
ca-datamigration-0.0.1.jar application-cloud-dev.properties
Thought job gets success I get whole my log file filled with below
WARN.
WARN org.apache.spark.storage.BlockManagerMasterEndpoint - No more replicas available for rdd_558_5026 !
2019-09-20 00:02:37,882 [dispatcher-event-loop-1] WARN org.apache.spark.storage.BlockManagerMasterEndpoint - No more replicas available for rdd_558_5367 !
2019-09-20 00:02:37,882 [dispatcher-event-loop-1] WARN org.apache.spark.storage.BlockManagerMasterEndpoint - No more replicas available for rdd_571_1745 !
org.apache.spark.network.server.TransportChannelHandler - Exception in connection from /10.24.96.88:58602
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1106)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:343)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:745)
What is wrong here ? how shoule I fix it?

Related

Unable write data using spark submit

when I'm doing spark-submit using this command on Cloudera
**time spark-submit \
--deploy-mode client \
--conf spark.app.name='XXXxxxxxx'
--conf spark.master=local[*] \
--conf spark.driver.memory=20g \
--conf spark.driver.cores=2 \
--conf spark.executor.instances=4 \
--conf spark.executor.memory=20g \
--conf spark.executor.cores=7 \
--conf spark.dynamicAllocation.enabled=True \
--conf spark.dynamicAllocation.minExecutors=0 \
--conf spark.dynamicAllocation.maxExecutors=5 \
--conf spark.dynamicAllocation.executorAllocationRatio=0.5 \
--conf spark.local.dir=/data/xxx/ssss/spark_local/ \
--py-files test.py \
--files test_ed.ini \
test_py.py**
I am getting this in my output log file:
**Caused by: java.io.FileNotFoundException: /xxx/xxxxx/xxx/xxx/spark_local (Too many open files)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:105)
at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:118)
at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:245)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:158)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more**
I have tried the shuffle partition after this job gets successful, but my downstream job also fails after that.

PySpark -Streaming- java.lang.ClassNotFoundException: org.apache.kafka.common.serialization.ByteArraySerializer

I'm submitting a PySpark Streaming Job in a kubernetes environment. The job consumes data from kafka and process it using pyspark.
Spark version : 3.2.1,
Apache Kafka version : 2.4
I submit using the below spark-submit command:
/opt/spark/bin/spark-submit \
--master k8s://https://test.containers.cloud.ibm.com:35000 \
--deploy-mode cluster \
--name spark-streaming-su \
--conf spark.kubernetes.driver.pod.name=spark-streaming-driver-su \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.kubernetes.namespace=test-spark \
--conf spark.kubernetes.file.upload.path=/code-volume/upload_path \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.authenticate.submission.caCertFile=/etc/spark-secret/ca.crt \
--conf spark.kubernetes.authenticate.submission.oauthTokenFile=/etc/spark-secret/token \
--conf spark.kubernetes.driver.limit.cores=2 \
--conf spark.driver.memory=2g \
--conf spark.sql.shuffle.partitions=4 \
--conf spark.executor.instances=2 \
--conf spark.executor.memory=2g \
--conf spark.kubernetes.executor.limit.cores=1 \
--conf spark.kubernetes.pyspark.pythonVersion=3 \
--conf spark.kubernetes.container.image=us.ic.io/test/spark-test:v5 \
--conf spark.kubernetes.container.image.pullSecrets=test-us-icr-io \
--conf spark.kubernetes.container.image.pullPolicy=IfNotPresent \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.pvc-567yun-4b67-389u-9cfg1-gtabd234567.options.claimName=pvc-code \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.pvc-567yun-4b67-389u-9cfg1-gtabd234567.mount.path=/code-volume \
/code-volume/test/test_streaming.py
error:
Error occured due to : An error occurred while calling o60.load. :
java.lang.NoClassDefFoundError:
org/apache/kafka/common/serialization/ByteArraySerializer at
org.apache.spark.sql.kafka010.KafkaSourceProvider$.(KafkaSourceProvider.scala:599)
at
org.apache.spark.sql.kafka010.KafkaSourceProvider$.(KafkaSourceProvider.scala)
at
org.apache.spark.sql.kafka010.KafkaSourceProvider.org$apache$spark$sql$kafka010$KafkaSourceProvider$$validateStreamOptions(KafkaSourceProvider.scala:338)
at
org.apache.spark.sql.kafka010.KafkaSourceProvider.sourceSchema(KafkaSourceProvider.scala:71)
at
org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:236)
at
org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:118)
at
org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:118)
at
org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:34)
at
org.apache.spark.sql.streaming.DataStreamReader.loadInternal(DataStreamReader.scala:167)
at
org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:143)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at
py4j.Gateway.invoke(Gateway.java:282) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:750) Caused by:
java.lang.ClassNotFoundException:
org.apache.kafka.common.serialization.ByteArraySerializer at
java.net.URLClassLoader.findClass(URLClassLoader.java:387) at
java.lang.ClassLoader.loadClass(ClassLoader.java:418) at
sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at
java.lang.ClassLoader.loadClass(ClassLoader.java:351) ... 22 more
I tried with the following addition in the spark submit. But none worked
Trial 1)
--jars "/opt/spark/jars/spark-sql-kafka-0-10_2.12-3.2.1.jar,/opt/spark/jars/kafka-clients-2.4.0.jar,/opt/spark/jars/spark-streaming-kafka-0-10_2.12-3.2.1.jar"
Trial 2)
--conf "spark.driver.extraClassPath=/opt/spark/jars/spark-sql-kafka-0-10_2.12-3.2.1.jar:/opt/spark/jars/kafka-clients-2.4.0.jar:/opt/spark/jars/spark-streaming-kafka-0-10_2.12-3.2.1.jar" \
--conf "spark.executor.extraClassPath=/opt/spark/jars/spark-sql-kafka-0-10_2.12-3.2.1.jar:/opt/spark/jars/kafka-clients-2.4.0.jar:/opt/spark/jars/spark-streaming-kafka-0-10_2.12-3.2.1.jar" \
Trial 3)
--packages org.apache.spark:spark-streaming-kafka-0-10_2.12:3.2.1
I downloaded elastic-jars, spark-sql-kafka-0-10_2.12-3.2.1.jar and it's dependencies and made it available in "/code-volume/extrajars/" .
Note: as per OneCricketeer's comment , the below should be used together.
--jars "/code-volume/extrajars/*" \
--conf spark.driver.extraClassPath=/code-volume/extrajars/* \
--conf spark.executor.extraClassPath=/code-volume/extrajars/* \
Working spark-submit command:
/opt/spark/bin/spark-submit \
--master k8s://https://test.containers.cloud.ibm.com:35000 \
--deploy-mode cluster \
--name spark-streaming-su \
--conf spark.kubernetes.driver.pod.name=spark-streaming-driver-su \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.kubernetes.namespace=test-spark \
--conf spark.kubernetes.file.upload.path=/code-volume/upload_path \
--jars "/code-volume/extrajars/*" \
--conf spark.driver.extraClassPath=/code-volume/extrajars/* \
--conf spark.executor.extraClassPath=/code-volume/extrajars/* \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.authenticate.submission.caCertFile=/etc/spark-secret/ca.crt \
--conf spark.kubernetes.authenticate.submission.oauthTokenFile=/etc/spark-secret/token \
--conf spark.kubernetes.driver.limit.cores=2 \
--conf spark.driver.memory=2g \
--conf spark.sql.shuffle.partitions=4 \
--conf spark.executor.instances=2 \
--conf spark.executor.memory=2g \
--conf spark.kubernetes.executor.limit.cores=1 \
--conf spark.kubernetes.pyspark.pythonVersion=3 \
--conf spark.kubernetes.container.image=us.ic.io/test/spark-test:v5 \
--conf spark.kubernetes.container.image.pullSecrets=test-us-icr-io \
--conf spark.kubernetes.container.image.pullPolicy=IfNotPresent \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.pvc-567yun-4b67-389u-9cfg1-gtabd234567.options.claimName=pvc-code \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.pvc-567yun-4b67-389u-9cfg1-gtabd234567.mount.path=/code-volume \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.pvc-567yun-4b67-389u-9cfg1-gtabd234567.options.claimName=pvc-code \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.pvc-567yun-4b67-389u-9cfg1-gtabd234567.mount.path=/code-volume \
/code-volume/test/test_streaming.py

MountVolume.Setup failed for volume "spark-conf-volume"

We are running Spark Cluster on Kubernetes. When we submited jobs as below, driver pod and executer pods were all up and running. However, the application failed to work as expected, the root cause we suspected is that it failed to find the source path as specified by parameter "py-files". As we witnessed, the driver pod has a warning MountVolume.Setup failed for volume "spark-conf-volume".
Would you please advise?
bin/spark-submit \
--master k8s://https://k8s-master-ip:6443 \
--deploy-mode cluster \
--name algo-vm \
--py-files hdfs://{our_ip}:9000/testdata/src.zip \
--conf spark.executor.instances=2 \
--conf spark.driver.port=10000 \
--conf spark.port.maxRetries=1 \
--conf spark.blockManager.port=20000 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.container.image.pullPolicy=Always \
--conf spark.kubernetes.pyspark.pythonVersion=3 \
--conf spark.kubernetes.container.image={our_ip}/sutpc/k8s-spark-242-entry/spark-py:1.0 \
--jars hdfs://hdfs-master-ip:9000/jar/spark-sql-kafka-0-10_2.11-2.4.5.jar,hdfs://{our_ip}:9000/jar/kafka-clients-0.11.0.2.jar \
hdfs://{our_ip}:9000/testdata/spark_main.py

How to fix "Connection refused error" when running a cluster mode spark job

I am running terasort benchmark with spark on the uni cluster which uses SLURM job management system. It works fine when I use --master local[8], however when I set the master as my current node I get connection refused error.
I run this command to launch the app on local without problem:
> spark-submit \
--class com.github.ehiggs.spark.terasort.TeraGen \
--master local[8] \
target/spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar 1g \
data/terasort_in
When I use cluster mode I get the following error:
> spark-submit \
--class com.github.ehiggs.spark.terasort.TeraGen \
--master spark://iris-055:7077 \ #name of the cluster-node in use
--deploy-mode cluster \
--executor-memory 20G \
--total-executor-cores 24 \
target/spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar 5g \
data/terasort_in
Output:
WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Exception in thread "main" org.apache.spark.SparkException: Exception thrown in awaitResult:
at
org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
at
.
.
./*many lines of timeout logs etc.*/
.
.
.
Caused by: java.net.ConnectException: Connection refused
... 11 more
I expect the command to run smooth and terminate, but I cannot get over this connection error.
The problem could be not defining --conf variables. This could work out:
spark-submit \
--class com.github.ehiggs.spark.terasort.TeraGen \
--master spark://iris-055:7077 \
--conf spark.driver.memory=4g \
--conf spark.executor.memory=20g \
--executor-memory 20g \
--total-executor-cores 24 \
target/spark-terasort-1.1-SNAPSHOT-jar-with-dependencies.jar 5g \
data/terasort_in

spark throws java.lang.NoClassDefFoundError: kafka/common/TopicAndPartition

when I use spark-submit command in Cloudera Yarn environment, I got this kind of exception:
java.lang.NoClassDefFoundError: kafka/common/TopicAndPartition
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.getDeclaredMethods(Class.java:1975)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.com$fasterxml$jackson$module$scala$introspect$BeanIntrospector$$listMethods$1(BeanIntrospector.scala:93)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.findMethod$1(BeanIntrospector.scala:99)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.com$fasterxml$jackson$module$scala$introspect$BeanIntrospector$$findGetter$1(BeanIntrospector.scala:124)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$$anonfun$3$$anonfun$apply$5.apply(BeanIntrospector.scala:177)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$$anonfun$3$$anonfun$apply$5.apply(BeanIntrospector.scala:173)
at scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$$anonfun$3.apply(BeanIntrospector.scala:173)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$$anonfun$3.apply(BeanIntrospector.scala:172)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.List.foreach(List.scala:318)
...
The spark-submit command is like:
spark-submit --master yarn-cluster \
--num-executors $2 \
--executor-cores $3 \
--class "APP" \
--deploy-mode cluster \
--properties-file $1 \
--files $HDFS_PATH/log4j.properties,$HDFS_PATH/metrics.properties \
--conf spark.metrics.conf=metrics.properties \
APP.jar
note that, TopicAndPartition.class is in shaded APP.jar.
Please try adding the Kafka jar using the --jars option as shown in the example below:
spark-submit --master yarn-cluster \
--num-executors $2 \
--executor-cores $3 \
--class "APP" \
--deploy-mode cluster \
--properties-file $1 \
--jars /path/to/kafka.jar
--files $HDFS_PATH/log4j.properties,$HDFS_PATH/metrics.properties \
--conf spark.metrics.conf=metrics.properties \
APP.jar
After using some methods, it turns out that the issue is caused because version incompatibility. As #user1050619 said, make sure the version of kafka, spark, zookeeper and scala are compatible with each other.

Resources