Unable write data using spark submit - apache-spark

when I'm doing spark-submit using this command on Cloudera
**time spark-submit \
--deploy-mode client \
--conf spark.app.name='XXXxxxxxx'
--conf spark.master=local[*] \
--conf spark.driver.memory=20g \
--conf spark.driver.cores=2 \
--conf spark.executor.instances=4 \
--conf spark.executor.memory=20g \
--conf spark.executor.cores=7 \
--conf spark.dynamicAllocation.enabled=True \
--conf spark.dynamicAllocation.minExecutors=0 \
--conf spark.dynamicAllocation.maxExecutors=5 \
--conf spark.dynamicAllocation.executorAllocationRatio=0.5 \
--conf spark.local.dir=/data/xxx/ssss/spark_local/ \
--py-files test.py \
--files test_ed.ini \
test_py.py**
I am getting this in my output log file:
**Caused by: java.io.FileNotFoundException: /xxx/xxxxx/xxx/xxx/spark_local (Too many open files)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:105)
at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:118)
at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:245)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:158)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more**
I have tried the shuffle partition after this job gets successful, but my downstream job also fails after that.

Related

PySpark -Streaming- java.lang.ClassNotFoundException: org.apache.kafka.common.serialization.ByteArraySerializer

I'm submitting a PySpark Streaming Job in a kubernetes environment. The job consumes data from kafka and process it using pyspark.
Spark version : 3.2.1,
Apache Kafka version : 2.4
I submit using the below spark-submit command:
/opt/spark/bin/spark-submit \
--master k8s://https://test.containers.cloud.ibm.com:35000 \
--deploy-mode cluster \
--name spark-streaming-su \
--conf spark.kubernetes.driver.pod.name=spark-streaming-driver-su \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.kubernetes.namespace=test-spark \
--conf spark.kubernetes.file.upload.path=/code-volume/upload_path \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.authenticate.submission.caCertFile=/etc/spark-secret/ca.crt \
--conf spark.kubernetes.authenticate.submission.oauthTokenFile=/etc/spark-secret/token \
--conf spark.kubernetes.driver.limit.cores=2 \
--conf spark.driver.memory=2g \
--conf spark.sql.shuffle.partitions=4 \
--conf spark.executor.instances=2 \
--conf spark.executor.memory=2g \
--conf spark.kubernetes.executor.limit.cores=1 \
--conf spark.kubernetes.pyspark.pythonVersion=3 \
--conf spark.kubernetes.container.image=us.ic.io/test/spark-test:v5 \
--conf spark.kubernetes.container.image.pullSecrets=test-us-icr-io \
--conf spark.kubernetes.container.image.pullPolicy=IfNotPresent \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.pvc-567yun-4b67-389u-9cfg1-gtabd234567.options.claimName=pvc-code \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.pvc-567yun-4b67-389u-9cfg1-gtabd234567.mount.path=/code-volume \
/code-volume/test/test_streaming.py
error:
Error occured due to : An error occurred while calling o60.load. :
java.lang.NoClassDefFoundError:
org/apache/kafka/common/serialization/ByteArraySerializer at
org.apache.spark.sql.kafka010.KafkaSourceProvider$.(KafkaSourceProvider.scala:599)
at
org.apache.spark.sql.kafka010.KafkaSourceProvider$.(KafkaSourceProvider.scala)
at
org.apache.spark.sql.kafka010.KafkaSourceProvider.org$apache$spark$sql$kafka010$KafkaSourceProvider$$validateStreamOptions(KafkaSourceProvider.scala:338)
at
org.apache.spark.sql.kafka010.KafkaSourceProvider.sourceSchema(KafkaSourceProvider.scala:71)
at
org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:236)
at
org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:118)
at
org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:118)
at
org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:34)
at
org.apache.spark.sql.streaming.DataStreamReader.loadInternal(DataStreamReader.scala:167)
at
org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:143)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at
py4j.Gateway.invoke(Gateway.java:282) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:750) Caused by:
java.lang.ClassNotFoundException:
org.apache.kafka.common.serialization.ByteArraySerializer at
java.net.URLClassLoader.findClass(URLClassLoader.java:387) at
java.lang.ClassLoader.loadClass(ClassLoader.java:418) at
sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at
java.lang.ClassLoader.loadClass(ClassLoader.java:351) ... 22 more
I tried with the following addition in the spark submit. But none worked
Trial 1)
--jars "/opt/spark/jars/spark-sql-kafka-0-10_2.12-3.2.1.jar,/opt/spark/jars/kafka-clients-2.4.0.jar,/opt/spark/jars/spark-streaming-kafka-0-10_2.12-3.2.1.jar"
Trial 2)
--conf "spark.driver.extraClassPath=/opt/spark/jars/spark-sql-kafka-0-10_2.12-3.2.1.jar:/opt/spark/jars/kafka-clients-2.4.0.jar:/opt/spark/jars/spark-streaming-kafka-0-10_2.12-3.2.1.jar" \
--conf "spark.executor.extraClassPath=/opt/spark/jars/spark-sql-kafka-0-10_2.12-3.2.1.jar:/opt/spark/jars/kafka-clients-2.4.0.jar:/opt/spark/jars/spark-streaming-kafka-0-10_2.12-3.2.1.jar" \
Trial 3)
--packages org.apache.spark:spark-streaming-kafka-0-10_2.12:3.2.1
I downloaded elastic-jars, spark-sql-kafka-0-10_2.12-3.2.1.jar and it's dependencies and made it available in "/code-volume/extrajars/" .
Note: as per OneCricketeer's comment , the below should be used together.
--jars "/code-volume/extrajars/*" \
--conf spark.driver.extraClassPath=/code-volume/extrajars/* \
--conf spark.executor.extraClassPath=/code-volume/extrajars/* \
Working spark-submit command:
/opt/spark/bin/spark-submit \
--master k8s://https://test.containers.cloud.ibm.com:35000 \
--deploy-mode cluster \
--name spark-streaming-su \
--conf spark.kubernetes.driver.pod.name=spark-streaming-driver-su \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.kubernetes.namespace=test-spark \
--conf spark.kubernetes.file.upload.path=/code-volume/upload_path \
--jars "/code-volume/extrajars/*" \
--conf spark.driver.extraClassPath=/code-volume/extrajars/* \
--conf spark.executor.extraClassPath=/code-volume/extrajars/* \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.authenticate.submission.caCertFile=/etc/spark-secret/ca.crt \
--conf spark.kubernetes.authenticate.submission.oauthTokenFile=/etc/spark-secret/token \
--conf spark.kubernetes.driver.limit.cores=2 \
--conf spark.driver.memory=2g \
--conf spark.sql.shuffle.partitions=4 \
--conf spark.executor.instances=2 \
--conf spark.executor.memory=2g \
--conf spark.kubernetes.executor.limit.cores=1 \
--conf spark.kubernetes.pyspark.pythonVersion=3 \
--conf spark.kubernetes.container.image=us.ic.io/test/spark-test:v5 \
--conf spark.kubernetes.container.image.pullSecrets=test-us-icr-io \
--conf spark.kubernetes.container.image.pullPolicy=IfNotPresent \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.pvc-567yun-4b67-389u-9cfg1-gtabd234567.options.claimName=pvc-code \
--conf spark.kubernetes.driver.volumes.persistentVolumeClaim.pvc-567yun-4b67-389u-9cfg1-gtabd234567.mount.path=/code-volume \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.pvc-567yun-4b67-389u-9cfg1-gtabd234567.options.claimName=pvc-code \
--conf spark.kubernetes.executor.volumes.persistentVolumeClaim.pvc-567yun-4b67-389u-9cfg1-gtabd234567.mount.path=/code-volume \
/code-volume/test/test_streaming.py

How to fix "BlockManagerMasterEndpoint - No more replicas available for rdd" issue?

I am using spark 2.4.1 version and java8 to copy data into cassandra-3.0.
My spark job script is
$SPARK_HOME/bin/spark-submit \
--master yarn \
--deploy-mode cluster \
--name MyDriver \
--jars "/local/jars/*.jar" \
--files hdfs://files/application-cloud-dev.properties,hdfs://files/column_family_condition.yml \
--class com.sp.MyDriver \
--executor-cores 3 \
--executor-memory 9g \
--num-executors 5 \
--driver-cores 2 \
--driver-memory 4g \
--driver-java-options -Dconfig.file=./application-cloud-dev.properties \
--conf spark.executor.extraJavaOptions=-Dconfig.file=./application-cloud-dev.properties \
--conf spark.driver.extraClassPath=. \
--driver-class-path . \
ca-datamigration-0.0.1.jar application-cloud-dev.properties
Thought job gets success I get whole my log file filled with below
WARN.
WARN org.apache.spark.storage.BlockManagerMasterEndpoint - No more replicas available for rdd_558_5026 !
2019-09-20 00:02:37,882 [dispatcher-event-loop-1] WARN org.apache.spark.storage.BlockManagerMasterEndpoint - No more replicas available for rdd_558_5367 !
2019-09-20 00:02:37,882 [dispatcher-event-loop-1] WARN org.apache.spark.storage.BlockManagerMasterEndpoint - No more replicas available for rdd_571_1745 !
org.apache.spark.network.server.TransportChannelHandler - Exception in connection from /10.24.96.88:58602
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1106)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:343)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:745)
What is wrong here ? how shoule I fix it?

Spark 2.1.1 with typesafeconfig

I'm trying to support some external configuration file for my spark application using typesafeconfig.
I'm loading the application.conf file in my application code like this (driver):
val config = ConfigFactory.load()
val myProp = config.getString("app.property")
val df = spark.read.avro(myProp)
application.conf looks like this:
app.propety="some value"
spark-submit execution looks like this:
spark-submit
--class com.myapp.Main \
--conf spark.shuffle.service.enabled=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.minExecutors=56 \
--conf spark.dynamicAllocation.maxExecutors=1000 \
--driver-class-path $HOME/conf/*.conf \
--files $HOME/conf/application.conf \
my-app-0.0.1-SNAPSHOT.jar
seems it doesn't work and I'm getting:
Exception in thread "main" com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'app'
at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:124)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:147)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:159)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:164)
at com.typesafe.config.impl.SimpleConfig.getString(SimpleConfig.java:206)
at com.paypal.cfs.fpti.Main$.main(Main.scala:42)
at com.paypal.cfs.fpti.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:750)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
looking at the logs i do see that "--files" work, seems like a classpath issue...
18/03/13 01:08:30 INFO SparkContext: Added file file:/home/user/conf/application.conf at file:/home/user/conf/application.conf with timestamp 1520928510820
18/03/13 01:08:30 INFO Utils: Copying /home/user/conf/application.conf to /tmp/spark-2938fde1-fa4a-47af-8dc6-1c54b5e89d48/userFiles-c2cec57f-18c8-491d-8679-df7e7da45e05/application.conf
Turns out I was pretty close to the answer to begin with... here is how it worked for me:
spark-submit \
--class com.myapp.Main \
--conf spark.shuffle.service.enabled=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.minExecutors=56 \
--conf spark.dynamicAllocation.maxExecutors=1000 \
--driver-class-path $APP_HOME/conf \
--files $APP_HOME/conf/application.conf \
$APP_HOME/my-app-0.0.1-SNAPSHOT.jar
then $APP_HOME will contain the below:
conf/application.conf
my-app-0.0.1-SNAPSHOT.jar
I guess you need to make sure the application.conf is placed inside a folder, that is the trick.
In order to specify the config file path, you may pass it as an application argument, and then read it from the args variable of the main class.
This is how you would execute the spark-submit command. Note that I've specified the config file after the application jar.
spark-submit
--class com.myapp.Main \
--conf spark.shuffle.service.enabled=true \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.dynamicAllocation.minExecutors=56 \
--conf spark.dynamicAllocation.maxExecutors=1000 \
my-app-0.0.1-SNAPSHOT.jar $HOME/conf/application.conf
And then, load the config file from the path specified in args(0):
import com.typesafe.config.ConfigFactory
[...]
val dbconfig = ConfigFactory.parseFile(new File(args(0))
Now you have access to the properties of your application.conf file.
val myProp = config.getString("app.property")
Hope it helps.

java.lang.ClassNotFoundException: org.apache.spark.deploy.kubernetes.submit.Client

I am running a sample spark job in kubernetes cluster with following command:
bin/spark-submit \
--deploy-mode cluster \
--class org.apache.spark.examples.SparkPi \
--master k8s://https://XXXXX \
--kubernetes-namespace sidartha-spark-cluster \
--conf spark.executor.instances=2 \
--conf spark.app.name=spark-pi \
--conf spark.kubernetes.driver.docker.image=kubespark/spark-driver:v2.1.0-kubernetes-0.1.0-rc1 \
--conf spark.kubernetes.executor.docker.image=kubespark/spark-executor:v2.1.0-kubernetes-0.1.0-rc1 \
examples/jars/spark-examples_2.11-2.1.0-k8s-0.1.0-SNAPSHOT.jar 1000
I am building the spark from apache-spark-on-k8s
I am not able find the jar for org.apache.spark.deploy.kubernetes.submit.Client Class.
This issue is resolved. We need to build the spark/resource-manager/kubernetes from the source.

spark throws java.lang.NoClassDefFoundError: kafka/common/TopicAndPartition

when I use spark-submit command in Cloudera Yarn environment, I got this kind of exception:
java.lang.NoClassDefFoundError: kafka/common/TopicAndPartition
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.getDeclaredMethods(Class.java:1975)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.com$fasterxml$jackson$module$scala$introspect$BeanIntrospector$$listMethods$1(BeanIntrospector.scala:93)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.findMethod$1(BeanIntrospector.scala:99)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.com$fasterxml$jackson$module$scala$introspect$BeanIntrospector$$findGetter$1(BeanIntrospector.scala:124)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$$anonfun$3$$anonfun$apply$5.apply(BeanIntrospector.scala:177)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$$anonfun$3$$anonfun$apply$5.apply(BeanIntrospector.scala:173)
at scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$$anonfun$3.apply(BeanIntrospector.scala:173)
at com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$$anonfun$3.apply(BeanIntrospector.scala:172)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.List.foreach(List.scala:318)
...
The spark-submit command is like:
spark-submit --master yarn-cluster \
--num-executors $2 \
--executor-cores $3 \
--class "APP" \
--deploy-mode cluster \
--properties-file $1 \
--files $HDFS_PATH/log4j.properties,$HDFS_PATH/metrics.properties \
--conf spark.metrics.conf=metrics.properties \
APP.jar
note that, TopicAndPartition.class is in shaded APP.jar.
Please try adding the Kafka jar using the --jars option as shown in the example below:
spark-submit --master yarn-cluster \
--num-executors $2 \
--executor-cores $3 \
--class "APP" \
--deploy-mode cluster \
--properties-file $1 \
--jars /path/to/kafka.jar
--files $HDFS_PATH/log4j.properties,$HDFS_PATH/metrics.properties \
--conf spark.metrics.conf=metrics.properties \
APP.jar
After using some methods, it turns out that the issue is caused because version incompatibility. As #user1050619 said, make sure the version of kafka, spark, zookeeper and scala are compatible with each other.

Resources