Apache Beam Issue with Spark Runner while using Kafka IO - apache-spark

I am trying to test KafkaIO for the Apache Beam Code with a Spark Runner.
The code works fine with a Direct Runner.
However, if I add below codeline it throws error:
options.setRunner(SparkRunner.class);
Error:
ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 2.0 (TID 0)
java.lang.StackOverflowError
at java.base/java.io.ObjectInputStream$BlockDataInputStream.readByte(ObjectInputStream.java:3307)
at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2135)
at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1668)
at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:482)
at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:440)
at scala.collection.immutable.List$SerializationProxy.readObject(List.scala:488)
at jdk.internal.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
Versions that I am trying to use:
<beam.version>2.33.0</beam.version>
<spark.version>3.1.2</spark.version>
<kafka.version>3.0.0</kafka.version>

This issue is resolved by adding VM argument: -Xss2M
This link helped me to solve this issue:
https://github.com/eclipse-openj9/openj9/issues/10370

Related

why spark job don't work on zepplin while they work when using pyspark shell

i'am trying to execute the following code on zepplin
df = spark.read.csv('/path/to/csv')
df.show(3)
but i get the following error
Py4JJavaError: An error occurred while calling o786.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 39.0 failed 4 times, most recent failure: Lost task 5.3 in stage 39.0 (TID 326, 172.16.23.92, executor 0): java.io.InvalidClassException: org.apache.commons.lang3.time.FastDateParser; local class incompatible: stream classdesc serialVersionUID = 2, local class serialVersionUID = 3
i have hadoop-2.7.3 running on 2 nodes cluster and spark 2.3.2 running on standalone mode and zeppelin 0.8.1, this problem only occur when using zepplin
and i have the SPARK_HOME in zeppelin configuration.
I solved it, the problem was that zeppelin was using a commons-lang3-3.5.jar and spark using commons-lang-2.6.jar so all i did is to add the jar path to zeppelin configuration on the interpreter menu:
1-Click 'Interpreter' menu in navigation bar.
2-Click 'edit' button of the interpreter which you want to load dependencies to.
3-Fill artifact and exclude field to your needs. Add the path to the respective jar file.
4-Press 'Save' to restart the interpreter with loaded libraries.
Zeppelin is using its commons-lang2 jar to stream to Spark executors while Spark local is using common-lang3. like Achref mentioned, just fill out artifact location of commons-lang3 and restart interpreter then you should be good.

Spark on Yarn Container Failure

For reference: I solved this issue by adding Netty 4.1.17 in hadoop/share/hadoop/common
No matter what jar I try and run (including the example from https://spark.apache.org/docs/latest/running-on-yarn.html), I keep getting an error regarding container failure when running Spark on Yarn. I get this error in the command prompt:
Diagnostics: Exception from container-launch.
Container id: container_1530118456145_0001_02_000001
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:585)
at org.apache.hadoop.util.Shell.run(Shell.java:482)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
When I look at the logs, I then find this error:
Exception in thread "main" java.lang.NoSuchMethodError:io.netty.buffer.PooledByteBufAllocator.metric()Lio/netty/buffer/PooledByteBufAllocatorMetric;
at org.apache.spark.network.util.NettyMemoryMetrics.registerMetrics(NettyMemoryMetrics.java:80)
at org.apache.spark.network.util.NettyMemoryMetrics.<init>(NettyMemoryMetrics.java:76)
at org.apache.spark.network.client.TransportClientFactory.<init>(TransportClientFactory.java:109)
at org.apache.spark.network.TransportContext.createClientFactory(TransportContext.java:99)
at org.apache.spark.rpc.netty.NettyRpcEnv.<init>(NettyRpcEnv.scala:71)
at org.apache.spark.rpc.netty.NettyRpcEnvFactory.create(NettyRpcEnv.scala:461)
at org.apache.spark.rpc.RpcEnv$.create(RpcEnv.scala:57)
at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:530)
at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:347)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply$mcV$sp(ApplicationMaster.scala:260)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$5.run(ApplicationMaster.scala:815)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1758)
at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:814)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:259)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:839)
at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:869)
at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
Any idea why this is happening? This is running on a pseudo-distributed cluster set up according to this tutorial: https://wiki.apache.org/hadoop/Hadoop2OnWindows. Spark runs fine locally, and seeing as this jar was provided with Spark, I doubt it's a problem within the jar. (Regardless, I added a Netty dependency inside another jar and I'm still getting the same error).
The only thing set in my spark-defaults.conf is spark.yarn.jars, which points to a hdfs directory where I uploaded all of Spark's jars. io.netty.buffer.PooledByteBufAllocator is contained within these jars.
Spark 2.3.1, Hadoop 2.7.6
I had exactly same issue. Previously I used Hadoop 2.6.5 and the compatible spark version, things worked out fine. When I switched to Hadoop 2.7.6, problem occurred. Not sure what is cause, but I copied to netty.4.1.17.Final jar file to the hadoop library folder then the problem goes away.
Seems like you have multiple netty version on your classpath ,
mvn clean compile
Remove all and add latest one.
This may have the version problem between your yarn and spark. check the compatibility of the versions are installed.
I strongly suggest to read more about NoSuchMethodError and some other similar Exceptions like NoClassDefFoundError and ClassNotFoundException. This suggestions reason is that when you start using spark in different situations these are the much more confusing errors and exception for the people are not so experienced. NosuchMethodError
Of course caring a lot is the best practice strategy for a programmer absolutely the ones working on distributed systems like spark. Well Done. ;)

spark error: in state DEFINE instead of RUNNING

i'm using spark-shell to run spark hbase script.
When i run this command :
val job = Job.getInstance(conf)
I got this error
java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
The error is due to running in spark-shell. Please use spark-submit, this should solve your problem.

AWS EMR no host: hdfs:///var/log/spark/apps

I am trying to use AWS EMR (emr-4.3.0) Spark 1.6.0, Hadoop 2.7.0
I created EMR cluster, and added Step(in AWS ERM web) with my sample jar.
It's SpringBoot application and written by Java(1.8) (I installed JDK8 in the box)
It run with following command
hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --deploy-mode cluster --class org.springframework.boot.loader.JarLauncher s3://my-test/SparkForSpring-S1.2014.jar
I created SparkContext as following code.
SparkConf conf = new SparkConf().setAppName("SparkForSpring");
return new JavaSparkContext(conf);
but it fails with following error, I feel like it's something like not related to my application, I am new to Spark, Yarn though.
Caused by: org.springframework.beans.factory.BeanDefinitionStoreException: Factory method [public org.apache.spark.api.java.JavaSparkContext com.pivotal.demo.spark.rocket.rdd.SparkConfig.javaSparkContext()] threw exception; nested exception is java.io.IOException: Incomplete HDFS URI, no host: hdfs:///var/log/spark/apps
at org.springframework.beans.factory.support.SimpleInstantiationStrategy.instantiate(SimpleInstantiationStrategy.java:188)
at org.springframework.beans.factory.support.ConstructorResolver.instantiateUsingFactoryMethod(ConstructorResolver.java:586)
... 49 more
Caused by: java.io.IOException: Incomplete HDFS URI, no host: hdfs:///var/log/spark/apps
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:143)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1650)
at org.apache.spark.scheduler.EventLoggingListener.<init>(EventLoggingListener.scala:66)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:547)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:59)
at com.pivotal.demo.spark.rocket.rdd.SparkConfig.javaSparkContext(SparkConfig.java:35)
at com.pivotal.demo.spark.rocket.rdd.SparkConfig$$EnhancerBySpringCGLIB$$82429e1b.CGLIB$javaSparkContext$0(<generated>)
at com.pivotal.demo.spark.rocket.rdd.SparkConfig$$EnhancerBySpringCGLIB$$82429e1b$$FastClassBySpringCGLIB$$10b15a77.invoke(<generated>)
at org.springframework.cglib.proxy.MethodProxy.invokeSuper(MethodProxy.java:228)
at org.springframework.context.annotation.ConfigurationClassEnhancer$BeanMethodInterceptor.intercept(ConfigurationClassEnhancer.java:312)
at com.pivotal.demo.spark.rocket.rdd.SparkConfig$$EnhancerBySpringCGLIB$$82429e1b.javaSparkContext(<generated>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.springframework.beans.factory.support.SimpleInstantiationStrategy.instantiate(SimpleInstantiationStrategy.java:166)
... 50 more
I read some document but quite not sure what should I do to fix this error. A hint will be greatly helpful.
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-file-systems.html
I solved this problem by not using SpringBoot's executable jar rather I use maven shade plugin to package only spring related jar files in one jar and using system classloader. Here is full pom.xml
I got a hint from this question's answer
apache-spark 1.3.0 and yarn integration and spring-boot as a container

Zeppelin Pyspark on HDP 2.3 giving error

I am trying to configure zeppelin to work with HDP 2.3 (Spark 1.3). I have successfully installed zeppelin via Ambari and the zeppelin service is running.
But when I am trying to run any %pyspark command I am getting the below error.
I read few blogs but seems like there is some issue with jar being compiled on Java 6 and Java 7 that are being shared between Python and Spark.
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 7, sandbox.hortonworks.com): org.apache.spark.SparkException:
Error from python worker:
/usr/bin/python: No module named pyspark
PYTHONPATH was:
/opt/incubator-zeppelin/interpreter/spark/zeppelin-spark-0.6.0-incubating-SNAPSHOT.jar
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:105)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
(<class 'py4j.protocol.Py4JJavaError'>, Py4JJavaError(u'An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.\n', JavaObject id=o68), <traceback object at 0x2618bd8>)
Took 0 seconds
Can you check in your zeppelin-env.sh if you have the below line?
export PYTHONPATH=${SPARK_HOME}/python
If missing, this can be added via Ambari under Zeppelin > Configs > Advanced zeppelin-env > zeppelin-env template
Although, if you installed using the latest version of Ambari service for zeppelin then it should have done this for you:
https://github.com/hortonworks-gallery/ambari-zeppelin-service/blob/master/configuration/zeppelin-env.xml#L63
I just setup a fresh HDP 2.3 setup (2.3.0.0-2557) on Centos 6.5 using Ambari 2.1 and installed zeppelin using Ambari zeppelin service (using default configs). Pyspark seems to work fine for me.
Based on your error it sounds like PYTHONPATH is not getting set to the correct value:
PYTHONPATH was:
/opt/incubator-zeppelin/interpreter/spark/zeppelin-spark-0.6.0-incubating-SNAPSHOT.jar
In zeppelin can you enter the below in a cell and run it and provide the output?
System.getenv().get("MASTER")
System.getenv().get("SPARK_YARN_JAR")
System.getenv().get("HADOOP_CONF_DIR")
System.getenv().get("JAVA_HOME")
System.getenv().get("SPARK_HOME")
System.getenv().get("PYSPARK_PYTHON")
System.getenv().get("PYTHONPATH")
System.getenv().get("ZEPPELIN_JAVA_OPTS")
Here is the output on my setup:
res41: String = yarn-client
res42: String = hdfs:///apps/zeppelin/zeppelin-spark-0.6.0-SNAPSHOT.jar
res43: String = /etc/hadoop/conf
res44: String = /usr/java/default
res45: String = /usr/hdp/current/spark-client/
res46: String = null
res47: String = /usr/hdp/current/spark-client//python:/usr/hdp/current/spark-client//python/lib/pyspark.zip:/usr/hdp/current/spark-client//python/lib/py4j-0.8.2.1-src.zip
res48: String = -Dhdp.version=2.3.0.0-2557 -Dspark.executor.memory=512m -Dspark.yarn.queue=default

Resources