Zeppelin+Spark+Cassandra: Spark dont work - apache-spark

Watched one nice youtube video about Zeppelin+Spark+Cassandra. Trying to repeat. OS Win10.
Runned Zeppelin like a docker image ;
Setuped options for Cassandra Interpreters, it works
Now trying to setup Spark, and i cant. Installed spark-3.0.1-bin-hadoop2.7 (folder named spark-3.0.1-bin-hadoop2.7, it is ok), spark-shell from cmd works. What i have to do with spark-cassandra-connector and what options i have to setup for spark Interpreters? Thanks.
org.apache.zeppelin.interpreter.InterpreterException: java.io.IOException: Fail to detect scala version, the reason is:Cannot run program "C:/bin/spark-3.3.1-bin-hadoop3/bin/spark-submit": error=2, No such file or directory
at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.open(RemoteInterpreter.java:129)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getFormType(RemoteInterpreter.java:271)
at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:438)
at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:69)
at org.apache.zeppelin.scheduler.Job.run(Job.java:172)
at org.apache.zeppelin.scheduler.AbstractScheduler.runJob(AbstractScheduler.java:132)
at org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:182)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Fail to detect scala version, the reason is:Cannot run program "C:/bin/spark-3.3.1-bin-hadoop3/bin/spark-submit": error=2, No such file or directory
at org.apache.zeppelin.interpreter.launcher.SparkInterpreterLauncher.buildEnvFromProperties(SparkInterpreterLauncher.java:127)
at org.apache.zeppelin.interpreter.launcher.StandardInterpreterLauncher.launchDirectly(StandardInterpreterLauncher.java:77)
at org.apache.zeppelin.interpreter.launcher.InterpreterLauncher.launch(InterpreterLauncher.java:110)
at org.apache.zeppelin.interpreter.InterpreterSetting.createInterpreterProcess(InterpreterSetting.java:856)
at org.apache.zeppelin.interpreter.ManagedInterpreterGroup.getOrCreateInterpreterProcess(ManagedInterpreterGroup.java:66)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getOrCreateInterpreterProcess(RemoteInterpreter.java:104)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.internal_create(RemoteInterpreter.java:154)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.open(RemoteInterpreter.java:126)
... 13 more

Ok guys, here we go:
Install Spark on Win10, there are many tutorials in internet. My version 3.0.1
Download docker image with Zeppelin
In image settings setuped path folder with Spark and port 8080, lounch it http://localhost:8080/
Spark interpreter settings: set SPARK_HOME like in prev point 3, spark.jars.packages = com.datastax.spark:spark-cassandra-connector_2.12:3.0.1. Add settings for Cassandra: spark.cassandra.connection.host, spark.cassandra.auth.username, spark.cassandra.auth.password.
Welcome

Related

Cannot run livy in sparkR mode on zeppelin 0.8

We've install the zeppelin that is one of HDP 3.1.5 service and install the suitable R version (from RHEL 7 repos) that compatible with the zeppelin version (0.8).
At first, when we scripting and running the Spark code using spark interpreter especially using sparkR one, it runs well but when we change the interpreter into livy, we always got an error with message "Fail to start interpreter".
When we check into Yarn UI, the application is created and running but when it's checked into spark ui, we've got the stderror that said
22/07/28 10:21:17 WARN SparkRInterpreter$: Fail to init Spark RBackend, using different method signature
java.lang.ClassCastException: scala.Tuple2 cannot be cast to java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
at org.apache.livy.repl.SparkRInterpreter$$anon$1.run(SparkRInterpreter.scala:88)
22/07/28 10:21:17 WARN Session: Fail to start interpreter sparkr
java.io.IOException: Cannot run program "R": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.livy.repl.SparkRInterpreter$.apply(SparkRInterpreter.scala:143)
at org.apache.livy.repl.Session.liftedTree1$1(Session.scala:107)
at org.apache.livy.repl.Session.interpreter(Session.scala:98)
at org.apache.livy.repl.Session$$anonfun$execute$1.apply$mcV$sp(Session.scala:168)
at org.apache.livy.repl.Session$$anonfun$execute$1.apply(Session.scala:163)
at org.apache.livy.repl.Session$$anonfun$execute$1.apply(Session.scala:163)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 11 more
Is there any step/configuration that we skip?
-Thanks-

pySpark job failing on yarn

i am trying submit pyspark job from yarnclient. getting below error from RM without any further logs.
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
Operation category READ is not supported in state standby ENOENT: No
such file or directory at
org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmodImpl(Native Method)
at
org.apache.hadoop.io.nativeio.NativeIO$POSIX.chmod(NativeIO.java:231)
at
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:773)
at
org.apache.hadoop.fs.DelegateToFileSystem.setPermission(DelegateToFileSystem.java:218)
at org.apache.hadoop.fs.FilterFs.setPermission(FilterFs.java:266) at
org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1008) at
org.apache.hadoop.fs.FileContext$11.next(FileContext.java:1004) at
org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at
org.apache.hadoop.fs.FileContext.setPermission(FileContext.java:1011)
at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:483)
at org.apache.hadoop.yarn.util.FSDownload$3.run(FSDownload.java:481)
at java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
at
org.apache.hadoop.yarn.util.FSDownload.changePermissions(FSDownload.java:481)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:419) at
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:242)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:235)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:223)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748) For more detailed output,
check the application tracking page:
https://.com:8090/cluster/app/application_1638972290118_64750
Then click on links to logs of each attempt. . Failing the
application.
cluster is fine and other pyspark jobs running fine.
Please help
Thanks in advance
What do you mean by "cluster is fine and other pyspark jobs running fine"?
Did you run them on Yarn or just on Standalone mode?
However, I think it's better to check your yarn cluster first to see if it works (without spark).
you can do it using hadoop MapR examples:
yarn jar $HadoopDir/share/hadoop/mapreduce/hadoop-mapreduce-examples-$version.jar wordcount inputFilePath OutputDir
Check link 1 and link 2 too. They may help.

Spark on Yarn Container Failure

For reference: I solved this issue by adding Netty 4.1.17 in hadoop/share/hadoop/common
No matter what jar I try and run (including the example from https://spark.apache.org/docs/latest/running-on-yarn.html), I keep getting an error regarding container failure when running Spark on Yarn. I get this error in the command prompt:
Diagnostics: Exception from container-launch.
Container id: container_1530118456145_0001_02_000001
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:585)
at org.apache.hadoop.util.Shell.run(Shell.java:482)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
When I look at the logs, I then find this error:
Exception in thread "main" java.lang.NoSuchMethodError:io.netty.buffer.PooledByteBufAllocator.metric()Lio/netty/buffer/PooledByteBufAllocatorMetric;
at org.apache.spark.network.util.NettyMemoryMetrics.registerMetrics(NettyMemoryMetrics.java:80)
at org.apache.spark.network.util.NettyMemoryMetrics.<init>(NettyMemoryMetrics.java:76)
at org.apache.spark.network.client.TransportClientFactory.<init>(TransportClientFactory.java:109)
at org.apache.spark.network.TransportContext.createClientFactory(TransportContext.java:99)
at org.apache.spark.rpc.netty.NettyRpcEnv.<init>(NettyRpcEnv.scala:71)
at org.apache.spark.rpc.netty.NettyRpcEnvFactory.create(NettyRpcEnv.scala:461)
at org.apache.spark.rpc.RpcEnv$.create(RpcEnv.scala:57)
at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:530)
at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:347)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply$mcV$sp(ApplicationMaster.scala:260)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$5.run(ApplicationMaster.scala:815)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1758)
at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:814)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:259)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:839)
at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:869)
at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
Any idea why this is happening? This is running on a pseudo-distributed cluster set up according to this tutorial: https://wiki.apache.org/hadoop/Hadoop2OnWindows. Spark runs fine locally, and seeing as this jar was provided with Spark, I doubt it's a problem within the jar. (Regardless, I added a Netty dependency inside another jar and I'm still getting the same error).
The only thing set in my spark-defaults.conf is spark.yarn.jars, which points to a hdfs directory where I uploaded all of Spark's jars. io.netty.buffer.PooledByteBufAllocator is contained within these jars.
Spark 2.3.1, Hadoop 2.7.6
I had exactly same issue. Previously I used Hadoop 2.6.5 and the compatible spark version, things worked out fine. When I switched to Hadoop 2.7.6, problem occurred. Not sure what is cause, but I copied to netty.4.1.17.Final jar file to the hadoop library folder then the problem goes away.
Seems like you have multiple netty version on your classpath ,
mvn clean compile
Remove all and add latest one.
This may have the version problem between your yarn and spark. check the compatibility of the versions are installed.
I strongly suggest to read more about NoSuchMethodError and some other similar Exceptions like NoClassDefFoundError and ClassNotFoundException. This suggestions reason is that when you start using spark in different situations these are the much more confusing errors and exception for the people are not so experienced. NosuchMethodError
Of course caring a lot is the best practice strategy for a programmer absolutely the ones working on distributed systems like spark. Well Done. ;)

Apache Zeppelin 0.7.2 and spark-2.1.1-bin-hadoop2.7 on Windows 10 machine

I am new to zeppelin and try to configure apache zeppelin to connect with stand alone spark on my local machine, I set below settings in zeppelin-env.cmd file. And change port no to 8082 in zeppelin-site.xml file.
set JAVA_HOME=C:\PROGRA~1\Java\jdk1.8.0_101
set SPARK_HOME=C:\tmp\spark\spark-2.1.1-bin-hadoop2.7
Started spark master and worker like below
spark-class2.cmd org.apache.spark.deploy.master.Master
spark-class2.cmd org.apache.spark.deploy.worker.Worker spark://192.168.99.1:7077
Started zeppelin zeppelin.cmd and change spark interpreter settings as below.
Change master to point to local spark master to spark://192.168.99.1:7077 and change useHiveContext to false.
When I try to run default out of box notebook I am getting below error.
org.apache.zeppelin.interpreter.InterpreterException: The filename, directory name, or volume label syntax is incorrect.
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterManagedProcess.start(RemoteInterpreterManagedProcess.java:143)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreterProcess.reference(RemoteInterpreterProcess.java:73)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.open(RemoteInterpreter.java:265)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getFormType(RemoteInterpreter.java:430)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.getFormType(LazyOpenInterpreter.java:111)
at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:387)
at org.apache.zeppelin.scheduler.Job.run(Job.java:175)
at org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:329)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Apache Zeppelin configuration with Spark

I have been trying to configure Apache Zeppeling with Spark 2.0. I managed to install them both on a linux os and I set the the spark on the 8080 port while zeppelin server on the 8082 port number.
In the zeppelin-env.sh file from zeppelin I set the SPARK_HOME variable to the location of the Spark folder.
However when I try to create a new node nothing compiles properly. From what it seems I didn't configure the interpreters as the interpreter tab is missing from in the home tab.
Any help would be much appreciated.
EDIT: E.I. when I am trying to run the zeppelin tutorial, the 'Load data into table' process I receive the following error:
java.lang.ClassNotFoundException:
org.apache.spark.repl.SparkCommandLine at
java.net.URLClassLoader.findClass(URLClassLoader.java:381) at
java.lang.ClassLoader.loadClass(ClassLoader.java:424) at
sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at
java.lang.ClassLoader.loadClass(ClassLoader.java:357) at
org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:400)
at
org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:69)
at
org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:93)
at
org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:341)
at org.apache.zeppelin.scheduler.Job.run(Job.java:176) at
org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I don't think it's possible to use spark 2.0 without building from
source, since some relatively big changes have happened with this release.
You can clone the zeppelin git repo and build using the spark 2.0 profile as mentioned in the readme on github https://github.com/apache/zeppelin.
I've tried it and it works.

Resources