Apache/Cloudera HUE / Livy Spark Server - InterpreterError: Fail to start interpreter - apache-spark

I'm at a loss at this point. I'm trying to run PySpark/SparkR on Apache HUE 4.3, using Spark 2.4 + Livy Server 0.5.0. I've followed every guide I can find, but I keep running into this issue. Basically, I can run PySpark/SparkR through command line, but HUE, for some reason, does the following:
Ignores all Spark configuration (executor memory, cores, etc) that I have set in multiple places (spark-defaults.conf, livy.conf and livy-client.conf)
Successfully creates session for both PySpark and SparkR, yet when you try to do anything (even just print(1+1)), I get InterpreterError: Fail to start interpreter
Actually works with Scala on HUE. Scala works, but PySpark and SparkR do not on HUE (presumably since Scala is java-based).
Any configuration needed I can provide. This is driving me absolutely insane.
I also cannot interact with PySpark through the REST API either, same InterpreterError. This leads me to believe it's more Livy Server based than HUE.

Figured it out. I was trying to run Spark on YARN in cluster mode and I switched to client and fixed it. Must have been a missed reference/file on the cluster machines.

Related

Getting "AssertionError("Unknown application type")" when Connecting to DSE 5.1.0 Spark

I am connecting to DSE (Spark) using this:
new SparkConf()
.setAppName(name)
.setMaster("spark://localhost:7077")
With DSE 5.0.8 works fine (Spark 1.6.3) but now fails with DSE 5.1.0 getting this error:
java.lang.AssertionError: Unknown application type
at org.apache.spark.deploy.master.DseSparkMaster.registerApplication(DseSparkMaster.scala:88) ~[dse-spark-5.1.0.jar:2.0.2.6]
After checking the use-spark jar, I've come up with this:
if(rpcendpointref instanceof DseAppProxy)
And within spark, seems to be RpcEndpointRef (NettyRpcEndpointRef).
How can I fix this problem?
I had a similar issue, and fixed it by following this:
https://docs.datastax.com/en/dse/5.1/dse-dev/datastax_enterprise/spark/sparkRemoteCommands.html
Then you need to run your job using dse spark-submit, without specifying any master.
Resource Manager Changes
The DSE Spark Resource manager is different than the OSS Spark Standalone Resource Manager. The DSE method uses a different uri "dse://" because under the hood it actually is performing a CQL based request. This has a number of benefits over the Spark RPC but as noted does not match some of the submission
mechanisms possible in OSS Spark.
There are several articles on this on the Datastax Blog as well as documentation notes
Network Security with DSE 5.1 Spark Resource Manager
Process Security with DSE 5.1 Spark Resource Manager
Instructions on the URL Change
Programmatic Spark Jobs
While it is still possible to launch an application using "setJars" you must also add the DSE specific jars and config options to talk with the resource manager. In DSE 5.1.3+ there is a class provided
DseConfiguration
Which can be applied to your Spark Conf DseConfiguration.enableDseSupport(conf) (or invoked via implicit) which will set these options for you.
Example
Docs
This is of course for advanced users only and we strongly recommend using dse spark-submit if at all possible.
I found a solution.
First of all, I think is impossible to run a Spark job within an Application within DSE 5.1. Has to be sent with dse spark-submit
Once sent, it works perfectly. In order to do the communications to the job I used Apache Kafka.
If you don't want to use a job, you can always go back to a Apache Spark.

How to configure Hive to use Spark execution engine on Google Dataproc?

I'm trying to configure Hive, running on Google Dataproc image v1.1 (so Hive 2.1.0 and Spark 2.0.2), to use Spark as an execution engine instead of the default MapReduce one.
Following the instructions here https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started doesn't really help, I keep getting Error running query: java.lang.NoClassDefFoundError: scala/collection/Iterable errors when I set hive.execution.engine=spark.
Does anyone know the specific steps to get this running on Dataproc? From what I can tell it should just be a question of making Hive see the right JARs, since both Hive and Spark are already installed and configured on the cluster, and using Hive from Spark (so the other way around) works fine.
This will probably not work with the jars in a Dataproc cluster. In Dataproc, Spark is compiled with Hive bundled (-Phive), which is not suggested / supported by Hive on Spark.
If you really want to run Hive on Spark, you might want to try to bring your own Spark in an initialization action compiled as described in the wiki.
If you just want to run Hive off MapReduce on Dataproc running Tez, with this initialization action would probably be easier.

Spark pyspark vs spark-submit

The documentation on spark-submit says the following:
The spark-submit script in Spark’s bin directory is used to launch
applications on a cluster.
Regarding the pyspark it says the following:
You can also use bin/pyspark to launch an interactive Python shell.
This question may sound stupid, but when i am running the commands though pyspark they also run on the "cluster", right? They do not run on the master node only, right?
There is no practical difference between these two. If not configured otherwise both will execute code in a local mode. If master is configured (either by --master command line parameter or spark.master configuration) corresponding cluster will be used to execute the program.
If you are using EMR , there are three things
using pyspark(or spark-shell)
using spark-submit without using --master and --deploy-mode
using spark-submit and using --master and --deploy-mode
although using all the above three will run the application in spark cluster, there is a difference how the driver program works.
in 1st and 2nd the driver will be in client mode whereas in 3rd the
driver will also be in the cluster.
in 1st and 2nd, you will have to wait untill one application complete
to run another, but in 3rd you can run multiple applications in
parallel.
Just adding a clarification that others have not addressed (you may already know this, but it was unclear from the wording of your question):
..when i am running the commands though pyspark they also run on the "cluster", right? They do not run on the master node only, right?
As with spark-submit, standard Python code will run only on the driver. When you call operations through the various pyspark APIs, you will trigger transformations or actions that will be registered/executed on the cluster.
As others have pointed out, spark-submit can also launch jobs in cluster mode. In this case, driver still executes standard Python code, but the driver is a different machine to the one that you call spark-submit from.
Pyspark compare to Scala spark and Java Spark have extreme differences, for Python spark in only support YARN for scheduling the cluster.
If you are running python spark on a local machine, then you can use pyspark. If in the cluster, use the spark-submit.
If you have any dependencies in your python spark job, you need a zip file for submission.

Submit spark application from laptop

I want to submit spark python applications from my laptop. I have a standalone spark cluster, and the master is running at some visible IP (MASTER_IP). After downloading and unzipping Spark on my laptop, I got this to work
./bin/spark-submit --master spark://MASTER_IP:7077 ~/PATHTO/pi.py
From what I understand, it is defaulting to client mode (vs cluster mode). According to Spark (http://spark.apache.org/docs/latest/submitting-applications.html) -
"only YARN supports cluster mode for Python applications." Since I'm not using YARN, I must use client mode.
My question is - do I need to download all of Spark on my laptop? Or just a few libraries?
I want to allow the rest of my team to use my Spark cluster, but I want them to do the least amount of work as possible. They don't need to setup a cluster. They only need to submit jobs to it. Having them downloading all of Spark seems like overkill.
So, what exactly is the minimum that they need?
The spark-1.5.0-bin-hadoop2.6 package I have here is 304MB unpacked. More than half, 175MB is made up of spark-assembly-1.5.0-hadoop2.6.0.jar, the main Spark stuff. You can't get rid of this unless you want to compile your own package maybe. A large part of the rest is spark-examples-1.5.0-hadoop2.6.0.jar, 113MB. Removing this and zipping back up is harmless and saves you a lot already.
However, using some tools such that they don't have to work with the spark package directly, like spark-jobserver (never used but never heard somebody very positive about the current state) or spark-kernel (needs your own code still to interface with it, or when used with notebook (see below) limited compared to alternatives) as suggested by Reactormonk makes it even easier for them.
A popular thing to do in that sense is set up access to a notebook. As you're using Python, IPython with a PySpark profile would be most straightforward to set up. Other alternatives are Zeppelin and spark-notebook (my favourite) for using Scala.

Is spark or spark with mesos the easiest to start with?

If I want a simple setup that would give me a quick start: would a combination of apache-spark and mesos would be the easiest? or maybe apache-spark alone would be better because....i.e. mesos would add complexity to the process given what it does, or maybe mesos does way so many things that would be hard to deal with spark alone, etc...
All I want is to be able to submit jobs and manage the cluster and jobs easily, nothing fancy for now, is spark or spark/mesos better or something else...
The easiest way to start using Spark is starting stand alone spark cluster on EC2.
It is as easy as running single script - spark-ec2 and it will do the rest for you.
The only case when stand alone cluster may not suit you - if you want to run more then single spark job at a time (at least it was the case with Spark 1.1).
For me personally the stand alone Spark cluster was good enough for a long time when I was running ad-hoc jobs - analyzing company's logs on S3 and learning Spark, and then destroy the cluster.
If you want to run more than one Spark at a time - I would go with Mesos.
Alternative would be to install CDH from Cloudera which is relatively easy (they provide install scripts and install instructions) and it is available for free.
CDH would provide you powerful tools to manage the cluster.
Using CDH for running Spark - they use YARN, and we have one or another issue from time to time with running Spark on YARN.
The main disadvantage to me - CDHs provider its own build of Spark - so it usually one minor version behind, which is a lot for such rapid progressing project as Spark.
So I would try Mesos for running Spark if I need to run more then one job at a time.
Just for completeness, Hortonworks provides downloadable HDP sandbox VM as well as supports Spark on HDP. It is also a good starting point.
Additionally, you can spin off your own cluster. I do thisonmy laptop, not for real big data usecases but for learning with moderate amount of data.
import subprocess as s
from time import sleep
cmd = "D:\\spark\\spark-1.3.1-bin-hadoop2.6\\spark-1.3.1-bin-hadoop2.6\\spark-1.3.1-bin-hadoop2.6\\bin\\spark-class.cmd"
master = "org.apache.spark.deploy.master.Master"
worker = "org.apache.spark.deploy.worker.Worker"
masterUrl="spark://BigData:7077"
cmds={"masters":1,"workers":3}
masterProcess=[cmd,master]
workerProcess=[cmd,worker,masterUrl]
noWorker = 3
pMaster = s.Popen(masterProcess)
sleep(3)
pWorkers = []
for i in range(noWorker):
pw = s.Popen(workerProcess)
pWorkers.append(pw)
The code above starts master and 3 workers, which I can monitor using the UI. This is just to get going and if you need aquick local set up.

Resources