Standalone Spark application in IntelliJ - apache-spark

I am trying to run a spark application (written in Scala) on a local server for debug. It seems that YARN is the default in the version of spark (2.2.1) that I have in the sbt build definitions, and according to an error I'm consistently getting, there is no spark/YARN server listening:
Client:920 - Failed to connect to server: 0.0.0.0/0.0.0.0:8032: retries get failed due to exceeded maximum allowed retries number
According to netstat indeed there is really no port 8032 on my local server, in listening state.
How would I typically accomplish running my spark application locally, in a way bypassing this problem? I only need the application to process a small amount of data for debug, and hence would like to be able to run locally, without reliance on specific SPARK/YARN installations and setups on the local server ― that would be an ideal debug setup.
Is that possible?
My sbt definitions already bring in all the necessary spark and spark.yarn jars. The problem also reproduces when running the same project in sbt, outside of IntelliJ.

You can add this property to VM options in your debug configurations instead of hardcoding inside the code
-Dspark.master=local[2]

You could submit spark application in local mode with .master("local[*]") if you have to test pipeline with miniscule data.
Full code:
val spark = SparkSession
.builder
.appName("myapp")
.master("local[*]")
.getOrCreate()
For spark-submit use --master local[*] as one of the arguments. Refer this: https://spark.apache.org/docs/latest/submitting-applications.html
Note: Do not hard code master in your codebase, always try to supply these variables from commandline. This makes application reusable for local/test/mesos/kubernetes/yarn/whatever.

Related

Spark on YARN via JDBC Thrift?

When executing queries via the Thrift interface, how do I tell it to run the queries over YARN?
I'm trying to get Spark's JDBC/ODBC Thrift interface to run Spark-SQL calls on YARN. This combination seems to be absent from the documentation. The Spark on YARN docs give a bunch of options, but doesn't describe which configuration file in which to put them so that the Thrift server will pick them up.
I see a few of the settings mentioned in spark-env.sh (cores, executer memory, etc), but I can't figure out where to tell it to use YARN in the first place.
In order to make the Thriftserver use YARN for execution, it is necessary to start the thriftserver with the "--master yarn" parameter. This parameter can be appended to sbin/start-thriftserver.sh. Appending it here passes it through to the spark-submit script, which starts the server with that executor.
There is no equivalent in a config file.

What setup is needed to use the Spark Cassandra Connector with Spark Job Server

I am working with Spark and Cassandra and in general things are straight forward and working as intended; in particular the spark-shell and running .scala processes to get results.
I'm now looking at utilisation of the Spark Job Server; I have the Job Server up and running and working as expected for both the test items, as well as some initial, simple .scala developed.
However I now want to take one of the .scala programs that works in spark-shell and get it onto the Spark Job Server to access via that mechanism. The issue I have is that the Job Server doesn't seem to recognise the import statements around cassandra and fails to build (sbt compile; sbt package) a jar for upload to the Job Server.
At some level it just looks like I need the Job Server equivalent to the spark shell package switch (--packages datastax:spark-cassandra-connector:2.0.1-s_2.11) on the Spark Job Server so that import com.datastax.spark.connector._ and similar code in the .scala files will work.
Currently when I attempt to build (sbt complie) I get message such as:
[error] /home/SparkCassandraTest.scala:10: object datastax is not a member of package com
[error] import com.datastax.spark.connector._
I have added different items to the build.sbt file based on searches and message board advice; but no real change; if that is the answer I'm after what should be added to the base Job Server to enable that usage of the cassandra connector.
I think that you need spark-submit to do this. I am working with Spark and Cassandra also, but only since one month; so I've needed read a lot of information. I had compiled this info in a repository, maybe this could help you, however is an alpha version, sorry about that.

How to get access to HDFS files in Spark standalone cluster mode?

I am trying to get access to HDFS files in Spark. Everything works fine when I run Spark in local mode, i.e.
SparkSession.master("local")
and get access to HDFS files by
hdfs://localhost:9000/$FILE_PATH
But when I am trying to run Spark in standalone cluster mode, i.e.
SparkSession.master("spark://$SPARK_MASTER_HOST:7077")
Error throws
java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.fun$1 of type org.apache.spark.api.java.function.Function in instance of org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1
So far I have only
start-dfs.sh
in Hadoop and does not really config anything in Spark. Do I need to run Spark using YARN cluster manager instead so that Spark and Hadoop are using the same cluster manager, hence can get access to HDFS files?
I have tried to config yarn-site.xml in Hadoop following tutorialspoint https://www.tutorialspoint.com/hadoop/hadoop_enviornment_setup.htm, and specified HADOOP_CONF_DIR in spark-env.sh, but it does not seem to work and the same error throws. Am I missing some other configurations?
Thanks!
EDIT
The initial Hadoop version is 2.8.0 and the Spark version is 2.1.1 with Hadoop 2.7. Tried to download hadoop-2.7.4 but the same error still exists.
The question here suggests this as a java syntax issue rather than spark hdfs issue. I will try this approach and see if this solves the error here.
Inspired by the post here, solved the problem by myself.
This map-reduce job depends on a Serializable class, so when running in Spark local mode, this serializable class can be found and the map-reduce job can be executed dependently.
When running in Spark standalone cluster mode, the best is to submit the application through spark-submit, rather than running in an IDE. Packaged everything in jar and spark-submit the jar, works as a charm!

Spark job with explicit setMaster("local"), passed to spark-submit with YARN

If I have a Spark job (2.2.0) compiled with setMaster("local") what will happen if I send that job with spark-submit --master yarn --deploy-mode cluster ?
I tried this and it looked like the job did get packaged up and executed on the YARN cluster rather than locally.
What I'm not clear on:
why does this work? According to the docs, things that you set in SparkConf explicitly have precedence over things passed in from the command line or via spark-submit (see: https://spark.apache.org/docs/latest/configuration.html). Is this different because I'm using SparkSession.getBuilder?
is there any less obvious impact of leaving setMaster("local") in code vs. removing it? I'm wondering if what I'm seeing is something like the job running in local mode, within the cluster, rather than properly using cluster resources.
It's because submitting your application to Yarn happens before SparkConf.setMaster.
When you use --master yarn --deploy-mode cluster, Spark will run its main method in your local machine and upload the jar to run on Yarn. Yarn will allocate a container as the application master to run the Spark driver, a.k.a, your codes. SparkConf.setMaster("local") runs inside a Yarn container, and then it creates SparkContext running in the local mode, and doesn't use the Yarn cluster resources.
I recommend that not setting master in your codes. Just use the command line --master or the MASTER env to specify the Spark master.
If I have a Spark job (2.2.0) compiled with setMaster("local") what will happen if I send that job with spark-submit --master yarn --deploy-mode cluster
setMaster has the highest priority and as such excludes other options.
My recommendation: Don't use this (unless you convince me I'm wrong - feel challenged :))
That's why I'm a strong advocate of using spark-submit early and often. It defaults to local[*] and does its job very well. It even got improved in the recent versions of Spark where it adds a nice-looking application name (aka appName) so you don't have to set it (or even...please don't...hardcore it).
Given we are in Spark 2.2 days with Spark SQL being the entry point to all the goodies in Spark, you should always start with SparkSession (and forget about SparkConf or SparkContext as too low-level).
The only reason I'm aware of when you could have setMaster in a Spark application is when you want to run the application inside your IDE (e.g. IntelliJ IDEA). Without setMaster you won't be able to run the application.
A workaround is to use src/test/scala for the sources (in sbt) and use a launcher with setMaster that will execute the main application.

Spark on Mesos RpcEndpointNotFoundException

I cannot launch a spark job on Mesos, when it starts automatically gives this error:
"Caused by: org.apache.spark.rpc.RpcEndpointNotFoundException: Cannot find
endpoint: spark://CoarseGrainedScheduler#10.32.8.178:59737"
Could be because of mismatch between versions? If I launch an example that is brought with the distribution works perfectly.
thanks
Works now. It was application fault that I did not put correctly a data input path.
In mesos you have two options to deploy, Cluster Mode or Client Mode. I chose cluster mode and I have a spark daemon (MesosClusterDispatcher) that is always listening to spark jobs, this is why I use mesos://spark-mesos-dispatcher.marathon.mesos:7077
Thanks Jacek!

Resources