spark2-submit different from spark-submit - apache-spark

How is spark2-submit different from spark submit. I need to migrate my code from spark 1.6 to spark 2.4 Can I still use spark-submit to launch my application or is it compulsory to move to spark2-submit.

I think you are using Cloudera Hadoop. Spark 2.x versions had major changes compared to 1.x versions. In a way, there are compatibility issues. So when your existing production jobs that used 1.x version runs on 2.x, there are more chances that your job may fail.
Just to provide backward compatibility, Cloudera added "spark2-submit" and asked the users to use it for all "go-forward" jobs. And "spark-submit" would still use 1.x version and you need not touch any of the production jobs.
So it is just for the compatibility reasons.

You can use spark-submit for Spark 2.X after setting following environment variables:
1) SPARK_HOME to path of spark2-client (e.g. /usr/hdp/current/spark2-client)
2) SPARK_MAJOR_VERSION=2
Using these two configuration, even if you have both Spark 1.x and Spark 2.x installed on Cluster, you can run jobs using Spark 2.x by same commands like spark-shell, spark-submit

Related

CDH(Cloudera Distributed Hadoop) to CDP(Cloudera Data Platform) migration Spark 1x-3x query

We are currently doing a feasibility study on migrating from CDH(Cloudera Distributed Hadoop) to CDP(Cloudera Data Platform) wrt spark(currently in version 1.6).
When checked the documenation,it was understood that 1.6 is not supported ,we need to refactor it to 2.4 and the steps to do manually is given
https://docs.cloudera.com/cdp-private-cloud-upgrade/latest/upgrade-cdh/topics/cdp-one-workload-migra...
But We are planning to migrate to Spark 3.x in CDP.In one of the cloudera blogs about the same(link below
https://blog.cloudera.com/upgrade-journey-the-path-from-cdh-to-cdp-private-cloud/
As part of pre upgrade step ,it is mentioned that we need to convert Spark 1.x jobs to 2.4.5.
Phase 2: Pre-upgrade
Backup existing cluster using the backup steps list here
Confirm if all the prerequisites are addressed. Ensure all outstanding dependencies are met.
Convert Spark 1.x jobs to Spark 2.4.5. Test and validate the jobs to ensure all the required code changes are performed and tested.
My doubt is :
If the migration is from Spark 1.x-3.x when moving from cdh to cdp,is it mandatory to have a step in between to convert spark 1x-2x and then 2x to 3,if yes then the refactoring of 1x-2x is automated or it should be done manually as the steps given in cloudera
https://docs.cloudera.com/cdp-private-cloud-upgrade/latest/upgrade-cdh/topics/cdp-one-workload-migration-spark16-to-spark24.html
If not,can we directly refactor from spark 1x-3x when moving from CDH to CDP.Kindly help.
Thanks in advance.
tried looking for the solution in exisiting cloudera docuementation but couldnt get anything,in terms of Migrating Spark workloads to CDP ,there are only 2 options
Spark 1.6 to Spark 2.4 Refactoring
Because Spark 1.6 is not supported on CDP, you need to refactor Spark workloads from Spark 1.6 on CDH or HDP to Spark 2.4 on CDP.
Spark 2.3 to Spark 2.4 Refactoring
Because Spark 2.3 is not supported on CDP, you need to refactor Spark workloads from Spark 2.3 on CDH or HDP to Spark 2.4 on CDP.
Spark 2.4 to 3.x
But, if in case if we have Spark 1.6,then moving it to 2.4 and then to 3 will be double the effort

Read data from Cassandra in spark-shell

I want to read data from cassandra node in my client node on :
This is what i tried :
spark-shell --jars /my-dir/spark-cassandra-connector_2.11-2.3.2.jar.
val df = spark.read.format("org.apache.spark.sql.cassandra")\
.option("keyspace","my_keyspace")\
.option("table","my_table")\
.option("spark.cassandra.connection.host","Hostname of my Cassandra node")\
.option("spark.cassandra.connection.port","9042")\
.option("spark.cassandra.auth.password","mypassword)\
.option("spark.cassandra.auth.username","myusername")\
.load
I'm getting this error: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.cassandra.DefaultSource$
and
java.lang.NoClassDefFoundError: org/apache/commons/configuration/ConfigurationException.
Am I missing any properties? What this error is for ? How would I resolve this ?
Spark-version:2.3.2, DSE version 6.7.8
The Spark Cassandra Connector itself depends on the number of other dependencies, that could be missing here - this happens because you're providing only one jar, and not all required dependencies.
Basically, in your case you need to have following choice:
If you're running this on the DSE node, then you can use built-in Spark, if the cluster has Analytics enabled - in this case, all jars and properties are already provided, and you only need to provide username and password when starting spark shell via dse -u user -p password spark
if you're using external Spark, then it's better to use so-called BYOS (bring your own spark) - special version of the Spark Cassandra Connector with all dependencies bundled inside, and you can download jar from DataStax's Maven repo, and use with --jars
you can still use open source Spark Cassandra Connector, but in this case, it's better to use --packages com.datastax.spark:spark-cassandra-connector_2.11:2.3.2 so Spark will able to fetch all dependencies automatically.
P.S. In case of open source Spark Cassandra Connector I would recommend to use version 2.5.1 or higher, although it requires Spark 2.4.x (although 2.3.x may work) - this version has improved support for DSE, plus a lot of the new functionality not available in the earlier versions. Plus for that version there is a version that includes all required dependencies (so-called assembly) that you can use with --jars if your machine doesn't have access to the internet.

How to run different Spark versions on each node in a cluster?

Can I have an apache Spark cluster where different nodes run different versions of Spark? For example, could I have a master which is Spark 2.2.0, one node that is 2.0.1, another that is 2.2.0 and another that is 1.6.3 or should all nodes have the same version of Spark?
Usually when we want to install different versions of spark on the cluster, all the versions will be installed on all the nodes, spark execution depends on which spark-submit (spark 1.6 or spark 2.0 or spark 2.2) is used while running the script.
Lets say we have installed spark 1.6 on master node only, when we submit the job on the cluster, say master node is fully utilized , then yarn-resource manager will see which node is free to run the job, here yarn will not wait until master node gets some resources,yarn will submit the job to the node which has free resources. So, for this reason all versions of spark has to be installed on all nodes on the cluster.
Can I have an apache Spark cluster where different nodes run different versions of Spark?
No. This is not possible.
The reason is that there is no notion of Spark installation. Spark is a library and as such is a dependency of an application that once submitted for execution will be deployed and executed on cluster nodes (at least one, i.e. the driver).
With that said, just the version of the Spark dependency of your application is exactly the version of Spark in use. To be precise, the version of spark-submit in use (unless you use so-called a uber-jar with the Spark dependency bundled).

Can I run spark 2.0.* artifact on a spark 2.2.* stand-alone cluster?

I am aware of the fact that with the change of major version of spark (i.e. from 1.* to 2.*) there will be compile time failures due to changes in existing APIs.
As per my knowledge spark guarantees that with minor version update (i.e. 2.0.* to 2.2.*), changes will be backward compatible.
Although this will eliminate the possibility of compile-time failures with upgrade, would it be safe to assume that there won't be any run time failure too if submit a job on spark 2.2.* stand alone cluster using an artifact(jar) created using 2.0.* dependencies?
would it be safe to assume that there won't be any run time failure too if submit a job on 2.2.* cluster using an artifact(jar) created using 2.0.* dependencies?
Yes.
I'd even say that there's no concept of a Spark cluster unless we talk about the built-in Spark Standalone cluster.
In other words, you deploy a Spark application to a cluster, e.g. Hadoop YARN or Apache Mesos, as a application jar that may or may not contain Spark jars and so disregard what's already available in the environment.
If however you do think of Spark Standalone, things may have been broken between releases even between 2.0 and 2.2 as the jars in your Spark application have to be compatible with the ones on JVM of Spark workers (they are already pre-loaded).
I would not claim full compatibility between releases of Spark Standalone.

Does any of Cloudera Hadoop distribution supports Apache Spark SQL

I am new to Apache Spark. I heard that none of the versions of CDH are supposrting Apache Spark SQL as of now, same case with hortonworks distribution as well. Is that true..?
And another one is I have CDH 5.0.0 installed in my PC, which version of Apache Spark my CDH supports..?
Also could someone please provide me the steps to execute my Spark program in my CDH distribution. I have written some basic programs using Apache Spark 1.2 version and I am not able to run those programs in CDH environment, i am facing very basic problem when I am running Spark program using spark-submit command
spark-submit: Command not found
Do i need to configure anything prior to run my Spark program..?
Thanks in advance
All of the distributions of CDH include the whole Spark distribution, including Spark SQL.
EDIT: It is supported as of CDH 5.5.x.
CDH 5.0.x includes Spark 0.9.x. CDH 5.3.x includes Spark 1.2.x and 5.4.x should ship 1.3.x since it is about to be released upstream.
spark-submit is already part of your path if you are using CDH. If you're running from somewhere else, you have to put this file on your path or give the full path to it. This is the same as any program. So, this is something wrong with what you set up.

Resources