the conflict between running spark-sql and hive-on-spark

the conflict between running spark-sql and hive-on-spark - apache-spark

according to the official document, when one want to running hive using the spark engine, he need to rebuild spark without hive.but, when I running spark-sql in the same environment ,I got an error:that I must rebuild the spark with -Phive .
As a workaround,i down spark-hive-thiftserver_2.11-2.2.1.jar and spark-hive-_2.11-2.0.0.jar from maven repository and put them in the ${SPARK_HOME}/jars/ .but this approach seems does not work.
thanks

Related

Spark job can't connect to cassandra when ran from a jar

I have spark job that writes data to Cassandra(Cassandra is on GCP). When I run this from IntelliJIDEA (my IDE) it works perfectly fine. The data is perfectly sent and written to Cassandra. However, this fails when I package my project into a fat jar and run it.
Here is an example of how I run it.
spark-submit --class com.testing.Job --master local out/artifacts/SparkJob_jar/SparkJob.jar 1 0
However, this fails for me and gives me the following errors
Caused by: java.io.IOException: Failed to open native connection to Cassandra at {X.X.X:9042} :: 'com.datastax.oss.driver.api.core.config.ProgrammaticDriverConfigLoaderBuilder com.datastax.oss.driver.api.core.config.DriverConfigLoader.programmaticBuilder()'
Caused by: java.lang.NoSuchMethodError: 'com.datastax.oss.driver.api.core.config.ProgrammaticDriverConfigLoaderBuilder com.datastax.oss.driver.api.core.config.DriverConfigLoader.programmaticBuilder()'
My artifacts file does include the spark-Cassandra files
spark-cassandra-connector-driver_2.12-3.0.0-beta.jar
spark-cassandra-connector_2.12-3.0.0-beta.jar
I'm wondering why this is happening and how I can fix it?

The problem is that besides that 2 things, you need to have more jars - full Java driver, and its dependencies. You have following possibilities to fix that:
You need to make sure that these artifact is packaged into the resulting jar (so-called "fat jar" or an "assembly") using Maven or SBT, or anything else
you can can specify maven coordinates com.datastax.spark:spark-cassandra-connector_2.12:3.0.0-beta with --packages like this --packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0-beta
you can download spark-cassandra-connector-assembly artifact to the node from which you're doing spark-submit, and then use that file name with --jars
See the documentation for Spark Cassandra Connector for more details.

Where should Apache Livy be installed in cluster

We want to use apache Livy so that we can invoke spark job from restapi. So do we need to install Livy server on name node or edge node. What is the best practice.
Our spark fat jar will reside in NFS path.

Livy can be installed anywhere.You just need to configure it correctly to use the resource manager.It would be easier to configure if you install in Edge node from where you run the spark-submit.

Can we run spark on mesos using the precompiled hadoop-spark package?

I have a Mesos Cluster on which I want to run Spark jobs.
I have downloaded the spark precompiled package and I can use the spark-shell by simply decompressing the archive.
So far, I haven't managed to run spark jobs on the Mesos Cluster.
First question : Do I need to install and build Spark from source to get it work on Mesos? And Does this precompiled package used only for Spark on Yarn and Hadoop?
Second question : Can anyone provide the best way to build spark. I have found many ways like :
sbt clean assembly
./build/mvn -Pmesos -DskipTests clean package
./build/sbt package
I don't know which one to use, and whether they are all correct or not.

How to use different Spark version (Spark 2.4) on YARN cluster deployed with Spark 2.1?

I have a Hortonworks yarn cluster with Spark 2.1.
However I want to run my application with spark 2.3+ (because an essential third-party ML library in use needs it).
Do we have to use spark-submit from the Spark 2.1 version or we have to submit job to yarn using Java or Scala with a FAT jar? Is this even possible? What about Hadoop libraries?

On a Hortonworks cluster, running a custom spark version in yarn client/cluster mode needs following steps:
Download Spark prebuilt file with appropriate hadoop version
Extract and unpack into a spark folder. eg. /home/centos/spark/spark-2.3.1-bin-hadoop2.7/
Copy jersey-bundle 1.19.1 jar into spark jar folder [Download from here][1]
Create a zip file containing all the jars in spark jar folder. Spark-jar.zip
Put this spark-jar.zip file in a world accessible hdfs location such as (hdfs dfs -put spark-jars.zip /user/centos/data/spark/)
get hdp version (hdp-select status hadoop-client): eg output. hadoop-client - 3.0.1.0-187
Use the above hdp version in export commands below
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/usr/hdp/3.0.1.0-187/hadoop/conf}
export HADOOP_HOME=${HADOOP_HOME:-/usr/hdp/3.0.1.0-187/hadoop}
export SPARK_HOME=/home/centos/spark/spark-2.3.1-bin-hadoop2.7/
Edit the spark-defaults.conf file in spark_home/conf directory, add following entries
spark.driver.extraJavaOptions -Dhdp.version=3.0.1.0-187
spark.yarn.am.extraJavaOptions -Dhdp.version=3.0.1.0-187
create java-opts file in spark_home/conf directory, add below entries and use the above mentioned hdp version
-Dhdp.version=3.0.1.0-187
export LD_LIBRARY_PATH=/usr/hdp/3.0.1.0-187/hadoop/lib/native:/usr/hdp/3.0.1.0-187/hadoop/lib/native/Linux-amd64-64
spark-shell --master yarn --deploy-mode client --conf spark.yarn.archive=hdfs:///user/centos/data/spark/spark-jars.zip

I assume you use sbt as the build tool in your project. The project itself could use Java or Scala. I also think that the answer in general would be similar if you used gradle or maven, but the plugins would simply be different. The idea is the same.
You have to use an assembly plugin (e.g. sbt-assembly) that is going to bundle all non-Provided dependencies together, including Apache Spark, in order to create a so-called fat jar or uber-jar.
If the custom Apache Spark version is part of the application jar that version is going to be used whatever spark-submit you use for deployment. The trick is to trick the classloader so it loads the jars and classes of your choice not spark-submit's (and hence whatever is used in the cluster).

How to run edited Spark MLLib code in Cloudera managed cluster

I have a 3 node cluster managed by Cloudera Manager. I have edited BlockMatrix.scala file in the Spark MLLib source code and packaged it using:
mvn -DskipTests package
command. It has created new jar file. Now I want to run this newly created MLLib jar on my cluster. What is the way to do that?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

the conflict between running spark-sql and hive-on-spark - apache-spark

Related

Spark job can't connect to cassandra when ran from a jar

Where should Apache Livy be installed in cluster

Can we run spark on mesos using the precompiled hadoop-spark package?

How to use different Spark version (Spark 2.4) on YARN cluster deployed with Spark 2.1?

How to run edited Spark MLLib code in Cloudera managed cluster

Categories

Resources