Spark 2.1 on MAPR 5.0 - apache-spark

I am running Spark 2.1 on Mapr 5.0
I am getting following exception while launching Spark on local mode
My spark-default (Important configuration)
spark.sql.hive.metastore.version 0.13.1
spark.sql.hive.metastore.jars
/opt/mapr/lib/maprfs-5.0.0-mapr.jar:
/opt/mapr/hadoop/hadoop-0.20.2/conf:
/opt/mapr/hadoop/hadoop-0.20.2/lib/protobuf-java-2.5.0.jar:
/opt/hadoopgpl/lib/hadoop-lzo.jar:
/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.7.0-mapr-1506.jar:
/opt/mapr/hadoop/hadoop-0.20.2/lib/commons-logging-1.1.1.jar:
/opt/mapr/hadoop/hadoop-0.20.2/lib/hadoop-auth-2.7.0-mapr-1506.jar:
/opt/mapr/lib/libprotodefs-5.0.0-mapr.jar:
/opt/mapr/lib/baseutils-5.0.0-mapr.jar:
/opt/mapr/hadoop/hadoop-0.20.2/lib/guava-13.0.1.jar:
/opt/mapr/hadoop/hadoop-0.20.2/lib/hadoop-common-2.7.0-mapr-1506.jar:
/opt/mapr/hadoop/hadoop-0.20.2/lib/commons-configuration-1.6.jar
spark.sql.hive.metastore.sharedPrefixes com.mysql.jdbc,org.postgresql,com.microsoft.sqlserver,oracle.jdbc,com.mapr.fs.shim.LibraryLoader,com.mapr.security.JNISecurity,com.mapr.fs.jni,com.mapr.fs.shim
java.lang.LinkageError: loader (instance of org/apache/spark/sql/hive/client/IsolatedClientLoader$$anon$1): attempted duplicate class definition for name: "com/mapr/fs/jni/MapRConstants"
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
... 104 more
java.lang.IllegalArgumentException: Error while instantiating
'org.apache.spark.sql.hive.HiveSessionState':
Please help me on that

I have experienced the same issue.
When running my Spark job using spark-submit everything worked but running it locally reproduced the same problem.
Digging out into MapR community yielded this post: https://community.mapr.com/thread/21262-spark-todf-returns-linkage-error-duplicate-class-definition#comments
In addition you will notice that in your cluster in this file spark-defaults.conf there is the following config key: spark.sql.hive.metastore.sharedPrefixes
So, adding this key spark.sql.hive.metastore.sharedPrefixes in SparkConf fixed my problem.
Here is an explanation of this key:
a comma separated list of class prefixes that should be loaded using
the classloader that is shared between Spark SQL and a specific
version of Hive. An example of classes that should be shared is JDBC
drivers that are needed to talk to the metastore. Other classes that
need to be shared are those that interact with classes that are
already shared. For example, custom appenders that are used by log4j.
You can read more about it in here: https://spark.apache.org/docs/2.1.0/sql-programming-guide.html

Related

Read data from Cassandra in spark-shell

I want to read data from cassandra node in my client node on :
This is what i tried :
spark-shell --jars /my-dir/spark-cassandra-connector_2.11-2.3.2.jar.
val df = spark.read.format("org.apache.spark.sql.cassandra")\
.option("keyspace","my_keyspace")\
.option("table","my_table")\
.option("spark.cassandra.connection.host","Hostname of my Cassandra node")\
.option("spark.cassandra.connection.port","9042")\
.option("spark.cassandra.auth.password","mypassword)\
.option("spark.cassandra.auth.username","myusername")\
.load
I'm getting this error: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.cassandra.DefaultSource$
and
java.lang.NoClassDefFoundError: org/apache/commons/configuration/ConfigurationException.
Am I missing any properties? What this error is for ? How would I resolve this ?
Spark-version:2.3.2, DSE version 6.7.8
The Spark Cassandra Connector itself depends on the number of other dependencies, that could be missing here - this happens because you're providing only one jar, and not all required dependencies.
Basically, in your case you need to have following choice:
If you're running this on the DSE node, then you can use built-in Spark, if the cluster has Analytics enabled - in this case, all jars and properties are already provided, and you only need to provide username and password when starting spark shell via dse -u user -p password spark
if you're using external Spark, then it's better to use so-called BYOS (bring your own spark) - special version of the Spark Cassandra Connector with all dependencies bundled inside, and you can download jar from DataStax's Maven repo, and use with --jars
you can still use open source Spark Cassandra Connector, but in this case, it's better to use --packages com.datastax.spark:spark-cassandra-connector_2.11:2.3.2 so Spark will able to fetch all dependencies automatically.
P.S. In case of open source Spark Cassandra Connector I would recommend to use version 2.5.1 or higher, although it requires Spark 2.4.x (although 2.3.x may work) - this version has improved support for DSE, plus a lot of the new functionality not available in the earlier versions. Plus for that version there is a version that includes all required dependencies (so-called assembly) that you can use with --jars if your machine doesn't have access to the internet.

Hive on Spark ERROR java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS

Running Hive on Spark with a simple select * from table query runs smoothly, but on joins and sums, the ApplicationMaster returns this stack trace for the associated spark container:
2019-03-29 17:23:43 ERROR ApplicationMaster:91 - User class threw exception: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS
java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS
at org.apache.hive.spark.client.rpc.RpcConfiguration.<clinit>(RpcConfiguration.java:47)
at org.apache.hive.spark.client.RemoteDriver.<init>(RemoteDriver.java:134)
at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:706)
2019-03-29 17:23:43 INFO ApplicationMaster:54 - Final app status: FAILED, exitCode: 13, (reason: User class threw exception: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS
at org.apache.hive.spark.client.rpc.RpcConfiguration.<clinit>(RpcConfiguration.java:47)
at org.apache.hive.spark.client.RemoteDriver.<init>(RemoteDriver.java:134)
at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:706)
)
2019-03-29 17:23:43 ERROR ApplicationMaster:91 - Uncaught exception:
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:486)
at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:345)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply$mcV$sp(ApplicationMaster.scala:260)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$5.run(ApplicationMaster.scala:800)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:799)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:259)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:824)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
Caused by: java.util.concurrent.ExecutionException: Boxed Error
at scala.concurrent.impl.Promise$.resolver(Promise.scala:55)
at scala.concurrent.impl.Promise$.scala$concurrent$impl$Promise$$resolveTry(Promise.scala:47)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:244)
at scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
at scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:724)
Caused by: java.lang.NoSuchFieldError: SPARK_RPC_SERVER_ADDRESS
at org.apache.hive.spark.client.rpc.RpcConfiguration.<clinit>(RpcConfiguration.java:47)
at org.apache.hive.spark.client.RemoteDriver.<init>(RemoteDriver.java:134)
at org.apache.hive.spark.client.RemoteDriver.main(RemoteDriver.java:516)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:706)
2019-03-29 17:23:43 INFO ApplicationMaster:54 - Deleting staging directory hdfs://LOSLDAP01:9000/user/hdfs/.sparkStaging/application_1553880018684_0001
2019-03-29 17:23:43 INFO ShutdownHookManager:54 - Shutdown hook called
I have already tried to increase yarn container memory allocation (and decrease spark memory) with no success.
Using:
Hadoop 2.9.2
Spark 2.3.0
Hive 2.3.4
Thank you for your help.
This was asked 6 months ago. Hope this helps others.
The reason for this error is SPARK_RPC_SERVER_ADDRESS added in hive version 2.x and Spark by default supports hive 1.2.1.
I was able to enable hive-on-spark using this manual on EMR 5.25 cluster (Hadoop 2.8.5, hive 2.3.5, Spark 2.4.3) for running on YARN. However, manual needs to be updated, it is missing some key items.
To run with YARN mode (either yarn-client or yarn-cluster), link the following jars to HIVE_HOME/lib. Manual didn't mention linking the last library spark-unsafe.jar
ln -s /usr/lib/spark/jars/scala-library-2.11.12.jar /usr/lib/hive/lib/scala-library.jar
ln -s /usr/lib/spark/jars/spark-core_2.11-2.4.3.jar /usr/lib/hive/lib/spark-core.jar
ln -s /usr/lib/spark/jars/spark-network-common_2.11-2.4.3.jar /usr/lib/hive/lib/spark-network-common.jar
ln -s /usr/lib/spark/jars/spark-unsafe_2.11-2.4.3.jar /usr/lib/hive/lib/spark-unsafe.jar
Allow Yarn to cache necessary spark dependency jars on nodes so that it does not need to be distributed each time when an application runs.
Hive 2.2.0, upload all jars in $SPARK_HOME/jars to hdfs folder and add following in hive-site.xml
<property>
<name>spark.yarn.jars</name>
<value>hdfs://xxxx:8020/spark-jars/*</value>
</property>
Manual is missing key information that you need to exclude default hive 1.2.1 jars. This is what I did:
hadoop fs -mkdir /spark-jars
hadoop fs -put /usr/lib/spark/jars/*.jar /spark-jars/
hadoop fs -rm /spark-jars/*hive*1.2.1*
Also, you need to add the following to spark-defaults.conf file:
spark.sql.hive.metastore.version 2.3.0;
spark.sql.hive.metastore.jars /usr/lib/hive/lib/*:/usr/lib/hadoop/client/*
For more information on interacting with different versions of hive metastore please check this link.
It turned out that Hive-on-Spark has a lot of implementation problems and essentially does not work at all unless you write your own custom Hive connector. In a nutshell, Spark devs are struggling to keep up with Hive releases and they did not yet decided how to deal with backward compatibility on how to load Hive versions ~< 2 while focusing on the newest branch.
Solutions
1) Go back to Hive 1.x
Not ideal. Especially if you want some more modern integration with file formats such as ORC.
2) Use Hive-on-Tez
This is the one we decided to adopt. *This solution does not break the open source stack* and works perfectly along with Spark-on-Yarn. 3rd party Hadoop ecosystems, like those for Azure, AWS and Hortonworks all add proprietary code just for running Hive-On-Spark because of the mess that it became.
By installing Tez, your Hadoop queries will work like this:
A direct Hive query (e.g. jdbc connection from DBeaver) will run a Tez container on the cluster
A Spark job will be able to access the Hive metastore as normal and will use a Spark container on the cluster when creating the SparkSession.builder.enableHiveSupport().getOrCreate() (this is pyspark code)
Installing Hive-on-Tez with Spark-on-Yarn
Note: I'll keep it short since I do not see much interest on these boards. Ask for details and I'll be happy to help and expand.
Version matrix
Hadoop 2.9.2
Tez 0.9.2
Hive 2.3.4
Spark 2.4.2
Hadoop is installed in cluster mode.
This is what worked for us. I would not expect it to work seamlessly when switching to Hadoop 3.x, which we will be doing at some point in the future, but it should work fine if you do not change the main release version for each component.
Basic guide
Compile Tez from source as written in the official install guide, with Mode A for sharing hadoop jars. Do not use any pre-compiled Tez distro. Test it by the hive shell with a simple query which is not a simple data access (i.e. just a select). For example, use: select count(*) from myDb.myTable. You should see the Tez bars from the hive console.
Compile Spark from source. To do so, follow the official guide (Important: download the archive labeled without-hadoop !!), but before compiling it edit the source code at ./sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala and comment out the following line: ConfVars.HIVE_STATS_JDBC_TIMEOUT -> TimeUnit.SECONDS,
Share $HIVE_HOME/conf/hive-site.xml, in your $SPARK_HOME/conf/ dir. You must make a hard copy of this config file and not a symlink. The reason is that you must remove all Tez-related Hive config values from it to guarantee that Spark co-exist independently with Tez, as explained above. This does include the hive.execution.engine=tez property which must be left empty. Just remove it completely from the Spark's hive-site.xml, while leaving it in the Hive's hive-site.xml.
In $HADOOP_HOME/etc/hadoop/mapred-site.xml set property mapreduce.framework.name=yarn. This will be picked up correctly by both environments even if it is not set to yarn-tez. It just means that raw mapreduce jobs will not run on Tez, while Hive jobs will indeed use it. This is a problem only for legacy jobs, since raw mapred is obsolete.
Good luck!

How to get access to HDFS files in Spark standalone cluster mode?

I am trying to get access to HDFS files in Spark. Everything works fine when I run Spark in local mode, i.e.
SparkSession.master("local")
and get access to HDFS files by
hdfs://localhost:9000/$FILE_PATH
But when I am trying to run Spark in standalone cluster mode, i.e.
SparkSession.master("spark://$SPARK_MASTER_HOST:7077")
Error throws
java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.fun$1 of type org.apache.spark.api.java.function.Function in instance of org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1
So far I have only
start-dfs.sh
in Hadoop and does not really config anything in Spark. Do I need to run Spark using YARN cluster manager instead so that Spark and Hadoop are using the same cluster manager, hence can get access to HDFS files?
I have tried to config yarn-site.xml in Hadoop following tutorialspoint https://www.tutorialspoint.com/hadoop/hadoop_enviornment_setup.htm, and specified HADOOP_CONF_DIR in spark-env.sh, but it does not seem to work and the same error throws. Am I missing some other configurations?
Thanks!
EDIT
The initial Hadoop version is 2.8.0 and the Spark version is 2.1.1 with Hadoop 2.7. Tried to download hadoop-2.7.4 but the same error still exists.
The question here suggests this as a java syntax issue rather than spark hdfs issue. I will try this approach and see if this solves the error here.
Inspired by the post here, solved the problem by myself.
This map-reduce job depends on a Serializable class, so when running in Spark local mode, this serializable class can be found and the map-reduce job can be executed dependently.
When running in Spark standalone cluster mode, the best is to submit the application through spark-submit, rather than running in an IDE. Packaged everything in jar and spark-submit the jar, works as a charm!

Spark-submit Executers are not getting the properties

I am trying to deploy the Spark application to 4 node DSE spark cluster, and I have created a fat jar with all dependent Jars and I have created a property file under src/main/resources which has properties like batch interval master URL etc.
I have copied this fat jar to master and I am submitting the application with "spark-submit" and below is my submit command.
dse spark-submit --class com.Processor.utils.jobLauncher --supervise application-1.0.0-develop-SNAPSHOT.jar qa
everything works properly when I run on single node cluster, but if run on DSE spark standalone cluster, the properties mentioned above like batch interval become unavailable to executors. I have googled and found that is the common issue many has solved it. so I have followed one of the solutions and created a fat jar and tried to run, but still, my properties are unavailable to executors.
can someone please give any pointers on how to solve the issue ?
I am using DSE 4.8.5 and Spark 1.4.2
and this is how I am loading the properties
System.setProperty("env",args(0))
val conf = com.typesafe.config.ConfigFactory.load(System.getProperty("env") + "_application")
figured out the solution:
I am referring the property file name from system property(i am setting it main method with the command line parameter) and when the code gets shipped and executed on worker node the system property is not available (obviously..!!) , so instead of using typesafe ConfigFactory to load property file I am using simple Scala file reading.

Spark+cassandra java.lang.ClassNotFoundException: com.datastax.spark.connector.rdd.CassandraRDD

I am trying to read data from cassandra 2.0.6 using Spark. I use datastax drivers.While reading I got an error like " Loss was due to java.lang.ClassNotFoundException
java.lang.ClassNotFoundException: com.datastax.spark.connector.rdd.CassandraRDD". But I included spark-cassandra-connector_2.10 in my pom.xml which has com.datastax.spark.connector.rdd.CassandraRDD class.Am i missing any other settings or environment variables.
You need to make sure that the connector is on the class-path for the executor using the -cp option or that it is a bundled jar in the spark context (using the SparkConf.addJars() ).
Edit for Modern Spark
In Spark > 1.X it's usually recommend that you use the spark-submit command to place your dependencies on the executor classpath. See
http://spark.apache.org/docs/latest/submitting-applications.html

Resources