Providing Hive support to a deployed Apache Spark

Providing Hive support to a deployed Apache Spark - apache-spark

I need to use Hive-specific features in Spark SQL, however I have to work with an already deployed Apache Spark instance that, unfortunately, doesn't have Hive support compiled in.
What would I have to do to include Hive support for my job?
I tried using the spark.sql.hive.metastore.jars setting, but then I always get these exceptions:
DataNucleus.Persistence: Error creating validator of type org.datanucleus.properties.CorePropertyValidator
ClassLoaderResolver for class "" gave error on creation : {1}
and
org.datanucleus.exceptions.NucleusUserException: Persistence process has been specified to use a ClassLoaderResolver of name "datanucleus" yet this has not been found by the DataNucleus plugin mechanism. Please check your CLASSPATH and plugin specification.
In the setting I am providing a fat-jar of spark-hive (excluded spark-core and spark-sql) with all its optional Hadoop dependencies (CDH-specific versions of hadoop-archives, hadoop-common, hadoop-hdfs, hadoop-mapreduce-client-core, hadoop-yarn-api, hadoop-yarn-client and hadoop-yarn-common).
I am also specifying spark.sql.hive.metastore.version with the value 1.2.1
I am using CDH5.3.1 (with Hadoop 2.5.0) and Spark 1.5.2 on Scala 2.10

Related

Spark 2.3 - Log4j Vunlerability

In our project, running spark 2.3 with 7 nodes.
Recently as part of Security scan, log4j vulnerability is reported by security Team.
We can see log4j 1.x jar in the spark folder (/opt/spark/jars/log4j-1.2.17.jar).
We tried to replace the jar with log4j 2.17.1 version and tried to run the spark. But Spark is failing with "NoClassDefFoundError" for class org/apache/log4j/or/RendererMap
Please help me to resolve this issue.

Try using log4j-1.2-api of version 2.17.1
https://mvnrepository.com/artifact/org.apache.logging.log4j/log4j-1.2-api

You need to copy 3 jars(core,api,bridge) from https://archive.apache.org/dist/logging/log4j/ and put in spark/jar folder.
Refer this page for details.
https://logging.apache.org/log4j/2.x/manual/migration.html

import org.apache.spark.streaming.kafka._ Cannot resolve symbol kafka

I have created one spark application to integrate with kafka and get stream of data from kafka.
But, when i try to import import org.apache.spark.streaming.kafka._ an error occur that Cannot resolve symbol kafka so what should i do to import this library

Depending on your Spark and Scala version you need to include the spark-kafka integration library to your dependencies.
Spark Structured Streaming
If you plan to use Spark Structured Streaming you need to add the following to your dependencies as described here:
For Scala/Java applications using SBT/Maven project definitions, link your application with the following artifact:
groupId = org.apache.spark
artifactId = spark-sql-kafka-0-10_2.12
version = 3.0.1
Please note that to use the headers functionality, your Kafka client version should be version 0.11.0.0 or up. For Python applications, you need to add this above library and its dependencies when deploying your application. See the Deploying subsection below. For experimenting on spark-shell, you need to add this above library and its dependencies too when invoking spark-shell. Also, see the Deploying subsection below.
Spark Streaming
If you plan to work Spark Streaming (Direct API) you can follow the guidance given here:
For Scala/Java applications using SBT/Maven project definitions, link your streaming application with the following artifact (see Linking section in the main programming guide for further information).
groupId = org.apache.spark
artifactId = spark-streaming-kafka-0-10_2.12
version = 3.0.1

What is LongAdder related to cassandra+spark connector?

When i load data into cassandra with using databricks, its getting the issue with
Caused by: java.lang.NoClassDefFoundError: com/twitter/jsr166e/LongAdder
Its simple saveToCassandra to table.
I looked this twitter jsr166e jar in maven , its very old, added in 2013,
I don't know why this jar is not available in Spark+cassandra_coonector

That error indicates you are missing dependencies and / or the Spark Cassandra connector is not on the runtime classpath of the Spark application. Not sure how you installed the connector but you should have used the packages method to ensure that dependencies are met and the Connector is correctly configured.
Read more HERE
Hope that helps,
Pat

Query Hive table created with built-in Serde from Spark app

I have an hadoop cluster deployed using Hortonwork's HDP 2.2 (Spark 1.2.1 & Hive 0.14)
I have developped a simple Spark app that is supposed to retrieve the content of a Hive table, perform some actions and output to a file. The Hive table was imported using Hive's built-in SerDe.
When I run the app on the cluster I get the following exception :
ERROR log: error in initSerDe: java.lang.ClassNotFoundException Class org.apache.hadoop.hive.serde2.OpenCSVSerde not found
java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.serde2.OpenCSVSerde not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1982)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:337)
at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:288)
at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:281)
at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:631)
at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:189)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:1017)
...
Basically, Spark doesn't find Hive's SerDe (org.apache.hadoop.hive.serde2.OpenCSVSerde)
I didn't find any jar to include at the app's execution and no mention of a similar problem anywhere. I have no idea how to tell Spark where to find it.

Make a shaded JAR of your application which includes hive-serde JAR. Refer this

add jar file in spark config spark.driver.extraClassPath.
Any external jar must be added here , then spark environment will automatically load them.
Or use spark-shell --jars command
example
spark.executor.extraClassPath /usr/lib/hadoop/lib/csv-serde-0.9.1.jar

The .jar was in hive's lib folder, just had to add it on launch with --jar and know where to look !
--jars /usr/hdp/XXX/hive/lib/hive-serde-XXX.jar

Classpath resolution between spark uber jar and spark-submit --jars when similar classes exist in both

What is the precedence in class loading when both the uber jar of my spark application and the contents of --jars option to my spark-submit shell command contain similar dependencies ?
I ask this from a third-party library integration standpoint. If I set --jars to use a third-party library at version 2.0 and the uber jar coming into this spark-submit script was assembled using version 2.1, which class is loaded at runtime ?
At present, I think of keeping my dependencies on hdfs, and add them to the --jars option on spark-submit, while hoping via some end-user documentation to ask users to set the scope of this third-party library to be 'provided' in their spark application's maven pom file.

This is somewhat controlled with params:
spark.driver.userClassPathFirst &
spark.executor.userClassPathFirst
If set to true (default is false), from docs:
(Experimental) Whether to give user-added jars precedence over Spark's own jars when loading classes in the the driver. This feature can be used to mitigate conflicts between Spark's dependencies and user dependencies. It is currently an experimental feature. This is used in cluster mode only.
I wrote some of the code that controls this, and there were a few bugs in the early releases, but if you're using a recent Spark release it should work (although it is still an experimental feature).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Providing Hive support to a deployed Apache Spark - apache-spark

Related

Spark 2.3 - Log4j Vunlerability

import org.apache.spark.streaming.kafka._ Cannot resolve symbol kafka

What is LongAdder related to cassandra+spark connector?

Query Hive table created with built-in Serde from Spark app

Classpath resolution between spark uber jar and spark-submit --jars when similar classes exist in both

Categories

Resources