Query Hive table created with built-in Serde from Spark app

Query Hive table created with built-in Serde from Spark app - apache-spark

I have an hadoop cluster deployed using Hortonwork's HDP 2.2 (Spark 1.2.1 & Hive 0.14)
I have developped a simple Spark app that is supposed to retrieve the content of a Hive table, perform some actions and output to a file. The Hive table was imported using Hive's built-in SerDe.
When I run the app on the cluster I get the following exception :
ERROR log: error in initSerDe: java.lang.ClassNotFoundException Class org.apache.hadoop.hive.serde2.OpenCSVSerde not found
java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.serde2.OpenCSVSerde not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1982)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:337)
at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:288)
at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:281)
at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:631)
at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:189)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:1017)
...
Basically, Spark doesn't find Hive's SerDe (org.apache.hadoop.hive.serde2.OpenCSVSerde)
I didn't find any jar to include at the app's execution and no mention of a similar problem anywhere. I have no idea how to tell Spark where to find it.

Make a shaded JAR of your application which includes hive-serde JAR. Refer this

add jar file in spark config spark.driver.extraClassPath.
Any external jar must be added here , then spark environment will automatically load them.
Or use spark-shell --jars command
example
spark.executor.extraClassPath /usr/lib/hadoop/lib/csv-serde-0.9.1.jar

The .jar was in hive's lib folder, just had to add it on launch with --jar and know where to look !
--jars /usr/hdp/XXX/hive/lib/hive-serde-XXX.jar

Related

Read/Load avro file from s3 using pyspark

Using AWS glue developer endpoint Spark Version - 2.4 Python Version- 3
Code:
df=spark.read.format("avro").load("s3://dataexport/users/prod-users.avro")
Getting the following error message while trying to read avro file:
Failed to find data source: avro. Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".;
Found the following links, but not helpful to resolve my issue
https://spark.apache.org/docs/latest/sql-data-sources-avro.html[Apache Avro Data Source Guide][1]
Apache Avro as a Built-in Data Source in Apache Spark 2.4

You just need to import that package
org.apache.spark:spark-avro_2.11:4.0.0
Check which version you need here

Have you imported the package while starting the shell? If not you need to start a shell as below. Below package is applicable for spark 2.4+ version.
pyspark --packages com.databricks:spark-avro_2.11:4.0.0
Also write as below inside read.format:
df=spark.read.format("com.databricks.spark.avro").load("s3://dataexport/users/prod-users.avro")
Note: For pyspark you need to write 'com.databricks.spark.avro' instead of 'avro'.

java.lang.NoSuchMethodError: org.apache.solr.client.solrj.impl.CloudSolrClient$Builder.withHttpClient

I am following this example to get data from Solr to my Scala Spark program. Below is my code:
val solrURL = "someurl"
val collectionName = "somecollection"
val solrRDD = new SelectSolrRDD(solrURL,collectionName,sc)
val solrQuery=new SolrQuery("somequery")
solrQuery.setTimeAllowed(0)
val solrDataRDD=solrRDD.query(solrQuery)
When I run this code on my local Spark cluster, I get the following exception at new selectSolrRDD line:
java.lang.NoSuchMethodError: org.apache.solr.client.solrj.impl.CloudSolrClient$Builder.withHttpClient(Lorg/apache/http/client/HttpClient;)Lorg/apache/solr/client/solrj/impl/SolrClientBuilder;
I looked at some other answers on StackOverflow but nothing worked.

The problem is with your packaging and deployment (your pom.xml assuming you are using maven). The issue is that the Solr client libraries are not being loaded when you run your Spark app. You need to package your app and any dependencies into an "uber jar" for deployment to a cluster.
Take a look at how spark-solr has it setup. They use the maven-shade-plugin to generate the uber jar.

My cluster had jars of spark-solr already present which were conflicting with the jars I was using. After removing those jars, my code worked correctly.

spark-submit how to specify that dependent libraries are inside application's jar

I am building a small test app for Spark 2.1.0 running as a 2 worker cluster on my computer and packaging dependent libraries inside my application's jar file. How can I tell Spark during spark-submit that libraries inside applications's jar file? Otherwise I am getting Exception in thread "main" java.lang.NoClassDefFoundError.
Or should dependent libraries copied to Spark?
Thanks in advance.

To add external libraries add application jar in below directories
spark.driver.extraLibraryPath
spark.driver.extraClassPath
spark.executor.extraClassPath
spark.executor.extraLibraryPath
You can find path of above directories in /etc/spark/conf.dist/spark-defaults.conf file

Easier way is to build an uber jar - where all your dependencies in pom will be added into your jar.
The other and best way is -
Make the spark specific jars available on the cluster classpath(in
pom make them as provided scope)
Any third party libraries can be
added using "--jars /fullpath/your.jar" for spark submit or by
spark.driver.extraLibraryPath, spark.driver.extraClassPath,
spark.executor.extraClassPath, spark.executor.extraLibraryPath as
mentioned above.

Providing Hive support to a deployed Apache Spark

I need to use Hive-specific features in Spark SQL, however I have to work with an already deployed Apache Spark instance that, unfortunately, doesn't have Hive support compiled in.
What would I have to do to include Hive support for my job?
I tried using the spark.sql.hive.metastore.jars setting, but then I always get these exceptions:
DataNucleus.Persistence: Error creating validator of type org.datanucleus.properties.CorePropertyValidator
ClassLoaderResolver for class "" gave error on creation : {1}
and
org.datanucleus.exceptions.NucleusUserException: Persistence process has been specified to use a ClassLoaderResolver of name "datanucleus" yet this has not been found by the DataNucleus plugin mechanism. Please check your CLASSPATH and plugin specification.
In the setting I am providing a fat-jar of spark-hive (excluded spark-core and spark-sql) with all its optional Hadoop dependencies (CDH-specific versions of hadoop-archives, hadoop-common, hadoop-hdfs, hadoop-mapreduce-client-core, hadoop-yarn-api, hadoop-yarn-client and hadoop-yarn-common).
I am also specifying spark.sql.hive.metastore.version with the value 1.2.1
I am using CDH5.3.1 (with Hadoop 2.5.0) and Spark 1.5.2 on Scala 2.10

Classpath resolution between spark uber jar and spark-submit --jars when similar classes exist in both

What is the precedence in class loading when both the uber jar of my spark application and the contents of --jars option to my spark-submit shell command contain similar dependencies ?
I ask this from a third-party library integration standpoint. If I set --jars to use a third-party library at version 2.0 and the uber jar coming into this spark-submit script was assembled using version 2.1, which class is loaded at runtime ?
At present, I think of keeping my dependencies on hdfs, and add them to the --jars option on spark-submit, while hoping via some end-user documentation to ask users to set the scope of this third-party library to be 'provided' in their spark application's maven pom file.

This is somewhat controlled with params:
spark.driver.userClassPathFirst &
spark.executor.userClassPathFirst
If set to true (default is false), from docs:
(Experimental) Whether to give user-added jars precedence over Spark's own jars when loading classes in the the driver. This feature can be used to mitigate conflicts between Spark's dependencies and user dependencies. It is currently an experimental feature. This is used in cluster mode only.
I wrote some of the code that controls this, and there were a few bugs in the early releases, but if you're using a recent Spark release it should work (although it is still an experimental feature).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Query Hive table created with built-in Serde from Spark app - apache-spark

Make a shaded JAR of your application which includes hive-serde JAR. Refer this

add jar file in spark config spark.driver.extraClassPath. Any external jar must be added here , then spark environment will automatically load them. Or use spark-shell --jars command example spark.executor.extraClassPath /usr/lib/hadoop/lib/csv-serde-0.9.1.jar

The .jar was in hive's lib folder, just had to add it on launch with --jar and know where to look ! --jars /usr/hdp/XXX/hive/lib/hive-serde-XXX.jar

Related

Read/Load avro file from s3 using pyspark

java.lang.NoSuchMethodError: org.apache.solr.client.solrj.impl.CloudSolrClient$Builder.withHttpClient

spark-submit how to specify that dependent libraries are inside application's jar

Providing Hive support to a deployed Apache Spark

Classpath resolution between spark uber jar and spark-submit --jars when similar classes exist in both

Categories

Resources