spark application how to use jar's methods function in spark - apache-spark

How to use external jar lib function and method in spark application (jupyter notebook)
Example I was trying to use ua-parser as external lib to parse user agent string,
So in my jupyter notebook I have added that jar file in spark config using
SparkSession.builder.appName(sparkAppName).master(sparkMaster)
.config("spark.jars", "my_hdfs_jar_path/ua_parser1.3.0.jar")
how can someone use class/method from such external jar into pyspark/spark sql code?
In Python or Java app, someone can easily use methods and classes from external jar by just adding it into application class path and importing classnames
from ua_parser import user_agent_parser
In java we can add jar as external jar and we can use methods/functions just doing
import ua_parser.Parser
Note - I have given a ua_parse just for a example purpose

Related

Unable to build Spark application with multiple main classes for Databricks job

I have a spark application that contains multiple spark jobs to be run on Azure data bricks. I want to build and package the application into a fat jar. The application is able to compile successfully. While I am trying to package (command: sbt package) the application, it gives an error "[warn] multiple main classes detected: run 'show discoveredMainClasses' to see the list".
How to build the application jar (Without specifying any main class) so that I can upload it to Databricks job and specify the main classpath over there?
This message is just a warning (see [warn] in it), it doesn't prevent generation of the jar files (normal or fat). Then you can upload resulting jar to DBFS (or ADLS for newer Databricks Runtime versions), and create Databricks job either as Jar task, or Spark Submit task.
If sbt fails, and doesn't produce jars, then you have some plugin that forces error on the warnings.
Also notice that sbt package doesn't produce fat jar - it produce jar only for classes in your project. You will need to use sbt assembly (install sbt-assembly plugin for that) to generate fat jar, but make sure that you marked Spark & Delta dependencies as provided.

import org.apache.spark.streaming.kafka._ Cannot resolve symbol kafka

I have created one spark application to integrate with kafka and get stream of data from kafka.
But, when i try to import import org.apache.spark.streaming.kafka._ an error occur that Cannot resolve symbol kafka so what should i do to import this library
Depending on your Spark and Scala version you need to include the spark-kafka integration library to your dependencies.
Spark Structured Streaming
If you plan to use Spark Structured Streaming you need to add the following to your dependencies as described here:
For Scala/Java applications using SBT/Maven project definitions, link your application with the following artifact:
groupId = org.apache.spark
artifactId = spark-sql-kafka-0-10_2.12
version = 3.0.1
Please note that to use the headers functionality, your Kafka client version should be version 0.11.0.0 or up. For Python applications, you need to add this above library and its dependencies when deploying your application. See the Deploying subsection below. For experimenting on spark-shell, you need to add this above library and its dependencies too when invoking spark-shell. Also, see the Deploying subsection below.
Spark Streaming
If you plan to work Spark Streaming (Direct API) you can follow the guidance given here:
For Scala/Java applications using SBT/Maven project definitions, link your streaming application with the following artifact (see Linking section in the main programming guide for further information).
groupId = org.apache.spark
artifactId = spark-streaming-kafka-0-10_2.12
version = 3.0.1

Load external jars to Zeppelin from s3

Pretty simple objective. Load my custom/local jars from s3 to zeppelin notebook (using zeppelin from AWS EMR).
Location of the Jar
s3://my-config-bucket/process_dataloader.jar
Following zeppelin documentation I opened the interpreter like in the following image and spark.jars in the properties name and its value is s3://my-config-bucket/process_dataloader.jar
I restarted the interpreter and then in the notebook I tried to import the jar using the following
import com.org.dataloader.DataLoader
but it throws the following
<console>:23: error: object org is not a member of package com
import com.org.dataloader.DataLoader
Any suggestions for solving this problem?
A bit late thought but for anyone else who might need this in future try below option,
https://bucket/dev/jars/RedshiftJDBC41-1.2.12.1017.jar" is basically your s3 object url.
%spark.dep
z.reset()
z.load("https://bucket/dev/jars/RedshiftJDBC41-1.2.12.1017.jar")

SBT console vs Spark-Shell for interactive development

I'm wondering if there are any important differences between using SBT console and Spark-shell for interactively developing new code for a Spark project (notebooks are not really an option w/ the server firewalls).
Both can import project dependencies, but for me SBT is a little more convenient. SBT automatically brings in all the dependencies in build.sbt and spark-shell can use the --jar, --packages, and --repositories arguments in the command line.
SBT has the handy initialCommands setting to automatically run lines at startup. I use this for initializing the SparkContext.
Are there others?
With SBT you need not install SPARK itself theoretically.
I use databricks.
From my experience sbt calls external jars innately spark shell calls series of imports and contexts innately. I prefer spark shell because it follows the standard you need to adhere to when build the spark submit session.
For running the code in production you need to build the code into jars, calling them via spark submit. To build that you need to package it via sbt (compilation check) and run the spark submit submit call (logic check).
You can develope using either tool but you should code as if you did not have the advantages of sbt (calling the jars) and spark shell (calling the imports and contexts) because spark submit doesn't do either.

Query Hive table created with built-in Serde from Spark app

I have an hadoop cluster deployed using Hortonwork's HDP 2.2 (Spark 1.2.1 & Hive 0.14)
I have developped a simple Spark app that is supposed to retrieve the content of a Hive table, perform some actions and output to a file. The Hive table was imported using Hive's built-in SerDe.
When I run the app on the cluster I get the following exception :
ERROR log: error in initSerDe: java.lang.ClassNotFoundException Class org.apache.hadoop.hive.serde2.OpenCSVSerde not found
java.lang.ClassNotFoundException: Class org.apache.hadoop.hive.serde2.OpenCSVSerde not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1982)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:337)
at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:288)
at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:281)
at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:631)
at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:189)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:1017)
...
Basically, Spark doesn't find Hive's SerDe (org.apache.hadoop.hive.serde2.OpenCSVSerde)
I didn't find any jar to include at the app's execution and no mention of a similar problem anywhere. I have no idea how to tell Spark where to find it.
Make a shaded JAR of your application which includes hive-serde JAR. Refer this
add jar file in spark config spark.driver.extraClassPath.
Any external jar must be added here , then spark environment will automatically load them.
Or use spark-shell --jars command
example
spark.executor.extraClassPath /usr/lib/hadoop/lib/csv-serde-0.9.1.jar
The .jar was in hive's lib folder, just had to add it on launch with --jar and know where to look !
--jars /usr/hdp/XXX/hive/lib/hive-serde-XXX.jar

Resources