How to add jar using HiveContext in the spark job - apache-spark

I am trying to add JSONSerDe jar file to in order to access the json data load the JSON data to hive table from the spark job. My code is shown below:
SparkConf sparkConf = new SparkConf().setAppName("KafkaStreamToHbase");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
JavaStreamingContext jssc = new JavaStreamingContext(sc, Durations.seconds(10));
final SQLContext sqlContext = new SQLContext(sc);
final HiveContext hiveContext = new HiveContext(sc);
hiveContext.sql("ADD JAR hdfs://localhost:8020/tmp/hive-serdes-1.0-SNAPSHOT.jar");
hiveContext.sql("LOAD DATA INPATH '/tmp/mar08/part-00000' OVERWRITE INTO TABLE testjson");
But I end up the following error:
java.net.MalformedURLException: unknown protocol: hdfs
at java.net.URL.<init>(URL.java:592)
at java.net.URL.<init>(URL.java:482)
at java.net.URL.<init>(URL.java:431)
at java.net.URI.toURL(URI.java:1096)
at org.apache.spark.sql.hive.client.ClientWrapper.addJar(ClientWrapper.scala:578)
at org.apache.spark.sql.hive.HiveContext.addJar(HiveContext.scala:652)
at org.apache.spark.sql.hive.execution.AddJar.run(commands.scala:89)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:145)
at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:130)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
at com.macys.apm.kafka.spark.parquet.KafkaStreamToHbase$2.call(KafkaStreamToHbase.java:148)
at com.macys.apm.kafka.spark.parquet.KafkaStreamToHbase$2.call(KafkaStreamToHbase.java:141)
at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$2.apply(JavaDStreamLike.scala:327)
at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$2.apply(JavaDStreamLike.scala:327)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at scala.util.Try$.apply(Try.scala:161)
I was able to add the jar through hive shell. But it throws an error when I was trying to add using hiveContext.sql() in the spark job(javacode). Quick help will be a great helpful.
Thanks.

One work around is you can pass the udf jars at run time by passing --jars to spark-submit command or you can copy those required jars to spark libs.
Basically it supports file, hdfs and ivy schemes.
Which version of spark you are using. I am not able see addJar method in ClientWrapper.scala in the latest version.

I just looked into spark code. Seems like it is an issue from spark side. They are using simple java.net.Uri for getting scheme. Java URI class does not understands hdfs scheme. Ideally they have to register FsUrlStreamHandlerFactory.java (i.e. [ hdfs-link https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FsUrlStreamHandlerFactory.java]) to the URI.
You can add jars from local file system or you can pass at jars at job submission time or you can copy the jars to spark lib folder.

Related

DataFrameReader throwing "Unsupported type NULL" while reading avro file

I am trying to read an avro file with DataFrame, but keep getting:
org.apache.spark.sql.avro.IncompatibleSchemaException: Unsupported type NULL
Since I am going to deploy it on Dataproc I am using Spark 2.4.0, but the same happened when I tried other versions.
Following is my dependencies:
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
</dependencies>
My main class:
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf()
.setAppName("Example");
SparkSession spark = SparkSession
.builder()
.appName("Java Spark SQL basic example")
.getOrCreate();
Dataset<Row> rowDataset = spark.read().format("avro").load("avro_file");
}
Running command:
spark-submit --packages org.apache.spark:spark-avro_2.11:2.4.0 --master local[*] --class MainClass my-spak-app.jar
After running a lot of tests I concluded that it happens because I have in my avro schema a field defined with "type": "null". I am not creating the files I am working on so I can't change the schema. I am able to read the files when I am using RDD and read the file with newAPIHadoopFile method.
Is there a way to read avro files with "type": "null" using Dataframe or I will have to work with RDD?
You can specify a schema when you read the file. Create a schema for your file
val ACCOUNT_schema = StructType(List(
StructField("XXX",DateType,true),
StructField("YYY",StringType,true))
val rowDataset = spark.read().format("avro").option("avroSchema", schema).load("avro_file");
I am not very familiar with java syntax, but I think you can manage it.

How to connect to Phoenix From Apache Spark 2.X Using Java

Surprisingly , couldn't find any up to date JAVA document in the web for this . The 1 or 2 examples in the entire World Wild Web is too old. I came up with the following which fails with error 'Module not Found org.apache.phoenix.spark' , but That module is part of the Jar for Sure . I don't think following approach is right because it is copy - paste from different examples, and loading a module like this is a bit anti pattern , as we already have the package as part of the jar. Please show me the right way.
Note- Please do Scala or Phython example , They are easily available over net,
public class ECLoad {
public static void main(String[] args){
//Create a SparkContext to initialize
String warehouseLocation = new File("spark-warehouse").getAbsolutePath();
SparkSession spark = SparkSession
.builder()
.appName("ECLoad")
.master("local")
.config("spark.sql.warehouse.dir", warehouseLocation)
.getOrCreate();
spark.conf().set("spark.testing.memory", "2147480000"); // if you face any memory issue
Dataset<Row> df = spark.sqlContext().read().format("org.apache.phoenix.spark.*").option("table",
"CLINICAL.ENCOUNTER_CASES").option("zkUrl", "localhost:2181").load();
df.show();
}
}
I'm trying to run it as
spark-submit --class "encountercases.ECLoad" --jars phoenix-spark-5.0.0-HBase-2.0.jar,phoenix-core-5.0.0-HBase-2.0.jar --master local ./PASpark-1.0-SNAPSHOT.jar
and I get following error -
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration
I see required jars are already at the suggested path and hbase-site.xml symlink exixsts.
Before getting phoenix working with spark, you will need to setup the environment for spark so that it knows how to access phoenix/hbase.
First create a symbolic link to hbase-site.xml
ln -s /etc/hbase/conf/hbase-site.xml /etc/spark2/conf/hbase-site.xml
Alternatively you can add this file while creating spark session or in spark defaults.
You will need to add jars under /usr/hdp/current/phoenix-client/ to driver as well as executor class path. Parameters to be set: spark.driver.extraClassPath and spark.executor.extraClassPath
This step is trivial and could easily be translated into java/scala/python/R, above 2 steps are critical for it to work as those setup env:
val df = sqlContext.load("org.apache.phoenix.spark",Map("table" -> "CLINICAL.ENCOUNTER_CASES", "zkUrl" -> "localhost:2181"))
Refer to: https://community.hortonworks.com/articles/179762/how-to-connect-to-phoenix-tables-using-spark2.html

SparkContext.setLogLevel("DEBUG") doesn't works in Cluster

I'm trying to control my Spark logs use
sc.setLogLevel("ERROR");
seems like it doesn't work in the cluster environment. Can anyone help?
public static JavaSparkContext getSparkContext(String appName, SparkConf conf) {
SparkSession spark = getSparkSession(appName, conf);
JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());
sc.setLogLevel("WARN");
return sc;
}
To configure log levels, add the following options to your spark submit command:
'--conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=custom-log4j.properties"'
'--conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=custom-log4j.properties"'
This assumes you have a file called custom-log4j.properties on the classpath. This log4j can then control the verbosity of spark's logging.

Got UnsatisfiedLinkError when start Spark program from Java code

I am using SparkLauncher to launch my spark app from Java.The code looks like
Map<String, String> envMap = new HashMap<>();
envMap.put("HADOOP_CONF_DIR","/etc/hadoop/conf");
envMap.put("JAVA_LIBRARY_PATH", "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/hadoop/lib/native");
envMap.put("LD_LIBRARY_PATH", "/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/hadoop/lib/native");
envMap.put("SPARK_HOME","/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark");
envMap.put("DEFAULT_HADOOP_HOME","/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/hadoop");
envMap.put("SPARK_DIST_CLASSPATH","all jars under /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars");
envMap.put("HADOOP_HOME","/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/hadoop");
SparkLauncher sparklauncher = new SparkLauncher(envMap)
.setAppResource("myapp.jar")
.setSparkHome("/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/spark/")
.setMainClass("spark.HelloSpark")
.setMaster("yarn-cluster")
.setConf(SparkLauncher.DRIVER_MEMORY, "2g")
.setConf("spark.driver.userClassPathFirst", "true")
.setConf("spark.executor.userClassPathFirst", "true").launch();
Every time, I got
User class threw exception: java.lang.UnsatisfiedLinkError:
org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
It looks like your jar is including Spark/Hadoop libraries that have conflict with other libraries in the cluster. Check that your Spark and Hadoop dependencies are marked as provided.

Spark Streaming standalone app and dependencies

I've got a scala spark streaming application that I'm running from inside IntelliJ. When I run against local[2], it runs fine. If I set the master to spark://masterip:port, then I get the following exception:
java.lang.ClassNotFoundException: RmqReceiver
I should add that I've got a custom receiver implemented in the same project called RmqReceiver. This is my app's code:
import akka.actor.{Props, ActorSystem}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkContext, SparkConf}
object Streamer {
def main(args:Array[String]): Unit ={
val conf = new SparkConf(true).setMaster("spark://192.168.40.2:7077").setAppName("Streamer")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(2))
val messages = ssc.receiverStream(new RmqReceiver(...))
messages.print()
ssc.start()
ssc.awaitTermination()
}
}
The RmqReceiver class is in the same scala folder as Streamer. I understand that using spark-submit with --jars for dependencies will likely make this work. Is there any way to get this working from inside the application?
To run job on standalone spark cluster it need to know about all classes used in your applications. So you can add them to spark class path at startup, what is difficult and I don't suggest you to do that.
You need to package your application as uber-jar (compress all dependencies into single jar file) and then add it to SparkConf jars.
We use sbt-assembly plugin. If you're using maven, it has the same functionality with maven assembly
val sparkConf = new SparkConf().
setMaster(config.getString("spark.master")).
setJars(SparkContext.jarOfClass(this.getClass).toSeq)
I don't think that you can dp it from Intellij Idea, you definitely can do it as a part of sbt test phase.

Resources