"Unable to instantiate SparkSession with Hive support" error when trying to process hive table with spark

"Unable to instantiate SparkSession with Hive support" error when trying to process hive table with spark - apache-spark

I want to process hive table using spark, but when I run my program, I got this error:
Exception in thread "main" java.lang.IllegalArgumentException: Unable to instantiate SparkSession with Hive support because Hive classes are not found.
My application code
object spark_on_hive_table extends App {
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", "hdfs://localhost:54310/user/hive/warehouse")
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
spark.sql("select * from pbSales").show()
}
build.sbt
version := "0.1"
scalaVersion := "2.11.12"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.3.2",
"org.apache.spark" %% "spark-sql" % "2.3.2",
"org.apache.spark" %% "spark-streaming" % "2.3.2",
"org.apache.spark" %% "spark-hive" % "2.3.2" % "provided"
)

You should remove provided for your spark-hive dependency:
"org.apache.spark" %% "spark-hive" % "2.3.2" % "provided"
change to
"org.apache.spark" %% "spark-hive" % "2.3.2"

Related

Spark DynamoDB Connectivity Issue

Requirement: Read data from DynamoDB(not local but on AWS) via Spark using Scala from my local machine.
Understanding: Data can be read using the emr-hadoop-dynamodb.jar when we are using an EMR cluster
Question:
Can we read data from DynamoDB(on cloud and not local) using the emr-dynamodb-hadoop.jar?
EMR cluster is not to be used. I directly want to access dynamodb from spark using scala code on my local machine
build.sbt
version := "0.1"
scalaVersion := "2.11.12"
scalacOptions := Seq("-target:jvm-1.8")
libraryDependencies ++= Seq(
"software.amazon.awssdk" % "dynamodb" % "2.15.1",
"org.apache.spark" %% "spark-core" % "2.4.1",
"com.amazon.emr" % "emr-dynamodb-hadoop" % "4.2.0",
"org.apache.httpcomponents" % "httpclient" % "4.5"
)
dependencyOverrides ++= {
Seq(
"com.fasterxml.jackson.module" %% "jackson-module-scala" % "2.6.7.1",
"com.fasterxml.jackson.core" % "jackson-databind" % "2.6.7",
"com.fasterxml.jackson.core" % "jackson-core" % "2.6.7"
)
}
readDataFromDDB.scala
import org.apache.hadoop.dynamodb.DynamoDBItemWritable
import org.apache.hadoop.dynamodb.read.DynamoDBInputFormat
import org.apache.hadoop.io.Text
import org.apache.hadoop.mapred.JobConf
import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.{SparkConf, SparkContext}
object readDataFromDDB {
def main(args: Array[String]): Unit = {
var sc: SparkContext = null
try {
val conf = new SparkConf().setAppName("DynamoDBApplication").setMaster("local")
sc = new SparkContext(conf)
val jobConf = getDynamoDbJobConf(sc, "Music", "TableNameForWrite")
val tableData = sc.hadoopRDD(jobConf, classOf[DynamoDBInputFormat], classOf[Text], classOf[DynamoDBItemWritable])
println(tableData.count())
} catch {
case e: Exception => {
println(e.getStackTrace)
}
} finally {
sc.stop()
}
}
private def getDynamoDbJobConf(sc: JavaSparkContext, tableNameForRead: String, tableNameForWrite: String) = {
val jobConf = new JobConf(sc.hadoopConfiguration)
jobConf.set("dynamodb.servicename", "dynamodb")
jobConf.set("dynamodb.input.tableName", tableNameForRead)
jobConf.set("dynamodb.output.tableName", tableNameForWrite)
jobConf.set("dynamodb.awsAccessKeyId", "*****************")
jobConf.set("dynamodb.awsSecretAccessKey", "*********************")
jobConf.set("dynamodb.endpoint", "dynamodb.us-east-1.amazonaws.com")
jobConf.set("dynamodb.regionid", "us-east-1")
jobConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
jobConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
jobConf
}
}
ERROR
java.lang.RuntimeException: Could not lookup table Music in DynamoDB.
at org.apache.hadoop.dynamodb.DynamoDBClient.describeTable(DynamoDBClient.java:116)
at org.apache.hadoop.dynamodb.read.ReadIopsCalculator.getThroughput(ReadIopsCalculator.java:67)
at org.apache.hadoop.dynamodb.read.ReadIopsCalculator.calculateTargetIops(ReadIopsCalculator.java:57)
at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.initReadManager(AbstractDynamoDBRecordReader.java:153)
at org.apache.hadoop.dynamodb.read.AbstractDynamoDBRecordReader.(AbstractDynamoDBRecordReader.java:84)
at org.apache.hadoop.dynamodb.read.DefaultDynamoDBRecordReader.(DefaultDynamoDBRecordReader.java:24)
at org.apache.hadoop.dynamodb.read.DynamoDBInputFormat.getRecordReader(DynamoDBInputFormat.java:32)
at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:267)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:266)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:403)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:409)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: java.lang.IllegalStateException: Socket not created by this factory
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.handleException(DynamoDBFibonacciRetryer.java:120)
at org.apache.hadoop.dynamodb.DynamoDBFibonacciRetryer.runWithRetry(DynamoDBFibonacciRetryer.java:83)
at org.apache.hadoop.dynamodb.DynamoDBClient.describeTable(DynamoDBClient.java:105)
... 20 more
Links already reviewed
https://aws.amazon.com/blogs/big-data/analyze-your-data-on-amazon-dynamodb-with-apache-spark/
read/write dynamo db from apache spark
Spark 2.2.0 - How to write/read DataFrame to DynamoDB
https://github.com/awslabs/emr-dynamodb-connector

This was solved when the following dependency version were updated
"software.amazon.awssdk" % "dynamodb" % "2.15.31",
"com.amazon.emr" % "emr-dynamodb-hadoop" % "4.14.0"

Dependency for org.apache.spark.streaming.kafka.KafkaUtils

I am trying to integrate spark streaming with kafka. I am unable to resolve dependency for org.apache.spark.streaming.kafka.KafkaUtils. Below is my build.sbt:
name := "StreamingTest"
version := "1.0"
organization := "com.sundogsoftware"
scalaVersion := "2.12.10"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "3.0.0-preview2" % "provided",
"org.apache.spark" %% "spark-streaming-kafka-0-10" % "3.0.0-preview2",
"org.apache.spark" %% "spark-sql" % "3.0.0-preview2",
"org.apache.spark" %% "spark-streaming" % "3.0.0-preview2" % "provided",
"org.apache.kafka" %% "kafka" % "2.0.0"
)
I am using following imports in my project:
import org.apache.spark.{SparkConf, SparkContext, sql}
import org.apache.spark.sql.{SQLContext, SparkSession}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka.KafkaUtils
import kafka.serializer.StringDecoder
All the dependecies are resolved except org.apache.spark.streaming.kafka.KafkaUtils. I am using spark version 3.0.0-preview2 and scala version 2.12.10.

Dependencies for Spark-Streaming and Twiter-Streaming in SBT

I was trying to use the following dependencies in my build.sbt, but it keeps giving "unresolved dependency" issue.
libraryDependencies += "org.apache.bahir" %% "spark-streaming-twitter_2.11" % "2.2.0.1.0.0-SNAPSHOT"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.2.0"
I'm using Spark 2.2.0. What are the correct dependencies?

The question was posted a while ago, but I ran into the same problem this week. Here is the solution for those who still have the problem :
As you can see here, the correct syntax of the artifact for importing the lib with SBT is "spark-streaming-twitter", while with Maven it is "spark-streaming-twitter_2.11". It is because, for some reason, when importing with SBT, the Scala version is appended later (the last number is truncated).
But the thing is that the only artifact that work is "spark-streaming-twitter_2.11". For example, with a Scala 2.12, you will have the error
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: UNRESOLVED DEPENDENCIES ::
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: org.apache.bahir#spark-streaming-twitter_2.12;2.3.2: not found
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
But if you use Scala 2.11, it should work fine. Here is a working sbt file :
name := "twitter-read"
version := "0.1"
scalaVersion := "2.11.12"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.2"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.2"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.4.2" % "provided"
libraryDependencies += "org.twitter4j" % "twitter4j-core" % "3.0.3"
libraryDependencies += "org.twitter4j" % "twitter4j-stream" % "3.0.3"
libraryDependencies += "org.apache.bahir" %% "spark-streaming-twitter" % "2.3.2"

Below are the dependencies you need to add for Spark-Twitter Streaming.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>spark-streaming-twitter_2.11</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>org.twitter4j</groupId>
<artifactId>twitter4j-core</artifactId>
<version>4.0.4</version>
</dependency>
<dependency>
<groupId>org.twitter4j</groupId>
<artifactId>twitter4j-stream</artifactId>
<version>4.0.4</version>
</dependency >
<dependency>
<groupId>com.twitter</groupId>
<artifactId>jsr166e</artifactId>
<version>1.1.0</version>
</dependency>

How to run sqlContext in the spark-jobserver

I'm trying to execute locally a job in the spark-jobserver. My application has the dependencies below:
name := "spark-test"
version := "1.0"
scalaVersion := "2.10.6"
resolvers += Resolver.jcenterRepo
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1"
libraryDependencies += "spark.jobserver" %% "job-server-api" % "0.6.2" % "provided"
libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "1.6.2"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.6.2"
libraryDependencies += "com.holdenkarau" % "spark-testing-base_2.10" % "1.6.2_0.4.7" % "test"
I've generated the application package using:
sbt assembly
After that, I've submitted the package like this:
curl --data-binary #spark-test-assembly-1.0.jar localhost:8090/jars/myApp
When I triggered the job, I got the following error:
{
"duration": "0.101 secs",
"classPath": "jobs.TransformationJob",
"startTime": "2017-02-17T13:01:55.549Z",
"context": "42f857ba-jobs.TransformationJob",
"result": {
"message": "java.lang.Exception: Could not find resource path for Web UI: org/apache/spark/sql/execution/ui/static",
"errorClass": "java.lang.RuntimeException",
"stack": ["org.apache.spark.ui.JettyUtils$.createStaticHandler(JettyUtils.scala:180)", "org.apache.spark.ui.WebUI.addStaticHandler(WebUI.scala:117)", "org.apache.spark.sql.execution.ui.SQLTab.<init>(SQLTab.scala:34)", "org.apache.spark.sql.SQLContext$$anonfun$createListenerAndUI$1.apply(SQLContext.scala:1369)", "org.apache.spark.sql.SQLContext$$anonfun$createListenerAndUI$1.apply(SQLContext.scala:1369)", "scala.Option.foreach(Option.scala:236)", "org.apache.spark.sql.SQLContext$.createListenerAndUI(SQLContext.scala:1369)", "org.apache.spark.sql.SQLContext.<init>(SQLContext.scala:77)", "jobs.TransformationJob$.runJob(TransformationJob.scala:64)", "jobs.TransformationJob$.runJob(TransformationJob.scala:14)", "spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$4.apply(JobManagerActor.scala:301)", "scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)", "scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)", "java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)", "java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)", "java.lang.Thread.run(Thread.java:745)"]
},
"status": "ERROR",
"jobId": "a6bd6f23-cc82-44f3-8179-3b68168a2aa7"
}
Here is the part of the application that is failing:
override def runJob(sparkCtx: SparkContext, config: Config): Any = {
val sqlContext = new SQLContext(sparkCtx)
...
}
I have some questions:
1) I've noticed that to run spark-jobserver local I don't need to have spark installed. Does spark-jobserver already come with spark embedded?
2) How do I know what is the version of the spark that is being used by spark-jobserver? Where is that?
3) I'm using the version 1.6.2 of the spark-sql. Should I change it or keep it?
If anyone can answer my questions, I will be very grateful.

Yes, spark-jobserver has spark dependencies. Instead of job-server/reStart you should use job-server-extras/reStart which will help you to get sql related dependencies.
Look at project/Versions.scala
You don't need spark-sql I think because it is included if you run job-server-extras/reStart

Not getting the proper base url using requestUri on server machine using spark-submit

My requirement is creating the rest json from the request uri using spray. I am using requestUri directive to get the base URL. When I run it through IDE or through spark-submit locally on my system, I got the proper output. But when I have done spark-submit on the cluster, I am not getting the base url using requestUri directive.The url, I am getting is partial. Because of which the expected output is also not proper.
The code to get the url is
requestUri {
uri =>
val reqUri = s"$uri"//uri.toString()
complete {
println ("URI " + reqUri)
}
}
build.sbt looks like this
scalaVersion := "2.10.5"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.4.0"
resolvers ++= Seq(
"Akka Repository" at "http://repo.akka.io/releases/")
resolvers ++= Seq("Typesafe Repository" at "http://repo.typesafe.com/typesafe/releases/",
"Spray Repository" at "http://repo.spray.io")
libraryDependencies +=
"com.typesafe.akka" %% "akka-actor" % "2.3.0"
libraryDependencies ++= {
val sprayVersion = "1.3.1"
Seq(
"io.spray" %% "spray-can" % sprayVersion,
"io.spray" %% "spray-routing" % sprayVersion,
"io.spray" %% "spray-json" % sprayVersion
)
}
Please let me know how I can I fix this issue.All your suggestions are valuable. Thanks in advance.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

"Unable to instantiate SparkSession with Hive support" error when trying to process hive table with spark - apache-spark

You should remove provided for your spark-hive dependency: "org.apache.spark" %% "spark-hive" % "2.3.2" % "provided" change to "org.apache.spark" %% "spark-hive" % "2.3.2"

Related

Spark DynamoDB Connectivity Issue

Dependency for org.apache.spark.streaming.kafka.KafkaUtils

Dependencies for Spark-Streaming and Twiter-Streaming in SBT

How to run sqlContext in the spark-jobserver

Not getting the proper base url using requestUri on server machine using spark-submit

Categories

Resources