Dependencies for Spark-Streaming and Twiter-Streaming in SBT - apache-spark

I was trying to use the following dependencies in my build.sbt, but it keeps giving "unresolved dependency" issue.
libraryDependencies += "org.apache.bahir" %% "spark-streaming-twitter_2.11" % "2.2.0.1.0.0-SNAPSHOT"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.2.0"
I'm using Spark 2.2.0. What are the correct dependencies?

The question was posted a while ago, but I ran into the same problem this week. Here is the solution for those who still have the problem :
As you can see here, the correct syntax of the artifact for importing the lib with SBT is "spark-streaming-twitter", while with Maven it is "spark-streaming-twitter_2.11". It is because, for some reason, when importing with SBT, the Scala version is appended later (the last number is truncated).
But the thing is that the only artifact that work is "spark-streaming-twitter_2.11". For example, with a Scala 2.12, you will have the error
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: UNRESOLVED DEPENDENCIES ::
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: org.apache.bahir#spark-streaming-twitter_2.12;2.3.2: not found
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
But if you use Scala 2.11, it should work fine. Here is a working sbt file :
name := "twitter-read"
version := "0.1"
scalaVersion := "2.11.12"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.2"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.2"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.4.2" % "provided"
libraryDependencies += "org.twitter4j" % "twitter4j-core" % "3.0.3"
libraryDependencies += "org.twitter4j" % "twitter4j-stream" % "3.0.3"
libraryDependencies += "org.apache.bahir" %% "spark-streaming-twitter" % "2.3.2"

Below are the dependencies you need to add for Spark-Twitter Streaming.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>spark-streaming-twitter_2.11</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>org.twitter4j</groupId>
<artifactId>twitter4j-core</artifactId>
<version>4.0.4</version>
</dependency>
<dependency>
<groupId>org.twitter4j</groupId>
<artifactId>twitter4j-stream</artifactId>
<version>4.0.4</version>
</dependency >
<dependency>
<groupId>com.twitter</groupId>
<artifactId>jsr166e</artifactId>
<version>1.1.0</version>
</dependency>

Related

"Unable to instantiate SparkSession with Hive support" error when trying to process hive table with spark

I want to process hive table using spark, but when I run my program, I got this error:
Exception in thread "main" java.lang.IllegalArgumentException: Unable to instantiate SparkSession with Hive support because Hive classes are not found.
My application code
object spark_on_hive_table extends App {
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", "hdfs://localhost:54310/user/hive/warehouse")
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
spark.sql("select * from pbSales").show()
}
build.sbt
version := "0.1"
scalaVersion := "2.11.12"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.3.2",
"org.apache.spark" %% "spark-sql" % "2.3.2",
"org.apache.spark" %% "spark-streaming" % "2.3.2",
"org.apache.spark" %% "spark-hive" % "2.3.2" % "provided"
)
You should remove provided for your spark-hive dependency:
"org.apache.spark" %% "spark-hive" % "2.3.2" % "provided"
change to
"org.apache.spark" %% "spark-hive" % "2.3.2"

Dependency for org.apache.spark.streaming.kafka.KafkaUtils

I am trying to integrate spark streaming with kafka. I am unable to resolve dependency for org.apache.spark.streaming.kafka.KafkaUtils. Below is my build.sbt:
name := "StreamingTest"
version := "1.0"
organization := "com.sundogsoftware"
scalaVersion := "2.12.10"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "3.0.0-preview2" % "provided",
"org.apache.spark" %% "spark-streaming-kafka-0-10" % "3.0.0-preview2",
"org.apache.spark" %% "spark-sql" % "3.0.0-preview2",
"org.apache.spark" %% "spark-streaming" % "3.0.0-preview2" % "provided",
"org.apache.kafka" %% "kafka" % "2.0.0"
)
I am using following imports in my project:
import org.apache.spark.{SparkConf, SparkContext, sql}
import org.apache.spark.sql.{SQLContext, SparkSession}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka.KafkaUtils
import kafka.serializer.StringDecoder
All the dependecies are resolved except org.apache.spark.streaming.kafka.KafkaUtils. I am using spark version 3.0.0-preview2 and scala version 2.12.10.

SBT built error : [error] (*:update) sbt.ResolveException: unresolved dependency:

I am trying to execute a spark program using sbt built but getting the below errors.
'[error] (*:update) sbt.ResolveException: unresolved dependency: org.apache.hadoop#hadoop-mapreduce-client-app;2.2.0: not found
[error] unresolved dependency: org.apache.hadoop#hadoop-yarn-api;2.2.0: not found
[error] unresolved dependency: org.apache.hadoop#hadoop-mapreduce-client-core;2.2.0: not found
[error] unresolved dependency: org.apache.hadoop#hadoop-mapreduce-client-jobclient;2.2.0: not found
[error] unresolved dependency: asm#asm;3.1: not found
[error] unresolved dependency: org.apache.spark#hadoop-core_2.10;2.2.0: not found
[error] unresolved dependency: org.apache.hadoop#hadoop-client_2.11;2.2.0: not found
[error] download failed: org.apache.avro#avro;1.7.7!avro.jar
[error] download failed: commons-codec#commons-codec;1.4!commons-codec.jar
attaching code as well as sbt built. I have set the folder structure correctly though.
helloSpark.scala:
import org.apache.spark.SparkConetxt
import org.apache.spark.SparkConetxt._
import org.apache.spark.SparkConf
object HelloSpark {
def main(args: Array[String]){
val conf = new SparkConf().setMaster("local").setAppName("Hello Spark")
val sc = new SparkContext(conf)
val rddFile = sc.textFile("data.txt").filter(line => line.contains("spark")).count()
println("lines with spark: %s".format(rddFile))
}
}
simple.sbt:
name := "Hello Spark"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.2"
libraryDependencies += "org.apache.spark" %% "hadoop-core" % "2.2.0"
libraryDependencies += "org.apache.hadoop" % "hadoop-client_2.11" % "2.2.0"

How to run sqlContext in the spark-jobserver

I'm trying to execute locally a job in the spark-jobserver. My application has the dependencies below:
name := "spark-test"
version := "1.0"
scalaVersion := "2.10.6"
resolvers += Resolver.jcenterRepo
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1"
libraryDependencies += "spark.jobserver" %% "job-server-api" % "0.6.2" % "provided"
libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "1.6.2"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.6.2"
libraryDependencies += "com.holdenkarau" % "spark-testing-base_2.10" % "1.6.2_0.4.7" % "test"
I've generated the application package using:
sbt assembly
After that, I've submitted the package like this:
curl --data-binary #spark-test-assembly-1.0.jar localhost:8090/jars/myApp
When I triggered the job, I got the following error:
{
"duration": "0.101 secs",
"classPath": "jobs.TransformationJob",
"startTime": "2017-02-17T13:01:55.549Z",
"context": "42f857ba-jobs.TransformationJob",
"result": {
"message": "java.lang.Exception: Could not find resource path for Web UI: org/apache/spark/sql/execution/ui/static",
"errorClass": "java.lang.RuntimeException",
"stack": ["org.apache.spark.ui.JettyUtils$.createStaticHandler(JettyUtils.scala:180)", "org.apache.spark.ui.WebUI.addStaticHandler(WebUI.scala:117)", "org.apache.spark.sql.execution.ui.SQLTab.<init>(SQLTab.scala:34)", "org.apache.spark.sql.SQLContext$$anonfun$createListenerAndUI$1.apply(SQLContext.scala:1369)", "org.apache.spark.sql.SQLContext$$anonfun$createListenerAndUI$1.apply(SQLContext.scala:1369)", "scala.Option.foreach(Option.scala:236)", "org.apache.spark.sql.SQLContext$.createListenerAndUI(SQLContext.scala:1369)", "org.apache.spark.sql.SQLContext.<init>(SQLContext.scala:77)", "jobs.TransformationJob$.runJob(TransformationJob.scala:64)", "jobs.TransformationJob$.runJob(TransformationJob.scala:14)", "spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$4.apply(JobManagerActor.scala:301)", "scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)", "scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)", "java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)", "java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)", "java.lang.Thread.run(Thread.java:745)"]
},
"status": "ERROR",
"jobId": "a6bd6f23-cc82-44f3-8179-3b68168a2aa7"
}
Here is the part of the application that is failing:
override def runJob(sparkCtx: SparkContext, config: Config): Any = {
val sqlContext = new SQLContext(sparkCtx)
...
}
I have some questions:
1) I've noticed that to run spark-jobserver local I don't need to have spark installed. Does spark-jobserver already come with spark embedded?
2) How do I know what is the version of the spark that is being used by spark-jobserver? Where is that?
3) I'm using the version 1.6.2 of the spark-sql. Should I change it or keep it?
If anyone can answer my questions, I will be very grateful.
Yes, spark-jobserver has spark dependencies. Instead of job-server/reStart you should use job-server-extras/reStart which will help you to get sql related dependencies.
Look at project/Versions.scala
You don't need spark-sql I think because it is included if you run job-server-extras/reStart

Not getting the proper base url using requestUri on server machine using spark-submit

My requirement is creating the rest json from the request uri using spray. I am using requestUri directive to get the base URL. When I run it through IDE or through spark-submit locally on my system, I got the proper output. But when I have done spark-submit on the cluster, I am not getting the base url using requestUri directive.The url, I am getting is partial. Because of which the expected output is also not proper.
The code to get the url is
requestUri {
uri =>
val reqUri = s"$uri"//uri.toString()
complete {
println ("URI " + reqUri)
}
}
build.sbt looks like this
scalaVersion := "2.10.5"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.4.0"
resolvers ++= Seq(
"Akka Repository" at "http://repo.akka.io/releases/")
resolvers ++= Seq("Typesafe Repository" at "http://repo.typesafe.com/typesafe/releases/",
"Spray Repository" at "http://repo.spray.io")
libraryDependencies +=
"com.typesafe.akka" %% "akka-actor" % "2.3.0"
libraryDependencies ++= {
val sprayVersion = "1.3.1"
Seq(
"io.spray" %% "spray-can" % sprayVersion,
"io.spray" %% "spray-routing" % sprayVersion,
"io.spray" %% "spray-json" % sprayVersion
)
}
Please let me know how I can I fix this issue.All your suggestions are valuable. Thanks in advance.

Resources