How to run sqlContext in the spark-jobserver - apache-spark

I'm trying to execute locally a job in the spark-jobserver. My application has the dependencies below:
name := "spark-test"
version := "1.0"
scalaVersion := "2.10.6"
resolvers += Resolver.jcenterRepo
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.1"
libraryDependencies += "spark.jobserver" %% "job-server-api" % "0.6.2" % "provided"
libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "1.6.2"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "1.6.2"
libraryDependencies += "com.holdenkarau" % "spark-testing-base_2.10" % "1.6.2_0.4.7" % "test"
I've generated the application package using:
sbt assembly
After that, I've submitted the package like this:
curl --data-binary #spark-test-assembly-1.0.jar localhost:8090/jars/myApp
When I triggered the job, I got the following error:
{
"duration": "0.101 secs",
"classPath": "jobs.TransformationJob",
"startTime": "2017-02-17T13:01:55.549Z",
"context": "42f857ba-jobs.TransformationJob",
"result": {
"message": "java.lang.Exception: Could not find resource path for Web UI: org/apache/spark/sql/execution/ui/static",
"errorClass": "java.lang.RuntimeException",
"stack": ["org.apache.spark.ui.JettyUtils$.createStaticHandler(JettyUtils.scala:180)", "org.apache.spark.ui.WebUI.addStaticHandler(WebUI.scala:117)", "org.apache.spark.sql.execution.ui.SQLTab.<init>(SQLTab.scala:34)", "org.apache.spark.sql.SQLContext$$anonfun$createListenerAndUI$1.apply(SQLContext.scala:1369)", "org.apache.spark.sql.SQLContext$$anonfun$createListenerAndUI$1.apply(SQLContext.scala:1369)", "scala.Option.foreach(Option.scala:236)", "org.apache.spark.sql.SQLContext$.createListenerAndUI(SQLContext.scala:1369)", "org.apache.spark.sql.SQLContext.<init>(SQLContext.scala:77)", "jobs.TransformationJob$.runJob(TransformationJob.scala:64)", "jobs.TransformationJob$.runJob(TransformationJob.scala:14)", "spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$4.apply(JobManagerActor.scala:301)", "scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)", "scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)", "java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)", "java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)", "java.lang.Thread.run(Thread.java:745)"]
},
"status": "ERROR",
"jobId": "a6bd6f23-cc82-44f3-8179-3b68168a2aa7"
}
Here is the part of the application that is failing:
override def runJob(sparkCtx: SparkContext, config: Config): Any = {
val sqlContext = new SQLContext(sparkCtx)
...
}
I have some questions:
1) I've noticed that to run spark-jobserver local I don't need to have spark installed. Does spark-jobserver already come with spark embedded?
2) How do I know what is the version of the spark that is being used by spark-jobserver? Where is that?
3) I'm using the version 1.6.2 of the spark-sql. Should I change it or keep it?
If anyone can answer my questions, I will be very grateful.

Yes, spark-jobserver has spark dependencies. Instead of job-server/reStart you should use job-server-extras/reStart which will help you to get sql related dependencies.
Look at project/Versions.scala
You don't need spark-sql I think because it is included if you run job-server-extras/reStart

Related

"Unable to instantiate SparkSession with Hive support" error when trying to process hive table with spark

I want to process hive table using spark, but when I run my program, I got this error:
Exception in thread "main" java.lang.IllegalArgumentException: Unable to instantiate SparkSession with Hive support because Hive classes are not found.
My application code
object spark_on_hive_table extends App {
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", "hdfs://localhost:54310/user/hive/warehouse")
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
spark.sql("select * from pbSales").show()
}
build.sbt
version := "0.1"
scalaVersion := "2.11.12"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.3.2",
"org.apache.spark" %% "spark-sql" % "2.3.2",
"org.apache.spark" %% "spark-streaming" % "2.3.2",
"org.apache.spark" %% "spark-hive" % "2.3.2" % "provided"
)
You should remove provided for your spark-hive dependency:
"org.apache.spark" %% "spark-hive" % "2.3.2" % "provided"
change to
"org.apache.spark" %% "spark-hive" % "2.3.2"

Dependency for org.apache.spark.streaming.kafka.KafkaUtils

I am trying to integrate spark streaming with kafka. I am unable to resolve dependency for org.apache.spark.streaming.kafka.KafkaUtils. Below is my build.sbt:
name := "StreamingTest"
version := "1.0"
organization := "com.sundogsoftware"
scalaVersion := "2.12.10"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "3.0.0-preview2" % "provided",
"org.apache.spark" %% "spark-streaming-kafka-0-10" % "3.0.0-preview2",
"org.apache.spark" %% "spark-sql" % "3.0.0-preview2",
"org.apache.spark" %% "spark-streaming" % "3.0.0-preview2" % "provided",
"org.apache.kafka" %% "kafka" % "2.0.0"
)
I am using following imports in my project:
import org.apache.spark.{SparkConf, SparkContext, sql}
import org.apache.spark.sql.{SQLContext, SparkSession}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka.KafkaUtils
import kafka.serializer.StringDecoder
All the dependecies are resolved except org.apache.spark.streaming.kafka.KafkaUtils. I am using spark version 3.0.0-preview2 and scala version 2.12.10.

Dependencies for Spark-Streaming and Twiter-Streaming in SBT

I was trying to use the following dependencies in my build.sbt, but it keeps giving "unresolved dependency" issue.
libraryDependencies += "org.apache.bahir" %% "spark-streaming-twitter_2.11" % "2.2.0.1.0.0-SNAPSHOT"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.2.0"
I'm using Spark 2.2.0. What are the correct dependencies?
The question was posted a while ago, but I ran into the same problem this week. Here is the solution for those who still have the problem :
As you can see here, the correct syntax of the artifact for importing the lib with SBT is "spark-streaming-twitter", while with Maven it is "spark-streaming-twitter_2.11". It is because, for some reason, when importing with SBT, the Scala version is appended later (the last number is truncated).
But the thing is that the only artifact that work is "spark-streaming-twitter_2.11". For example, with a Scala 2.12, you will have the error
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: UNRESOLVED DEPENDENCIES ::
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: org.apache.bahir#spark-streaming-twitter_2.12;2.3.2: not found
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
But if you use Scala 2.11, it should work fine. Here is a working sbt file :
name := "twitter-read"
version := "0.1"
scalaVersion := "2.11.12"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.2"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.2"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.4.2" % "provided"
libraryDependencies += "org.twitter4j" % "twitter4j-core" % "3.0.3"
libraryDependencies += "org.twitter4j" % "twitter4j-stream" % "3.0.3"
libraryDependencies += "org.apache.bahir" %% "spark-streaming-twitter" % "2.3.2"
Below are the dependencies you need to add for Spark-Twitter Streaming.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>spark-streaming-twitter_2.11</artifactId>
<version>2.0.0</version>
</dependency>
<dependency>
<groupId>org.twitter4j</groupId>
<artifactId>twitter4j-core</artifactId>
<version>4.0.4</version>
</dependency>
<dependency>
<groupId>org.twitter4j</groupId>
<artifactId>twitter4j-stream</artifactId>
<version>4.0.4</version>
</dependency >
<dependency>
<groupId>com.twitter</groupId>
<artifactId>jsr166e</artifactId>
<version>1.1.0</version>
</dependency>

Spark Driver Heap Memory Issues

I am seeing issues where I slowly run out of Java Heap on the master node. Below is a simple example I've created which just repeats itself 200 times. With the settings below the master runs out of memory in about 1 hour with the following error:
16/12/15 17:55:46 INFO YarnSchedulerBackend$YarnDriverEndpoint: Launching task 97578 on executor id: 9 hostname: ip-xxx-xxx-xx-xx
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 20160"...
The Code:
import org.apache.spark.sql.functions._
import org.apache.spark._
object MemTest {
case class X(colval: Long, colname: Long, ID: Long)
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("MemTest")
val spark = new SparkContext(conf)
val sc = org.apache.spark.sql.SQLContext.getOrCreate(spark)
import sc.implicits._;
for( a <- 1 to 200)
{
var df = spark.parallelize((1 to 5000000).map(x => X(x.toLong, x.toLong % 10, x.toLong / 10 ))).toDF()
df = df.groupBy("ID").pivot("colname").agg(max("colval"))
df.count
}
spark.stop()
}
}
I'm running on AWS emr-5.1.0 using m4.xlarge (4 nodes+1 master). Here are my spark settings
{
"Classification": "spark-defaults",
"Properties": {
"spark.dynamicAllocation.enabled": "false",
"spark.executor.instances": "16",
"spark.executor.memory": "2560m",
"spark.driver.memory": "768m",
"spark.executor.cores": "1"
}
},
{
"Classification": "spark",
"Properties": {
"maximizeResourceAllocation": "false"
}
},
I compile with sbt using
name := "Simple Project"
version := "1.0"
scalaVersion := "2.11.7"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.0.2" % "provided",
"org.apache.spark" %% "spark-sql" % "2.0.2")
and then run it using
spark-submit --class MemTest target/scala-2.11/simple-project_2.11-1.0.jar
Looking at memory with jmap -histo I see java.lang.Long and scala.Tuple2 keep growing.
Are you sure the spark version installed on the cluster is 2.0.2?
Or if there are several Spark installations on your cluster, are you sure you're calling the correct (2.0.2) spark-submit?
(I unfortunately cannot comment so that's the reason I posted this as an answer)

Not getting the proper base url using requestUri on server machine using spark-submit

My requirement is creating the rest json from the request uri using spray. I am using requestUri directive to get the base URL. When I run it through IDE or through spark-submit locally on my system, I got the proper output. But when I have done spark-submit on the cluster, I am not getting the base url using requestUri directive.The url, I am getting is partial. Because of which the expected output is also not proper.
The code to get the url is
requestUri {
uri =>
val reqUri = s"$uri"//uri.toString()
complete {
println ("URI " + reqUri)
}
}
build.sbt looks like this
scalaVersion := "2.10.5"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.4.0"
resolvers ++= Seq(
"Akka Repository" at "http://repo.akka.io/releases/")
resolvers ++= Seq("Typesafe Repository" at "http://repo.typesafe.com/typesafe/releases/",
"Spray Repository" at "http://repo.spray.io")
libraryDependencies +=
"com.typesafe.akka" %% "akka-actor" % "2.3.0"
libraryDependencies ++= {
val sprayVersion = "1.3.1"
Seq(
"io.spray" %% "spray-can" % sprayVersion,
"io.spray" %% "spray-routing" % sprayVersion,
"io.spray" %% "spray-json" % sprayVersion
)
}
Please let me know how I can I fix this issue.All your suggestions are valuable. Thanks in advance.

Resources