What is use of method addJar() in Spark? - apache-spark

In spark job, I don't know how to import and use the jars that is shared by method SparkContext.addJar(). It seems that this method is able to move jars into some place that are accessible by other nodes in the cluster, but I do not know how to import them.
This is an example:
package utils;
public class addNumber {
public int addOne(int i){
return i + 1;
}
public int addTwo(int i){
return i + 2;
}
}
I create a class called addNumber and make it into a jar file utils.jar.
Then I create a spark job and codes are shown below:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object TestDependencies {
def main(args:Array[String]): Unit = {
val sparkConf = new SparkConf
val sc = new SparkContext(sparkConf)
sc.addJar("/path/to//utils.jar")
val data = 1 to 100 toList
val rdd = sc.makeRDD(data)
val rdd_1 = rdd.map ( x => {
val handler = new utils.addNumber
handler.addOne(x)
} )
rdd_1.collect().foreach { x => print(x + "||") }
}
}
The error "java.lang.NoClassDefFoundError: utils/addNumber" raised after submission of the job through command "spark-submit".
I know that method addJar() does not guarantee jars included into class path of the spark job. If I want to use the jar files I have move all of dependencies to the same path in each node of cluster. But if I can move and include all of the jars, what is the use of method addJar()?
I am wondering if there is a way using jars imported by method addJar(). Thanks in advance.

Did you try set the path of jar with prefix "local"? From documentation:
public void addJar(String path)
Adds a JAR dependency for all tasks to be executed on this
SparkContext in the future. The path passed can be either a local
file, a file in HDFS (or other Hadoop-supported filesystems), an HTTP,
HTTPS or FTP URI, or local:/path for a file on every worker node.
You can try this way as well:
val conf = new SparkConf()
.setMaster('local[*]')
.setAppName('tmp')
.setJars(Array('/path1/one.jar', '/path2/two.jar'))
val sc = new SparkContext(conf)
and take a look here, check spark.jars option
and set "--jars" param in spark-submit:
--jars /path/1.jar,/path/2.jar
or edit conf/spark-defaults.conf:
spark.driver.extraClassPath /path/1.jar:/fullpath/2.jar
spark.executor.extraClassPath /path/1.jar:/fullpath/2.jar

Related

Spark Structured Streaming program that reads from non-empty Kafka topic (starting from earliest) triggers batches locally, but not on EMR cluster

We are running into a problem where -- for one of our applications --
we don't see any evidences of batches being processed in the Structured
Streaming tab of the Spark UI.
I have written a small program (below) to reproduce the issue.
A self-contained project that allows you to build the app, along with scripts that facilitate upload to AWS, and details on how to run and reproduce the issue can be found here: https://github.com/buildlackey/spark-struct-streaming-metrics-missing-on-aws (The github version of the app is a slightly evolved version of what is presented below, but it illustrates the problem of Spark streaming metrics not showing up.)
The program can be run 'locally' -- on someones' laptop in local[*] mode (say with a dockerized Kafka instance),
or on an EMR cluster. For local mode operation you invoke the main method with 'localTest' as the first
argument.
In our case, when we run on the EMR cluster, pointing to a topic
where we know there are many data records (we read from 'earliest'), we
see that THERE ARE INDEED NO BATCHES PROCESSED -- on the cluster for some reason...
In the local[*] case we CAN see batches processed.
To capture evidence of this i wrote a forEachBatch handler that simply does a
toLocalIterator.asScala.toList.mkString("\n") on the Dataset of each batch, and then dumps the
resultant string to a file. Running locally.. i see evidence of the
captured records in the temporary file. HOWEVER, when I run on
the cluster and i ssh into one of the executors i see NO SUCH
files. I also checked the master node.... no files matching the pattern 'Missing'
So... batches are not triggering on the cluster. Our kakfa has plenty of data and
when running on the cluster the logs show we are churning through messages at increasing offsets:
21/12/16 05:15:21 DEBUG KafkaDataConsumer: Get spark-kafka-source-blah topic.foo.event-18 nextOffset 4596542913 requested 4596542913
21/12/16 05:15:21 DEBUG KafkaDataConsumer: Get spark-kafka-source-blah topic.foo.event-18 nextOffset 4596542914 requested 4596542914
Note to get the logs we are using:
yarn yarn logs --applicationId <appId>
which should get both driver and executor logs for the entire run (when app terminates)
Now, in the local[*] case we CAN see batches processed. The evidence is that we see a file whose name
is matching the pattern 'Missing' in our tmp folder.
I am including my simple demo program below. If you can spot the issue and clue us in, I'd be very grateful !
// Please forgive the busy code.. i stripped this down from a much larger system....
import com.typesafe.scalalogging.StrictLogging
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
import org.apache.spark.sql.{Dataset, SparkSession}
import java.io.File
import java.util
import scala.collection.JavaConverters.asScalaIteratorConverter
import scala.concurrent.duration.Duration
object AwsSupportCaseFailsToYieldLogs extends StrictLogging {
case class KafkaEvent(fooMsgKey: Array[Byte],
fooMsg: Array[Byte],
topic: String,
partition: String,
offset: String) extends Serializable
case class SparkSessionConfig(appName: String, master: String) {
def sessionBuilder(): SparkSession.Builder = {
val builder = SparkSession.builder
builder.master(master)
builder
}
}
case class KafkaConfig(kafkaBootstrapServers: String, kafkaTopic: String, kafkaStartingOffsets: String)
def sessionFactory: (SparkSessionConfig) => SparkSession = {
(sparkSessionConfig) => {
sparkSessionConfig.sessionBuilder().getOrCreate()
}
}
def main(args: Array[String]): Unit = {
val (sparkSessionConfig, kafkaConfig) =
if (args.length >= 1 && args(0) == "localTest") {
getLocalTestConfiguration
} else {
getRunOnClusterConfiguration
}
val spark: SparkSession = sessionFactory(sparkSessionConfig)
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val dataSetOfKafkaEvent: Dataset[KafkaEvent] = spark.readStream.
format("kafka").
option("subscribe", kafkaConfig.kafkaTopic).
option("kafka.bootstrap.servers", kafkaConfig.kafkaBootstrapServers).
option("startingOffsets", kafkaConfig.kafkaStartingOffsets).
load.
select(
$"key" cast "binary",
$"value" cast "binary",
$"topic",
$"partition" cast "string",
$"offset" cast "string").map { row =>
KafkaEvent(
row.getAs[Array[Byte]](0),
row.getAs[Array[Byte]](1),
row.getAs[String](2),
row.getAs[String](3),
row.getAs[String](4))
}
val initDF = dataSetOfKafkaEvent.map { item: KafkaEvent => item.toString }
val function: (Dataset[String], Long) => Unit =
(dataSetOfString, batchId) => {
val iter: util.Iterator[String] = dataSetOfString.toLocalIterator()
val lines = iter.asScala.toList.mkString("\n")
val outfile = writeStringToTmpFile(lines)
println(s"writing to file: ${outfile.getAbsolutePath}")
logger.error(s"writing to file: ${outfile.getAbsolutePath} / $lines")
}
val trigger = Trigger.ProcessingTime(Duration("1 second"))
initDF.writeStream
.foreachBatch(function)
.trigger(trigger)
.outputMode("append")
.start
.awaitTermination()
}
private def getLocalTestConfiguration: (SparkSessionConfig, KafkaConfig) = {
val sparkSessionConfig: SparkSessionConfig =
SparkSessionConfig(master = "local[*]", appName = "dummy2")
val kafkaConfig: KafkaConfig =
KafkaConfig(
kafkaBootstrapServers = "localhost:9092",
kafkaTopic = "test-topic",
kafkaStartingOffsets = "earliest")
(sparkSessionConfig, kafkaConfig)
}
private def getRunOnClusterConfiguration = {
val sparkSessionConfig: SparkSessionConfig = SparkSessionConfig(master = "yarn", appName = "AwsSupportCase")
val kafkaConfig: KafkaConfig =
KafkaConfig(
kafkaBootstrapServers= "kafka.foo.bar.broker:9092", // TODO - change this for kafka on your EMR cluster.
kafkaTopic= "mongo.bongo.event", // TODO - change this for kafka on your EMR cluster.
kafkaStartingOffsets = "earliest")
(sparkSessionConfig, kafkaConfig)
}
def writeStringFile(string: String, file: File): File = {
java.nio.file.Files.write(java.nio.file.Paths.get(file.getAbsolutePath), string.getBytes).toFile
}
def writeStringToTmpFile(string: String, deleteOnExit: Boolean = false): File = {
val file: File = File.createTempFile("streamingConsoleMissing", "sad")
if (deleteOnExit) {
file.delete()
}
writeStringFile(string, file)
}
}
I have encountered similar issue, maxOffsetsPerTrigger would fix the issue. Actually, it's not issue.
All logs and metrics per batch are only printed or showing after
finish of this batch. That's the reason why you can't see the job make
progress.
If maxOffsetsPerTrigger can't solve the issue, you could try to consume from latest offset to confirm the procssing logic is correct.
This is a provisional answer. One of our team members has a theory that looks pretty likely. Here it is: Batches ARE getting processed (this is demonstrated better by the version of the program I linked to on github), but we are thinking that since there is so much backed up in the topic on our cluster that the processing (from earliest) of the first batch takes a very long time, hence when looking at the cluster we see zero batches processed... even though there is clearly work being done. It might be that the solution is to use maxOffsetsPerTrigger to gate the amount of incoming traffic (when starting from earliest and working w/ a topic that has huge volumes of data). We are working on confirming this.

SparkConf as Kubernetes ConfigMap

I am playing with the spark-on-k8s-operator. I wondered upfront if anyone has good examples/manifests for providing spark conf via Kubernetes ConfigMaps?
Appreciate some pointers.
For now, I am using:import com.typesafe.config.{Config, ConfigFactory} and an explicit application.conf file in src/main/resources
#transient lazy val logger: Logger = LoggerFactory.getLogger(getClass)
val config: Config = ConfigFactory.parseResources("application.conf")
config.checkValid(ConfigFactory.defaultReference(), topicName)
private val source: String = config.getString(s"${topicName}.source")
private val topic: String =
config.getString(s"${topicName}.topic")
private val brokers: String =
config.getString(s"${topicName}.kafka_bootstrap_servers")
private val offsets: String =
config.getString(s"${topicName}.auto_offset_reset")
private val failOnLoss: String =
config.getString(s"${topicName}.fail_on_data_loss")
I dont have any examples, but i can suggest an offer.
Instead use application.conf, you can use any YAML library(or other types of config files extention), and take the path of the config.yaml from environment variable (for example: app.config.dir=/etc/app/config.yaml),
Than configure kubernetes config map file (config.yaml) to /etc/app,
and when your app is started it`ll read from there the configuration (offcourse you should setup the environment variable of app.config.dir).

NullPointer in broadcast value being passed as parameter

:)
I'd like to say that I'm new in Spark, as many of these posts start..but the truth is I'm not that new.
Still, I'm facing this issue with broadcast variables.
When a variable is broadcast, each executor receives a copy of it. Later on, when this variable is referenced in the part of the code that is executed in the executors (let's say map or foreach), if the variable reference that was set in the driver is not passed to it, the executor does not know what are we talking about. Which I think is perfectly explain here
My problem is I am getting a nullPointerException even tough I passed the broadcast reference to the executors.
class A {
var broadcastVal: Broadcast[Dataframe] = _
...
def method1 {
broadcastVal = otherMethodWhichSendBroadcast
doSomething(broadcastVal, others)
}
}
class B {
def doSomething(...) {
forEachPartition {x => doSomethingElse(x, broadcasVal)}
}
}
object C {
def doSomethingElse(...) {
broadcastVal.value.show --> Exception
}
}
What am I missing?
Thanks in advance!
RDD and DataFrames are already distributed structures, no need to broadcast them as local variable .(org.apache.spark.sql.functions.broadcast() function (which is used while doing joins) is not local variable broadcast )
Even if you try the code syntax wise it wont show any compilation error, rather it will throw RuntimeException like NullPointerException which is 100% valid.
Example to Explain the behavior :
package examples
import org.apache.log4j.Level
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.sql.{DataFrame, SparkSession}
object BroadCastCheck extends App {
org.apache.log4j.Logger.getLogger("org").setLevel(Level.OFF)
val spark = SparkSession.builder().appName(getClass.getName).master("local").getOrCreate()
val sc = spark.sparkContext
val df = spark.range(100).toDF()
var broadcastVal: Broadcast[DataFrame] = sc.broadcast(df)
val t1 = sc.parallelize(0 until 10)
val t2 = sc.broadcast(2) // this is right since its local variable can be primitive or map or any scala collection
val t3 = t1.filter(_ % t2.value == 0).persist() //this is the way of ha
t3.foreach {
x =>
println(x)
// broadcastVal.value.toDF().show // null pointer wrong way
// spark.range(100).toDF().show // null pointer wrong way
}
}
Result : (if you un comment broadcastVal.value.toDF().show or spark.range(100).toDF().show in above code)
Caused by: java.lang.NullPointerException
at org.apache.spark.sql.execution.SparkPlan.sparkContext(SparkPlan.scala:56)
at org.apache.spark.sql.execution.WholeStageCodegenExec.metrics$lzycompute(WholeStageCodegenExec.scala:528)
at org.apache.spark.sql.execution.WholeStageCodegenExec.metrics(WholeStageCodegenExec.scala:527)
Further read the difference between broadcast variable and broadcast function here...

How to load tar.gz files in streaming datasets?

I would like to do streaming from tar-gzip files (tgz) which include my actual CSV stored data.
I already managed to do structured streaming with spark 2.2 when my data comes in as CSV files, but actually, the data comes in as gzipped csv files.
Is there a way that the trigger done by structured streaming does an decompress before handling the CSV stream?
The code I use to process the files is this:
val schema = Encoders.product[RawData].schema
val trackerData = spark
.readStream
.option("delimiter", "\t")
.schema(schema)
.csv(path)
val exceptions = rawCientData
.as[String]
.flatMap(extractExceptions)
.as[ExceptionData]
produced output as expected when path points to csv files.
But I would like to use tar gzip files.
When I try to place those files at the given path, I do not get any exceptions and batch output tells me
"sources" : [ {
"description" : "FileStreamSource[file:/Users/matthias/spark/simple_spark/src/main/resources/zsessionlog*]",
"startOffset" : null,
"endOffset" : {
"logOffset" : 0
},
"numInputRows" : 1095,
"processedRowsPerSecond" : 211.0233185584891
} ],
But I do not get any actual data processed.
Console sink looks like this:
+------+---+-----+
|window|id |count|
+------+---+-----+
+------+---+-----+
I solved the part of reading .tar.gz (.tgz) files this way:
Inspired by this site I created my own TGZ codec
final class DecompressTgzCodec extends CompressionCodec {
override def getDefaultExtension: String = ".tgz"
override def createOutputStream(out: OutputStream): CompressionOutputStream = ???
override def createOutputStream(out: OutputStream, compressor: Compressor): CompressionOutputStream = ???
override def createCompressor(): Compressor = ???
override def getCompressorType: Class[_ <: Compressor] = ???
override def createInputStream(in: InputStream): CompressionInputStream = {
new TarDecompressorStream(new TarArchiveInputStream(new GzipCompressorInputStream(in)))
}
override def createInputStream(in: InputStream, decompressor: Decompressor): CompressionInputStream = createInputStream(in)
override def createDecompressor(): Decompressor = null
override def getDecompressorType: Class[_ <: Decompressor] = null
final class TarDecompressorStream(in: TarArchiveInputStream) extends DecompressorStream(in) {
def updateStream(): Unit = {
// still have data in stream -> done
if (in.available() <= 0) {
// create stream content from following tar elements one by one
in.getNextTarEntry()
}
}
override def read: Int = {
checkStream()
updateStream()
in.read()
}
override def read(b: Array[Byte], off: Int, len: Int): Int = {
checkStream()
updateStream()
in.read(b, off, len)
}
override def resetState(): Unit = {}
}
}
And registered it for use by spark.
val conf = new SparkConf()
conf.set("spark.hadoop.io.compression.codecs", classOf[DecompressTgzCodec].getName)
val spark = SparkSession
.builder()
.master("local[*]")
.config(conf)
.appName("Streaming Example")
.getOrCreate()
Works exactly like I wanted it to do.
I do not think reading tar.gz'ed files is possible in Spark (see Read whole text files from a compression in Spark or gzip support in Spark for some ideas).
Spark does support gzip files, but they are not recommended as not splittable and result in a single partition (that in turn makes Spark of little to no help).
In order to have gzipped files loaded in Spark Structured Streaming you have to specify the path pattern so the files are included in loading, say zsessionlog*.csv.gz or alike. Else, csv alone loads CSV files only.
If you insist on Spark Structured Streaming to handle tar.gz'ed files, you could write a custom streaming data Source to do the un-tar.gz.
Given gzip files are not recommended as data format in Spark, the whole idea of using Spark Structured Streaming does not make much sense.

How to know executor IPs for a Spark application?

I want to get all the IPs of executors at runtime, which API in Spark should I use? Or any other method to get IPs at runtime?
You should use SparkListener abstract class and intercept two executor-specific events - SparkListenerExecutorAdded and SparkListenerExecutorRemoved.
override def onExecutorAdded(executorAdded: SparkListenerExecutorAdded): Unit = {
val execId = executorAdded.executorId
val host = executorAdded.executorInfo.executorHost
executors += (execId -> host)
println(s">>> executor id=$execId added on host=$host")
}
override def onExecutorRemoved(executorRemoved: SparkListenerExecutorRemoved): Unit = {
val execId = executorRemoved.executorId
val host = executors remove execId getOrElse "Host unknown"
println(s">>> executor id=$execId removed from host=$host")
}
The entire working project is in my Spark Executor Monitor Project.
There is a class in Apache Spark namely ExecutorInfo which has a method executorHost() which returns the Executor Host IP.

Resources