Spark streaming: batch interval vs window - apache-spark

I have spark streaming application which consumes kafka messages. And I want to process all messages coming last 10 minutes together.
Looks like there are two approaches to do job done:
val ssc = new StreamingContext(new SparkConf(), Minutes(10))
val dstream = ....
and
val ssc = new StreamingContext(new SparkConf(), Seconds(1))
val dstream = ....
dstream.window(Minutes(10), Minutes(10))
and I just want to clarify is there any performance differences between them

Related

foreachRDD sometimes takes too long between batches

I have a problem, we are using Kafka and spark.
val ssc = new StreamingContext(conf, Seconds(10))
val messages = KafkaUtils.createDirectStream(ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[K, V](config.topics, scala.collection.Map[String, Object](kafkaParams.toSeq: _*), offsetRange)))
messages.foreachRDD {(rdd, time) => ...}
It works well, but sometimes new batch begins to start after about 10 minutes after previous one. Times are measured by log messages.
Why is that happening?
I`ve found the reason, it was due to issues.apache.org/jira/browse/KAFKA-12890

How to broadcast data from MySQL and use it in streaming batches?

// How do I get attributes from MYSQL DB during each streaming batch and broadcast it.
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext (sc, Seconds(streamingBatchSizeinSeconds))
val eventDStream=getDataFromKafka(ssc)
val eventDtreamFiltered=eventFilter(eventDStream,eventType)
Whatever you do in getDataFromKafka and eventFilter I think you get a DStream to work with. That's how your future computations are described by and every batch interval you have a RDD to work with.
The answer to your question greatly depends on what exactly you want to do exactly, but let's assume that you're done with this stream processing of Kafka records and you want to do something with them.
If foreach were acceptable, you could do the following:
// I use Spark 2.x here
// Read attributes from MySQL
val myAttrs = spark.read.jdbc([mysql-url-here]).collect
// Broadcast the attributes so they're available on executors
val attrs = sc.broadcast(myAttrs) // do it once OR move it as part of foreach below
eventDtreamFiltered.foreach { rdd =>
// for each RDD reach out to attrs broadcast
val _attrs = attrs.get
// do something here with the rdd and _attrs
}
I tyle!

Spark: Poor performance on distributed system. How to improve>

I wrote a simple Spark program, and want to deploy it to the distributed servers. It is pretty simple:
obtain data-> arrange data->train data->reapply to see training result.
The input data is just 10K rows, with 3 features.
I first run at my local machine, using "local[*]". It runs just about 3 mins.
Now when I deploy to a cluster, it runs extremely slow: half an hour without finished. It becomes very slow at the training stage.
I am curious, if I did something wrong. Please help me to check. I use Spark 1.6.1.
I submit:
spark-submit --packages com.databricks:spark-csv_2.11:1.5.0 orderprediction_2.11-1.0.jar --driver-cores 1 --driver-memory 4g --executor-cores 8 --executor-memory 4g
The code is here:
def main(args: Array[String]) {
// Set the log level to only print errors
Logger.getLogger("org").setLevel(Level.ERROR)
val conf = new SparkConf()
.setAppName("My Prediction")
//.setMaster("local[*]")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val data = sqlContext.read
.option("header","true")
.option("delimiter", "\t")
.format("com.databricks.spark.csv")
.option("inferSchema","true")
.load("mydata.txt")
data.printSchema()
data.show()
val dataDF = data.toDF().filter("clicks >=10")
dataDF.show()
val assembler = new VectorAssembler()
.setInputCols(Array("feature1", "feature2", "feature3"))
.setOutputCol("features")
val trainset = assembler.transform(dataDF).select("target", "features")
trainset.printSchema()
val trainset2 = trainset.withColumnRenamed("target", "label")
trainset2.printSchema()
val trainset3 = trainset2.withColumn("label", trainset2.col("label").cast(DataTypes.DoubleType))
trainset3.cache() // cache data into memory
trainset3.printSchema()
trainset3.show()
// Train a RandomForest model.
println("training Random Forest")
val rf = new RandomForestRegressor()
.setLabelCol("label")
.setFeaturesCol("features")
.setNumTrees(1000)
val rfmodel = rf.fit(trainset3)
println("prediction")
val result = rfmodel.transform(trainset3)
result.show()
}
Update: After investigation, I found it jammed at
collectAsMap at RandomForest.scala:525
It spent already 1.1 hours on this line, still unfinished yet. The data, I believe is only several Megabyte.
You are building a RandomForest made out of 1000 RandomTrees which will train 1000 instances.
In the code collectAsMap is the first action, while all the rest are transformations (are lazy evaluated). So while you see it hanging at that line it is because now all the maps, flatMaps, filters, groupBy, etc are evaluated.

Saving multiple hadoop datasets concurrently in Spark

I have a Spark app that looks like this:
val conf = new SparkConf().setAppName("MyApp")
val sc = new SparkContext(conf)
val rdd1 = ...
rdd1.saveAsNewAPIHadoopDataset(output1)
val rdd2 = ...
rdd2.saveAsNewAPIHadoopDataset(output2)
val rdd3 = ...
rdd3.saveAsNewAPIHadoopDataset(output3)
```
The call to saveAsNewAPIHadoopDataset and while some of my workers are doing IO, it would be nice if the job continued to run the next stages.
I tried to wrap each computation in a Future {} and await on all of them at the end but ran into this issue https://issues.apache.org/jira/browse/SPARK-13631
Is there a way in Spark to save to Hadoop dataset in a way that will queue other stages? FWIW, Hadoop's output configuration is BigQuery connector (https://cloud.google.com/hadoop/bigquery-connector)

Joining Kafka and Cassandra DataFrames in Spark Streaming ignores C* predicate pushdown

Intent
I'm receiving data from Kafka via direct stream and would like to enrich the messages with data from Cassandra. The Kafka messages (Protobufs) are decoded into DataFrames and then joined with a (supposedly pre-filtered) DF from Cassandra. The relation of (Kafka) streaming batch size to raw C* data is [several streaming messages to millions of C* rows], BUT the join always yields exactly ONE result [1:1] per message. After the join the resulting DF is eventually stored to another C* table.
Problem
Even though I'm joining the two DFs on the full Cassandra primary key and pushing the corresponding filter to C*, it seems that Spark is loading the whole C* data-set into memory before actually joining (which I'd like to prevent by using the filter/predicate pushdown). This leads to a lot of shuffling and tasks being spawned, hence the "simple" join takes forever...
def main(args: Array[String]) {
val conf = new SparkConf()
.setAppName("test")
.set("spark.cassandra.connection.host", "xxx")
.set("spark.cassandra.connection.keep_alive_ms", "30000")
.setMaster("local[*]")
val ssc = new StreamingContext(conf, Seconds(10))
ssc.sparkContext.setLogLevel("INFO")
// Initialise Kafka
val kafkaTopics = Set[String]("xxx")
val kafkaParams = Map[String, String](
"metadata.broker.list" -> "xxx:32000,xxx:32000,xxx:32000,xxx:32000",
"auto.offset.reset" -> "smallest")
// Kafka stream
val messages = KafkaUtils.createDirectStream[String, MyMsg, StringDecoder, MyMsgDecoder](ssc, kafkaParams, kafkaTopics)
// Executed on the driver
messages.foreachRDD { rdd =>
// Create an instance of SQLContext
val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)
import sqlContext.implicits._
// Map MyMsg RDD
val MyMsgRdd = rdd.map{case (key, MyMsg) => (MyMsg)}
// Convert RDD[MyMsg] to DataFrame
val MyMsgDf = MyMsgRdd.toDF()
.select(
$"prim1Id" as 'prim1_id,
$"prim2Id" as 'prim2_id,
$...
)
// Load DataFrame from C* data-source
val base_data = base_data_df.getInstance(sqlContext)
// Left join on prim1Id and prim2Id
val joinedDf = MyMsgDf.join(base_data,
MyMsgDf("prim1_id") === base_data("prim1_id") &&
MyMsgDf("prim2_id") === base_data("prim2_id"), "left")
.filter(base_data("prim1_id").isin(MyMsgDf("prim1_id"))
&& base_data("prim2_id").isin(MyMsgDf("prim2_id")))
joinedDf.show()
joinedDf.printSchema()
// Select relevant fields
// Persist
}
// Start the computation
ssc.start()
ssc.awaitTermination()
}
Environment
Spark 1.6
Cassandra 2.1.12
Cassandra-Spark-Connector 1.5-RC1
Kafka 0.8.2.2
SOLUTION
From discussions on the DataStax Spark Connector for Apache Cassandra ML
Joining Kafka and Cassandra DataFrames in Spark Streaming ignores C* predicate pushdown
How to create a DF from CassandraJoinRDD
I've learned the following:
Quoting Russell Spitzer
This wouldn't be a case of predicate pushdown. This is a join on a partition key column. Currently only joinWithCassandraTable supports this direct kind of join although we are working on some methods to try to have this automatically done within Spark.
Dataframes can be created from any RDD which can have a schema applied to it. The easiest thing to do is probably to map your joinedRDD[x,y] to Rdd[JoinedCaseClass] and then call toDF (which will require importing your sqlContext implicits.) See the DataFrames documentation here for more info.
So the actual implementation now resembles something like
// Join myMsg RDD with myCassandraTable
val joinedMsgRdd = myMsgRdd.joinWithCassandraTable(
"keyspace",
"myCassandraTable",
AllColumns,
SomeColumns(
"prim1_id",
"prim2_id"
)
).map{case (myMsg, cassandraRow) =>
JoinedMsg(
foo = myMsg.foo
bar = cassandraRow.bar
)
}
// Convert RDD[JoinedMsg] to DataFrame
val myJoinedDf = joinedMsgRdd.toDF()
Have you tried joinWithCassandraTable ? It should pushdown to C* all keys you are looking for...

Resources