Spark QueueStream never exhausted - apache-spark

Puzzled on a piece of code I borrowed from the internet for research purposes. This is the code:
import org.apache.spark.sql.SparkSession
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.{Seconds, StreamingContext}
import scala.collection.mutable
val spark = ...
val sc = spark.sparkContext
val ssc = new StreamingContext(spark.sparkContext, Seconds(1))
val rddQueue = new mutable.Queue[RDD[Char]]()
val QS = ssc.queueStream(rddQueue)
QS.foreachRDD(q=> {
print("Hello") // Queue never exhausted
if(!q.isEmpty) {
... do something
... do something
}
}
)
//ssc.checkpoint("/chkpoint/dir") if unchecked causes Serialization error
ssc.start()
for (c <- 'a' to 'c') {
rddQueue += ssc.sparkContext.parallelize(List(c))
}
ssc.awaitTermination()
I was tracing through it just to check and noted that "hello" is being printed out forever:
HelloHelloHelloHelloHelloHelloHelloHelloHelloHello and so on
I would have thought the queueStream would exhaust after 3 iterations.
So, what have I missed?

Got it. It is actually exhausted, but the looping continues and that's why the statement
if(!q.isEmpty)
is there.
OK, would have thought it would just stop, or, rather not execute, but not so. I remember now. An empty RDD will result, if nothing streamed, based on timing of batch interval. Leaving for others as there was an upvote.
However, even though legacy, it is a bad example as adding checkpoint
causes a Serialization error. Leaving it for the benefit of others.
ssc.checkpoint("/chkpoint/dir")

Related

Case Class within foreachRDD causes Serialization Error

I can can create a DF inside foreachRDD if I do not try and use a Case Class and simply let default names for columns be made with toDF() or if I assign them via toDF("c1, "c2").
As soon as I try and use a Case Class, and having looked at the examples, I get:
Task not serializable
If I shift the Case Class statement around I then get:
toDF() not part of RDD[CaseClass]
It's legacy, but I am curious as to the nth Serialization error that Spark can produce and if it carries over into Structured Streaming.
I have an RDD that need not be split, may be that is the issue? NO. Running in DataBricks?
Coding is as follows:
import org.apache.spark.sql.SparkSession
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.{Seconds, StreamingContext}
import scala.collection.mutable
case class Person(name: String, age: Int) //extends Serializable // Some say inherently serializable so not required
val spark = SparkSession.builder
.master("local[4]")
.config("spark.driver.cores", 2)
.appName("forEachRDD")
.getOrCreate()
val sc = spark.sparkContext
val ssc = new StreamingContext(spark.sparkContext, Seconds(1))
val rddQueue = new mutable.Queue[RDD[List[(String, Int)]]]()
val QS = ssc.queueStream(rddQueue)
QS.foreachRDD(q => {
if(!q.isEmpty) {
import spark.implicits._
val q_flatMap = q.flatMap{x=>x}
val q_withPerson = q_flatMap.map(field => Person(field._1, field._2))
val df = q_withPerson.toDF()
df.show(false)
}
}
)
ssc.start()
for (c <- List(List(("Fred",53), ("John",22), ("Mary",76)), List(("Bob",54), ("Johnny",92), ("Margaret",15)), List(("Alfred",21), ("Patsy",34), ("Sylvester",7)) )) {
rddQueue += ssc.sparkContext.parallelize(List(c))
}
ssc.awaitTermination()
Having not grown up with Java, but having looked around I found out what to do, but am not expert enough to explain.
I was running in a DataBricks notebook where I prototype.
The clue is that the
case class Person(name: String, age: Int)
was inside the same DB Notebook. One needs to define the case class external to the current notebook - in a separate notebook - and thus separate to the class running the Streaming.

wondering why empty inner iterator causes not serializable exception with mapPartitionsWithIndex

I've been experimenting with Spark's mapPartitionsWithIndex and I ran into problems when
trying to return an Iterator of a tuple that itself contained an empty iterator.
I tried several different ways of constructing the inner iterator [ via Iterator(), and List(...).iterator ], and
all roads let to my getting this error:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 2.0 in stage 0.0 (TID 2) had a not serializable result: scala.collection.LinearSeqLike$$anon$1
Serialization stack:
- object not serializable (class: scala.collection.LinearSeqLike$$anon$1, value: empty iterator)
- field (class: scala.Tuple2, name: _2, type: class java.lang.Object)
- object (class scala.Tuple2, (1,empty iterator))
- element of array (index: 0)
- array (class [Lscala.Tuple2;, size 1)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
My code example is given below. Note that as given it runs OK (an empty iterator is returned as the
mapPartitionsWithIndex value.) But when you run with the now commented-out version of
the mapPartitionsWithIndex invocations you will get the error above.
If anyone has a suggestion on how to this can be made to work, I'd be much obliged.
import org.apache.spark.{Partition, SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object ANonWorkingExample extends App {
val sparkConf = new SparkConf().setAppName("continuous").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val parallel: RDD[Int] = sc.parallelize(1 to 9)
val parts: Array[Partition] = parallel.partitions
val partRDD: RDD[(Int, Iterator[Int])] =
parallel.coalesce(3).
mapPartitionsWithIndex {
(partitionIndex: Int, inputiterator: Iterator[Int]) =>
val mappedInput: Iterator[Int] = inputiterator.map(_ + 1)
// Iterator((partitionIndex, mappedInput)) // FAILS
Iterator() // no exception.. but not really what i want.
}
val data = partRDD.collect
println("data:" + data.toList);
}
I am not sure what you are trying to achieve and I am a sort of novice compared to some of the expert folks here.
I present something that may give you an idea of how to do things I think correctly and make some comments:
You seem to get the partitions explicitly and call mapPartitions - a 1st for me.
RDD inside mapPartitions and the various SPARK SCALA thing will not fly; it is about iterables and I think you need to drop to SCALA only level.
The serializeable error come from doing List[Int].
Here is an example showing index partition along with those corresponding index values.
import org.apache.spark.{Partition, SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.{Minutes, Seconds, StreamingContext}
// from your stuff, left in
val parallel: RDD[Int] = sc.parallelize(1 to 9, 4)
val mapped = parallel.mapPartitionsWithIndex{
(index, iterator) => {
println("Called in Partition -> " + index)
val myList = iterator.toList
myList.map(x => (index, x)).groupBy( _._1 ).mapValues( _.map( _._2 ) ).toList.iterator
}
}
mapped.collect()
This returns the following that resembles a little of what I think you seemed to want:
res38: Array[(Int, List[Int])] = Array((0,List(1, 2)), (1,List(3, 4)), (2,List(5, 6)), (3,List(7, 8, 9)))
Final note: the documentation and such is not so easy to follow, you don't get it all from word count example!
So, hope this helps.
I think it might get you on the right path to where you want to go, I could not quite see it, but may be you can now see the forest for the trees.
So, the dumb thing I was doing was trying to return an unserializable data structure: an Iterator, as clearly indicated by the stack trace I got.
And the solution is to not use an iterator. Rather, use a collection like a Seq, or List. The sample program below illustrates the correct way to do what I was trying to do.
import org.apache.spark.{Partition, SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object AWorkingExample extends App {
val sparkConf = new SparkConf().setAppName("batman").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val parallel: RDD[Int] = sc.parallelize(1 to 9)
val parts: Array[Partition] = parallel.partitions
val partRDD: RDD[(Int, List[Int])] =
parallel.coalesce(3).
mapPartitionsWithIndex {
(partitionIndex: Int, inputiterator: Iterator[Int]) =>
val mappedInput: Iterator[Int] = inputiterator.map(_ + 1)
Iterator((partitionIndex, mappedInput.toList)) // Note the .toList() call -- that makes it work
}
val data = partRDD.collect
println("data:" + data.toList);
}
By the way, what I was trying to do originally was to see concretely which chunks of data from my parallelized-to-RDD structure were assigned to which partition. Here is the output you get if you run the program:
data:List((0,List(2, 3)), (1,List(4, 5, 6)), (2,List(7, 8, 9, 10)))
Interesting that the data distribution could have been more optimally balanced, but wasn't. That's not the point of the question, but I thought it was interesting.

Running threads in Spark DataFrame foreachPartition()

I use multiple threads inside foreachPartition(), which works great for me except for when the underlying iterator is TungstenAggregationIterator. Here is a minimal code snippet to reproduce:
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.duration.Duration
import scala.concurrent.{Await, Future}
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
object Reproduce extends App {
val sc = new SparkContext("local", "reproduce")
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val df = sc.parallelize(Seq(1)).toDF("number").groupBy("number").count()
df.foreachPartition { iterator =>
val f = Future(iterator.toVector)
Await.result(f, Duration.Inf)
}
}
When I run this, I get:
java.lang.NullPointerException
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:751)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:84)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
I believe I actually understand why this happens - TungstenAggregationIterator uses a ThreadLocal variable that returns null when called from a thread other than the original thread that got the iterator from Spark. From examining the code, this does not appear to differ between recent Spark versions.
However, this limitation is specific to TungstenAggregationIterator, and not documented, as far as I'm aware.
Is there a way to work around this limitation of TungstenAggregationIterator? Any relevant documentation? I have a workaround for this, but it's quite hacky and unnecessarily reduces runtime performance.

Why do Window functions fail with "Window function X does not take a frame specification"?

I'm trying to use Spark 1.4 window functions in pyspark 1.4.1
but getting mostly errors or unexpected results.
Here is a very simple example that I think should work:
from pyspark.sql.window import Window
import pyspark.sql.functions as func
l = [(1,101),(2,202),(3,303),(4,404),(5,505)]
df = sqlContext.createDataFrame(l,["a","b"])
wSpec = Window.orderBy(df.a).rowsBetween(-1,1)
df.select(df.a, func.rank().over(wSpec).alias("rank"))
==> Failure org.apache.spark.sql.AnalysisException: Window function rank does not take a frame specification.
df.select(df.a, func.lag(df.b,1).over(wSpec).alias("prev"), df.b, func.lead(df.b,1).over(wSpec).alias("next"))
===> org.apache.spark.sql.AnalysisException: Window function lag does not take a frame specification.;
wSpec = Window.orderBy(df.a)
df.select(df.a, func.rank().over(wSpec).alias("rank"))
===> org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException: One or more arguments are expected.
df.select(df.a, func.lag(df.b,1).over(wSpec).alias("prev"), df.b, func.lead(df.b,1).over(wSpec).alias("next")).collect()
[Row(a=1, prev=None, b=101, next=None), Row(a=2, prev=None, b=202, next=None), Row(a=3, prev=None, b=303, next=None)]
As you can see, if I add rowsBetween frame specification, neither rank() nor lag/lead() window functions recognize it: "Window function does not take a frame specification".
If I omit the rowsBetween frame specification at leas lag/lead() do not throw exceptions but return unexpected (for me) result: always None. And the rank() still doesn't work with different exception.
Can anybody help me to get my window functions right?
UPDATE
All right, that starts to look as a pyspark bug.
I have prepared the same test in pure Spark (Scala, spark-shell):
import sqlContext.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
val l: List[Tuple2[Int,Int]] = List((1,101),(2,202),(3,303),(4,404),(5,505))
val rdd = sc.parallelize(l).map(i => Row(i._1,i._2))
val schemaString = "a b"
val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, IntegerType, true)))
val df = sqlContext.createDataFrame(rdd, schema)
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val wSpec = Window.orderBy("a").rowsBetween(-1,1)
df.select(df("a"), rank().over(wSpec).alias("rank"))
==> org.apache.spark.sql.AnalysisException: Window function rank does not take a frame specification.;
df.select(df("a"), lag(df("b"),1).over(wSpec).alias("prev"), df("b"), lead(df("b"),1).over(wSpec).alias("next"))
===> org.apache.spark.sql.AnalysisException: Window function lag does not take a frame specification.;
val wSpec = Window.orderBy("a")
df.select(df("a"), rank().over(wSpec).alias("rank")).collect()
====> res10: Array[org.apache.spark.sql.Row] = Array([1,1], [2,2], [3,3], [4,4], [5,5])
df.select(df("a"), lag(df("b"),1).over(wSpec).alias("prev"), df("b"), lead(df("b"),1).over(wSpec).alias("next"))
====> res12: Array[org.apache.spark.sql.Row] = Array([1,null,101,202], [2,101,202,303], [3,202,303,404], [4,303,404,505], [5,404,505,null])
Even though the rowsBetween cannot be applied in Scala, both rank() and lag()/lead() work as I expect when rowsBetween is omitted.
As far as I can tell there two different problems. Window frame definition is simply not supported by Hive GenericUDAFRank, GenericUDAFLag and GenericUDAFLead so errors you see are an expected behavior.
Regarding issue with the following PySpark code
wSpec = Window.orderBy(df.a)
df.select(df.a, func.rank().over(wSpec).alias("rank"))
it looks like it is related to my question https://stackoverflow.com/q/31948194/1560062 and should be addressed by SPARK-9978. As far now you can make it work by changing window definition to this:
wSpec = Window.partitionBy().orderBy(df.a)

Spark: fail to run the terasort when the amount of data gets bigger

I have a spark bench which includes a terasort and it run properly when data is only a few hundred of GB,but when i generate more data such as 1 TB, it went wrong in some step.The following is my code:
import org.apache.spark.rdd._
import org.apache.spark._
import org.apache.spark.SparkContext._
object ScalaTeraSort{
def main(args: Array[String]){
if (args.length < 2){
System.err.println(
s"Usage: $ScalaTeraSort <INPUT_HDFS> <OUTPUT_HDFS>"
)
System.exit(1)
}
val sparkConf = new SparkConf().setAppName("ScalaTeraSort")
val sc = new SparkContext(sparkConf)
val file = sc.textFile(args(0))
val data = file.map(line => (line.substring(0, 10), line.substring(10)))
.sortByKey().map{case(k, v) => k + v}
data.saveAsTextFile(args(1))
sc.stop()
}
}
this code mainly includes 3 steps: sortByKey, map and saveAsTextFile. it seems there is no wrong in the first two step but when it comes to the third step,it went wrong all the times and then retried the second step. the third step went wrong because of "FetchFailed(BlockManagerId(40, sr232, 44815, 0), shuffleId=0, mapId=11825, reduceId=0)"
I found out the reason, the essential problem is : java.io.IOException: sendMessageReliably failed because ack was not received within 60 sec
that is to say,you have to set the property "spark.core.connection.ack.wait.timeout" to a bigger value, by default it's 60 secs. Otherwises, the stage will fail because of long time not response.

Resources