Running threads in Spark DataFrame foreachPartition()

Running threads in Spark DataFrame foreachPartition() - multithreading

I use multiple threads inside foreachPartition(), which works great for me except for when the underlying iterator is TungstenAggregationIterator. Here is a minimal code snippet to reproduce:
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.duration.Duration
import scala.concurrent.{Await, Future}
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
object Reproduce extends App {
val sc = new SparkContext("local", "reproduce")
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val df = sc.parallelize(Seq(1)).toDF("number").groupBy("number").count()
df.foreachPartition { iterator =>
val f = Future(iterator.toVector)
Await.result(f, Duration.Inf)
}
}
When I run this, I get:
java.lang.NullPointerException
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:751)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:84)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
I believe I actually understand why this happens - TungstenAggregationIterator uses a ThreadLocal variable that returns null when called from a thread other than the original thread that got the iterator from Spark. From examining the code, this does not appear to differ between recent Spark versions.
However, this limitation is specific to TungstenAggregationIterator, and not documented, as far as I'm aware.
Is there a way to work around this limitation of TungstenAggregationIterator? Any relevant documentation? I have a workaround for this, but it's quite hacky and unnecessarily reduces runtime performance.

Related

Spark QueueStream never exhausted

Puzzled on a piece of code I borrowed from the internet for research purposes. This is the code:
import org.apache.spark.sql.SparkSession
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.{Seconds, StreamingContext}
import scala.collection.mutable
val spark = ...
val sc = spark.sparkContext
val ssc = new StreamingContext(spark.sparkContext, Seconds(1))
val rddQueue = new mutable.Queue[RDD[Char]]()
val QS = ssc.queueStream(rddQueue)
QS.foreachRDD(q=> {
print("Hello") // Queue never exhausted
if(!q.isEmpty) {
... do something
... do something
}
}
)
//ssc.checkpoint("/chkpoint/dir") if unchecked causes Serialization error
ssc.start()
for (c <- 'a' to 'c') {
rddQueue += ssc.sparkContext.parallelize(List(c))
}
ssc.awaitTermination()
I was tracing through it just to check and noted that "hello" is being printed out forever:
HelloHelloHelloHelloHelloHelloHelloHelloHelloHello and so on
I would have thought the queueStream would exhaust after 3 iterations.
So, what have I missed?

Got it. It is actually exhausted, but the looping continues and that's why the statement
if(!q.isEmpty)
is there.
OK, would have thought it would just stop, or, rather not execute, but not so. I remember now. An empty RDD will result, if nothing streamed, based on timing of batch interval. Leaving for others as there was an upvote.
However, even though legacy, it is a bad example as adding checkpoint
causes a Serialization error. Leaving it for the benefit of others.
ssc.checkpoint("/chkpoint/dir")

Case Class within foreachRDD causes Serialization Error

I can can create a DF inside foreachRDD if I do not try and use a Case Class and simply let default names for columns be made with toDF() or if I assign them via toDF("c1, "c2").
As soon as I try and use a Case Class, and having looked at the examples, I get:
Task not serializable
If I shift the Case Class statement around I then get:
toDF() not part of RDD[CaseClass]
It's legacy, but I am curious as to the nth Serialization error that Spark can produce and if it carries over into Structured Streaming.
I have an RDD that need not be split, may be that is the issue? NO. Running in DataBricks?
Coding is as follows:
import org.apache.spark.sql.SparkSession
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.{Seconds, StreamingContext}
import scala.collection.mutable
case class Person(name: String, age: Int) //extends Serializable // Some say inherently serializable so not required
val spark = SparkSession.builder
.master("local[4]")
.config("spark.driver.cores", 2)
.appName("forEachRDD")
.getOrCreate()
val sc = spark.sparkContext
val ssc = new StreamingContext(spark.sparkContext, Seconds(1))
val rddQueue = new mutable.Queue[RDD[List[(String, Int)]]]()
val QS = ssc.queueStream(rddQueue)
QS.foreachRDD(q => {
if(!q.isEmpty) {
import spark.implicits._
val q_flatMap = q.flatMap{x=>x}
val q_withPerson = q_flatMap.map(field => Person(field._1, field._2))
val df = q_withPerson.toDF()
df.show(false)
}
}
)
ssc.start()
for (c <- List(List(("Fred",53), ("John",22), ("Mary",76)), List(("Bob",54), ("Johnny",92), ("Margaret",15)), List(("Alfred",21), ("Patsy",34), ("Sylvester",7)) )) {
rddQueue += ssc.sparkContext.parallelize(List(c))
}
ssc.awaitTermination()

Having not grown up with Java, but having looked around I found out what to do, but am not expert enough to explain.
I was running in a DataBricks notebook where I prototype.
The clue is that the
case class Person(name: String, age: Int)
was inside the same DB Notebook. One needs to define the case class external to the current notebook - in a separate notebook - and thus separate to the class running the Streaming.

Spark mapPartitions correct usage with DataFrames

I'm struggling with the correct usage of mapPartitions.
I've successfully run my code with map, however since I do not want the resources to be loaded for every row I'd like to switch to mapPartitions.
Here's some simple example code:
import spark.implicits._
val dataDF = spark.read.format("json").load("basefile")
val newDF = dataDF.mapPartitions( iterator => {
iterator.map(p => Seq(1,"1")))
}).toDF("id", "newContent")
newDF.write.json("newfile")
This causes the exception
Exception in thread "main" java.lang.ClassNotFoundException: scala.Any
I'm guessing this has something to do with typing. What could the problem be?

the problem is that Seq(1,"1") is of type Seq[Any] which can't be returned from mapPartitions, try Seq(1,2) intsead if that works

Spark Pooling Options

I am having trouble understanding how to use the settings for pooling options and being able to tell if they are working from this source: https://docs.datastax.com/en/developer/java-driver/3.4/manual/pooling/
Would the SparkSession val take into account the pooling options from the cluster?
My Scala Code:
package com.zeropoints.processing
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import com.datastax.spark.connector._
import com.datastax.driver.core.Cluster
import com.datastax.driver.core.PoolingOptions
import com.datastax.driver.core.HostDistance
//This object provides the main entry point into spark processing
object main {
var appName = "Processing"
lazy val sparkconf:SparkConf = new SparkConf(true).setAppName(appName)
lazy val poolingOptions:PoolingOptions = new PoolingOptions()
lazy val cluster:Cluster = Cluster.builder().withPoolingOptions(poolingOptions).build()
lazy val spark:SparkSession = SparkSession.builder().config(sparkconf).getOrCreate
lazy val sc:SparkContext = spark.sparkContext
def main(args: Array[String]) {
//Set pooling stuff
poolingOptions.setConnectionsPerHost(HostDistance.LOCAL, 6, 60)
//DF and RDDs tasks...
spark.sql("select * from data.raw").groupBy("key1,key2").agg(sum("views")).
write.format("org.apache.spark.sql.cassandra").options(Map( "table" -> "summary", "keyspace" -> "data")).
mode(org.apache.spark.sql.SaveMode.Append).save()
//..more stuff
}
}

No, the Spark Connector won't take your pooling configuration into account - it works different way, especially if you think about execution of your code in the distributed environment - your setConnectionsPerHost is executed only in driver, and won't affect executors.
The correct way is to specify necessary settings via Spark configuration parameters. Documentation have a separate section on connection parameters, and connection.connections_per_executor_max could be what you need. You can also write your own class that implements trait CassandraConnectionFactory and provide implementation of the createCluster function. Then you can specify this class name as connection.factory configuration parameter.
But the main question is - do you really need to tune these options? Do you think that processing is slow? Java driver doc recommends to have 1 connection per host to avoid putting an additional load to Cassandra.

Spark Avro write RDD to multiple directories by key

I need to split an RDD by first letters (A-Z) and write the files into directories respectively.
The simple solution is to filter the RDD for each letter, but this requires 26 passes.
There is a response to a similar question for writing to text files here, but I cannot figure out how to do this for Avro files.
Has anyone been able to do this?

You can use multipleoutputformat to do this
It is a two step task :-
First you need the multiple output format for avro. Below is the code for that:
package avro
import org.apache.hadoop.mapred.lib.MultipleOutputFormat
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.mapred.JobConf
import org.apache.hadoop.util.Progressable
import org.apache.avro.mapred.AvroOutputFormat
import org.apache.avro.mapred.AvroWrapper
import org.apache.hadoop.io.NullWritable
import org.apache.spark.rdd.RDD
import org.apache.hadoop.mapred.RecordWriter
class MultipleAvroFileOutputFormat[K] extends MultipleOutputFormat[AvroWrapper[K], NullWritable] {
val outputFormat = new AvroOutputFormat[K]
override def generateFileNameForKeyValue(key: AvroWrapper[K], value: NullWritable, name: String) = {
val name = key.datum().asInstanceOf[String].substring(0, 1)
name + "/" + name
}
override def getBaseRecordWriter(fs: FileSystem,
job: JobConf,
name: String,
arg3: Progressable) = {
outputFormat.getRecordWriter(fs, job, name, arg3).asInstanceOf[RecordWriter[AvroWrapper[K], NullWritable]]
}
}
In your driver code you have to mention that you want to use the Above given output format. You also need to mention the output schema for avro data. Below is sample driver code which stores a RDD of string in avro format with schema {"type":"string"}
package avro
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.hadoop.io.NullWritable
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.hadoop.mapred.JobConf
import org.apache.avro.mapred.AvroJob
import org.apache.avro.mapred.AvroWrapper
object AvroDemo {
def main(args: Array[String]): Unit = {
val conf = new SparkConf
conf.setAppName(args(0));
conf.setMaster("local[2]");
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
conf.registerKryoClasses(Array(classOf[AvroWrapper[String]]))
val sc = new SparkContext(conf);
val input = sc.parallelize(Seq("one", "two", "three", "four"), 1);
val pairRDD = input.map(x => (new AvroWrapper(x), null));
val job = new JobConf(sc.hadoopConfiguration)
val schema = "{\"type\":\"string\"}"
job.set(AvroJob.OUTPUT_SCHEMA, schema) //set schema for avro output
pairRDD.partitionBy(new HashPartitioner(26)).saveAsHadoopFile(args(1), classOf[AvroWrapper[String]], classOf[NullWritable], classOf[MultipleAvroFileOutputFormat[String]], job, None);
sc.stop()
}
}

I hope you get a better answer than mine...
I've been in a similar situation myself, except with "ORC" instead of Avro. I basically threw up my hands and ended up calling the ORC file classes directly to write the files myself.
In your case, my approach would entail partitioning the data via "partitionBy" into 26 partitions, one for each first letter A-Z. Then call "mapPartitionsWithIndex", passing a function that outputs the i-th partition to an Avro file at the appropriate path. Finally, to convince Spark to actually do something, have mapPartitionsWithIndex return, say, a List containing the single boolean value "true"; and then call "count" on the RDD returned by mapPartitionsWithIndex to get Spark to start the show.
I found an example of writing an Avro file here: http://www.myhadoopexamples.com/2015/06/19/merging-small-files-into-avro-file-2/

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Running threads in Spark DataFrame foreachPartition() - multithreading

Related

Spark QueueStream never exhausted

Case Class within foreachRDD causes Serialization Error

Spark mapPartitions correct usage with DataFrames

Spark Pooling Options

Spark Avro write RDD to multiple directories by key

Categories

Resources