To avoid delaying and to speed up the process,i build the thread pool in the spark streaming. The main program is listed as follows:
stream.foreachRDD(rdd=> {
rdd.foreachPartition { rddPartition => {
val client: Client = ESClient.getInstance.getClient
var num = Random.nextInt()
val threadPool: ExecutorService = Executors.newFixedThreadPool(5)
val confs = new Configuration()
rddPartition.foreach(x => {
threadPool.execute(new esThread(x._2, num, client, confs))
} ) } } } )
The function of the esThread is that firstly,we inquire the elasticsearch,then we get the query result of ES,finally we write the result to HDFS. But we find data of the result file in HDFS lack a lot,which is a little left. I wonder that we can build the thread pool in the spark streaming. Does the thread pool in spark streaming make some data missing?
thanks for your help.
Partitions are processed by separate threads already, and stream won't proceed to the next batch until the previous one has finished. So it is not likely to buy you anything and makes resource usage tracking less transparent.
At the same time, as your code is implemented at this moment, you're likely to loose data. Since threadPool doesn't awaitTermination, parent thread might exit before all data has been processed.
Overall it is not useful approach. If you want to increase throughput you should tune number of partitions and amount of computing resources.
Related
I have a mapGroupsWithState function that I would like to add some additional functionality to based on the groupBy key. Roughly it would look like this:
dataFrame
.as[Log]
.groupByKey(_.id)
.mapGroupsWithState(GroupStateTimeout.NoTimeout())(processData)
.writeStream
.trigger(Trigger.ProcessingTime(s"$x seconds"))
.outputMode(OutputMode.Update())
.foreachBatch(postProcess _)
.start()
def processData(id: String, logs: Iterator[Log], oldState: GroupState[Checkpoint]): Array[Log] = {
if (f(id)) {
// long running operation
}
else {
.
.
.
}
}
My dataframe is partitioned by the id field. I realize that since this if() operation is long running, it may delay the processing of other batches of data with an id that maps to the same partition. However, there was some concern brought up as to whether this long running operation could also delay the processing of data batches on other partitions. I'm not sure how Spark handles batches of data when taking the output of mapGroupsWithState and then passing that to forEachBatch; am I in danger of delaying data output on all partitions with this setup? It seems counterintuitive to me that delays on one partition could affect another, but I'd like to be sure.
mapGroupsWithState runs at Executor level.
This means parallel operations. An Executor Core is assigned to a Task servicing a given Partition for the duration of the processing of that Partition.
Assuming you have enough Executors and thus Cores, there should be no issue, but of course if you have 1 Executor with 1 Core only available to your App, then you would get an issue at Task level.
In my computation, I
first broadcast some data, say bc,
then compute some big data shared by all executor/partition:val shared = f(bc)
then run the distributed computing, using shared data
To avoid computing the shared data on all RDD items, I can use .mapPartitions but I have much more partitions than executors. So it run the computation of shared data more times than necessary.
I found a simple method to compute the shared data only once per executor (which I understood is the JVM actually running the spark tasks): using lazy val on the broadcast data.
// class to be Broadcast
case class BC(input: WhatEver){
lazy val shared = f(input)
}
// in the spark code
val sc = ... // init SparkContext
val bc = sc.broadcast(BC(...))
val initRdd = sc.parallelize(1 to 10000, numSlices = 10000)
initRDD.map{ i =>
val shared = bc.value.shared
... // run computation using shared data
}
I think this does what I want, but
I am not sure, can someone guaranties it?
I am not sure lazy val is the best way to manage concurrent access, especially with respect to the spark internal distribution system. Is there a better way?
if computing shared fails, I think it will be recomputed for all RDD items with possible retries, instead of simply stopping the whole job with a unique error.
So, is there a better way?
I've been reading the source code of Spark, but I still not be able to understand how does Spark standalone implement the resource isolation and allocation. For example Mesos use LXC or Docker to implement the container for resource limitation. So how does Spark Standalone to implement this. for example I ran 10 threads in one executor, but Spark only gave the executor one core, so how does Spark guarantee these 10 threads only run on one cpu core.
After the following testing code, it turns out that Spark Standalone Resource Allocation is somehow Fake. I just had one Worker(executor) and only gave the executor one core(the machine has 6 cores totally), when the following code was running I found there were 5 cores 100% usage. (My code kicked off 4 threads)
object CoreTest {
class MyThread extends Thread {
override def run() {
while (true) {
val i = 1+1
}
}
}
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("core test")
val sc = new SparkContext(conf)
val memRDD = sc.parallelize(Seq(1), 1)
memRDD.foreachPartition { part =>
part.foreach {
x =>
var hello = new MyThread()
hello.start
hello = new MyThread()
hello.start
hello = new MyThread()
hello.start
hello = new MyThread()
hello.start
while (true) {
val j = 1+2
Thread.sleep(1000)
}
}
}
sc.stop()
}
}
Following Question: I'm curious that if I ran the above code on Spark+Mesos, what would happen, would Mesos limit the 4 threads only run on one core.
but I still not be able to understand how does Spark standalone implement the resource isolation and allocation.
With Spark, we have the notation of a Master node and Worker nodes. We can think about the latter as a resource pool. Each worker has CPU and RAM which it brings to the pool, and Spark jobs can utilize the resources in that pool to do their computation.
Spark Standalone has the notation of an Executor, which is the process that handles the computation, and to which we give resources from the resource pool. In any given executor, we run different stages of a computation which is composed of different tasks. Now, we can control the amount of computation power (cores) a given task uses (via spark.tasks.cpu configuration parameter), and we also control the general amount of computation power a given job may have (via spark.cores.max, which tells the cluster manager how many resources in total we want to give to the particular job we're running). Note that Standalone is greety by default and will schedule an executor on every Worker node in the cluster. We can get finer grained control over how many actual Executors we have by using Dynamic Allocation.
for example I ran 10 threads in one executor, but Spark only gave the executor one core, so how does Spark guarantee these 10 threads only run on one cpu core.
Spark doesn't verify that the execution only happens on a single core. Spark doesn't know which CPU cycle it'll get from the underlying operating system. What Spark Standalone does attempt to do is resource management, it tells you "Look, you have X amount of CPUs and Y amount of RAM, I will not let you schedule jobs if you don't partition your resources properly".
Spark standalone handles only resource allocation which is a simple task. All it is required is keeping tabs on:
available resources.
assigned resources.
It doesn't take care of resource isolation. YARN and Mesos, which have broader scope, don't implement resource isolation but depend on Linux Control Groups (cgroups).
I would like to know if the read operations from a Kafka queue is faster by using batch-Kafka RDD instead of the KafkaDirectStream when I want to read all the Kafka queue.
I've observed that reading from different partition with batch RDD is not resulting in Spark concurrent jobs. Is there some Spark proprierties to config in order to allow this behaviour?
Thanks.
Try running your spark consumers in different threads or as different processes. That's the approach I take. I've observed that I get the best concurrency by allocating one consumer thread (or process) per topic partition.
Regarding your questions about batch vs KafkaDirectStream, I think even KafkaDirectStream is batch oriented. The batch interval can be specified in the streaming context, like this:
private static final int INTERVAL = 5000; // 5 seconds
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(sc);
JavaStreamingContext ssc = new JavaStreamingContext(sc, new Duration(INTERVAL));
There's a good image that described how spark streaming is batch oriented here:
http://spark.apache.org/docs/1.6.0/streaming-programming-guide.html#discretized-streams-dstreams
Spark is essentially a batch engine and Spark streaming takes batching closer to streaming by defining something called micro-batching. Micro-batching is nothing but specifying batch interval to be very low (can be as low as 50ms per the advice in the official documentation). So now all it matters is how much is your micro-batch interval going to be. If you keep it low, you would feel it is near real-time.
On the Kafka consumer front, Spark direct receiver runs as a separate task in each executor. So if you have enough executors as the partitions, then it fetches data from all partitions and creates an RDD out of it.
If you are talking about reading from multiple queues, then you would create multiple DStreams, which would again need more executors to match the total number of partitions.
I am designing a system with the following flow:
Download feed files (line based) over the network
Parse the elements into objects
Filter invalid / unnecessary objects
Execute blocking IO (HTTP Request) on part of the elements
Save to DB
I have been considering implementing the system using Spark-streaming mainly for tasks parallelization, resource management, fault tolerance, etc.
But I am not sure this is the right use-case for spark streaming, as I am not using it only for metrics and data processing.
Also I'm not sure how Spark-streaming handles blocking IO tasks.
Is Spark-streaming suitable for this use-case? Or maybe I should look for another technology/framework?
Spark is, at its heart, a general parallel computing framework. Spark Streaming adds an abstraction to support stream processing using micro-batching.
We can certainly implement such an use case on Spark Streaming.
To 'fan-out' the I/O operations, we need to ensure the right level of parallelism at two levels:
First, distribute the data evenly across partitions:
The initial partitioning of the data will depend on the streaming source used. For this usecase, it would look like a custom receiver could be the way to go. After the batch is received, we probably need to use dstream.repartition(n) to a larger number of partitions that should roughly match 2-3x the number of executors allocated for the job.
Spark uses 1 core (configurable) for each task executed. Tasks are executed per partition. This makes the assumption that our task is CPU intensive and requires a full CPU. To optimize execution for blocking I/O, we would like to multiplex that core for many operations. We do this by operating directly on the partitions and using classical concurrent programming to parallelize our work.
Given the original stream of feedLinesDstream, we could so something like:
(* in Scala. Java version should be similar, but like x times more LOC)
val feedLinesDstream = ??? // the original dstream of feed lines
val parsedElements = feedLinesDstream.map(parseLine)
val validElements = parsedElements.filter(isValid _)
val distributedElements = validElements.repartition(n) // n = 2 to 3 x #of executors
// multiplex execution at the level of each partition
val data = distributedElements.mapPartitions{ iter =>
implicit executionContext = ??? // obtain a thread pool for execution
val futures = iter.map(elem => Future(ioOperation(elem)))
// traverse the future resulting in a future collection of results
val res = Future.sequence(future)
Await.result(res, timeout)
}
data.saveToCassandra(keyspace, table)
Is Spark-streaming suitable for this use-case? Or maybe I should look
for another technology/framework?
When considering using Spark, you should ask yourself a few questions:
What is the scale of my application in it's current state and where will it grow to in the future? (Spark is generally meant for Big Data applications where millions of processes will happen a second)
What language is my preferred? (Spark can implemented in Java, Scala, Python, and R)
What database will I be using? (Technologies like Apache Spark are normally implemented with large DB structures like HBase)
Also I'm not sure how Spark-streaming handles blocking IO tasks.
There is already an answer on Stack Overflow about blocking IO tasks using Spark in Scala. It should give you a start, but to answer that question, yes it is possible.
Lastly, reading documentation is important and you can find Spark's right here.