Spark-streaming for task parallelization - apache-spark

I am designing a system with the following flow:
Download feed files (line based) over the network
Parse the elements into objects
Filter invalid / unnecessary objects
Execute blocking IO (HTTP Request) on part of the elements
Save to DB
I have been considering implementing the system using Spark-streaming mainly for tasks parallelization, resource management, fault tolerance, etc.
But I am not sure this is the right use-case for spark streaming, as I am not using it only for metrics and data processing.
Also I'm not sure how Spark-streaming handles blocking IO tasks.
Is Spark-streaming suitable for this use-case? Or maybe I should look for another technology/framework?

Spark is, at its heart, a general parallel computing framework. Spark Streaming adds an abstraction to support stream processing using micro-batching.
We can certainly implement such an use case on Spark Streaming.
To 'fan-out' the I/O operations, we need to ensure the right level of parallelism at two levels:
First, distribute the data evenly across partitions:
The initial partitioning of the data will depend on the streaming source used. For this usecase, it would look like a custom receiver could be the way to go. After the batch is received, we probably need to use dstream.repartition(n) to a larger number of partitions that should roughly match 2-3x the number of executors allocated for the job.
Spark uses 1 core (configurable) for each task executed. Tasks are executed per partition. This makes the assumption that our task is CPU intensive and requires a full CPU. To optimize execution for blocking I/O, we would like to multiplex that core for many operations. We do this by operating directly on the partitions and using classical concurrent programming to parallelize our work.
Given the original stream of feedLinesDstream, we could so something like:
(* in Scala. Java version should be similar, but like x times more LOC)
val feedLinesDstream = ??? // the original dstream of feed lines
val parsedElements = feedLinesDstream.map(parseLine)
val validElements = parsedElements.filter(isValid _)
val distributedElements = validElements.repartition(n) // n = 2 to 3 x #of executors
// multiplex execution at the level of each partition
val data = distributedElements.mapPartitions{ iter =>
implicit executionContext = ??? // obtain a thread pool for execution
val futures = iter.map(elem => Future(ioOperation(elem)))
// traverse the future resulting in a future collection of results
val res = Future.sequence(future)
Await.result(res, timeout)
}
data.saveToCassandra(keyspace, table)

Is Spark-streaming suitable for this use-case? Or maybe I should look
for another technology/framework?
When considering using Spark, you should ask yourself a few questions:
What is the scale of my application in it's current state and where will it grow to in the future? (Spark is generally meant for Big Data applications where millions of processes will happen a second)
What language is my preferred? (Spark can implemented in Java, Scala, Python, and R)
What database will I be using? (Technologies like Apache Spark are normally implemented with large DB structures like HBase)
Also I'm not sure how Spark-streaming handles blocking IO tasks.
There is already an answer on Stack Overflow about blocking IO tasks using Spark in Scala. It should give you a start, but to answer that question, yes it is possible.
Lastly, reading documentation is important and you can find Spark's right here.

Related

How can I parallelize multiple Datasets in Spark?

I have a Spark 2.1 job where I maintain multiple Dataset objects/RDD's that represent different queries over our underlying Hive/HDFS datastore. I've noticed that if I simply iterate over the List of Datasets, they execute one at a time. Each individual query operates in parallel, but I feel that we are not maximizing our resources by not running the different datasets in parallel as well.
There doesn't seem to be a lot out there regarding doing this, as most questions appear to be around parallelizing a single RDD or Dataset, not parallelizing multiple within the same job.
Is this inadvisable for some reason? Can I just use a executor service, thread pool, or futures to do this?
Thanks!
Yes you can use multithreading in the driver code, but normally this does not increase performance, unless your queries operate on very skewed data and/or cannot be parallelized well enough to fully utilize the resources.
You can do something like that:
val datasets : Seq[Dataset[_]] = ???
datasets
.par // transform to parallel Seq
.foreach(ds => ds.write.saveAsTable(...)

Running several spark jobs concurrently from driver

Imagine that we have 3 customers and we want do some same work for each of them in parallel.
def doSparkJob(customerId: String) = {
spark
.read.json(s"$customerId/file.json")
.map(...)
.reduceByKey(...)
.write
.partitionBy("id")
.parquet("output/")
}
We do it concurrently like this (from spark driver):
val jobs: Future[(Unit, Unit, Unit)] = for {
f1 <- Future { doSparkJob("customer1") }
f2 <- Future { doSparkJob("customer1") }
f3 <- Future { doSparkJob("customer1") }
} yield (f1, f2, f3)
Await.ready(jobs, 5.hours)
Do I understand correctly that this is bad approach? Many spark job will push out context of each other from executors and there will be many spilling data to disc appears. How spark will be manage execute task from parallel jobs? How shuffle appears when we have 3 concurrent job from one driver and only 3 executors with one core.
I guess, a good approach should looks like this:
We read all data together for all customers groupByKey by customer and do what we want to do.
Do I understand correctly that this is bad approach?
Not necessarily. A lot depends on the context and Spark implements it's own set of AsyncRDDActions to address scenarios like this one (though there is no Dataset equivalent).
In the simplest scenario, with static allocation, it is quite likely that Spark will just schedule all jobs sequentially, due to lack of resources. Unless configured otherwise, this is the most probable outcome with the described configuration. Please keep in mind that Spark can use in-application scheduling with FAIR scheduler to share limited resources between multiple concurrent jobs. See Scheduling Within an Application.
If amount of resources is sufficient to start multiple jobs at the same, there can be competition between individual jobs, especially for IO and memory intensive jobs. If all jobs use the same of resources (especially databases) it is possible that Spark will cause throttling and subsequent failures or timeouts. A less severe effect of running multiple jobs can be increased cache eviction.
Overall there multiple factors to consider when you choose between sequential and concurrent execution including, but not limited to, available resource (Spark cluster and external services), choice of the API (RDD tend to be more greedy than SQL, therefore requires some low level management) and choice of operators. Even if jobs are sequentially you may still decide to use asynchronous to improve driver utilization and reduce latency. This is particularly useful with Spark SQL and complex execution plans (common bottleneck in Spark SQL). This way Spark can crunch new execution plans, while other jobs are executed.

Why does SparkContext.parallelize use memory of the driver?

Now I have to create a parallelized collection using sc.parallelize() in pyspark (Spark 2.1.0).
The collection in my driver program is big. when I parallelize it, I found it takes up a lot of memory in master node.
It seems that the collection is still being kept in spark's memory of the master node after I parallelize it to each worker node.
Here's an example of my code:
# my python code
sc = SparkContext()
a = [1.0] * 1000000000
rdd_a = sc.parallelize(a, 1000000)
sum = rdd_a.reduce(lambda x, y: x+y)
I've tried
del a
to destroy it, but it didn't work. The spark which is a java process is still using a lot of memory.
After I create rdd_a, how can I destroy a to free the master node's memory?
Thanks!
The job of the master is to coordinate the workers and to give a worker a new task once it has completed its current task. In order to do that, the master needs to keep track of all of the tasks that need to be done for a given calculation.
Now, if the input were a file, the task would simply look like "read file F from X to Y". But because the input was in memory to begin with, the task looks like 1,000 numbers. And given the master needs to keep track of all 1,000,000 tasks, that gets quite large.
The collection in my driver program is big. when I parallelize it, I found it takes up a lot of memory in master node.
That's how it supposed to be and that's why SparkContext.parallelize is only meant for demos and learning purposes, i.e. for quite small datasets.
Quoting the scaladoc of parallelize
parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T] Distribute a local Scala collection to form an RDD.
Note "a local Scala collection" that means that the collection you want to map to a RDD (or create a RDD from) is already in the memory of the driver.
In your case, a is a local Python variable and Spark knows nothing about it. What happens when you use parallelize is that the local variable (that's already in the memory) is wrapped in this nice data abstraction called RDD. It's simply a wrapper around the data that's already in memory on the driver. Spark can't do much about that. It's simply too late. But Spark plays nicely and pretends the data is as distributed as other datasets you could have processed using Spark.
That's why parallelize is only meant for small datasets to play around (and mainly for demos).
Just like Jacek's answer, parallelize is only demo for small dataset, you can access all variables defined in driver within parallelize block.

Spark-Streaming Kafka Direct Streaming API & Parallelism

I understood the automated mapping that exists between a Kafka Partition and a Spark RDD partition and ultimately Spark Task. However in order to properly Size My Executor (in number of Core) and therefore ultimately my node and cluster, I need to understand something that seems to be glossed over in the documentations.
In Spark-Streaming how does exactly work the data consumption vs data processing vs task allocation, in other words:
Does a corresponding Spark task to a Kafka partition both read
and process the data altogether ?
The rational behind this question is that in the previous API, that
is, the receiver based, a TASK was dedicated for receiving the data,
meaning a number tasks slot of your executors were reserved for data
ingestion and the other were there for processing. This had an
impact on how you size your executor in term of cores.
Take for example the advise on how to launch spark-streaming with
--master local. Everyone would tell that in the case of spark streaming,
one should put local[2] minimum, because one of the
core, will be dedicated to running the long receiving task that never
ends, and the other core will do the data processing.
So if the answer is that in this case, the task does both the reading
and the processing at once, then the question that follows, is that
really smart, i mean, this sounds like asynchronous. We want to be
able to fetch while we process so on the next processing the data is
already there. However if there only one core or more precisely to
both read the data and process them, how can both be done in
parallel, and how does that make things faster in general.
My original understand was that, things would have remain somehow the
same in the sense that, a task would be launch to read but that the
processing would be done in another task. That would mean that, if
the processing task is not done yet, we can still keep reading, until
a certain memory limit.
Can someone outline with clarity what is exactly going on here ?
EDIT1
We don't even have to have this memory limit control. Just the mere fact of being able to fetch while the processing is going on and stopping right there. In other words, the two process should be asynchronous and the limit is simply to be one step ahead. To me if somehow this is not happening, i find it extremely strange that Spark would implement something that break performance as such.
Does a corresponding Spark task to a Kafka partition both read and
process the data altogether ?
The relationship is very close to what you describe, if by talking about a task we're referring to the part of the graph that reads from kafka up until a shuffle operation. The flow of execution is as follows:
Driver reads offsets from all kafka topics and partitions
Driver assigns each executor a topic and partition to be read and processed.
Unless there is a shuffle boundary operation, it is likely that Spark will optimize the entire execution of the partition on the same executor.
This means that a single executor will read a given TopicPartition and process the entire execution graph on it, unless we need to shuffle. Since a Kafka partition maps to a partition inside the RDD, we get that guarantee.
Structured Streaming takes this even further. In Structured Streaming, there is stickiness between the TopicPartition and the worker/executor. Meaning, if a given worker was assigned a TopicPartition it is likely to continue processing it for the entire lifetime of the application.

On which way does RDD of spark finish fault-tolerance?

Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. But, I did not find the internal mechanism on which the RDD finish fault-tolerance. Could somebody describe this mechanism?Thanks.
Let me explain in very simple terms as I understand.
Faults in a cluster can happen when one of the nodes processing data is crashed. In spark terms, RDD is split into partitions and each node (called the executors) is operating on a partition at any point of time. (Theoretically, each each executor can be assigned multiple tasks depending on the number of cores assigned to the job versus the number of partitions present in the RDD).
By operation, what is really happening is a series of Scala functions (called transformations and actions in Spark terms depending on if the function is pure or side-effecting) executing on a partition of the RDD. These operations are composed together and Spark execution engine views these as a Directed Acyclic Graph of operations.
Now, if a particular node crashes in the middle of an operation Z, which depended on operation Y, which inturn on operation X. The cluster manager (YARN/Mesos) finds out the node is dead and tries to assign another node to continue processing. This node will be told to operate on the particular partition of the RDD and the series of operations X->Y->Z (called lineage) that it has to execute, by passing in the Scala closures created from the application code. Now the new node can happily continue processing and there is effectively no data-loss.
Spark also uses this mechanism to guarantee exactly-once processing, with the caveat that any side-effecting operation that you do like calling a database in a Spark Action block can be invoked multiple times. But if you view your transformations like pure functional mapping from one RDD to another, then you can be rest assured that the resulting RDD will have the elements from the source RDD processed only once.
The domain of fault-tolerance in Spark is very vast and it needs much bigger explanation. I am hoping to see others coming up with technical details on how this is implemented, etc. Thanks for the great topic though.

Resources