I'm fairly new to Scala and the use of multiple threads. I would like to test if i can speed up the filling of Spark DataFrames if i ran them in parallel. Unfortunately I couldn't find any good tutorial how to assign variables in parallel threads.
Initiating DataFrames
val first_df = stg_df.as('a).select($"a.attr1", $"a.attr2")
val second_df = stg_df.as('a).select($"a.attr3", $"a.attr4")
Maybe something i can make use of:
import scala.actors.Futures._
List("one", "two", "three", "four").foreach(name => future(println("Thread " + name + " says hi")))
Spark is very different from regular Scala code. It already runs in parallel across your cluster and you generally shouldn't be creating threads yourself.
Stick to Spark specific programming tutorials when working with Spark and parallelism.
Related
I have a processing data pipeline including 3 methods ( let's say A(), B(), C() sequentially) for an input text file. But I have to repeat this pipeline for 10000 different files. I have used adhoc multithreading: create 10000 threads, and add them to threadPool...Now I switch to Spark to achieve this parallel. My question are:
If Spark can do better job, guide me basic steps please cause I'm new to Spark.
If I use adhoc multithreading, deploy it on cluster. How can i manage resource to allocate threads running equally among nodes.I'm new to HPC system too.
I hope I ask the right questions, thanks !
I have a Spark 2.1 job where I maintain multiple Dataset objects/RDD's that represent different queries over our underlying Hive/HDFS datastore. I've noticed that if I simply iterate over the List of Datasets, they execute one at a time. Each individual query operates in parallel, but I feel that we are not maximizing our resources by not running the different datasets in parallel as well.
There doesn't seem to be a lot out there regarding doing this, as most questions appear to be around parallelizing a single RDD or Dataset, not parallelizing multiple within the same job.
Is this inadvisable for some reason? Can I just use a executor service, thread pool, or futures to do this?
Thanks!
Yes you can use multithreading in the driver code, but normally this does not increase performance, unless your queries operate on very skewed data and/or cannot be parallelized well enough to fully utilize the resources.
You can do something like that:
val datasets : Seq[Dataset[_]] = ???
datasets
.par // transform to parallel Seq
.foreach(ds => ds.write.saveAsTable(...)
Is there any way to run multiple independent aggregation jobs on a single RDD in parallel? First preference is Python then Scala and Java.
The course of actions in order of preference are -
Using Threadpool - run different functions doing different aggregations on different threads. I did not see an example which does this.
Using cluster mode on yarn , submitting different jars. Is this possible , if yes then is it possible in pyspark?
Using Kafka - run different spark-submits on the dataframe streaming through kafka.
I am quite new to Spark , and my experience ranges on running Spark on Yarn for ETL doing multiple aggregations serially. I was thinking if it was possible to run these aggregations in parallel as they are mostly independent.
Consider your broad question, here is a broad answer :
Yes, it is possible to run multiple aggregation jobs on a single DataFrame in parallel.
For the rest, it doesn't seem to be clear what you are asking.
I am designing a system with the following flow:
Download feed files (line based) over the network
Parse the elements into objects
Filter invalid / unnecessary objects
Execute blocking IO (HTTP Request) on part of the elements
Save to DB
I have been considering implementing the system using Spark-streaming mainly for tasks parallelization, resource management, fault tolerance, etc.
But I am not sure this is the right use-case for spark streaming, as I am not using it only for metrics and data processing.
Also I'm not sure how Spark-streaming handles blocking IO tasks.
Is Spark-streaming suitable for this use-case? Or maybe I should look for another technology/framework?
Spark is, at its heart, a general parallel computing framework. Spark Streaming adds an abstraction to support stream processing using micro-batching.
We can certainly implement such an use case on Spark Streaming.
To 'fan-out' the I/O operations, we need to ensure the right level of parallelism at two levels:
First, distribute the data evenly across partitions:
The initial partitioning of the data will depend on the streaming source used. For this usecase, it would look like a custom receiver could be the way to go. After the batch is received, we probably need to use dstream.repartition(n) to a larger number of partitions that should roughly match 2-3x the number of executors allocated for the job.
Spark uses 1 core (configurable) for each task executed. Tasks are executed per partition. This makes the assumption that our task is CPU intensive and requires a full CPU. To optimize execution for blocking I/O, we would like to multiplex that core for many operations. We do this by operating directly on the partitions and using classical concurrent programming to parallelize our work.
Given the original stream of feedLinesDstream, we could so something like:
(* in Scala. Java version should be similar, but like x times more LOC)
val feedLinesDstream = ??? // the original dstream of feed lines
val parsedElements = feedLinesDstream.map(parseLine)
val validElements = parsedElements.filter(isValid _)
val distributedElements = validElements.repartition(n) // n = 2 to 3 x #of executors
// multiplex execution at the level of each partition
val data = distributedElements.mapPartitions{ iter =>
implicit executionContext = ??? // obtain a thread pool for execution
val futures = iter.map(elem => Future(ioOperation(elem)))
// traverse the future resulting in a future collection of results
val res = Future.sequence(future)
Await.result(res, timeout)
}
data.saveToCassandra(keyspace, table)
Is Spark-streaming suitable for this use-case? Or maybe I should look
for another technology/framework?
When considering using Spark, you should ask yourself a few questions:
What is the scale of my application in it's current state and where will it grow to in the future? (Spark is generally meant for Big Data applications where millions of processes will happen a second)
What language is my preferred? (Spark can implemented in Java, Scala, Python, and R)
What database will I be using? (Technologies like Apache Spark are normally implemented with large DB structures like HBase)
Also I'm not sure how Spark-streaming handles blocking IO tasks.
There is already an answer on Stack Overflow about blocking IO tasks using Spark in Scala. It should give you a start, but to answer that question, yes it is possible.
Lastly, reading documentation is important and you can find Spark's right here.
Normally when creating an RDD from a List you can just use the SparkContext.parallelize method, but you can not use the spark context from within a Task as it's not serializeable. I have a need to create an RDD from a list of Strings from within a task. Is there a way to do this?
I've tried creating a new SparkContext in the task, but it gives me an error about not supporting multiple spark contexts in the same JVM and that I need to set spark.driver.allowMultipleContexts = true. According to the Apache User Group, that setting however does not yet seem to be supported
As far as I am concerned it is not possible and it is hardly a matter of serialization or a support for multiple Spark contexts. A fundamental limitation is a core Spark architecture. Since Spark context is maintained by a driver and tasks are executed on the workers creating a RDD from inside a task would require pushing changes from workers to a driver. I am not saying it is technically impossible but a whole ideas seems to be rather cumbersome.
Creating Spark context from inside tasks looks even worse. First of all it would mean that context is created on the workers, which for all practical purposes don't communicate with each other. Each worker would get its own context which could operate only on a data that is accessible on given worker. Finally preserving worker state is definitely not a part of the contract so any context create inside a task should be simply garbage collected after the task is finished.
If handling the problem using multiple jobs is not an option you can try to use mapPartitions like this:
val rdd = sc.parallelize(1 to 100)
val tmp = rdd.mapPartitions(iter => {
val results = Map(
"odd" -> scala.collection.mutable.ArrayBuffer.empty[Int],
"even" -> scala.collection.mutable.ArrayBuffer.empty[Int]
)
for(i <- iter) {
if (i % 2 != 0) results("odd") += i
else results("even") += i
}
Iterator(results)
})
val odd = tmp.flatMap(_("odd"))
val even = tmp.flatMap(_("even"))