Parallel parameter search on spark - apache-spark

I would like to minimize a cost function in parallel - testing a set of parameters of my algorithm.
From this article I get the impression that this can be done by creating an RDD of parameters and then call the RDD map as
val grid = (1 until 10)
val partitions = 10
val rdd = sc.parallelize(grid, partitions)
val costs = rdd.map(costfnc(_))
Is this a reasonable approach? What if the cost function already utilizes operations on an RDD? Can this have a negative impact on the cluster (maybe a competing resources)?

What if the cost function already utilizes operations on an RDD?
Then it is not valid Spark code and simply won't work. You cannot start an action or a transformation from inside another action or transformation.
Is this a reasonable approach?
It depends on a multiple factors. Generally speaking Spark is a rather heavyweight solution and using it only to achieve naive parallelization without leveraging its other properties (fault-tolerance, data processing capabilities) doesn't make sense.

Related

Optimal (low-latency) spark settings for small datasets

I'm aware that spark is designed for large datasets for which it's great. But under certain circumstances I don't need this scalability, e.g. for unit tests or for data exploration on small datasets. Under these conditions spark performs relatively bad compared implementation in pure scala/python/matlab/R etc.
Note that I don't want to drop spark entirely, I want to keep the framework for larger workloads without re-implementing everything.
How can I disable sparks overhead as much as possible on small datasets (say 10-1000s of records)? I'm tried using only 1 partition in local mode (setting spark.sql.shuffle.partitions=1 and spark.default.parallelism=1)? Even which these settings, simple queries on 100 records take on the order of 1-2 seconds.
Note that I'm not trying to reduce the time for SparkSession instantiation, just the execution time given SparkSession exists.
operations in spark have same signature as the scala collections.
You could implement something like:
val useSpark = false
val rdd: RDD[String]
val list: List[String] = Nil
def mapping: String => Int = s => s.length
if (useSpark) {
rdd.map(mapping)
} else {
list.map(mapping)
}
I think this code could be abstracted even more.

Stateful udfs in spark sql, or how to obtain mapPartitions performance benefit in spark sql?

Using map over map partitions can give significant performance boost in cases where the transformation incurs creating or loading an expensive resource (e.g - authenticate to an external service or create a db connection).
mapPartition allows us to initialise the expensive resource once per partition verses once per row as happens with the standard map.
But if I am using dataframes, the way I apply custom transformations is by specifying user defined functions that operate on a row by row basis- so I lose the ability I had with mapPartitions to perform heavy lifting once per chunk.
Is there a workaround for this in spark-sql/dataframe?
To be more specific:
I need to perform feature extraction on a bunch of documents. I have a function that inputs a document and outputs a vector.
The computation itself involves initialising a connection to an external service. I don't want or need to initialise it per document. This has non trivial overhead at scale.
In general you have three options:
Convert DataFrame to RDD and apply mapPartitions directly. Since you use Python udf you already break certain optimizations and pay serde cost and using RDD won't make it worse on average.
Lazily initialize required resources (see also How to run a function on all Spark workers before processing data in PySpark?).
If data can be serialized with Arrow use vectorized pandas_udf (Spark 2.3 and later). Unfortunately you cannot use it directly with VectorUDT, so you'd have to expand vectors and collapse later, so the limiting factor here is the size of the vector. Also you have to be careful to keep size of partitions under control.
Note that using UserDefinedFunctions might require promoting objects to non-deterministic variants.

How to choose between join(broadcast) and collect with Spark

I'm using Spark 2.2.1.
I have a small DataFrame (less than 1M) and I have a computation on a big DataFrame that will need this small one to compute a column in an UDF.
What is the best option regarding performance
Is it better to broadcast this DF (I don't know if Spark will do the cartesian into memory).
bigDF.crossJoin(broadcast(smallDF))
.withColumn(udf("$colFromSmall", $"colFromBig"))
or to collect it and use the small value directly in the udf
val small = smallDF.collect()
bigDF.withColumn(udf($"colFromBig"))
Both will collect data first, so in terms of memory footprint there is no difference. So the choice should be dictated by the logic:
If you can do better than default execution plan and don't want to create your own, udf might be a better approach.
If it is just a Cartesian, and requires subsequent explode - perish the though - just go with the former option.
As suggested in the comments by T.Gawęda in the second case you can use broadcast
val small = spark.spark.broadcast(smallDF.collect())
bigDF.withColumn(udf($"colFromBig"))
It might provide some performance improvements if udf is reused.

How Spark shuffle operation works?

I'm learning Spark for my project and I'm in stuck with shuffle process in Spark. I want to know how this operation works internal. I found some keywords involved in this operation: ShuffleMapStage, ShuffleMapTask, ShuffledRDD, Shuffle Write, Shuffle Read....
My questions are:
1) Why we need ShuffleMapStage? When this stage is created and how it works?
2) When ShuffledRDD's compute method is called?
3) What are Shuffle Read and Shuffle Write?
The suffle operation consist to distribute coherent data on workers (repartition) using hash function on data key (data localilty problem).
This operation involves data transfert to organise data before perform an action, reduce the number of suffle operation increase the performance.
Shuffle operation are automatically called by Spark between 2 transformation to execute a final action.
Some Spark transformation need shuffle (like Group by, Join, sort)
Some Spark transformation doesn't need shuffle (like Union, Map, Reduce, Filter, count)

Spark-streaming for task parallelization

I am designing a system with the following flow:
Download feed files (line based) over the network
Parse the elements into objects
Filter invalid / unnecessary objects
Execute blocking IO (HTTP Request) on part of the elements
Save to DB
I have been considering implementing the system using Spark-streaming mainly for tasks parallelization, resource management, fault tolerance, etc.
But I am not sure this is the right use-case for spark streaming, as I am not using it only for metrics and data processing.
Also I'm not sure how Spark-streaming handles blocking IO tasks.
Is Spark-streaming suitable for this use-case? Or maybe I should look for another technology/framework?
Spark is, at its heart, a general parallel computing framework. Spark Streaming adds an abstraction to support stream processing using micro-batching.
We can certainly implement such an use case on Spark Streaming.
To 'fan-out' the I/O operations, we need to ensure the right level of parallelism at two levels:
First, distribute the data evenly across partitions:
The initial partitioning of the data will depend on the streaming source used. For this usecase, it would look like a custom receiver could be the way to go. After the batch is received, we probably need to use dstream.repartition(n) to a larger number of partitions that should roughly match 2-3x the number of executors allocated for the job.
Spark uses 1 core (configurable) for each task executed. Tasks are executed per partition. This makes the assumption that our task is CPU intensive and requires a full CPU. To optimize execution for blocking I/O, we would like to multiplex that core for many operations. We do this by operating directly on the partitions and using classical concurrent programming to parallelize our work.
Given the original stream of feedLinesDstream, we could so something like:
(* in Scala. Java version should be similar, but like x times more LOC)
val feedLinesDstream = ??? // the original dstream of feed lines
val parsedElements = feedLinesDstream.map(parseLine)
val validElements = parsedElements.filter(isValid _)
val distributedElements = validElements.repartition(n) // n = 2 to 3 x #of executors
// multiplex execution at the level of each partition
val data = distributedElements.mapPartitions{ iter =>
implicit executionContext = ??? // obtain a thread pool for execution
val futures = iter.map(elem => Future(ioOperation(elem)))
// traverse the future resulting in a future collection of results
val res = Future.sequence(future)
Await.result(res, timeout)
}
data.saveToCassandra(keyspace, table)
Is Spark-streaming suitable for this use-case? Or maybe I should look
for another technology/framework?
When considering using Spark, you should ask yourself a few questions:
What is the scale of my application in it's current state and where will it grow to in the future? (Spark is generally meant for Big Data applications where millions of processes will happen a second)
What language is my preferred? (Spark can implemented in Java, Scala, Python, and R)
What database will I be using? (Technologies like Apache Spark are normally implemented with large DB structures like HBase)
Also I'm not sure how Spark-streaming handles blocking IO tasks.
There is already an answer on Stack Overflow about blocking IO tasks using Spark in Scala. It should give you a start, but to answer that question, yes it is possible.
Lastly, reading documentation is important and you can find Spark's right here.

Resources