How can I parallelize multiple Datasets in Spark? - apache-spark

I have a Spark 2.1 job where I maintain multiple Dataset objects/RDD's that represent different queries over our underlying Hive/HDFS datastore. I've noticed that if I simply iterate over the List of Datasets, they execute one at a time. Each individual query operates in parallel, but I feel that we are not maximizing our resources by not running the different datasets in parallel as well.
There doesn't seem to be a lot out there regarding doing this, as most questions appear to be around parallelizing a single RDD or Dataset, not parallelizing multiple within the same job.
Is this inadvisable for some reason? Can I just use a executor service, thread pool, or futures to do this?
Thanks!

Yes you can use multithreading in the driver code, but normally this does not increase performance, unless your queries operate on very skewed data and/or cannot be parallelized well enough to fully utilize the resources.
You can do something like that:
val datasets : Seq[Dataset[_]] = ???
datasets
.par // transform to parallel Seq
.foreach(ds => ds.write.saveAsTable(...)

Related

How to parallelize work in spark when working with multiple dataframes?

Is parallelism among multiple dataframe supported in spark?
I have a task which net me 100s of dataframe, and I want to perform transformation and write to each of them.
However union them together seems to be not performant?
Are there any other concurrency primitive that work with multiple dataframes so that it scales horizontally with number of servers?
No, there isn't a primitive to operate on multiple data frames. But there is nothing stopping you from launching a batch script to saturate your cluster. If that what you want to do.
Write the script for 1 data frame, then launch 100s of jobs to do the work.

In spark, is it possible to reuse a DataFrame's execution plan to apply it to different data sources

I have a bit complex pipeline - pyspark which takes 20 minutes to come up with execution plan. Since I have to execute the same pipeline multiple times with different data frame (as source) Im wondering is there any option for me to avoid building execution plan every time? Build execution plan once and reuse it with different source data?`
There is a way to do what you ask but it requires advanced understanding of Spark internals. Spark plans are simply trees of objects. These trees are constantly transformed by Spark. They can be "tapped" and transformed "outside" of Spark. There is a lot of devil in the details and thus I do not recommend this approach unless you have a severe need for it.
Before you go there, it important to look at other options, such as:
Understanding what exactly is causing the delay. On some managed planforms, e.g., Databricks, plans are logged in JSON for analysis/debugging purposes. We sometimes seen delays of 30+ mins with CPU pegged at 100% on a single core while a plan produces tens of megabytes of JSON and pushes them on the wire. Make sure something like this is not happening in your case.
Depending on your workflow, if you have to do this with many datasources at the same time, use driver-side parallelism to analyze/optimize plans using many cores at the same time. This will also increase your cluster utilization if your jobs have any skew in the reduce phases of processing.
Investigate the benefit of Spark's analysis/optimization to see if you can introduce analysis barriers to speed up transformations.
This is impossible because the source DataFrame affects the execution of the optimizations applied to the plan.
As #EnzoBnl pointed out, this is not possible as Tungsten applies optimisations specific to the object. What you could do instead (if possible with your data) is to split your large file into smaller chunks that could be shared between the multiple input dataframes and use persist() or checkpoint() on them.
Specifically checkpoint makes the execution plan shorter by storing a mid-point, but there is no way to reuse.
See
Data checkpointing - Saving of the generated RDDs to reliable storage. This is necessary in some stateful transformations that combine data across multiple batches. In such transformations, the generated RDDs depend on RDDs of previous batches, which causes the length of the dependency chain to keep increasing with time. To avoid such unbounded increases in recovery time (proportional to dependency chain), intermediate RDDs of stateful transformations are periodically checkpointed to reliable storage (e.g. HDFS) to cut off the dependency chains.

How to achieve vertical parallelism in spark?

Is it possible to run multiple calculations in parallel using spark?
Example cases that could benefit from that:
running column-wise tasks for large columns. Applying StringIndexer to 10K columns can benefit from having calculation only on a single worker and having as many workers working on single columns as possible.
running numerous atomic tasks for small datasets. For example:
for in_path, out_path in long_ds_list:
spark.read(in_path).select('column').distinct().write(out_path)
The closest equivalents I can think of would be SparkR.lapply() or .Net Parallel.ForEach(), but for a cluster environment rather than simpler multi-threading case.
I'd say that Spark is good at scheduling distributed computing tasks and could handle your cases with ease, but you'd have to develop their solutions yourself. I'm not saying it'd take ages, but would require quite a lot of effort since it's below the developer-facing API in Spark SQL, Spark MLlib, Structured Streaming and such.
You'd have to use Spark Core API and create a custom RDD that would know how to describe such computations.
Let's discuss the first idea.
running column-wise tasks for large columns. Applying StringIndexer to 10K columns can benefit from having calculation only on a single worker and having as many workers working on single columns as possible.
"column-wise tasks for large columns" seems to suggest that you think about Spark SQL's DataFrames and Spark MLlib's StringIndexer Transformer. They are higher-level APIs that don't offer such features. You're not supposed to deal with the problem using them. It's an optimization feature so you have to go deeper into Spark.
I think you'd have to rewrite the higher-level APIs in Spark SQL and Spark MLlib to use your own optimized custom code where you'd have the feature implemented.
Same with the other requirement, but this time you'd have to be concerned with Spark SQL only (leaving Spark MLlib aside).
Wrapping up, I think both are possible with some development (i.e. not available today).

which is faster in spark, collect() or toLocalIterator()

I have a spark application in which I need to get the data from executors to driver and I am using collect(). However, I also came across toLocalIterator(). As far as I have read about toLocalIterator() on Internet, it returns an iterator rather than sending whole RDD instantly, so it has better memory performance, but what about speed? How is the performance between collect() and toLocalIterator() when it comes to execution/computation time?
The answer to this question depends on what would you do after making df.collect() and df.rdd.toLocalIterator(). For example, if you are processing a considerably big file about 7M rows and for each of the records in there, after doing all the required transformations, you needed to iterate over each of the records in the DataFrame and make a service calls in batches of 100.
In the case of df.collect(), it will dumping the entire set of records to the driver, so the driver will need an enormous amount of memory. Where as in the case of toLocalIterator(), it will only return an iterator over a partition of the total records, hence the driver does not need to have enormous amount of memory. So if you are going to load such big files in parallel workflows inside the same cluster, df.collect() will cause you a lot of expense, where as toLocalIterator() will not and it will be faster and reliable as well.
On the other hand if you plan on doing some transformations after df.collect() or df.rdd.toLocalIterator(), then df.collect() will be faster.
Also if your file size is so small that Spark's default partitioning logic does not break it down into partitions at all then df.collect() will be more faster.
To quote from the documentation on toLocalIterator():
This results in multiple Spark jobs, and if the input RDD is the result of a wide transformation (e.g. join with different partitioners), to avoid recomputing the input RDD should be cached first.
It means that in the worst case scenario (no caching at all) it can be n-partitions times more expensive than collect. Even if data is cached, the overhead of starting multiple Spark jobs can be significant on large datasets. However lower memory footprint can partially compensate that, depending on a particular configuration.
Overall, both methods are inefficient and should be avoided on large datasets.
As for the toLocalIterator, it is used to collect the data from the RDD scattered around your cluster into one only node, the one from which the program is running, and do something with all the data in the same node. It is similar to the collect method, but instead of returning a List it will return an Iterator.
So, after applying a function to an RDD using foreach you can call toLocalIterator to get an iterator to all the contents of the RDD and process it. However, bear in mind that if your RDD is very big, you may have memory issues. If you want to transform it to an RDD again after doing the operations you need, use the SparkContext to parallelize it.

Parallel writing from same DataFrame in Spark

Let's say I have a DataFrame in Spark and I need to write the results of it to two databases, where one stores the original data frame but the other stores a slightly modified version (e.g. drops some columns). Since both operations can take a few moments, is it possible/advisable to run these operations in parallel or will that cause problems because Spark is working on the same object in parallel?
import java.util.concurrent.Executors
import scala.concurrent._
implicit val ec = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(10))
def write1(){
//your save statement for first dataframe
}
def write2(){
//your save statement for second dataframe
}
def writeAllTables() {
Future{ write1()}
Future{ write2()}
}
Let me ask you, do you really need to do it? If you are not sure, then, you most probably don't.
So, lets assume below scenario similar to one you explained:
val df1 = spark.read.csv('someFile.csv') // Original Dataframe
val df2 = df1.withColumn("newColumn", concat(col("oldColumn"), lit(" is blah!"))) // Modified Dataframe, FYI, this df2 is a different object
df1.write('db_loc1') // Write to DB1, already parallelised & uses spark resources optimally
df2.write('db_loc2') // Write to DB2, already parallelised & uses spark resources optimally
Spark scheduler divides the first DataFrame df1 into partitions and writes them in parallel in db_loc1.
It picks up the second DataFrame df2 and again breaks it into partitions and writes these partitions in parallel in db_loc2.
By default, the degree of parallelisation per write is speculated in order to optimally use available cluster resources.
Small writes might not be repartitioned as mostly the write time is low and repartitioning will only increase overhead. In a extraordinary case where you have a lot of small writes, it might make a good case for trying to parallelise these writes. But, the best way to do so is to redesign your code to run one spark job per DataFrame instead of trying to parallelise DataFrame.write() call in same driver program.
Large writes will probably use all available resources in parallel during the single DataFrame write itself. Hence, if spark allowed issuing another write operation for a different DataFrame at the same time, it would only delay both operations as now they are racing with each other for resources. Not to mention, there may be some performance slowdown due to increased overhead because of sheer increase in number of tasks that spark now needs to manage and track.
Also, You can read this answer to and learn more about this

Resources