how can I save time when I use reduce operation?

how can I save time when I use reduce operation? - apache-spark

I use reduce operation to process data on cluster, but i find it cost too much time. The type of the RDD to reduce is:
RDD[(Array[Array[Double]], Array[Array[Double]], Array[Double], Array[Double])]
Question1: if the type of rdd is simple like RDD[Array[Double]], maybe it cost less time?
Question2: any other way for me to save time when using rdd.reduce?

ReduceByKey method takes two argument- reduceByKey(func, [numTasks]). First is fuction, whatever operation you want to perform on it, Second is "number of task" it's optional argument.
Set the optimum number of task, depends on your machine configuration.
For example.
Rdd.redueceByKey(func,4)
It will create four parallel process/task to perform reduce operation. And it will be four time faster than your current performance.

Related

Spark with divide and conquer

I'm learning Spark and trying to process some huge dataset. I don't understand why I don't see decrease in stage completion times with following strategy (pseudo):
data = sc.textFile(dataset).cache()
while True:
data.count()
y = data.map(...).reduce(...)
data = data.filter(lambda x: x < y).persist()
So idea is to pick y so that it most of the time ~halves the data. But for some reason it looks like all the data is always processed again on each count().
Is this some kind of an anti-pattern? How I'm supposed to do this with Spark?

Yes, that is an anti-pattern.
map, same as most, but not all, of the distributed primitives in Spark, is pretty much by definition a divide and conquer approach. You take the data, you compute splits, and transparently distribute computing of individual splits over the cluster.
Trying to further divide this process, using high level API, makes no sense at all. At best it will provide no benefits at all, at worst it will incur the cost of multiple data scans, caching and spills.

Spark is lazily evaluated so in the for or while loop above each call to data.filter does not sequentially return the data but instead sequentially returns Spark calls to be executed later. All these calls get aggregated and then executed simultaneously when you do something later.
In particular, results remain unevaluated and merely represented until a Spark Action gets called. Past a certain point the application can’t handle that many parallel tasks.
In a way we’re running into a conflict between two different representations: conventional structured coding with its implicit (or at least implied) execution patterns and independent, distributed, lazily-evaluated Spark representations.

Spark streaming - Does reduceByKeyAndWindow() use constant memory?

I'm playing with the idea of having long-running aggregations (possibly a one day window). I realize other solutions on this site say that you should use batch processing for this.
I'm specifically interested in understanding this function though. It sounds like it would use constant space to do an aggregation over the window, one interval at a time. If that is true, it sounds like a day-long aggregation would be possible-viable (especially since it uses check-pointing in case of failure).
Does anyone know if this is the case?
This function is documented as: https://spark.apache.org/docs/2.1.0/streaming-programming-guide.html
A more efficient version of the above reduceByKeyAndWindow() where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enters the sliding window, and “inverse reducing” the old data that leaves the window. An example would be that of “adding” and “subtracting” counts of keys as the window slides. However, it is applicable only to “invertible reduce functions”, that is, those reduce functions which have a corresponding “inverse reduce” function (taken as parameter invFunc). Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument. Note that checkpointing must be enabled for using this operation.

After researching this on the MapR forums, it seems that it would definitely use a constant level of memory, making a daily window possible assuming you can fit one day of data in your allocated resources.
The two downsides are that:
Doing a daily aggregation may only take 20 minutes. Doing a window over a day means that you're using all those cluster resources permanently rather than just for 20 minutes a day. So, stand-alone batch aggregations are far more resource efficient.
Its hard to deal with late data when you're streaming exactly over a day. If your data is tagged with dates, then you need to wait till all your data arrives. A 1 day window in streaming would only be good if you were literally just doing an analysis of the last 24 hours of data regardless of its content.

How to deal with strongly varying data sizes in spark

I'm wondering about the best practice in designing spark-jobs where the volume of data is not known in advance (or is strongly varying). In my case, the application should both handle initial loads and later on incremental data.
I wonder how I should set the number of partitions in my data (e.g. using repartition or setting parameters like spark.sql.shuffle.partitions in order to avoid OOM excpetion in the executors (giving fixed amount of allocated memory per executor). I could
define a very high number of partition to make sure that even on very high workloads, the job does not fail
Set number of partitions at runtime depending on the size of source-data
Introduce an iteration over independent chunks of data (i.e. looping)
In all option, I see issues:
1: I imagine this to be inefficient for small data sizes as taks get very small
2: Needs additional querys (e.g. count) and e.g. for setting spark.sql.shuffle.partitions, SparkContext needs to be restartet which I would like to avoid
3: Seems to contradict the spirit of Spark
So I wonder what the most efficient strategy is for strongly varying data volumes.
EDIT:
I was wrong about setting spark.sql.shuffle.partitions, this can be set at runtime woutout restarting spark context

Do not set a high number of partitions without knowing this is needed. You will absolutely kill the performance of your job.
Yes
As you said, don't loop!
As you mention, you introduce an extra step which is to count your data, which at first glance seems wrong. However, you shouldn't think of this as mis-spent computation. Usually, the time it takes to count your data is significantly less than the time it would take to do further processing if you partition the data badly. Think of the count operation as an investment, it's certainly worth it.
You do not need to set partitions through the config and restart Spark. Instead, do the following:
Note current number of partitions for RDD / Dataframe / Dataset
Count number of entries / rows in your data
Based on an estimate of average row size, compute the target number of partitions
If #targetPartitions << #actualPartitions Then coalesce
Else If #targetPartitions >> #actualPartitions Then repartition
Else #targetPartitions ~= #actualPartitions Then do nothing
The coalesce operation will re-partition your data without shuffling, and so is much more efficient when it is available.
Ideally you can estimate the number of rows your will generate, rather than count them. Also, you will need to think carefully about when it is appropriate to perform this operation. With a long RDD lineage you can kill performance, because you may inadvertently reduce the number of cores which can execute complex code, due to scala lazy execution. Look into checkpointing to mitigate this problem.

In spark, how to estimate the number of elements in a dataframe quickly

In spark, is there a fast way to get an approximate count of the number of elements in a Dataset ? That is, faster than Dataset.count() does.
Maybe we could calculate this information from the number of partitions of the DataSet, could we ?

You could try to use countApprox on RDD API, altough this also launches a Spark job, it should be faster as it just gives you an estimate of the true count for a given time you want to spend (milliseconds) and a confidence interval (i.e. the probabilty that the true value is within that range):
example usage:
val cntInterval = df.rdd.countApprox(timeout = 1000L,confidence = 0.90)
val (lowCnt,highCnt) = (cntInterval.initialValue.low, cntInterval.initialValue.high)
You have to play a bit with the parameters timeout and confidence. The higher the timeout, the more accurate is the estimated count.

If you have a truly enormous number of records, you can get an approximate count using something like HyperLogLog and this might be faster than count(). However you won't be able to get any result without kicking off a job.
When using Spark there are two kinds of RDD operations: transformations and actions. Roughly speaking, transformations modify an RDD and return a new RDD. Actions calculate or generate some result. Transformations are lazily evaluated, so they don't kick off a job until an action is called at the end of a sequence of transformations.
Because Spark is a distributed batch programming framework, there is a lot of overhead for running jobs. If you need something that feels more like "real time" whatever that means, either use basic Scala (or Python) if your data is small enough, or move to a streaming approach and do something like update a counter as new records flow through.

Limit max parallelism for a single RDD without decreasing the number of partitions

Is it possible to limit the max number of concurrent tasks at the RDD level without changing the actual number of partitions? The use case is to not overwhelm a database with too many concurrent connections without reducing the number of partitions. Reducing the number of partitions causes each partition to become larger and eventually unmanageable.

I'm re-posting this as an "answer" because I think it may be the least-dirty hack that might get the behavior you want:
Use a mapPartitions(...) call, and at the beginning of the mapping function, do some kind of blocking check on a globally viewable state (REST-call, maybe?) that only allows some maximum number of checks to succeed at any given time. Since that will delay the full RDD operation, you may need to increase the timeout on RDD finishing to prevent an error

Primary significance of partitioning in spark is for providing parallelism, and your requirement is to reduce parallelism!!! But the the requirement is genuine :)
What is the real problem with less number of partition? Is writing too much data at once is creating problem? If that is the case, you could breakdown the per partition writing.
Can you put the data in some intermediate queue and process the at a controlled manner?

One approach might be to enable dynamic allocation, and set the maximum number of executors to your desired maximum parallelism.
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.maxExecutors <maximum>
You can read more about configuring dynamic allocation is described here:
https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
https://spark.apache.org/docs/latest/configuration.html#scheduling
If you are trying to control one specific computation, you could experiment with programmatically controlling the number of executors:
https://github.com/jaceklaskowski/mastering-apache-spark-book/blob/master/spark-sparkcontext.adoc#dynamic-allocation

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string