Why doesn't spark add performance configurations by default? - apache-spark

I was reading for some spark optimization techniques and found some configurations that we need to enable,such as
spark.conf.set("spark.sql.cbo.enabled", true)
spark.conf.set("spark.sql.adaptive.enabled",true)
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled",true)
spark.conf.set("spark.sql.adaptive.skewJoin.enabled",true)
Can I enable this for all my spark jobs, even if I don't need it? what are the downsides of including it? and why doesn't spark provide this performance by default? When should I use what?

It does not turn on these features as they have a little more risk than not using them. To have the most stable platform they're not enabled by default.
One thing that is called out and called out by Databricks is that CBO heavily rely on table statistics. So you need to regularly update these when your table statistics change significantly. I have hit edge cases where I had to remove CBO for my queries to complete. (I believe that this was related to a badly calculated map side join.)
The same is true of spark.sql.adaptive.skewJoin.enabled. This only helps if the table stats are up to date and you have skew. It could make your query take longer with out of data stats.
spark.sql.adaptive.coalescePartitions.enabled also looks great but should be used for specific types of performance tuning. There are knobs and levers here that could be used to drive better performance.
There settings in general are helpful might actually cover up a problem that you might want to be aware of. Yes, they are useful, yes you should use them. Perhaps you should leave them off until you need them. Often you get better performance out of tuning the algorithm of your spark job by understanding it and what it's doing. If you turn all this on by default you may not have as in-depth understanding or the implication of your choices.
(Java/Python do not force you to manage memory. This lack of understanding of the implications of what you use and its effect on performance is frequently learned the hard way with a performance issue that sneaks up on new developers.) This is a similar lesson but slight more sinister, as now they're switches to auto fix your bad queries, will you really learn to be an expert without understanding their value?
TLDR: Don't turn these on until you need them, or turn them on when you need to do something quick and dirty.
I hope this helps your understanding.

Related

Are there any downsides to Spark Adaptive Query Execution (AQE)?

I am familiar with what AQE is and a lot of the major benefits to using it. What I can't seem to find is a discussion of the downsides (if any). Since it is disabled by default I figure there might be some reasons why you would NOT want to enable AQE. Any thoughts on this?

What tools are available to tune chunk_length_kb?

How should I know whether I need to tweak chunk_length_kb for table? What tools do I have at hand to find a better chunk_length_kb?
That's one of the least frequently tuned variables - the best way to tune it is based on benchmarking with your actual data (as different data will behave in different ways). The defaults are fairly sane, though - you may see a few percent improvement by moving either up or down, but it's unlikely to be major.

LINQ vs PLINQ: When does overhead outweigh benefits?

I am working on projects basically ecommerce type. our architect has got instructions from client to use PLINQ as its much more beneficial than LINQ, as they works in parallel and uses all cores of the processors, resulting in quick responses. Client suggestion is PLINQ + Repository if possible.
So I just want to know, which one is good to follow in small and medium app. Is it feasible to use Plinq + Repository. As per my findings, I found Plinq has more overhead than linq if we are not handling the stuffs properly. Please help me.
It is impossible to answer this question without knowing far more details about your application. PLINQ has overhead to fan out the workload to worker threads and then coordinate the work amongst them. If you are processing hundreds of thousands of entities and have a meaningful amount of work to do for each one, then yes it can benefit. In the end, the only way to really know if PLINQ will benefit you is to profile using a realistic data set.
When a for loop has a small body, it might perform more slowly than the equivalent sequential loop. Slower performance is caused by the overhead involved in partitioning the data and the cost of invoking a delegate on each loop iteration. To address such scenarios, the Partitioner class provides the Partitioner.Create method, which enables you to provide a sequential loop for the delegate body, so that the delegate is invoked only once per partition, instead of once per iteration.
See here.
This applies to PLINQ.
See here for PLINQ.
LOL.
He who pays the piper calls the tune
Generally though, this is an engineering issue. Talk to the architects and clients and work out what what metrics they will be using to measure the performance of the deliverables.
Then using these metrics find the optimal solution Linq, PLinq or other and then report back your findings.
In the main all technologies are good for something and the size of the app is measured in different ways. So your term 'small' is meaningless.

Examples of simple stats calculation with hadoop

I want to extend an existing clustering algorithm to cope with very large data sets and have redesigned it in such a way that it is now computable with partitions of data, which opens the door to parallel processing. I have been looking at Hadoop and Pig and I figured that a good practical place to start was to compute basic stats on my data, i.e. arithmetic mean and variance.
I've been googling for a while, but maybe I'm not using the right keywords and I haven't really found anything which is a good primer for doing this sort of calculation, so I thought I would ask here.
Can anyone point me to some good samples of how to calculate mean and variance using hadoop, and/or provide some sample code.
Thanks
Pig latin has an associated library of reusable code called PiggyBank that has numerous handy functions. Unfortunately it didn't have variance last time I checked, but maybe that has changed. If nothing else, it might provide examples to get you started on your own implementation.
I should note that variance is difficult to implement in a stable way over huge data sets, so take care!
You might double check and see if your clustering code can drop into Cascading. Its quite trivial to add new functions, do joins, etc with your existing java libraries.
http://www.cascading.org/
And if you are into Clojure, you might watch these github projects:
http://github.com/clj-sys
They are layering new algorithms implemented in Clojure over Cascading (which in turn is layered over Hadoop MapReduce).

Server side language for cpu/memory intensive process

Whats a good server side language for doing some pretty cpu and memory intensive things that plays well with php and mysql. Currently, I have a php script which runs some calculations based on a large subset of a fairly large database and than updates that database based on those calculations (1.5millions rows). The current implementation is very slow, taking 1-2 hours depending on other activities on the server. I was hoping to improve this and was wondering what peoples opinions are on a good language for this type of task?
Where is the bottleneck? Run some real profiling, and see what exactly is causing the problem. Is it the DB I/O? Is it the cpu? Is the algorithm inefficient? Are you calling slow library methods in a tight inner loop? Could precalculation be used.
You're pretty much asking what vehicle you need to get from point A to point B, and you've offered a truck, car, bicycle, airplane, jet, and helicopter. The answer won't make sense without more context.
The language isn't the issue, your issue is probably where you are doing these calculations. Sounds like you may be better off writing this in SQL, if possible. Is it? What are you doing?
I suspect your bottleneck is not the computation. It definitely takes several hours to just update a few million records.
If that's the case, you can write a customized function in c/c++ for MySQL and execute the function in stored procedure.
We do this in our database to re-encrypt some sensitive fields during key-rotation. It shrunk key-rotation time from days to hours. However, it's a pain to maintain your own copy of MySQL. We have been looking for alternatives but nothing is close to the performance of this approach.

Resources