Are there any downsides to Spark Adaptive Query Execution (AQE)? - apache-spark

I am familiar with what AQE is and a lot of the major benefits to using it. What I can't seem to find is a discussion of the downsides (if any). Since it is disabled by default I figure there might be some reasons why you would NOT want to enable AQE. Any thoughts on this?

Related

Why doesn't spark add performance configurations by default?

I was reading for some spark optimization techniques and found some configurations that we need to enable,such as
spark.conf.set("spark.sql.cbo.enabled", true)
spark.conf.set("spark.sql.adaptive.enabled",true)
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled",true)
spark.conf.set("spark.sql.adaptive.skewJoin.enabled",true)
Can I enable this for all my spark jobs, even if I don't need it? what are the downsides of including it? and why doesn't spark provide this performance by default? When should I use what?
It does not turn on these features as they have a little more risk than not using them. To have the most stable platform they're not enabled by default.
One thing that is called out and called out by Databricks is that CBO heavily rely on table statistics. So you need to regularly update these when your table statistics change significantly. I have hit edge cases where I had to remove CBO for my queries to complete. (I believe that this was related to a badly calculated map side join.)
The same is true of spark.sql.adaptive.skewJoin.enabled. This only helps if the table stats are up to date and you have skew. It could make your query take longer with out of data stats.
spark.sql.adaptive.coalescePartitions.enabled also looks great but should be used for specific types of performance tuning. There are knobs and levers here that could be used to drive better performance.
There settings in general are helpful might actually cover up a problem that you might want to be aware of. Yes, they are useful, yes you should use them. Perhaps you should leave them off until you need them. Often you get better performance out of tuning the algorithm of your spark job by understanding it and what it's doing. If you turn all this on by default you may not have as in-depth understanding or the implication of your choices.
(Java/Python do not force you to manage memory. This lack of understanding of the implications of what you use and its effect on performance is frequently learned the hard way with a performance issue that sneaks up on new developers.) This is a similar lesson but slight more sinister, as now they're switches to auto fix your bad queries, will you really learn to be an expert without understanding their value?
TLDR: Don't turn these on until you need them, or turn them on when you need to do something quick and dirty.
I hope this helps your understanding.

Quantify performance gain when using Java instead of SSJS

When developing XPages applications it seems to have become very popular to mainly use Java methods and beans instead of server-side JavaScript (SSJS). SSJS of course takes longer to execute because the code has to be evaluated at runtime. However, can anyone provide information about the QUANTITATIVE gain in performance when using Java? Are there any benchmarks for how much the execution times differ, for example depending on the length of the SSJS code or the functions used?
You have to use your own benchmarks. The increase in time might not be measurable. It is more around capabilities and your development process. Switching from SSJS to Java an expecting an instant increase in performance most likely won't happen.
Unless of course Java allows you to code things differently. So most of the decisions are based on capabilities, not speed. You are most welcome to run some tests and share the insights. What you can expect e.g. opening a document in SSJS vs. Java: the difference should be in the space of a rounding error, since most of the time is needed for the C call below.
SSJS and Java run at almost the same speed after the SSJS has been evaluated, so you have some onramp time and similar speed thereafter.
I agree about the performance gain being negligible. I will chime in to say this. Right now I am trying to learn to support an existing XPages application written without using any java, and entirely in SSJS. There is code here, there, and everywhere. It is very hard to follow.
Depending on your environment, you should consider programmer productivity when considering how to build your applications, especially when you know both. Productivity for you, and those coming after you.
Stephan's answer is right on point: though Java as a language IS faster (you'd probably see performance gains proportional to the complexity of the block of code more than the number of operations running), the primary benefit is program structure. My experience has been that using Java extensively makes my code much cleaner, easier to debug, and MUCH easier to understand after coming back to it months later.
One of the nice side effects of this structural change does happen to be performance, but not because of anything inherent to Java: by focusing on classes and getters/setters, it makes it easier to really pay attention to expensive operations and caching. While you CAN cache your data excellently in SSJS using the various scopes, it's easier for your brain - both now and after you've forgotten what you did next year - to think about that sort of thing in Java.
Personally, even if Java executed more slowly than SSJS but the programming models in XPages were the same as they are now, I would still use Java primarily.
You are asking about the pure processing performance - the speed of the computer running the code. And as Stephen stated Java is going to be a "little" faster because it doesn't need to do the extra step of the string parsing the code first. Ok in the big picture that's really not a big deal.
I think the real "performance" gain that you get by moving to Java in XPages is cleaner code with more capabilities. Yes you're putting a lot of code in SSJS Libraries. And that can work really well. But I assume those are more individual functions that you use over and over rather then true objects that you can put in memory and they're they're when you need them. When you get your core business logic inside Java Objects in my experience the speed of development goes significantly faster. It's not even close.
Take the Domino document object. That's a rather handy object. Imagine if it wasn't an "object" but simply a library of 50 or so functions that you need to first paste into each database. Doesn't seem right. And of course in the Domino API it's not just the domino object. There's like 60 or so different objects!
Typical XPages with Java development moves much - not all - but much of the code away from the .xsp page and into Java Classes which are very similar to custom classes on LotusScript. The not only creates separation between frontend code - making the .xsp pages easier to work with - but puts the business logic inside Java which is similar to working to the the Domino backend objects. So then the backend gets easier to work with, maintain and add onto.
And that's where a big part of the development speed improvements come from.
Getting back to your original question, which is about computer speed. I would suggest that it's much easier to cache frequently used data via Java Objects and managed beans then it is with SSJS. Not having to hit the disc as much would be a real speed advantage.
I would recommend you to consider performance gain in a wider context.
performance gain in quicker running?
performance gain in typing?
performance gain in not making mistakes because of the editor?
performance gain of using templating in the Java editor?
performance gain in better reusability, eventually to server-wide plugins?
performance gain in being comfortable building your own classes to hold complex objects?
performance gain in easier debugging?
performance gain in being comfortable with Validators, Converters, Phase Listeners, VariableResolvers etc?
performance gain in being comfortable looking at Extension Libraries to investigate or extend?
performance gain of being able to find answers more easily on StackOverflow or Google because you're using a standard language vs a proprietary language?
performance gain in using third party Java code like Apache Commons, Apache POI etc?
To be honest, when you have got that far and understand how much code is run during a page load or partial request, performance gain in runtime of Java vs SSJS is minimal compared to something like using loaded where possible instead of rendered. The gains of Java over SSJS are much wider, and I have not even mentioned the gains in professional development.
My answer is way too long for a stackOverflow answer, so as promised, here is a link to my blog post about this issue. Basically it has nothing to do with performance, but with Maintainability, Readability, Usability

lucene 4.6 concurrent flushing

I have read http://blog.trifork.com/2011/04/01/gimme-all-resources-you-have-i-can-use-them/ which mentions concurrent flushing, however when I tried looking into api of version 4.5.1 and 4.6.1 there are no such function and I cannot find any sample code either. The class DocumentsWriterPerThread is not in 4.5.1-4.6.1
Can anyone please provide some info on this issue? It would be great if a sample code provided as well to get me start up.
thanks
DocumentsWriterPerThread certainly is out there in Lucene 4.5, though to the best of my knowledge, it's not really something most users would be expected to monkey with.
As far as how to use Concurrent Flushing, you already are. The change went out with Lucene 4.0, see LUCENE-3023.
If you are not seeing this speed improvement is some way (not sure what problem you are observing), as stated by Michael McCandless, in his article on the topic:
Remember this change only helps you if you have concurrent hardware, you use enough threads for indexing and there's no other bottleneck (for example, in the content source that provides the documents)

threadscope functionality

Can programs be monitored while they are running (possibly piping the event log)? Or is it only possible to view event logs after execution. If the latter is the case, is there a deeper reason with respect to how the Haskell runtime works?
Edit: I don't know much about the runtime tbh, but given dflemstr's response, I was curious about how much and the ways in which performance is degraded by adding the event monitoring runtime option. I recall in RWH they mentioned that the rts has to add cost centres, but I wasn't completely sure about how expensive this sort of thing was.
The direct answer is that, no, it is not possible. And, no, there is no reason for that except that nobody has done the required legwork so far.
I think this would mainly be a matter of
Modifying ghc-events so it supports reading event logs chunk-wise and provide partial results. Maybe porting it over to attoparsec would help?
Threadscope would have to update its internal tree data structures as new data streams in.
Nothing too hard, but somebody would need to do it. I think I heard discussion about adding this feature already... So it might happen eventually.
Edit: And to make it clear, there's no real reason this would have to degrade performance beyond what you get with event log or cost centre profiling already.
If you want to monitor the performance of the application while it is running, you can for instance use the ekg package as described in this blog post. It isn't as detailed as ThreadScope, but it does the job for web services, for example.
To get live information about what the runtime is doing, you can use the dtrace program to capture dynamic events posted by some GHC runtime probes. How this is done is outlined in this wiki page. You can then use this information to put together a more coherent event log.

If you had one wish for SubSonic what would it be?

I know this question seems subjective but it's really pretty simple. As a long term user, and part time contributor to SubSonic I'm interested in what the community thinks would be the single best way to improve it.
So what's your opinion, how would you make SubSonic even better? What one thing would make you more likely to use/recommend/evangelise/stop complaining about it?
As I said I know this is a bit subjective and may get closed but as SO is the main support forum for SubSonic I think this could be a useful way to solicit opinion and/or contributions.
To keep this from turning into a general discussion here's the rules:
No omnibus wishes
No duplicate wishes
Up-vote those you agree with rather than re-posting them
Ability to run in MediumTrust out of the box
In all honesty the biggest thing thats lacking is solid documentation and HowTo's
Its got better but I think it needs a lot more.
Ability to automatically map collections of other objects, like Fluent NHibernate does.
When SubSonic throws an exception that isn't clear, I'd like to be able to use Google or some other mechanism to discover more information about how to keep my development effort moving forward. Right now it's too easy to get into a situation where you have to go spelunking into the SubSonic source code since SubSonic doesn't seem to be very proactive when the user goes off the "happy path".
This critique is hardly specific to SubSonic. Many (most?) software products suffer from this same problem. I have not really had this problem with NHibernate though, which is SubSonic's most clear competitor.
Faster and higher quality releases
Binary types for SimpleRepository (Images)
Left Outer Joins
Support more database-independent code generation...
What I mean by this is that it is truly a real pain if your application wants to talk to different databases (e.g. SQL Server and Oracle) and you want to only have one set of generated DAL objects. I would love it if you had the option of specifying that any SQL code that gets sent to the DB would be as compatible with most engines as possible, since right now if you generated your objects targeting SQL Server then all queries will be of the form:
SELECT [schema].[table_name] FROM ....
Sadly, this does not work in Oracle, so basically you're out of luck there.
Perhaps this isn't a huge concern for most of you, but I'm currently writing a commercial app that touts one of its main features as being able to run on various database engines just by changing its configuration and I chose SubSonic because I thought it could handle the job pretty easily, but I'm honestly having second thoughts now because of all the hoops I may have to jump through just to get this to work correctly under different environments.
Support MS Access ,Postgres and FireBird database :)....

Resources