Spark UI SQL view shows almost nothing - apache-spark

I'm trying to optimize one program with Spark SQL, this program is basically a HUGE SQL query (joins like 10 tables with many cases etc etc). I'm more used to more DF-API-oriented programs, and those did show the different stages much better.
It's quite well structured and I understand it more or less. However I have a problem, I always use Spark UI SQL view to get hints on where to focus the optimizations.
However in this kind of program Spark UI SQL shows nothing, is there a reason for this? (or a way to force it to show).
I'm expecting to see each join/scan with the number of output rows after it and such.... but I only see a full "WholeStageCodeGen" for a "Parsed logical plan" which is like 800lines
I can't show code, it has the following "points":
1- Action triggering it, its "show"(20)
3- Takes like 1 hour of execution (few executors yet)
2- has a persist before the show/action.
3- Uses Kudu, Hive and In-memory tables (registered before this query)
4- Has like 700 lines logical plan
Is there a way to improve the tracing there? (maybe disabling WholeStageCodegen?, but that may hurt performance...)
Thanks!

Related

How to speed up spark sql filter queries if the where clause is already fixed?

In my case, the data resides in spark tables which are created by calling createOrReplaceTempView API on a dataframe. Once the table is created, several queries are going to run on top of the table. Most of the time, the where query is going to be based on a particular column. The concerned columns' name is already known. I would like to know if some sort of optimizations can be done to improve the performance of the filter query.
I tried exploring the approach of indexing but it turns out spark does not support indexing a particular column.
Have you looked at the SPARK UI to see where most of your time is being consumed? Is it really the query where most of the time is spent? Usually reading the data from disk is where most of the time is spent. Learn to read the SPARK UI and find where the real bottleneck is. The SQL tab is a really great way to start figuring things out.
Here's some tricks to run faster in spark that apply to most jobs:
Can you reframe the problem? Was the data you are using in a format that helps you solve the query? Can you change how it's written to change the problem? (Could you start "pre-chewing" the data before you even query it to have it stored in the best format to help you solve the issue you want to solve?) Most performance gains come from changing the parameters of the problem to make them easier/faster to solve.
What format (is the incoming data) you are
storing the data in? Are you using Parquet/Orc? They have a great payoff disk space/compression that are worth using. They also can enable file level filter to speed read. Is their transformation work that you can push upstream to help make the query do less work? Can you be writing the data via a partition schema that would aid lookups?
How many files is your input? Can you consolidate files to maximize read throughput. Reading/listing a lot of small files as input slows down the processing of data.
If the tempView query is of similar size every time you could look at tweaking the partition count so that files are smaller but approximately the size of your HDFS block size. (Assuming you are using hdfs). HDFS you have to read an entire block weather you use all the data or not. Try and fit this to some multiple of your executors so that you are finishing together and not straggling. This is hard to get perfect but you can make decent strides to find a good ratio.
There is no need to optimize filter conditions with spark. spark already is smart enough to optimize its conditions post where query to fetch minimum rows first. The best I guess you can do is by persisting your TempView if querying the same view again and again.

What determines which SQL queries are displayed in the Spark SQL UI

I'm trying to figure out why activity that I know is occurring isn't showing up in the SQL tab of the Spark UI. I am using Spark 1.6.0.
For example, we have a load of activity occurring today between 11:06 & 13:17, and I know for certain that the code being executed is using Spark dataframes API:
Yet if I hop over to the SQL tab I don't see any activity occurring for between those times:
So...I'm trying to figure out what influences whether not activity appears in that SQL tab, because the information presented in that SQL tab is (arguably) the most useful information in the whole UI - and when there's activity occurring that isn't showing up it becomes kinda annoying. The only distinguishing characteristic seems to be that the jobs that are showing up in the SQL tab use actions that don't write any data (e.g. count()), the jobs that do write data don't seem to be showing up. I'm puzzled as to why.
Any pearls of wisdom?

why spark still slow than mysql?

I am trying to work with Apache spark with data source MySQL. I have a cluster having 1 master and 1 slave node and both have 8 GB ram and 2 cores I am submitting my SQL query to spark using spark-shell and that table having 6402821 this many rows. I am performing a group by onto that table. and time taken by MySQL is 5.2secs and using spark when I am performing query the time is 21Secs. why is this happening?
i am also setting some configurations like partitionColumn, upperBound, lowerBound, and numofPartitions but still no change.
I have also tried with executing the query using 1,2,4 cores but the time taken by the spark is same 21Secs.
is this problem occurs because of my MySQL database is on a single machine and
all spark nodes try to query on data onto that single machine?
Can any one help me to solve this issue?
the database having a table called demo_call_stats on which i am trying to query is:
val jdbcDF = spark.read.format("jdbc").options( Map("url" -> "jdbc:mysql://192.168.0.31:3306/cmanalytics?user=root&password=","zeroDateTimeBehaviour"->"convertToNull", "dbtable" -> "cmanalytics.demo_call_stats", "fetchSize" -> "10000", "partitionColumn" -> "newpartition", "lowerBound" -> "0", "upperBound" -> "4", "numPartitions" -> "4")).load()
jdbcDF.createOrReplaceTempView("call_stats")
val sqlDF = sql("select Count(*), classification_id from call_stats where campaign_id = 77 group by classification_id")
sqlDF.show()
Any help will be most appreciated.
Thanks
There is a couple of things you should understand here:
Despite what you might have heard, Spark isn't 'faster than MySQL', simply because this kind of generality doesn't mean anything.
Spark is faster than MySQL for some queries, and MySQL is faster than Spark for others.
Generally speaking, MySQL is a relational database, meaning it has been conceived to serve
as a back-end for an application. It is optimized to access records efficiently as long as they are indexed.
When thinking about databases, I like to think of them as a library with one librarian to help you get the books you want
(I am speaking about a very old school library here, without any computer to help the librarian).
If you ask your librarian:
"I want to know how many books you have that are about Geopolitics",
the librarian can go to the Geopolitics shelf and count the number of books on that shelf.
If you ask your librarian:
"I want to know how many books you have that have at least 500 pages",
the librarian will have to look at every single book in the library to answer your query.
In SQL this is called a full table scans.
Of course you can have several librarians (processors) working on the query to go faster,
but you cannot have more than a few of them (let's say up to 16) inside your library (computer).
Now, Spark has been designed to handle large volume of data, namely libraries that are so big
that they won't fit into a single buildings, and even if it does, they will be so many that
even 16 librarians will take days to look at them all to answer your second query.
What makes Spark faster than MySQL is just this: if you put your books in several buildings,
you can have 16 librarians per building working on your answer.
You can also handle a larger number of books.
Also, since Spark is mostly made to answer the second type of queries rather than queries like "Please bring me 'The Portrait of Dorian Gray', by Oscar Wilde", it means that Spark doesn't care, at least by default, to sort your books in any particular way.
This means that if you want to find that particular book with spark, your librarians will have
to go through the entire library to find it.
Of course, Spark uses many other type of optimizations to perform some queries more efficiently,
but indexation is not one of them (if you are familiar with the notion of Primary Key in mySQL, there are no such thing in Spark).
Other optimizations include storage format like Parquet and ORC that allow you to read only the columns that are useful
to answer your queries, and compression (e.g. Snappy), which are aimed at increasing the number of books you can fit
in your library without having to push the walls.
I hope this metaphor helped you, but please bear in mind that this is just a metaphor and
doesn't fit reality perfectly.
Now, to get back to your question specific details:
Assuming campaign_id is your primary key or you created an index on this column, MySQL will only have
to read the rows for which campaign_id = 77.
On the other hand, Spark will have to ask for mySQL to send all the rows in that table to Spark.
If Spark is clever, it will only ask for the one with campaign_id = 77, and maybe it will send multiple queries to mySQL to get ranges in parallel.
But this means that all the data that MySQL could just read and aggregate will have to be serialized, sent to Spark, and be aggregated by Spark.
I hope you see why this should take longer.
If you want Spark to answer your queries faster than MySQL, you should try copying your table in another format like this.
// replace this line :
// jdbcDF.createOrReplaceTempView("call_stats")
// with :
jdbcDF.write.format("orc").saveAsTable("call_stats")
Another thing you could try is caching your data like this:
jdbcDF.cache().createOrReplaceTempView("call_stats")
Caching won't be bring any improvement for the first query as it will cache the data while performing it, but if you continue querying the same view, it might be faster.
But as I explained above, this doesn't mean Spark will be faster than mySQL for everything.
For small data and local deployements, you can also get a perf improvement by changing this configuration
parameter: spark.sql.shuffle.partitions=4 which is 200 by default.
Hope this helps.

What would be the proper way to tune Apache Spark for responsive web applications?

I have previously used Apache Spark for streaming applications where it does a wonderful job for ETL pipelines and predictions using Machine Learning.
However, Spark for EDA may not be as fast as one may want. For example, if you would like to do basic mathematical operations on data coming from Postgres or ElasticSearch using the data frames in Spark, the time it takes to fetch data from the host system and do the analysis is much higher than that taken by the SQL query on Postgres to run.
Even simple aggregations such as sum, average, and count can be done much faster using SQL than doing them on top of Spark-SQL.
From what I understand, this is not because of latency in fetching the data from the host system. If you call the show method on a data frame, you can quickly get the top rows of the data set. However, if you limit the response in SQL, and then call collect the time taken is huge.
This means that the data is there, but the processing being done while calling collect is taking a time.
Regardless of the data source (CSV file, JSON file, ElasticSearch, Parquet, etc.), the behavior remains the same.
What is the reason for this latency on collect and is there any way to reduce it to the point where it can work with responsive applications to make real-time or near real-time queries?

Is it bad to use INDEX in Cassandra if performance is not important?

Background
We have recently started a "Big Data" project where we want to track what users are doing with our product - how often they are logging in, which features they are clicking on, etc - your basic user analytics stuff. We still don't know exactly what questions we will be asking, but most of it will be "how often did X occur over the last Y months?" type of thing, so we started storing the data sooner rather than later thinking we can always migrate, re-shape etc when we need to but if we don't store it it is gone forever.
We are now looking at what sorts of questions we can ask. In a typical RDBMS, this stage would consist of slicing and dicing the data in many different dimensions, exporting to Excel, producing graphs, looking for trends etc - it seems that for Cassandra, this is rather difficult to do.
Currently we are using Apache Spark, and submitting Spark SQL jobs to slice and dice the data. This actually works really well, and we are getting the data we need, but it is rather cumbersome as there doesn't seem to be any native API for Spark that we can connect to from our workstations, so we are stuck using the spark-submit script and a Spark app that wraps some SQL from the command line and outputs to a file which we then have to read.
The question
In a table (or Column Family) with ~30 columns running on 3 nodes with RF 2, how bad would it be to add an INDEX to every non-PK column, so that we could simply query it using CQL across any column? Would there be a horrendous impact on the performance of writes? Would there be a large increase in disk space usage?
The other option I have been investigating is using Triggers, so that for each row inserted, we populated another handful of tables (essentially, custom secondary index tables) - is this a more acceptable approach? Does anyone have any experience of the performance impact of Triggers?
Impact of adding more indexes:
This really depends on your data structure, distribution and how you access it; you were right before when you compared this process to RDMS. For Cassandra, it's best to define your queries first and then build the data model.
These guys have a nice write-up on the performance impacts of secondary indexes:
https://pantheon.io/blog/cassandra-scale-problem-secondary-indexes
The main impact (from the post) is that secondary indexes are local to each node, so to satisfy a query by indexed value, each node has to query its own records to build the final result set (as opposed to a primary key query where it is known exactly which node needs to be quired). So there's not just an impact on writes, but on read performance as well.
In terms of working out the performance on your data model, I'd recommend using the cassandra-stress tool; you can combine it with a data modeler tool that Datastax have built, to quickly generate profile yamls:
http://www.datastax.com/dev/blog/data-modeler
For example, I ran the basic stress profile without and then with secondary indexes on the default table, and the "with indexes" batch of writes took a little over 40% longer to complete. There was also an increase in GC operations / duration etc.

Resources