When I run two same queries in Spark SQL in local mode. The second run query always run faster (I assume cache locality may result this).
But when I look into Spark UI, I find out the two same queries have different number of jobs and this is the part confuses me, for example, like below.
As you could see, the second one only requires one job (20), so does this information imply Spark SQL cache the query result explicitly?
Or it caches some intermediate result of some jobs of the previous run?
Thank you for the explanation.
collect at <console>:26+details 2019/10/09 08:28:34 2 s [20]
collect at <console>:26+details 2019/10/09 08:26:01 2.3 min [16][17][18][19]
Related
If my understanding is correct, spark applications may contain one or more job. A job may be split into stages and stages can be split into tasks. I more or less can follow this in the spark user interface (or I at least I think so). But I am confused about the meaning of the SQL tab.
In particular:
what is the relation of a SQL query to jobs and queries?
in the SQL tab would we also see dataframe operations or only queries submitted e.g. via spark.sql()?
I have been running some examples in order to understand but it is still not very clear. Could you please help me?
The SQL tab shows what you'd consider the execution plan. It shows the stages, run times, memory and operations (Exchanges, Projections, etc). Catalyst builds the job plan based on your query, regardless of whether your query was done with spark.sql or dataset/dataframe operations.
You can find more information here:
If the application executes Spark SQL queries, the SQL tab displays information, such as the duration, jobs, and physical and logical plans for the queries.
https://spark.apache.org/docs/latest/web-ui.html
I am using Spark SQL API. When I see the Spark SQL section on the spark UI which details the query execution plan it says it scans parquet stage multiple times even though I am reading the parquet only once.
Is there any logical explanation?
I would also like to understand the different operations like Hash Aggregate, SortMergeJoin etc and understand the Spark UI better as a whole.
If you are doing unions or joins they may force your plan to be "duplicated" since the beginning.
Since spark doesn't keep intermediate states (unless you cache) automatically, it will have to read the sources multiple times
Something like
1- df = Read ParquetFile1
2- dfFiltered = df.filter('active=1')
3- dfFiltered.union(df)
The plan will probably look like : readParquetFIle1 --> union <-- filter <-- readParquetFIle1
I am currently running some Spark code and I need to query a data frame that is taking a long time (over 1 hour) per query. I need to query multiple times to check if the data frame is in fact correct.
I am relatively new to Spark and I understand that Spark uses lazy evaluation which means that the commands are executed only once I do a call for some action (in my case .show()).
Is there a way to do this process once for the whole DF and then quickly call on the data?
Currently I am saving the DF as a temporary table and then running queries in beeline (HIVE). This seems a little bit overkill as I have to save the table in a database first, which seems like a waste of time.
I have looked into the following functions .persist, .collect but I am confused on how to use them and query from them.
I would really like to learn the correct way of doing this.
Many thanks for the help in advance!!
Yes, you can keep your RDD in memory using rddName.cache() (or persists()) . More information about RDD Persistence can be found here
Using a temporary table ( registerTempTable (spark 1.6) or createOrReplaceTempView (spark2.x)) does not "save" any data. It only creates a view with the lifetime of you spark session. If you wish to save the table, you should use .saveAsTable, but I assume that this is not what you are looking for.
Using .cache is equivalent to .persist(StorageLevel.MEMORY). If your table is large and thus can't fit in memory, you should use .persist(StorageLevel.MEMORY_AND_DISK).
Also it is possible that you simple need more nodes in you cluster. In case you are running locally, make sure you deploy with --master local[*] to use all available cores on your machine. If you are running on a stand alone cluster or with a cluster manager like Yarn or Mesos, you should make sure that all necessary/available resources are assigned to you job.
For testing purposes, while I donĀ“t have production cluster, I am using spark locally:
print('Setting SparkContext...')
sconf = SparkConf()
sconf.setAppName('myLocalApp')
sconf.setMaster('local[*]')
sc = SparkContext(conf=sconf)
print('Setting SparkContext...OK!')
Also, I am using a very very small dataset, consisting of only 20 rows in a postgresql database ( ~2kb)
Also(!), my code is quite simple as well, only grouping 20 rows by a key and applying a trivial map operation
params = [object1, object2]
rdd = df.rdd.keyBy(lambda x: (x.a, x.b, x.c)) \
.groupByKey() \
.mapValues(lambda value: self.__data_interpolation(value, params))
def __data_interpolation(self, data, params):
# TODO: only for testing
return data
What bothers me is that the whole execution takes about 5 minutes!!
Inspecting the Spark UI, I see that most of the time was spent in Stage 6: byKey method. (Stage 7, collect() method was also slow...)
Some info:
These numbers make no sense to me... Why do I need 22 tasks, executing for 54 sec, to process less than 1 kb of data
Can it be a network issue, trying to figure out the ip address of localhost?
I don't know... Any clues?
It appears the main reason for the slower performance in your code snippet is due to the use of groupByKey(). The issue with groupByKey is that it ends up shuffling all of the key-value pairs resulting in a lot of data unnecessarily being transferred. A good reference to explain this issue is Avoid GroupByKey.
To work around this issue, you can:
Try using reduceByKey which should be faster (more info is also included in the above Avoid GroupByKey link).
Use DataFrames (instead of RDDs) as DFs include performance optimizations (and the DF GroupBy statement is faster than the RDD version). As well, as you're using Python, you can avoid the Python-to-JVM issues with PySpark RDDs. More information on this can be seen in PySpark Internals
By the way, reviewing the Spark UI diagram above, the #22 refers to the task # within the DAG (not the number of tasks executed).
HTH!
I suppose the "postgresql" is the key to solve that puzzle.
keyBy is probably the first operation that really uses the data so it's execution time is bigger as it needs to get the data from external database. You can verify it by adding at the beginning:
df.cache()
df.count() # to fill the cache
df.rdd.keyBy....
If I am right, you need to optimize the database. It may be:
Network issue (slow network to DB server)
Complicated (and slow) SQL on this database (try it using postgre shell)
Some authorization difficulties on DB server
Problem with JDBC driver you use
From what I have seen happening in my system while running spark:
When we run a spark job it internally creates map and reduce tasks and runs them. In your case, to run the data you have, it created 22 such tasks. I bigger the size of data the number may be big.
Hope this helps.
I am fairly new to Apache Spark. I have been using it for several months, but this is my first project that uses it.
I use Spark to compute dynamic reports from data, stored in a NoSQL database (Cassandra). So far I have created several reports and they are computed correctly. Inside them I use DataFrame .unionAll(), .join(), .count(), .map(), etc.
I am running a 1.4.1 Spark cluster on my local machine with the following setup:
export SPARK_WORKER_INSTANCES=6
export SPARK_WORKER_CORES=8
export SPARK_WORKER_MEMORY=1g
I have also populated the database with test data which is around 10-12k records per table.
By using the driver's web UI (http://localhost:4040/), I have noticed that the jobs are taking 40s-50s to execute, so lately I have been researching ways to tune Apache Spark and the jobs.
I have configured Spark to use the KryoSerializer, I have set the spark.io.compression.codec to lzf, I have optimized the jobs as much as I can and as much as my knowledge allows me to.
This led to the jobs taking 20s-30s to compute (which I think is a good improvement). The problem is that because this is my first Spark project, I have no baseline to compare the jobs times, so I have no idea if the execution is slow or fast and whether there is some problem in the code or with the Spark config.
What is the best way to proceed? Is there a graph or benchmark that shows how much time an action with N data should take?
You have to use hive . On top of hive you can put spark . After doing this create temp table in hive for Cassandra table you can perform all type of aggregation and filtering . After doing this you can use hive jdbc connection to get result . It will give fast result .