Pyspark: Caching approaches in spark sql - apache-spark

I need to understand if there is any difference between the below two approaches of caching while using spark sql and is there any performance benefit of one over the another (considering building the dataframes are costly and I want to reuse it many times/hit many actions) ?
1> Cache the original data frame before registering it as temporary table
df.cache()
df.createOrReplaceTempView("dummy_table")
2> Register the dataframe as temporary table and cache the table
df.createOrReplaceTempView("dummy_table")
sqlContext.cacheTable("dummy_table")
Thanks in advance.

df.cache() is a lazy cache, which means that the cache would only occur when the next action is triggered.
sqlContext.cacheTable("dummy_table") is an eager cache, which mean the table will get cached as the command is called. An equivalent of this would be: spark.sql("CACHE TABLE dummy_table")
To answer your question if there is a performance benefit of one over another, it will be hard to tell without understand your entire workflow and how (and where) your cached dataframes are used. I'd recommend using the eager cache, so you won't have to second guess when (and whether) your dataframe is cached.

Related

Spark: Is this wrong way to cache temp view?

I've seen follow code and i think that it is a wrong way to cache tempview in Spark. What do you think?
spark.sql(
s"""
|...
""".stripMargin).createOrReplaceTempView(s"temp_view")
spark.table(s"temp_view").cache()
For my opinion, this code caches DataFrame that I create by spark.table("temp_view"), but not original temp view.
Am I right?
Imo yes, you are caching what you read from this table, but for example if in next line you are going to read it again you will end up with second scan
I think that maybe you can try to use cache table within your sql
https://spark.apache.org/docs/latest/sql-ref-syntax-aux-cache-cache-table.html
CACHE TABLE statement caches contents of a table or output of a query
with the given storage level. If a query is cached, then a temp view
will be created for this query. This reduces scanning of the original
files in future queries.
For me its seems promising
I think the caching in your example will actually work. Spark does not cache instances of DataFrame. Instead, it uses logical plans as the cache key, and the view is transparent for that purpose. For example, here's the code I've just tried using some local table I have
val df = spark.table("mart.dim_region")
df.createOrReplaceTempView("dim_region")
spark.table("dim_region").cache()
Even though cache is applied to view, if I repeatedly invoke df.show, the execution plan contains InMemoryTableScan - which is precisely the effect of caching.

Spark dataframe is being re-evaluated after a cache

I am running into some issues using cache on a spark dataframe. My expectation is that after a cache on a dataframe, the dataframe is created and cached the fist time it is needed. Any further calls to the dataframe should be from the cache
here's my code:
val mydf = spark.sql("read about 400 columns from a hive table").
withColumn ("newcol", someudf("existingcol")).
cache()
To test I ran a mydf.count() twice. I would expect the first time to take some time since the data is being cached. But the second time should be instantaneous?
What I am actually seeing is that it takes the same time for both the counts. This first one comes back pretty quickly which I think tells me that the data was not cached. If I remove the withColumn part of the code and just cache the raw data, the second count is instantaneous
Am I doing something wrong? How can I load raw data from hive, add columns and then cache the dataframe for further use? Using spark 2.3
Any help will be great!
the problem with your case is that mydf.count() is not actually materializing the dataframe (i.e. not all columns are read, your udf will no be called). That is because count() is highly optimized.
To make sure the entire dataframe is cached into memory, you should repeat your experiment with mydf.rdd.count() or another query (e.g. using sorting and/or aggregation)
See e.g. this SO question
As you are caching a dataset/dataframe, se the documented default behavior:
def cache(): Dataset.this.type
Persist this Dataset with the default storage level (MEMORY_AND_DISK).
So for your case you can try persist(MEMORY_ONLY)
def persist(newLevel: StorageLevel): Dataset.this.type
Persist this Dataset with the given storage level.
newLevel One of: MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.
If its relevant
.cache/persist is lazy evaluation, to force it you can use the spark SQL's API which have the capability change form lazy to eager.
CACHE [ LAZY ] TABLE table_identifier
[ OPTIONS ( 'storageLevel' [ = ] value ) ] [ [ AS ] query ]
Unless LAZY specified it would be eager mode, you need to register a temp table prior to this.
Pseudo code would be:
df.createOrReplaceTempView("dummyTbl")
spark.sql("cache table dummyTbl")
More on the document reference - https://spark.apache.org/docs/latest/sql-ref-syntax-aux-cache-cache-table.html

Why is execution time of spark sql query different between first time and second time of execution?

I am using spark sql to run some aggregated query on the parquet data source.
My parquet data source includes a table with columns: id int, time timestamp, location int, counter_1 long, counter_2 long, ..., counter_48. The total data size is about 887 MB.
My spark version is 2.4.0. I run one master and one slave on a single machine (4 cores, 16G memory).
Using spark-shell, I ran the spark command:
spark.time(spark.sql("SELECT location, sum(counter_1)+sum(counter_5)+sum(counter_10)+sum(counter_15)+sum(cou
nter_20)+sum(counter_25)+sum(counter_30)+sum(counter_35 )+sum(counter_40)+sum(counter_45) from parquet.`/home/hungp
han227/spark_data/counters` group by location").show())
The execution time is 17s.
The second time I ran a similar command (only change columns):
spark.time(spark.sql("SELECT location, sum(counter_2)+sum(counter_6)+sum(counter_11)+sum(counter_16)+sum(cou
nter_21)+sum(counter_26)+sum(counter_31)+sum(counter_36 )+sum(counter_41)+sum(counter_46) from parquet.`/home/hungp
han227/spark_data/counters` group by location").show())
The execution time is about 3s.
My first question is: Why are they different? I know it is not data caching because of the parquet format. Is it about reusing something like query planning?
I did another test: The first command is
spark.time(spark.sql("SELECT location, sum(counter_1)+sum(counter_5)+sum(counter_10)+sum(counter_15)+sum(cou
nter_20)+sum(counter_25)+sum(counter_30)+sum(counter_35 )+sum(counter_40)+sum(counter_45) from parquet.`/home/hungp
han227/spark_data/counters` group by location").show())
The execution time is 17s.
In the second command, I change the aggregate function:
spark.time(spark.sql("SELECT location, avg(counter_1)+avg(counter_5)+avg(counter_10)+avg(counter_15)+avg(cou
nter_20)+avg(counter_25)+avg(counter_30)+avg(counter_35 )+avg(counter_40)+avg(counter_45) from parquet.`/home/hungp
han227/spark_data/counters` group by location").show())
The execution time is about 5s.
My second question is: Why is the second command is faster than the first command but the execution time difference is slightly smaller than the first scenario?
Finally, I have a problem related to above scenarios: The are about 200 formulas like:
formula1 = sum(counter_1)+sum(counter_5)+sum(counter_10)+sum(counter_15)+sum(cou
nter_20)+sum(counter_25)+sum(counter_30)+sum(counter_35 )+sum(counter_40)+sum(counter_45)
formula2 = avg(counter_2)+avg(counter_5)+avg(counter_11)+avg(counter_15)+avg(cou
nter_21)+avg(counter_25)+avg(counter_31)+avg(counter_35 )+avg(counter_41)+avg(counter_45)
I have to run the following format frequently:
select formulaX,formulaY, ..., formulaZ from table where time > value1 and time < value2 and location in (value1, value 2...) group by location
My third question is: Is there anyway to optimize the performance (the query used once should be faster if it is used again in the future)? Does spark optimize itself or do I have to write some code, change config?
It's called Exchange Reuse. When Spark runs shuffling (i.e. aggregation, join) it stores a copy of the shuffle data on local worker nodes for potential reuse. This is an internally controlled behavior and cannot be directly influenced by end user. If you find you're keep re-using a particular portion of data (or query outcome), you could consider explicitly CACHING it by using the cache(). However, bear in mind although this allows Spark to reuse cached result for potentially faster query performance (if, and only if the Analyzer Plan of your cached query matches your new query), over using CACHE can cause whole lot of different performance problems.
A bad example is when your dataset is very large, it may cause Disk Spill problem. That is, the dataset doesn't fit into your cluster's available memory and needs to be written to slower hard disks.
Another bad example is when your query only needs to access a subset of the cached data. By caching the entire dataset in memory, Spark is forced to perform full in-memory table scan. Not only that's waste of resource but also results in a slower query performance as oppose to not using cache at all.
The best thing to do is try & error with a few of your own example queries, look at the Spark UI and check if there is sign of Disk Spill or large amount of input data scan.
Every query/data combination is unique hence you'll need to experiment it a bit to find the best performance tuning method for your own workload.
When doing an aggregate spark creates what are called shuffle files. If you run the same query twice, it will reuse the shuffle files which are stored locally on the workers fs. Unfortunately you can't rely on them to always be there because eventually the file handler gets gc'd. If your going to run 10 queries on the same dataset, cache it or use databricks.

How to save a same rdd to multiple cassandra table?

I'm trying to do this mainly because I have to save data from a same stream to two cassandra tables, they have almost same schema but different primary key to serve two queries.
Will
rdd.saveToCassandra(keySpace, tableOne, allColumn)
rdd.saveToCassandra(keySpace, tableTwo, allColumn)
do the work?
Is this a normal thing to do? I googled a bit and someone said it may incur performance issue when the rdd is large:
https://groups.google.com/a/lists.datastax.com/forum/#!topic/spark-connector-user/e1nfWWyhZRo
That is OK to do so. To avoid performance issues you need to cache your RDD before first use like this:
rdd.cache()
Also after use it's good practice to unpersist your RDD like this:
rdd.unpersist()

Spark 1.6 Dataframe cache not working correctly

My understanding is that if I have a dataframe if I cache() it and trigger an action like df.take(1) or df.count() it should compute the dataframe and save it in memory, And whenever that cached dataframe is called in the program it uses already computed dataframe from cache.
but that is not how my program is working.
I have a dataframe like below which I am caching it, and then immediately I run a df.count action.
val df = inputDataFrame.select().where().withColumn("newcol" , "").cache()
df.count
When I run the program. In Spark UI I see that first line runs for 4 min and
when it comes to second line it again runs 4 min basically first line is re computed twice?
Shouldn't first line computed and cached when second line triggers?
how to resolve this behavior. I am stuck, please advise.
My understanding is that if I have a dataframe if I cache() it and trigger an action like df.take(1) or df.count() it should compute the dataframe and save it in memory,
It is not correct. Simple cache and count (take wouldn't work on RDD either) is a valid method for RDDs but it is not the case with Datasets, which use much more advanced optimizations. With query:
df.select(...).where(...).withColumn("newcol" , "").count()
any column, which is not used in where clause can be ignored.
There is an important discussion on the developer list and quoting Sean Owen
I think the right answer is "don't do that" but if you really had to you could trigger a Dataset operation that does nothing per partition. I presume that would be more reliable because the whole partition has to be computed to make it available in practice. Or, go so far as to loop over every element.
Translated to code:
df.foreach(_ => ())
There is
df.registerAsTempTable("df")
sqlContext.sql("CACHE TABLE df")
which is eager but it is no longer (Spark 2 and forward) documented and should be avoided.
No, if you call cache on a DataFrame it's not cached in this moment, it's only "marked" for potential future caching. The actual caching is only done when an action is followed later. You can also see your cached DataFrame in Spark UI under "Storage"
Another problem in your code is that count on DataFrame does not compute the entire DataFrame because not all columns need to be computed for that. You can use df.rdd.count() to force the entire evualation (see How to force DataFrame evaluation in Spark).
The question is why your first operation takes so long, even if no action is called. I think this is related to the caching logic (e.g. size estimations etc) being computed when calling cache (see eg. Why is rdd.map(identity).cache slow when rdd items are big?)

Resources