Spark: Is this wrong way to cache temp view? - apache-spark

I've seen follow code and i think that it is a wrong way to cache tempview in Spark. What do you think?
spark.sql(
s"""
|...
""".stripMargin).createOrReplaceTempView(s"temp_view")
spark.table(s"temp_view").cache()
For my opinion, this code caches DataFrame that I create by spark.table("temp_view"), but not original temp view.
Am I right?

Imo yes, you are caching what you read from this table, but for example if in next line you are going to read it again you will end up with second scan
I think that maybe you can try to use cache table within your sql
https://spark.apache.org/docs/latest/sql-ref-syntax-aux-cache-cache-table.html
CACHE TABLE statement caches contents of a table or output of a query
with the given storage level. If a query is cached, then a temp view
will be created for this query. This reduces scanning of the original
files in future queries.
For me its seems promising

I think the caching in your example will actually work. Spark does not cache instances of DataFrame. Instead, it uses logical plans as the cache key, and the view is transparent for that purpose. For example, here's the code I've just tried using some local table I have
val df = spark.table("mart.dim_region")
df.createOrReplaceTempView("dim_region")
spark.table("dim_region").cache()
Even though cache is applied to view, if I repeatedly invoke df.show, the execution plan contains InMemoryTableScan - which is precisely the effect of caching.

Related

How to solve the maximum view depth error in Spark?

I have a very long task that creates a bunch of views using Spark SQL and I get the following error at some step: pyspark.sql.utils.AnalysisException: The depth of view 'foobar' exceeds the maximum view resolution depth (100).
I have been searching in Google and SO and couldn't find anyone with a similar error.
I have tried caching the view foobar, but that doesn't help. I'm thinking of creating temporary tables as a workaround, as I would like not to change the current Spark Configuration if possible, but I'm not sure if I'm missing something.
UPDATE:
I tried creating tables in parquet format to reference tables and not views, but I still get the same error. I applied that to all the input tables to the SQL query that causes the error.
If it makes a difference, I'm using ANSI SQL, not the python API.
Solution
Using parque tables worked for me after all. I spotted that I was still missing one table to persist so that's why it wouldn't work.
So I changed my SQL statements from this:
CREATE OR REPLACE TEMPORARY VIEW `VIEW_NAME` AS
SELECT ...
To:
CREATE TABLE `TABLE_NAME` USING PARQUET AS
SELECT ...
To move all the critical views to parquet tables under spark_warehouse/ - or whatever you have configured.
Note:
This will write the table on the master node's disk. Make sure you have enough disk or consider dumping in an external data store like s3 or what have you. Read this as an alternative - and now preferred - solution using checkpoints.

Pyspark: Caching approaches in spark sql

I need to understand if there is any difference between the below two approaches of caching while using spark sql and is there any performance benefit of one over the another (considering building the dataframes are costly and I want to reuse it many times/hit many actions) ?
1> Cache the original data frame before registering it as temporary table
df.cache()
df.createOrReplaceTempView("dummy_table")
2> Register the dataframe as temporary table and cache the table
df.createOrReplaceTempView("dummy_table")
sqlContext.cacheTable("dummy_table")
Thanks in advance.
df.cache() is a lazy cache, which means that the cache would only occur when the next action is triggered.
sqlContext.cacheTable("dummy_table") is an eager cache, which mean the table will get cached as the command is called. An equivalent of this would be: spark.sql("CACHE TABLE dummy_table")
To answer your question if there is a performance benefit of one over another, it will be hard to tell without understand your entire workflow and how (and where) your cached dataframes are used. I'd recommend using the eager cache, so you won't have to second guess when (and whether) your dataframe is cached.

Spark HiveContext : Insert Overwrite the same table it is read from

I want to apply SCD1 and SCD2 using PySpark in HiveContext. In my approach, I am reading incremental data and target table. After reading, I am joining them for upsert approach. I am doing registerTempTable on all the source dataframes. I am trying to write final dataset into target table and I am facing the issue that Insert overwrite is not possible in the table it is read from.
Please suggest some solution for this. I do not want to write intermediate data into a physical table and read it again.
Is there any property or way to store the final data set without keeping the dependency on the table it is read from. This way, It might be possible to overwrite the table.
Please suggest.
You should never overwrite a table from which you are reading. It can result in anything between data corruption and complete data loss in case of failure.
It is also important to point out that correctly implemented SCD2 shouldn't never overwrite a whole table and can be implemented as a (mostly) append operation. As far as I am aware SCD1 cannot be efficiently implemented without mutable storage, therefore is not a good fit for Spark.
I was going through the documentation of spark and a thought clicked to me when I was checking one property there.
As my table was parquet, I used hive meta store to read the data by setting this property to false.
hiveContext.conf("spark.sql.hive.convertMetastoreParquet","false")
This solution is working fine for me.

Does registerTempTable cause the table to get cached?

I have a sql statement query which is doing a group by on many fields. The tables that it uses is also big (4TB in size). I'm registering the table as a temp table. However I don't know whether the table gets cached or not when I'm registering it as a temp table? I also don't know whether it is more performant if I convert my query into Scala function (e.g. df.groupby().aggr()...) rather than having it as a sql statement. Any help on that?
SQL is most likely going to be the fastest by far Databricks blog
Did you try to partition/repartition your dataframe as well to see whether it improves the performance?
Regarding registerTempTable: it only registers the table within a spark context. You can check with the UI.
val test = List((1,2,3),(4,5,6)).toDF("bla","blb","blc")
test.createOrReplaceTempView("test")
test.show()
Storage is blank
vs
val test = List((1,2,3),(4,5,6)).toDF("bla","blb","blc")
test.createOrReplaceTempView("test").cache()
test.show()
by the way registerTempTable is deprecated in Spark 2.0 and has been replaced by
createOrReplaceTempView
I have a sql statement query which is doing a group by on many fields. The tables that it uses is also big (4TB in size). I'm registering the table as a temp table. However I don't know whether the table gets cached or not when I'm registering it as a temp table?
The registerTempTabele or createOrReplaceTempView doesn't cache the data into memory or disc itself unless you use cache() function.
I also don't know whether it is more performant if I convert my query into Scala function (e.g. df.groupby().aggr()...) rather than having it as a sql statement. Any help on that?
Keep in mind the sql terms in sql query ultimately call the function inside. so whether you use sql query terms or functions available in code it doesn't matter. that is same thing.

What does InMemoryColumnarTableScan do?

When I run a Spark SQL query and head to the Spark UI DAG visualization, the first step that is shown is called InMemoryColumnarTableScan.
Is my data loaded from disk every time I run a query?
If not, what does this step do exactly?
As its name suggests, the InMemoryColumnarTableScan class contains methods which look at a table stored in memory using columnar compression techniques.
It's used to gather, cache, and provide statistics on the data stored within an in-memory table for more efficient querying of the table.
So the engine calls it first to figure out the best way to run your query against the RDD.
It doesn't have anything to do with your actual data load.

Resources