I know there are two ways to save a DF to a table in Pyspark:
1) df.write.saveAsTable("MyDatabase.MyTable")
2) df.createOrReplaceTempView("TempView")
spark.sql("CREATE TABLE MyDatabase.MyTable as select * from TempView")
Is there any difference in performance using a "CREATE TABLE AS " statement vs "saveAsTable" when running on a large distributed dataset?
createOrReplaceTempView creates (or replaces if that view name already exists) a lazily evaluated "view" that can be uses as a table in Spark SQL. It is not materialized until you call an action (like count) or persisted to memory unless you call cache on the dataset that underpins the view. As the name suggests, this is just a temporary view. It is lost after your application/session ends.
saveAsTable on the other hand saves the data to external stores like hdfs or s3 or adls. This is permanent storage and lasts longer than scope of the SparkSession or Spark Application and is available for use later.
So the main difference is on the lifetime of dataset than the performance. Obviously, within the same job, working with cached data is faster.
Related
I have below questions on spark managed tables.
Does Spark tables/view are for long term persistent storage? If Yes, when we create a table, where does physical data gets stored?
What is the exact purpose of spark tables as opposed to dataframe?
If we create a spark tables for temporary purpose, are we not filling up disk space that otherwise used for Spark compute needs in jobs?
Spark view (Temp view and Global Temp View) are not for long term persistent storage. they will only alive till your spark session is active. and when you write data in one of your spark/hive table using save() method it will write the data in spark_warehouse folder based on your system path of spark.
DataFrames are more of a programmatically designed but if their are developer who is not comfortable with programming languages like Scala, Python and R so for them spark exposed the SQL based API. But under the hood all of them have same performance because they all need to pass through the same Catalyst Optimizer.
3.Spark manages all the tables in it's catalog and catalog is connected with it's metastore. Now whenever we create any temp table it will copy it's data to the catalog or in spark_warehouse folder rather it will make a reference to that dataframe from which we are creating the tempview. when this spark session will end this temporary views are also deleted from catalog information.
I am trying to understand why I would register a dataframe as a temporary view in pyspark.
Here's a dummy example
# Create spark dataframe
spark_df = spark.createDataFrame([(1, 'foo'),(2, 'bar'),],['id', 'txt'])
# Pull data using the dataframe
spark_df.selectExpr("id + 1")
# Register spark_df as a temporary view to the catalog
spark_df.createOrReplaceTempView("temp")
# Pull data using the view
spark.sql("select id + 1 from temp")
Whether I register the dataframe as a temporary view or not:
The data can only be accessed in this live spark session
I can use sql statements to query the data in both cases
Pulling data takes pretty much the same time (10K simulations, but I don't have a spark cluster yet, only my local machine).
I am failing to see the benefits of storing a dataframe as a temporary view, but I see it in every introductory classes to pyspark. What am I missing? Tks!!
SQL is quite a powerful language and many consider it beneficial in some cases.
In my code we use a lot createOrReplaceTempView so that we can invoke SQL on the generated view. This is done on multiple stages of the transformation. It also help us to keep the code in modules each performing a particular operation. A sample code below to put in context my question is shown below. So my questions are:
What is the performance penalty if any by creating the temp view
from the Dataset?
When I create more than one from each transformation does this
increases memory size?
What is the life cycle of those views and is there any function call
to remove them?
val dfOne = spark.read.option("header",true).csv("/apps/cortex/landing/auth/cof_auth.csv")
dfOne.createOrReplaceTempView("dfOne")
val dfTwo = spark.sql("select * from dfOne where column_one=1234567890")
dfTwo.createOrReplaceTempView("dfTwo")
val dfThree = spark.sql("select column_two, count(*) as count_two from dfTwo")
dfTree.createOrReplaceTempView("dfThree")
No.
From the manuals on
Running SQL Queries Programmatically
The sql function on a SparkSession enables applications to run SQL queries programmatically and returns the result as a DataFrame.
In order to do this you register the dataFrame as a SQL temporary view. This is a "lazy" artefact and there must already be a data frame / dataset present. It's just needs registering to allow the SQL interface.
createOrReplaceTempView takes some time in processing.
udf also takes time as it should be register in spark application and can cause performance issue.
From my experience, its better to use and look for bulitin function --> then I would put other two in same speed udf == sql_temp_table/view. Here, table/view has not all possiblities.
I have the following code:
Dataset<Row> rows = sparkContext.sql ("select from hive tables with multiple joins");
rows.saveAsTable(writing to another external table in hive immediately);
1) In the above case when saveAsTable() is invoked, will spark load the whole dataset into memory?
1.1) If yes, then how do we handle the scenario when this query can actually return huge volume of data which cannot fit into the memory?
2) When spark starts executing saveAsTable() to write data to the external Hive table when the server crashes, is there a possibility of partial data be written to the target Hive table?
2.2) If yes, how do we avoid incomplete/partial data being persisted into target hive tables?
Yes spark will place all data in memory but use parallel processes. But when we write data it will use driver memory to store the data before write. So try increasing driver memory.
so there are couple of options you have. If you have memory in clustor you can increase num-cores, num-executors, executor-memory along with driver-memory based on data size.
If you cannot fit all data in memory break the data and process in a loop programatically.
Lets say source data is partitioned by date and you have 10 days to process. try to process 1 day at a time and write to a staging dataframe. Then create partition based on date in final table and overwrite date everytime in loop.
Recently i come to Spark SQL.
I read the Data Source Api and still confused at what role Spark SQL acts.
When i do SQL on whatever i need, will spark load all the data first and perform sql in memory? That means spark sql is only a memory db that works on data already loaded. Or it scan locally every time?
Really willing to any answers.
Best Regards.
I read the Data Source Api and still confused at what role Spark SQL acts.
Spark SQL is not a database. It is just an interface that allows you to execute SQL-like queries over the data that you store in Spark specific row-based structures called DataFrame
To run a SQL query via Spark, the first requirement is that the table on which you are trying to run a query should be present in either the Hive Metastore (i.e the table should be present in Hive) or it should be a temporary view that is part of the current SQLContext/HiveContext.
So, if you have a dataframe df and you want to run SQL queries over it, you can either use:
df.createOrReplaceTempView("temp_table") // or registerTempTable
and then you can use the SQLContext/HiveContext or the SparkSession to run queries over it.
spark.sql("SELECT * FROM temp_table")
Here's eliasah's answer that explains how createOrReplaceTempView works internally
When i do SQL on whatever i need, will spark load all the data first and perform sql in memory?
The data will be stored in-memory or on disk depending upon the persistence strategy that you use. If you choose to cache the table, the data will get stored in memory and the operations would be considerable faster when compared to the case where data is fetched from the disk. That part is anyway configurable and up to the user. You can basically tell Spark how you want it to store the data.
Spark-sql will only cache the rows that are pulled by the action, this means that it will cache as many partitions as it has to read during the action. this makes your first call much faster than your second call