Spark-SQL- managed tables needs & features - apache-spark

I have below questions on spark managed tables.
Does Spark tables/view are for long term persistent storage? If Yes, when we create a table, where does physical data gets stored?
What is the exact purpose of spark tables as opposed to dataframe?
If we create a spark tables for temporary purpose, are we not filling up disk space that otherwise used for Spark compute needs in jobs?

Spark view (Temp view and Global Temp View) are not for long term persistent storage. they will only alive till your spark session is active. and when you write data in one of your spark/hive table using save() method it will write the data in spark_warehouse folder based on your system path of spark.
DataFrames are more of a programmatically designed but if their are developer who is not comfortable with programming languages like Scala, Python and R so for them spark exposed the SQL based API. But under the hood all of them have same performance because they all need to pass through the same Catalyst Optimizer.
3.Spark manages all the tables in it's catalog and catalog is connected with it's metastore. Now whenever we create any temp table it will copy it's data to the catalog or in spark_warehouse folder rather it will make a reference to that dataframe from which we are creating the tempview. when this spark session will end this temporary views are also deleted from catalog information.

Related

Save DataFrame to Table - performance in Pyspark

I know there are two ways to save a DF to a table in Pyspark:
1) df.write.saveAsTable("MyDatabase.MyTable")
2) df.createOrReplaceTempView("TempView")
spark.sql("CREATE TABLE MyDatabase.MyTable as select * from TempView")
Is there any difference in performance using a "CREATE TABLE AS " statement vs "saveAsTable" when running on a large distributed dataset?
createOrReplaceTempView creates (or replaces if that view name already exists) a lazily evaluated "view" that can be uses as a table in Spark SQL. It is not materialized until you call an action (like count) or persisted to memory unless you call cache on the dataset that underpins the view. As the name suggests, this is just a temporary view. It is lost after your application/session ends.
saveAsTable on the other hand saves the data to external stores like hdfs or s3 or adls. This is permanent storage and lasts longer than scope of the SparkSession or Spark Application and is available for use later.
So the main difference is on the lifetime of dataset than the performance. Obviously, within the same job, working with cached data is faster.

What is differences between RDD and a traditional Relational Database System

I am new to spark, I know SQL but would like to know the differences between RDD(Resilient Distributed Datasets) and Relational databases like in architecture level and access level. Thank you.
RDD(Resilient Distributed Dataset) is a in memory data structure used by Spark. It is immutable data structure. Think of it as , spark has loaded data in memory in a specific structure and that structure is called RDD. Once your spark job stops, there is no RDD existence.
Database on other hand are storage systems. You can store your data and query that later.
I hope this clarify. One more thing - Spark can load data from a file system or database and create a RDD. filesystem and database are two places where data is stored. Once that data is loaded in memory by spark. spark uses a data structure named RDD to store and process it.

What are databricks spark delta tables? Does they also stores data for a specific session and how can I view these delta tables and their structure

What is the purpose of spark delta tables? Does they meant to store data permanently or only holds the processing data till the session lasts. How can I view them in spark cluster and what database they belongs to.
What is the purpose of spark delta tables?
The primary goal is to enable single table transnational writes in multicluster setups. This is achieved by keeping a transaction log (idea very similar to append-only tables in typical database systems).
Does they meant to store data permanently or only holds the processing data till the session lasts.
There are persistent, and by definition scoped across the sessions.
. How can I view them in spark cluster and what database
Same as any other table in Spark. There are not specific to any database, and written using delta format.

What role Spark SQL acts? Memory DB?

Recently i come to Spark SQL.
I read the Data Source Api and still confused at what role Spark SQL acts.
When i do SQL on whatever i need, will spark load all the data first and perform sql in memory? That means spark sql is only a memory db that works on data already loaded. Or it scan locally every time?
Really willing to any answers.
Best Regards.
I read the Data Source Api and still confused at what role Spark SQL acts.
Spark SQL is not a database. It is just an interface that allows you to execute SQL-like queries over the data that you store in Spark specific row-based structures called DataFrame
To run a SQL query via Spark, the first requirement is that the table on which you are trying to run a query should be present in either the Hive Metastore (i.e the table should be present in Hive) or it should be a temporary view that is part of the current SQLContext/HiveContext.
So, if you have a dataframe df and you want to run SQL queries over it, you can either use:
df.createOrReplaceTempView("temp_table") // or registerTempTable
and then you can use the SQLContext/HiveContext or the SparkSession to run queries over it.
spark.sql("SELECT * FROM temp_table")
Here's eliasah's answer that explains how createOrReplaceTempView works internally
When i do SQL on whatever i need, will spark load all the data first and perform sql in memory?
The data will be stored in-memory or on disk depending upon the persistence strategy that you use. If you choose to cache the table, the data will get stored in memory and the operations would be considerable faster when compared to the case where data is fetched from the disk. That part is anyway configurable and up to the user. You can basically tell Spark how you want it to store the data.
Spark-sql will only cache the rows that are pulled by the action, this means that it will cache as many partitions as it has to read during the action. this makes your first call much faster than your second call

Any benefit for my case when using Hive as datawarehouse?

Currently, i am trying to adopt big data to replace my current data analysis platform. My current platform is pretty simple, my system get a lot of structured csv feed files from various upstream systems, then, we load them as java objects (i.e. in memory) for aggregation.
I am looking for using Spark to replace my java object layer for aggregation process.
I understandthat Spark support loading file from hdfs / filesystem. So, Hive as data warehouse seems not a must. However, i can still load my csv files to Hive first, then, use Spark to load data from Hive.
My question here is, in my situation, what's the pros / benefit if i introduce a Hive layer rather than directly loading the csv file to Spark DF.
Thanks.
You can always look and feel the data using the tables.
Adhoc queries/aggregation can be performed using HiveQL.
When accessing that data through Spark, you need not mention the schema of the data separately.

Resources