I'm learning Spark and found that I can create temp view in Spark by calling one of following pySpark API:
df.createGlobalTempView("people")
df.createTempView("people")
df.createOrReplaceTempView'("people")
Can I create a permanent view to that it became available for every user of my spark cluster? Think this will save people's time if views are already defined for them.
Yes, but you'll have to use SQL:
spark.sql("CREATE VIEW persistent_people AS SELECT * FROM people")
By paradigm, Spark doesn't have any persistence capabilities since it's a data processing engine but not data warehousing.
If you want to provide some session independent views you need to work with existing Hive deployment or use an approach with Spark owned metastore. For more details please refer Spark doc about Hive interaction.
Related
CONTEXT -
Integration of Hive metastore with any data analytics platform say, Presto, Spark etc
Assuming that the BI Tool is integrated to the Query Engine (Presto/Spark in this scenario), all the information regarding the tables/views will be displayed in BI Tool Dashboard.
However, the information (columns and respective datatypes) about the tables or Views can be found stored in metastore DB in the form of tables.
CASE SCENARIO:
Let us assume that we have initially connected Presto engine with Hive as metastore and we create some Views. Now, we should be able to find the views and their columns stored in hive.
Now, let us assume that we want to replace Presto with Spark and retain same metastore as before which has Views in it.
NOTE : Since, views are not materialised, the underlying datasource shall not contain any physical data pertaining to those views.
QUESTION:
How does Spark be aware of views and their respective 'Create' script in order to execute any queries related to that.
is there any query such as "create catalog x" ? I didn't find it on the documentation here https://spark.apache.org/docs/latest/sql-ref-syntax.html
Catalog is a spark abstraction to control metadata. It has several implementations: hive or in-memory. And it will be automatically created by spark.
You can control which implementation is using with spark.sql.catalogImplementation or with SparkSession.builder.enableHiveSupport(which set spark.sql.catalogImplementation=hive)
If you're using hive as catalog implementation, there're several ways how hive can work, you can read about these more here .
Which option is better to use, spark as an execution engine on hive or accessing hive tables using spark SQL? And Why?
A few assumptions here are:
Reason to opt for SQL is to stay user friendly, e.g. if you have business users trying to access data.
Hive is in consideration because it provides an SQL like interface and persistence of data
If that is true, Spark-SQL is perhaps the better way forward. It is better integrated within Spark and as an integral part of Spark, it will provide more features (one example is structured streaming). You will still get user friendliness and an SQL like interface to Spark so you will get full benefits. But you will need to manage your system only from Spark's point of view. Hive installation and management will still be there but from a single perspective.
Using Hive with Spark as execution engine will keep you limited based upon how good a translation Hive's libraries are able to do to convert your HQL to Spark. They may do a pretty good job but you will still loose the advanced features of Spark SQL. And new features may take longer to get integrated in Hive compared to Spark SQL.
Also, with Hive exposed to end users, some advanced users or data engineering teams may want access to Spark. This will cause you to manage two tools. System management may get more tedious compared to only using Spark-SQL in this scenario as Spark SQL has the potential to serve both non-technical and advanced users and even if advanced users use pyspark, spark-shell or more, they will still be integrated within the same toolset.
I have a question regarding best practices for managing permanent tables in Spark. I have been working previously with Databricks, and in that context, Databricks manages permanent tables so you do not have to 'create' or reference them each time a cluster is launched.
Let's say in a Spark cluster session, a permanent table is created with saveAsTable command using option to partition the table. Data is stored in S3 as parquet files.
Next day, a new cluster is created and it needs to access that table for different purposes:
SQL query for exploratory analysis
ETL process for appending a new chunk of data
What is the best way to make saved table available again as the same table with same structure/options/path? Maybe there a way to store hive metastore settings to be reused between spark sessions? Or maybe each time a spark cluster is created, I should do CREATE EXTERNAL TABLE with the correct options to tell the format (parquet), the partitioning and the path?
Furthermore, if I want to access those parquet files from another application, i.e. Apache Impala, is there a way to store and retrieve hive metastore information or the table has to be created again?
Thanks
What is the best way to share spark RDD data between two spark jobs.
I have a case where job 1: Spark Sliding window Streaming App, will be consuming data at regular intervals and creating RDD. This we do not want to persist to storage.
Job 2: Query job that will access the same RDD created in job 1 and generate reports.
I have seen few queries where they were suggesting SPARK Job Server, but as it is a open source not sure if it a possible solution, but any pointers will be of great help.
thankyou !
The short answer is you can't share RDD's between jobs. The only way you can share data is to write that data to HDFS and then pull it within the other job. If speed is an issue and you want to maintain a constant stream of data you can use HBase which will allow for very fast access and processing from the second job.
To get a better idea you should look here:
Serializing RDD
You can share RDDs across different applications using Apache Ignite.
Apache ignite provides an abstraction to share the RDDs through which applications can access the RDDs corresponding to different applications. In addition Ignite has the support for SQL indexes, where as native Spark doesn't.
Please refer https://ignite.apache.org/features/igniterdd.html for more details.
According to the official document describes:
Note that none of the modes currently provide memory sharing across applications. If you would like to share data this way, we recommend running a single server application that can serve multiple requests by querying the same RDDs.
http://spark.apache.org/docs/latest/job-scheduling.html
You can save to a temporary view. Table will be available to other sessions until the one that creates it is closed