Spark on hive replacing presto - apache-spark

CONTEXT -
Integration of Hive metastore with any data analytics platform say, Presto, Spark etc
Assuming that the BI Tool is integrated to the Query Engine (Presto/Spark in this scenario), all the information regarding the tables/views will be displayed in BI Tool Dashboard.
However, the information (columns and respective datatypes) about the tables or Views can be found stored in metastore DB in the form of tables.
CASE SCENARIO:
Let us assume that we have initially connected Presto engine with Hive as metastore and we create some Views. Now, we should be able to find the views and their columns stored in hive.
Now, let us assume that we want to replace Presto with Spark and retain same metastore as before which has Views in it.
NOTE : Since, views are not materialised, the underlying datasource shall not contain any physical data pertaining to those views.
QUESTION:
How does Spark be aware of views and their respective 'Create' script in order to execute any queries related to that.

Related

Need use case or example for Spark’s Relationship to Hive

I am reading Spark Definitive Guide
In the "Spark’s Relationship to Hive" section ..the below lines are give
"With Spark SQL, you can connect to your Hive metastore (if you already have one) and access table metadata to reduce file listing when accessing information. This is popular for users who are migrating from a legacy Hadoop environment and beginning to run all their workloads using Spark."
I am not able to understand what it means. Someone please help me with examples for the above use case.
Spark being the latest tool in Hadoop ecosystem has connectivity with earlier Hadoop tools. Hive was the most popular until recent times. Most Hadoop platforms have data stored in Hive tables which can be accessed using Hive as a SQL engine. However, Spark can also do the same things.
So, the given statements mention that you can connect to Hive metastore (which contains information about existing tables, databases, their location, schema, file types, etc.) and then you can run similar Hive queries on them just like you would with Hive.
Below are two examples that you can do with spark once you can connect to Hive metastore.
spark.sql("show databases")
spark.sql("select * from test_db.test_table")
I hope this answers your question.

SnowFlake Datawarehouse : 'show tables' & create table using spark

I have 2 questions w.r.t spark and Snowflake datawarehouse.
1) Is there any way to query/create snowflake tables like hive/spark(either new or old versions of spark)
val hive_tables=hiveContext.sql("show tables").foreach(println)
2) hiveContext.sql("create table....")
first question is about knowing what tables are present for that particular user for the particular role. The reason why I am asking question is via web ui of snowflake I am able to query the table but through spark I am not able to query
Exception in thread "main" net.snowflake.client.jdbc.SnowflakeSQLException: SQL compilation error:
Object 'mytable' does not exist.
You should double check things like database/schema/role in your JDBC connection settings. If you don't see a table via JDBC, one of these might be the culprit.
You can validate the current settings by running e.g. show roles, show schemas and show databases on the established JDBC connection.
In general, I highly recommend using Spark-Snowflake connector for communicating with Snowflake from Spark. It also provides Utils.runQuery() for running simple queries like DDL.

Is it possible to create persistent view in Spark?

I'm learning Spark and found that I can create temp view in Spark by calling one of following pySpark API:
df.createGlobalTempView("people")
df.createTempView("people")
df.createOrReplaceTempView'("people")
Can I create a permanent view to that it became available for every user of my spark cluster? Think this will save people's time if views are already defined for them.
Yes, but you'll have to use SQL:
spark.sql("CREATE VIEW persistent_people AS SELECT * FROM people")
By paradigm, Spark doesn't have any persistence capabilities since it's a data processing engine but not data warehousing.
If you want to provide some session independent views you need to work with existing Hive deployment or use an approach with Spark owned metastore. For more details please refer Spark doc about Hive interaction.

Hive tables Not Visible in Tableau

I have created a table, ztest7 in the default database in my hive. I am able to query it using beeline. In tableau, I can query it using a custom sql.
However the table does NOT show when I search for it.
Am I missing something here?
Tableau Desktop Version = v10.1.1
Hive = v2.0.1
Spark = v2.1.0
Best Regards
I have the same issue with Tableau Desktop 10 (mac) to Hive (2.1.1) via Spark SQL 2.1 (on centos 7 server)
This is what I got from Tableau Support:
In Tableau Desktop, the ability to connect to Spark SQL without a
defining a default schema is not currently built into the product.
As a preliminary step, to define a default schema, configure the Spark
SQL hivemetastore to utilize a SchemaRDD or DataFrame. This must be
defined in the Hive Metastore for Tableau Desktop to be able to access
it. Pure schema-less Spark RDD's can not be queried by Spark SQL
because of the lack of a schema. RDDs can be converted into
SchemaRDDs, which have additional schema metadata as Spark SQL
provides access to SchemaRDDs. When a SchemaRDD is created, it is only
available in the local namespace or context, and is unavailable to
external services accessing Spark through ODBC and the Spark Thrift
Server. For Tableau to have access, the SchemaRDD needs to be
registered in a catalog that is available outside of just the local
context; the Hive Metastore is currently the only supported service.
I don't know how to check/implement this.
PS: I'd have posted this as a comment because I am not allowed to as I am new to Stack Overflow.
In the file labeled Table on the left side of the screen, Try selecting contains, entering part of your table name and hitting enter
I ran into similar issue. In my case, I had loaded tables using HIVE but the tableau connection to the data source was made using Impala as shown in the image below.
To fix the issue of not seeing the tables in tableau dropdown, try running INVALIDATE METADATA database.table_name in the impala interface. This fixed the problem for me.
To know why this fixes the issue, refer this link.

Spark Permanent Tables Management

I have a question regarding best practices for managing permanent tables in Spark. I have been working previously with Databricks, and in that context, Databricks manages permanent tables so you do not have to 'create' or reference them each time a cluster is launched.
Let's say in a Spark cluster session, a permanent table is created with saveAsTable command using option to partition the table. Data is stored in S3 as parquet files.
Next day, a new cluster is created and it needs to access that table for different purposes:
SQL query for exploratory analysis
ETL process for appending a new chunk of data
What is the best way to make saved table available again as the same table with same structure/options/path? Maybe there a way to store hive metastore settings to be reused between spark sessions? Or maybe each time a spark cluster is created, I should do CREATE EXTERNAL TABLE with the correct options to tell the format (parquet), the partitioning and the path?
Furthermore, if I want to access those parquet files from another application, i.e. Apache Impala, is there a way to store and retrieve hive metastore information or the table has to be created again?
Thanks

Resources