I have a Data lake (Hive Managed tables base) and I would like to do an Incremental approach to the Warehouse (Hive Managed tables base) using spark v3.2, and I faced an issue connection to Hive Managed tables with spark3.
And I would like to know,
How to connect to Hive ACID Tables? Using JDBC if yes how? Or there are other ways?
Thank you!
Related
I'm writing a Spark-based app and have to drop some tables in Cassandra DB.
I know how to read from tables with spark.read.format("jdbc"). I know how to save dataframe with df.write.format("jbdc").
But how can I drop a table that I don't need anymore?
To drop a Cassandra table, you can simply use the Spark SQL DROP TABLE command:
spark.sql("DROP TABLE table_name")
Note that the JDBC API is limited so our general recommendation is to use the Spark Cassandra connector. It is fully open-source so is free to use.
The Spark Cassandra connector is a library specifically designed for connecting to Cassandra clusters from Spark applications. Cassandra tables are exposed as DataFrames or RDDs and the connector also allows execution of the full CQL API. Cheers!
I am working on presto on spark , I have noticed that with this package inserting data into databases only work with hive-hadoop connector. All the other connectors return "connector does not support page sink commit" when trying insert queries. I want to use presto on spark to execute my queries and insert the result into cassandra. Any idea how i can acheive this?? Thanks in advance
Currently in our project we are using HDInsights 3.6 in which we have spark and hive integration enabled by default as both shares the same catalogs. Now we want to migrate HDInsights 4.0 where spark and hive will be having different catalogs . I had a go through the Microsoft document (https://learn.microsoft.com/en-us/azure/hdinsight/interactive-query/apache-hive-warehouse-connector) where we need additional cluster required to integrate with help of Hive warehouse connector. Now i wanted to know if there is any other approach instead of using extra cluster .Any suggestions will be highly appreciable.
Thanks
If you are using external tables, they can point both Spark and Hive to use same metastore. This only applies to external tables .
Hive can have its metadata and stores the tables,columns,partitions information over there.
If I do not want to use the hive.Can we create a metadata for spark same as hive.
I want to query spark SQL (not using dataframe) like Hive (select, from and where) Can we do that? if yes, which relational DB can we use for metadata storage?
Can we create a metadata for spark same as hive.
Spark does this for you and you don't have to use a separate installation of Hive or even just part of it (e.g. a Hive metastore).
Regardless of the installation of Apache Spark you use, Spark SQL uses a Hive metastore internally for the same purpose as Hive does (but the metastore is now part of Spark SQL).
if yes which relational DB can we use for metadata storage?
Anything that Hive supports, e.g. Oracle, MySQL, PostgreSQL. The configuration is pretty much as you would do with a separate Hive installation (which is usually the case in such enterprisey installations).
You may want to read Hive Metastore.
Spark is essentially a distributed computation system instead of a distributed storage. Therefore, we mostly use Spark to do the computation work, which needs the metadata from different storage.
However, Spark internally provides an InMemoryCatalog to store the metadata if it's not configured with Hive.
You can take a look at this for more information.
We are trying to generate reports in tableau by spark SQL connectivity, But i found out that we are ultimately connecting to hive meta-store.
If this is the case what are the advantages of this new spark SQL connection. Is there a way to connect to spark data frames that are persisted, from tableau using spark SQL.
The problem here is a Tableau problem more than a Spark problem. Spark SQL Connector launches a Spark job each time you connect to a database. Part of that Spark job loads the underlying Hive table into the distributed memory that Spark manages, and each time you make a change or select on a graph, the refresh has to go a level deeper to Hive metastore to get the data, through Spark. That is how Tableau is designed. The only option here is to change Tableau for Spotfire (or some other tool) where by pre-caching the underlying Hive table, the Spark SQL Connector can query it directly from Spark distributed memory, skipping the load step.
Disclosure: I am in no way associated with Spotfire makers