I am working on a project on Kubernetes where I use Spark SQL to create tables and I would like to add partitions and schemas to an Hive Metastore. However, I did not found any proper documentation to install Hive Metastore on Kubernetes. Is it something possible knowing that I have already a PostGreSQL database installed ? If yes, could you please help me with any official documentation ?
Thanks in advance.
Hive on MR3 allows the user to run Metastore in a Pod on Kubernetes. The instruction may look complicated, but once the Pod is properly configured, it's easy to start Metastore on Kubernetes. You can also find the pre-built Docker image at Docker Hub. Helm chart is also provided.
https://mr3docs.datamonad.com/docs/k8s/guide/run-metastore/
https://mr3docs.datamonad.com/docs/k8s/helm/run-metastore/
The documentation assumes MySQL, but we have tested it with PostgreSQL as well.
Related
Original Spark distributive supports several cluster managers like YARN, Mesos, Spark Standalone, K8s.
I can't find what is under the hood in Databricks Spark, which cluster manager it is using, and is it possible to change?
What's Databricks Spark architecture?
Thanks.
You can't check the cluster manager in Databricks, and you really don't need that because this part is managed for you. You can think about it as a kind of standalone cluster, but there are differences. General Databricks architecture is shown here.
You can change the cluster configuration by different means - init scripts, configuration parameters, etc. See documentation for more details.
I am new to the HDInsight of Azure.
I am trying to install presto on the HDInsight cluster.
As a test, I want to run TPC-H Query over. Here are what I did so far.
I loaded TPC-H tables on Hive
I am able to run a query over hive cli.
I am able to run show tables query on presto cli.
I am not able to run queries such as select count(*) from region; with Query 20200605_074052_00011_6etih failed: cannot create caching file system error message.
When I submit show tables query on presto cli, I got messages below.
Query 20200605_074050_00010_6etih, FINISHED, 5 nodes
Splits: 70 total, 70 done (100.00%)
0:00 [8 rows, 326B] [27 rows/s, 1.08KB/s]
I barely touched hadoop settings such as hdfs-site.xml or, core-site.xml and presto's configuration is nothing but settings about memories.
Any help would be appreciated. Thanks for reading it.
You can install Starburst Presto from HDInsights marketplace.
Read more: https://azure.microsoft.com/pl-pl/blog/azure-hdinsight-and-starburst-brings-presto-to-microsoft-azure-customers/
However, Starburst does not provide an updated version of this solution, recommending Kuberneters-based (e.g. using Azure AKS) solution instead. See https://docs.starburstdata.com/latest/installation/azure.html
Disclaimer: I am from Starburst.
I have been doing a research about Configuring Spark JobServer Backend (SharedDb) with Cassandra.
And I saw in the SJS documentation that they cited Cassandra as one of the Shared DBs that can be used.
Here is the documentation part:
Spark Jobserver offers a variety of options for backend storage such as:
H2/PostreSQL or other SQL Databases
Cassandra
Combination of SQL DB or Zookeeper with HDFS
But I didn't find any configuration example for this.
Would anyone have an example? Or can help me to configure it?
Edited:
I want to use Cassandra to store metadata and jobs from Spark JobServer. So, I can hit any servers through a proxy behind of these servers.
Cassandra was supported in the previous versions of Jobserver. You just needed to have Cassandra running, add correct settings to your configuration file for Jobserver: https://github.com/spark-jobserver/spark-jobserver/blob/0.8.0/job-server/src/main/resources/application.conf#L60 and specify spark.jobserver.io.JobCassandraDAO as DAO.
But Cassandra DAO was recently deprecated and removed from the project, because it was not really used and maintained by the community.
I am trying to get Zeppelin to work. But when I run a notebook twice, the second time it fails due to Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient. (full log at the end of the post)
It seems to be due to the fact that the lock in the metastore doesn't get removed. It is also advised to use for example Postgres instead of Hive as it allows multiple users to run jobs in Zeppelin.
I made a postgres DB and a hive-site.xml pointing to this DB. I added this file into the config folder of Zeppelin but also into the config folder of Spark. Also in the jdbc interpreter of Zeppelin I added similar parameters than the ones in the hive-site.xml.
The problems persists though.
Error log: http://pastebin.com/Jqf9cdtU
hive-site.xml: http://pastebin.com/RZdXHPX4
Try using Thrift server architecture in the Spark setup instead of working on a single instance JVM of Hive where you cannot generate multiple of sessions.
There are mainly three types of connection to Hive:
Single JVM - Metastore stored locally in the warehouse which doesn't allow multiple sessions
Mutiple JVM - where each worker behaves as a metastore
Thrift Server Architecture - Multiple Users can access the SQL engine and parallelism can be achieved
Another instance of Derby may have already booted the database
By default, spark use derby as the metadata store which can only serve one user. It seems you start multiple spark interpreter, that's why you see the above error message. So here's the 2 solutions for you
Disable hive in spark interpreter via setting zeppelin.spark.useHiveContext to false if you don't need hive.
Set up hive metadata store which support multiple users. Refer this https://www.cloudera.com/documentation/enterprise/5-8-x/topics/cdh_ig_hive_metastore_configure.html
Stop Zeppelin. Go to your bin folder in Apache Zeppelin and try deleting metastore_db
sudo rm -r metastore_db/
Start Zeppelin again and try now.
I just set up a Spark cluster in Google Cloud using DataProc and I have a standalone installation of Cassandra running on a separate VM. I would like to install the Datastax spark-cassandra connector so I can connect to Cassandra from spark. How can I do this ?
The connector can be downloaded here:
https://github.com/datastax/spark-cassandra-connector
The instructions on building are here:
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/12_building_and_artifacts.md
sbt is needed to build it.
Where can I find sbt for the DataProc installation ?
Would it be under $SPARK_HOME/bin ? Where is spark installed for DataProc ?
I'm going to follow up the really helpful comment #angus-davis made not too long ago.
Where can I find sbt for the DataProc installation ?
At present, sbt is not included on Cloud Dataproc clusters. The sbt documentation contains information on how to install sbt manually. If you need to re-install sbt on your clusters, I highly recommend you create an init action to install sbt when you create a cluster. After some research, it looks like SBT is covered under a BSD-3 license, which means we can probably (no promise) include it in Cloud Dataproc clusters.
Would it be under $SPARK_HOME/bin ? Where is spark installed for DataProc ?
The answer to this is it depends on what you mean.
binaries - /usr/bin
config - /etc/spark/conf
spark_home - /usr/lib/spark
Importantly, this same pattern is used for other major OSS components installed on Cloud Dataproc clusters, like Hadoop and Hive.
I would like to install the Datastax spark-cassandra connector so I can connect to Cassandra from spark. How can I do this ?
The Stack Overflow answer Angus sent is probably the easiest way if it can be used as a Spark package. Based on what I can find, however, this is probably not an option. This means you will need to install sbt and manually install.
You can use cassandra along with the mentioned jar and connector from datastax. You can simply download the jar and pass it to dataproc cluster. You can find Google provided template, I contributed to, in this link [1]. This explains how you can use the template to connect to Cassandra using Dataproc.