I am trying to create a locally persisted spark warehouse database that will be present/loaded/accessible to future spark sessions created by the same application.
I have configured the spark session conf with:
.config("spark.sql.warehouse.dir", "C:/path/to/my/long/lived/mock-hive")
When I create the databases, I see the mock-hive folder get created, and underneath two distinct databases that I create have folders: db1.db and db2.db
However, these folders are EMPTY after the session completes, despite the databases being successfully created and subsequently queried in the run that stands them up.
On a subsequent run with the same configured spark session, if I
baseSparkSession.catalog.listDatabases().collect() I only see the default database. The two I created did not persist into the second spark session.
What is the trick to get these local persisted databases to be available to read in subsequent execution?
I've noticed that spark.sql.warehouse.dir *.db folders empty after creation, which might have something to do with it...
Spark Version: 3.0.1
Turns out spark.sql.warehouse.dir is not where local db data is stored... it's in the derby database stored in metastore_db. To relocate that, you need to change a system param:
System.setProperty("derby.system.home", derbyPath)
I didn't even have to set spark.sql.warehouse.dir, just relocate the derbyPath to a common location all spark sessions use.
NOTE - You don't need to specify the "metastore_db" portion of the derbyPath, it will be auto appended to the location.
Related
We are setting up a spark cluster using the standalone deploy method. Master and all workers are looking at a shared (networked) file system. (This is a cluster that is spun up every now and then to do heavy lifting data wrangling, no need for the (beautiful but intense) HDFS).
The services are running as user spark with group spark. My user is a member of group spark. When I start a session, an application gets created on the cluster. That application can read any file on the shared file system that is readable by the group spark.
But when I write a file to it, in this setup (for instance: orders.write.parquet("file:///srv/spark-data/somefile.parquet")), the various steps are performed by different users - depending on which service in the application is performing it.
It seems that the directory is created by my user. The the spark user writes files to it (in _temporary). And then my user gets to move these temporary files to their final destination.
And this is where it goes wrong. These temporary files only have read access for the group spark. Therefore my user cannot move them accross to the permanent place.
I have not yet found a solution to either a) have all workers run under my user account or b) have the file permissions on these temporary files as read + write.
My current work-around it to create my session as user spark. That works fine, of course, but is not ideal for obvious reasons.
I work on a Spark 2.1 application that also uses SparkSQL and saves data with dataframe.write.saveAsTable(tbl). My understanding is that an in-memory Derby DB is used for the Hive metastore (right?). This means that a table that I create in the first execution is not available in any subsequent executions. In many cases that might be the intended behaviour - but I would like to persist the metastore across executions (since this is also the behavior I have in my production system).
So, a simple question: How can I change the configuration to persist the metastore on disc?
One remark: I am not starting the Spark job with spark-shell or spark-submit, but as a standalone Scala application.
It is already persisted on disk. As long as both sessions use the same working directory or specific metastore configuration, the permanent table will be persisted between sessions.
I am trying to get Zeppelin to work. But when I run a notebook twice, the second time it fails due to Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient. (full log at the end of the post)
It seems to be due to the fact that the lock in the metastore doesn't get removed. It is also advised to use for example Postgres instead of Hive as it allows multiple users to run jobs in Zeppelin.
I made a postgres DB and a hive-site.xml pointing to this DB. I added this file into the config folder of Zeppelin but also into the config folder of Spark. Also in the jdbc interpreter of Zeppelin I added similar parameters than the ones in the hive-site.xml.
The problems persists though.
Error log: http://pastebin.com/Jqf9cdtU
hive-site.xml: http://pastebin.com/RZdXHPX4
Try using Thrift server architecture in the Spark setup instead of working on a single instance JVM of Hive where you cannot generate multiple of sessions.
There are mainly three types of connection to Hive:
Single JVM - Metastore stored locally in the warehouse which doesn't allow multiple sessions
Mutiple JVM - where each worker behaves as a metastore
Thrift Server Architecture - Multiple Users can access the SQL engine and parallelism can be achieved
Another instance of Derby may have already booted the database
By default, spark use derby as the metadata store which can only serve one user. It seems you start multiple spark interpreter, that's why you see the above error message. So here's the 2 solutions for you
Disable hive in spark interpreter via setting zeppelin.spark.useHiveContext to false if you don't need hive.
Set up hive metadata store which support multiple users. Refer this https://www.cloudera.com/documentation/enterprise/5-8-x/topics/cdh_ig_hive_metastore_configure.html
Stop Zeppelin. Go to your bin folder in Apache Zeppelin and try deleting metastore_db
sudo rm -r metastore_db/
Start Zeppelin again and try now.
I have a question regarding best practices for managing permanent tables in Spark. I have been working previously with Databricks, and in that context, Databricks manages permanent tables so you do not have to 'create' or reference them each time a cluster is launched.
Let's say in a Spark cluster session, a permanent table is created with saveAsTable command using option to partition the table. Data is stored in S3 as parquet files.
Next day, a new cluster is created and it needs to access that table for different purposes:
SQL query for exploratory analysis
ETL process for appending a new chunk of data
What is the best way to make saved table available again as the same table with same structure/options/path? Maybe there a way to store hive metastore settings to be reused between spark sessions? Or maybe each time a spark cluster is created, I should do CREATE EXTERNAL TABLE with the correct options to tell the format (parquet), the partitioning and the path?
Furthermore, if I want to access those parquet files from another application, i.e. Apache Impala, is there a way to store and retrieve hive metastore information or the table has to be created again?
Thanks
I just setup a fresh windows server with a fresh datastax installation including cassandra 1.2 and opscenter 2.1.3. I've tried finding solutions to these questions on cassandra wikis and datastax website, but I can only find unix specific information or datastax API information.
Cassandra is defaulted to using C: drive (I was never asked to select a drive for cassandra during install).
In the same cassandra instance, can I have keyspaces on separate
disks?
If not, how do I migrate the existing keyspace to the new
drive? (just reconfiguring cassandra.yaml to use a new directory
would lose my opscenter data and may even break opscenter).
If yes, how can I create a new keyspace on a separate drive? cassandra.yaml
seems to only have configuration options for a single store location.
Should I be creating a new cluster to store my data in? If I start
adding new nodes to the default cluster, that will mean the datastax
opscenter data will be getting replicated - that seems like a bad
idea.
If there is good documentation on this somewhere, please point me there.
Thanks,
Adam
You cannot get cassandra to split the keyspaces and store them in different directories. They are all stored under a common data directory that is specified in the cassandra.yaml file.
However, you can set this up and use NTFS to mount different drives under the data directory on your server but this will not be simple or expandable.
If you want to move where the data is stored on cassandra, then stop the cassandra daemon/service, change the cassandra.yaml file to store the data at a new location, then copy/move the entirety of the data directory to this new location. THEN start cassandra back up and it will work fine with the data in the new location. I have done this quite a few times now and cassandra comes back up without incident and no lost data (if you do not move the data, then it will lose it all and recreate the directory structure under the new location).
Data getting replicated is not a bad thing - it is what cassandra was designed for. I don't know what replication factor opscenter uses, but it does not store a massive amount of data so replication is not a problem.