Sparklyr not connecting to my hive warehouse - apache-spark

I'm doing a very silly thing and trying to install a Yarn/Hive/Spark/R platform from scratch, not using Hortonworks or Cloudera. I've gotten many pieces figured out but am stuck trying to get my sparklyr to connect to my Hive warehouse.
I am using Rstudio on one machine and connecting to yarn-client located on a separate cluster. I've put hive-site.xml pretty much everywhere, the local $SPARK_HOME/conf and each of the hadoop nodes' $SPARK_HOME/conf and $HADOOP_CONF_DIR. In hive-site.xml I've included the param:
<property>
<name>spark.sql.warehouse.dir</name>
<value>hdfs://<driver node>/user/hive/warehouse/</value>
<description>The loation of the hive warehouse</description>
</property>
I feel that that should make it pretty clear that I'm trying to use hive but when I run this code:
DBI::dbGetQuery(sc, "CREATE DATABASE test")
DBI::dbGetQuery(sc, "use test")
iris_spark_table <- copy_to(sc, iris, overwrite = TRUE)
sdf_copy_to(sc, iris_spark_table)
DBI::dbGetQuery(sc, "create table iris_hive as SELECT * FROM iris_spark_table")
I get this error:
org.apache.hadoop.hive.ql.metadata.HiveException:
java.io.IOException:
Mkdirs failed to create file:/<my-r-code's-working-dir>/spark-warehouse/test.db/iris_hive/.hive-staging_hive_2018-08-05_14-18-58_646_6160231583951115949-1/-ext-10000/_temporary/0/_temporary/attempt_20180805141859_0013_m_000000_3
(exists=false, cwd=file:/tmp/hadoop-hadoop/nm-local-dir/usercache/dzafar/appcache/application_1533357216333_0015/container_1533357216333_0015_01_000002)
What am I missing??? Thanks in advance!!!

First of all Spark specific properties should be placed in Spark configuration files. It means you should put
spark.sql.warehouse.dir
in $SPARK_HOME/conf/spark-defaults.conf
Additionally you might have a problem with hdfs-site.xml not being present on the search path.

Related

How can i show hive table using pyspark

Hello i created a spark HD insight cluster on azure and i’m trying to read hive tables with pyspark but the proble that its show me only default database
Anyone have an idea ?
If you are using HDInsight 4.0, Spark and Hive not share metadata anymore.
For default you will not see hive tables from pyspark, is a problem that i share on this post: How save/update table in hive, to be readbale on spark.
But, anyway, things you can try:
If you want test only on head node, you can change the hive-site.xml, on property "metastore.catalog.default", change the value to hive, after that open pyspark from command line.
If you want to apply to all cluster nodes, changes need to be made on Ambari.
Login as admin on ambari
Go to spark2 > Configs > hive-site-override
Again, update property "metastore.catalog.default", to hive value
Restart all required on Ambari panel
These changes define hive metastore catalog as default.
You can see hive databases and table now, but depending of table structure, you will not see the table data properly.
If you have created tables in other databases, try show tables from database_name. Replace database_name with the actual name.
You are missing details of hive server in SparkSession. If you haven't added any it will create and use default database to run sparksql.
If you've added configuration details in spark default conf file for spark.sql.warehouse.dir and spark.hadoop.hive.metastore.uris then while creating SparkSession add enableHiveSupport().
Else add configuration details while creating sparksession
.config("spark.sql.warehouse.dir","/user/hive/warehouse")
.config("hive.metastore.uris","thrift://localhost:9083")
.enableHiveSupport()

Spark SQL cannot access Spark Thrift Server

I cannot configure Spark SQL so that I could access Hive Table in Spark Thrift Server (without using JDBC, but natively from Spark)
I use single configuration file conf/hive-site.xml for both Spark Thrift Server and Spark SQL. I have javax.jdo.option.ConnectionURL property set to jdbc:derby:;databaseName=/home/user/spark-2.4.0-bin-hadoop2.7/metastore_db;create=true. I also set spark.sql.warehouse.dir property to absolute path pointing to spark-warehouse directory. I run Thrift server with ./start-thriftserver.sh and I can observe that embedded Derby database is being created with metastore_db directory. I can connect with beeline, create a table and see spark-warehouse directory created with subdirectory for table. So at this stage it's fine.
I launch pyspark shell with Hive support enabled ./bin/pyspark --conf spark.sql.catalogImplementation=hive, and try to access the Hive table with:
from pyspark.sql import HiveContext
hc = HiveContext(sc)
hc.sql('show tables')
I got errors like:
ERROR XJ040: Failed to start database
'/home/user/spark-2.4.0-bin-hadoop2.7/metastore_db' with class loader
sun.misc.Launcher$AppClassLoader#1b4fb997
ERROR XSDB6: Another instance of Derby may have already booted the
database /home/user/spark-2.4.0-bin-hadoop2.7/metastore_db
pyspark.sql.utils.AnalysisException: u'java.lang.RuntimeException:
java.lang.RuntimeException: Unable to instantiate
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;
Apparently Spark is trying to create new Derby database instead of using Metastore I put in config file. If I stop Thrift Server and run only spark, everything is fine. How could I fix it?
Is Embedded Derby Metastore Database fine to have both Thrift Server and Spark access one Hive or I need to use e.g. MySQL? I don't have a cluster and do everything locally.
Embedded Derby Metastore Database is fine to be used in local, but for production environment, it is recommended to use any other Metastore database.
Yes, you can definitely use MYSQL as metastore. For this, you have to make an entry in hive-site.xml.
You can follow the configuration guide at Use MySQL for the Hive Metastore for the exact details.

How to specify which hive metastore to connect to?

Going back a few versions of Spark it used to be required to put the
hive-site.xml
in the $SPARK_HOME/conf directory. Is that still the case?
The motivation for this question: we are unable to see hive tables that are defined within the metastore instance for which we did copy the hive-site.xml to the conf dir.
I have verified that the hive-site.xml is still used. It is selected from the classpath of spark. This may be set up via
export SPARK_CLASSPATH=/path/to/conf/dir

How can i access cfs url from a remote non dse (datastax) node

im am trying to do... from my prog.
val file = sc.textFile("cfs://ip/.....")
but i get java.io.IOException: No FileSystem for scheme: cfs exception...
How should i modify the core-site.xml and where? It should be on dse nodes or should i add it as a resource in my jar.
I use maven to build my jar and execute the jobs remotely...from a non dse node which does not have cassandra or spark or something similar... Other type of flows without cfs files work ok... so the jar is ok so far...
Thnx!
There is some info in the middle of this page about Spark using Hadoop for some operations, such as CFS access: http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkCassProps.html
I heard about a problem using Hive from a non-DSE node that was solved by adding a property file to core-site.xml. This is really a long-shot since it's Spark, but if you're willing to experiment, try adding the IP address of the remote machine to the core-site.xml file.
<property>
<name>cassandra.host</name>
<value>192.168.2.100</value>
<property>
Find the core-site.xml in /etc/dse/hadoop/conf/ or install_location/resources/hadoop/conf/, depending on the type of installation.
I assume you started the DSE cluster in hadoop and spark mode: http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkStart.html
Been quite some time.
The integration is done as usual with any integration of a hadoop client to a compatible hadoop fs.
Copy core-site.xml (append the dse-core-default.xml there) along with dse.yaml, cassandra.yaml and then it requires a proper dependency set-up in the class path eg. dse.jar, cassandra-all, etc.
Note: this is not officially supported so better use other way.

Has anyone tried to use Shark/Spark on DataStax Enterprise?

I've been trying to achieve this without success. I tried to use the included hive disitribution on dse with shark, however, shark provides with a patched up and older version of Hive (0.9 I believe), which makes shark execution impossible due to incompatibilities. I also tried to use the patched up hive version from shark instead of dse's, recycling the dse hive configuration (in order to make available CFS to shark's hive distribution) only to discover a long list of dependencies from the full dse classpath (hive, cassandra, hadoop, etc.).
It is possible to achieve this with C* by following the instructions on this blog.
Am I being stubborn by trying to use CFS? Is there a way with or without CFS on dse?
Thanks!
Here are some shark-env.sh highlights:
export HIVE_HOME="/home/cassserv/hive-0.9.0-bin/" #choosing this when using hive distro.
#export HIVE_HOME="/usr/share/dse/hive/" #choosing this when using dse distro.
export HIVE_CONF_DIR="/home/cassserv/hive-0.9.0-bin/conf" #edited dse hive-site.xml conf file
#export HIVE_CONF_DIR="/etc/dse/hive" #original dse hive-site.xml conf file
Edited hive-site.xml highlights:
<property>
<name>hive.hwi.war.file</name>
<!--<value>lib/hive-hwi.war</value>-->
<value>lib/hive-hwi-0.9.0-shark-0.8.1.war</value><!--edited to use sharks distro-->
<description>This sets the path to the HWI war file, relative to ${HIVE_HOME}</description>
</property>
<property>
<name>hadoop.bin.path</name>
<!--<value>${dse.bin}/dse hadoop</value>-->
<value>/usr/share/dse hadoop</value><!--edited to override variable-->
</property>
Here's shark's output while trying to use sharks patched hive distro with dse's hive configuration. That missing class is in dse.jar file:
Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:com.datastax.bdp.hadoop.hive.metastore.CassandraHiveMetaStore class not found)
I'm trying to figure out if I can do something like this in the edited hive-site.xml:
<property>
<name>fs.cfs.impl</name>
<value>org.apache.cassandra.hadoop.fs.CassandraFileSystem</value>
</property>
<property>
<name>hive.metastore.rawstore.impl</name>
<!--<value>com.datastax.bdp.hadoop.hive.metastore.CassandraHiveMetaStore</value>--> <value>org.apache.hadoop.hive.metastore.ObjectStore</value>
<description>Use the Apache Cassandra Hive RawStore implementation</description>
</property>
in order to remove any dependency from the dse libraries. Also, might not use dse's hadoop distro.
DSE 4.5 has Spark and Shark 0.9 integrated. You don't need to setup anything, it works out-of-the-box the same way pig/hive worked before.

Resources