How to run Spark SQL JDBC/ODBC server and pyspark at the same time?

How to run Spark SQL JDBC/ODBC server and pyspark at the same time? - apache-spark

I have a one node deployment of Spark. I am running JDBC/ODBC server on it. Which works fine. However, if at the same time I use pyspark to save a table (df.write.saveAsTable()) I get a very long error message. I think the core part to it is this:
Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /root/spark/bin/metastore_db.
Doing some research, I've found that this is caused by Spark creating a new session which tries to create another instance of Derby which causes an error. The solution offered is to shut down all other spark-shell processes. However if I do that then ODBC server stops running.
What can I do to have both running at the same time?

You might want to use derby network server instead of the default embedded version so it can be shared by multiple processes. Or you use another datastore such as MySQL.
After installing derby network server, you can copy the derby-client.jar file into the spark jars directory and then edit the file conf/hive-site.xml with something like:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:derby://localhost:1527/metastore_db;create=true</value>
  <description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>org.apache.derby.jdbc.ClientDriver</value>
  <description>Driver class name for a JDBC metastore</description>
</property>
</configuration>

Related

how to disbale Namenode web UI?

I want to disable HDFS web UI http://localhost:50070 .
I tried to disable it by below config,however it is still accessible.
<property>
<name>dfs.webhdfs.enabled</name>
<value>false</value>
<description>Enable or disable webhdfs. Defaults to false</description>
</property>

That is WebHDFS, not the WebUI.
You want dfs.namenode.http-bind-host set to 127.0.0.1 to make the server bind locally, meaning it is not externally available.
You must restart any Hadoop process after editting its configuration files.
If you use Apache Ambari or Cloudera Manager, it'll request that you do this right away
I would advise not doing this, though, since you need the UI to be informed about the cluster's overall health, if not using one of the tools I mentioned above.

Sparklyr not connecting to my hive warehouse

I'm doing a very silly thing and trying to install a Yarn/Hive/Spark/R platform from scratch, not using Hortonworks or Cloudera. I've gotten many pieces figured out but am stuck trying to get my sparklyr to connect to my Hive warehouse.
I am using Rstudio on one machine and connecting to yarn-client located on a separate cluster. I've put hive-site.xml pretty much everywhere, the local $SPARK_HOME/conf and each of the hadoop nodes' $SPARK_HOME/conf and $HADOOP_CONF_DIR. In hive-site.xml I've included the param:
<property>
<name>spark.sql.warehouse.dir</name>
<value>hdfs://<driver node>/user/hive/warehouse/</value>
<description>The loation of the hive warehouse</description>
</property>
I feel that that should make it pretty clear that I'm trying to use hive but when I run this code:
DBI::dbGetQuery(sc, "CREATE DATABASE test")
DBI::dbGetQuery(sc, "use test")
iris_spark_table <- copy_to(sc, iris, overwrite = TRUE)
sdf_copy_to(sc, iris_spark_table)
DBI::dbGetQuery(sc, "create table iris_hive as SELECT * FROM iris_spark_table")
I get this error:
org.apache.hadoop.hive.ql.metadata.HiveException:
java.io.IOException:
Mkdirs failed to create file:/<my-r-code's-working-dir>/spark-warehouse/test.db/iris_hive/.hive-staging_hive_2018-08-05_14-18-58_646_6160231583951115949-1/-ext-10000/_temporary/0/_temporary/attempt_20180805141859_0013_m_000000_3
(exists=false, cwd=file:/tmp/hadoop-hadoop/nm-local-dir/usercache/dzafar/appcache/application_1533357216333_0015/container_1533357216333_0015_01_000002)
What am I missing??? Thanks in advance!!!

First of all Spark specific properties should be placed in Spark configuration files. It means you should put
spark.sql.warehouse.dir
in $SPARK_HOME/conf/spark-defaults.conf
Additionally you might have a problem with hdfs-site.xml not being present on the search path.

Permission issue with spark yarn cluster

I have an spark 1.5.1 process installed on a HDP 2.1 cluster. Which uses hadoop 2.4.0, and I'm having permission denied issues when the driver tries to write into a given Hive table.
The user submiting the job is gvp_service, and during the job it's workers are able to write with gvp_service permission, but when interacting with the metastore I receive the following exception:
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=yarn, access=WRITE, inode="/apps/hive/warehouse/gbic_video_video.db/gbic_video_video_raw_users/global_op_id=1000/service_type=1/day=2015-09-15":gvp_service:hdfs:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:265)
Why is it using yarn user for this task? Is it due to use Hadoop client 2.4.0?

Can you check the job is submitted as which user? To interact with hive you should have "hive-site.xml" in $SPARK_INSTALL_DIR/conf with below contents
<configuration>
<property>
<name>hive.metastore.uris</name>
<!-- Ensure that the following statement points to the Hive Metastore URI in your cluster -->
<value>thrift://<METASTORE IP>:9083</value>
<description>URI for client to contact metastore server</description>
</property>
</configuration>
Hope this helps

How can i access cfs url from a remote non dse (datastax) node

im am trying to do... from my prog.
val file = sc.textFile("cfs://ip/.....")
but i get java.io.IOException: No FileSystem for scheme: cfs exception...
How should i modify the core-site.xml and where? It should be on dse nodes or should i add it as a resource in my jar.
I use maven to build my jar and execute the jobs remotely...from a non dse node which does not have cassandra or spark or something similar... Other type of flows without cfs files work ok... so the jar is ok so far...
Thnx!

There is some info in the middle of this page about Spark using Hadoop for some operations, such as CFS access: http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkCassProps.html
I heard about a problem using Hive from a non-DSE node that was solved by adding a property file to core-site.xml. This is really a long-shot since it's Spark, but if you're willing to experiment, try adding the IP address of the remote machine to the core-site.xml file.
<property>
<name>cassandra.host</name>
<value>192.168.2.100</value>
<property>
Find the core-site.xml in /etc/dse/hadoop/conf/ or install_location/resources/hadoop/conf/, depending on the type of installation.
I assume you started the DSE cluster in hadoop and spark mode: http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkStart.html

Been quite some time.
The integration is done as usual with any integration of a hadoop client to a compatible hadoop fs.
Copy core-site.xml (append the dse-core-default.xml there) along with dse.yaml, cassandra.yaml and then it requires a proper dependency set-up in the class path eg. dse.jar, cassandra-all, etc.
Note: this is not officially supported so better use other way.

Has anyone tried to use Shark/Spark on DataStax Enterprise?

I've been trying to achieve this without success. I tried to use the included hive disitribution on dse with shark, however, shark provides with a patched up and older version of Hive (0.9 I believe), which makes shark execution impossible due to incompatibilities. I also tried to use the patched up hive version from shark instead of dse's, recycling the dse hive configuration (in order to make available CFS to shark's hive distribution) only to discover a long list of dependencies from the full dse classpath (hive, cassandra, hadoop, etc.).
It is possible to achieve this with C* by following the instructions on this blog.
Am I being stubborn by trying to use CFS? Is there a way with or without CFS on dse?
Thanks!
Here are some shark-env.sh highlights:
export HIVE_HOME="/home/cassserv/hive-0.9.0-bin/" #choosing this when using hive distro.
#export HIVE_HOME="/usr/share/dse/hive/" #choosing this when using dse distro.
export HIVE_CONF_DIR="/home/cassserv/hive-0.9.0-bin/conf" #edited dse hive-site.xml conf file
#export HIVE_CONF_DIR="/etc/dse/hive" #original dse hive-site.xml conf file
Edited hive-site.xml highlights:
<property>
<name>hive.hwi.war.file</name>
<!--<value>lib/hive-hwi.war</value>-->
<value>lib/hive-hwi-0.9.0-shark-0.8.1.war</value><!--edited to use sharks distro-->
<description>This sets the path to the HWI war file, relative to ${HIVE_HOME}</description>
</property>
<property>
<name>hadoop.bin.path</name>
<!--<value>${dse.bin}/dse hadoop</value>-->
<value>/usr/share/dse hadoop</value><!--edited to override variable-->
</property>
Here's shark's output while trying to use sharks patched hive distro with dse's hive configuration. That missing class is in dse.jar file:
Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:com.datastax.bdp.hadoop.hive.metastore.CassandraHiveMetaStore class not found)
I'm trying to figure out if I can do something like this in the edited hive-site.xml:
<property>
<name>fs.cfs.impl</name>
<value>org.apache.cassandra.hadoop.fs.CassandraFileSystem</value>
</property>
<property>
<name>hive.metastore.rawstore.impl</name>
<!--<value>com.datastax.bdp.hadoop.hive.metastore.CassandraHiveMetaStore</value>--> <value>org.apache.hadoop.hive.metastore.ObjectStore</value>
<description>Use the Apache Cassandra Hive RawStore implementation</description>
</property>
in order to remove any dependency from the dse libraries. Also, might not use dse's hadoop distro.

DSE 4.5 has Spark and Shark 0.9 integrated. You don't need to setup anything, it works out-of-the-box the same way pig/hive worked before.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to run Spark SQL JDBC/ODBC server and pyspark at the same time? - apache-spark

Related

how to disbale Namenode web UI?

Sparklyr not connecting to my hive warehouse

Permission issue with spark yarn cluster

How can i access cfs url from a remote non dse (datastax) node

Has anyone tried to use Shark/Spark on DataStax Enterprise?

Categories

Resources