Unable to write data on hive using spark - apache-spark

I am using spark1.6. I am creating hivecontext using spark context. When I save the data into hive it gives error. I am using cloudera vm. My hive is inside cloudera vm and spark in on my system. I can access the vm using IP. I have started the thrift server and hiveserver2 on vm. I have user thrift server uri for hive.metastore.uris
val hiveContext = new HiveContext(sc)
hiveContext.setConf("hive.metastore.uris", "thrift://IP:9083")
............
............
df.write.mode(SaveMode.Append).insertInto("test")
I get the following error:
FAILED: SemanticException java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClien‌​t

Probably inside spark conf folder, hive-site.xml is not available , I have added the details below.
Adding hive-site.xml inside spark configuration folder.
creating a symlink which points to hive-site.xml in hive configuration folder.
sudo ln -s /usr/lib/hive/conf/hive-site.xml /usr/lib/spark/conf/hive-site.xml
after the above steps, restarting spark-shell should help.

Related

How to add hbase-site.xml config file using spark-shell

I have the following simple code:
import org.apache.hadoop.hbase.client.ConnectionFactory
import org.apache.hadoop.hbase.HBaseConfiguration
val hbaseconfLog = HBaseConfiguration.create()
val connectionLog = ConnectionFactory.createConnection(hbaseconfLog)
Which I'm running on spark-shell, and I'm getting the following error:
14:23:42 WARN zookeeper.ClientCnxn: Session 0x0 for server null, unexpected
error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:30)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
Many these errors actually, and a few of these every now and then:
14:23:46 WARN client.ZooKeeperRegistry: Can't retrieve clusterId from
Zookeeper org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase/hbaseid
Through Cloudera's VM I'm able to solve this by simply restarting the hbase-master, regionserver and thrift, but here in my company I'm not allowed to do it, I also solved it once by copying the file hbase-site.xml to spark conf directory but I can't to it either, is there a way to set the path for this specific file in the spark-shell parameters?
1) make sure that your zookeeper is running
2) need to copy hbase-site.xml to /etc/spark/conf folder just like we copy hive-site.xml to /etc/spark/conf to access the Hive tables.
3) export SPARK_CLASSPATH=/a/b/c/hbase-site.xml;/d/e/f/hive-site.xml
just as described in hortonworks forum.. like this
or
open spark-shell with out adding hbase-site.xml
3 commands to execute in spark-shell
val conf = HBaseConfiguration.create()
conf.addResource(new Path("/home/spark/development/hbase/conf/hbase-site.xml"))
conf.set(TableInputFormat.INPUT_TABLE, table_name)

Spark - yarn master but dataset on different hdfs cluster

I wish to run spark on one hdfs cluster (yarn master) but wish to access dataset from another hdfs cluster.
Both the hdfs cluster are keberized and the same ID has access on both.
steps:
setup env for first hdfs cluster
spark-shell --master yarn-client
sc.textFile("hdfs://[secondshdfscluster][dataset there]
res0.count(*) gives
......
Caused by: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN,KERBEROS]
.....
Is what I am trying even possible ? If so, any suggestions to fix it ?

Connect to Spark running on VM

I have a Spark enviroment running on Ubuntu 16.2 over VirtualBox. Its configured to run locally and when I start Spark with
./start-all
I can access to it on VM via web-ui using the URL: http://localhost:8080
From the host machine (windows), I can access it too using the VM IP: http://192.168.x.x:8080.
The problem appears when I try to create a context from my host machine. I have a project in eclipse that uses maven, and I try to run the following code:
ConfigLoader.masterEndpoint = "spark://192.168.1.132:7077"
val conf = new SparkConf().setMaster(ConfigLoader.masterEndpoint).setAppName("SimpleApp")
val sc = new SparkContext(conf)
I got this error:
16/12/21 00:52:05 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://192.168.1.132:8080...
16/12/21 00:52:06 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master 192.168.1.132:8080
org.apache.spark.SparkException: Exception thrown in awaitResult
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
I've tried changing the URL for:
ConfigLoader.masterEndpoint = "spark://192.168.1.132:7077"
Unsuccessfully.
Also, if I try to access directly to the master URL via web (http://localhost:7077 in VM), I don't get anything. I don't know if its normal.
What am I missing?
In your VM go to spark-2.0.2-bin-hadoop2.7/conf directory and create spark-env.sh file using below command.
cp spark-env.sh.template spark-env.sh
Open spark-env.sh file in vi editor and add below line.
SPARK_MASTER_HOST=192.168.1.132
Stop and start Spark using stop-all.sh and start-all.sh. Now in your program you can set the master like below.
val spark = SparkSession.builder()
.appName("SparkSample")
.master("spark://192.168.1.132:7077")
.getOrCreate()

new Spark StreamingContext failes with hdfs errors

I'm using dcos installed via Azure ACS and installed hdfs and spark via dcos tool with default options.
Creating a SparkStreamingContext gives:
16/07/22 01:51:04 WARN DFSUtil: Namenode for hdfs remains unresolved for ID nn1. Check your hdfs-site.xml file to ensure namenodes are configured properly.
16/07/22 01:51:04 WARN DFSUtil: Namenode for hdfs remains unresolved for ID nn2. Check your hdfs-site.xml file to ensure namenodes are configured properly.
Exception in thread "main" java.lang.IllegalArgumentException:
java.net.UnknownHostException: namenode1.hdfs.mesos
I expect I have to redeploy the spark package with dcos package install with –options= but can't figure out what the hdfs.config-url should be. The https://docs.mesosphere.com/1.7/usage/service-guides/spark/install/#hdfs docs seem out of date.
Yes, it is out of date. We'll fix that.
DC/OS HDFS now serves its config on http://hdfs.marathon.mesos:[port]/v1/connect

DSE Spark Shell Authentication

I have a DSE 4.5 installation with spark running. I need some help in passing the username / password of cassandra cluster from Spark Shell.
I have added these properties to conf/spark-default.conf file
spark.cassandra.auth.username=user
spark.cassandra..auth.password=pass
And start up my spark shell using
dse spark
But still seeing the error when I try sc.cassandraTable
com.datastax.driver.core.exceptions.AuthenticationException: Authentication error on host /11.111.11.11:9042: Host /11.111.11.11:9042 requires authentication, but no authenticator found in Cluster configuration
at com.datastax.driver.core.AuthProvider$1.newAuthenticator(AuthProvider.java:38)
at com.datastax.driver.core.Connection.initializeTransport(Connection.java:138)
at com.datastax.driver.core.Connection.<init>(Connection.java:111)
at com.datastax.driver.core.Connection$Factory.open(Connection.java:432)
at com.datastax.driver.core.ControlConnection.tryConnect(ControlConnection.java:216)
at com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:171)
at com.datastax.driver.core.ControlConnection.connect(ControlConnection.java:79)
at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:1104)
looks like you can execute this command
dse spark -Dcassandra.username=user -Dcassandra.password=pass
ref:
http://docs.datastax.com/en/datastax_enterprise/4.5/datastax_enterprise/sec/secIntrnlAuth.html?scroll=secItrnlAuth__authentication-for-hadoop-tools
This worked for me:
dse -u cassandra -p cassandra spark

Resources