How to add hbase-site.xml config file using spark-shell - apache-spark

I have the following simple code:
import org.apache.hadoop.hbase.client.ConnectionFactory
import org.apache.hadoop.hbase.HBaseConfiguration
val hbaseconfLog = HBaseConfiguration.create()
val connectionLog = ConnectionFactory.createConnection(hbaseconfLog)
Which I'm running on spark-shell, and I'm getting the following error:
14:23:42 WARN zookeeper.ClientCnxn: Session 0x0 for server null, unexpected
error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:30)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
Many these errors actually, and a few of these every now and then:
14:23:46 WARN client.ZooKeeperRegistry: Can't retrieve clusterId from
Zookeeper org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase/hbaseid
Through Cloudera's VM I'm able to solve this by simply restarting the hbase-master, regionserver and thrift, but here in my company I'm not allowed to do it, I also solved it once by copying the file hbase-site.xml to spark conf directory but I can't to it either, is there a way to set the path for this specific file in the spark-shell parameters?

1) make sure that your zookeeper is running
2) need to copy hbase-site.xml to /etc/spark/conf folder just like we copy hive-site.xml to /etc/spark/conf to access the Hive tables.
3) export SPARK_CLASSPATH=/a/b/c/hbase-site.xml;/d/e/f/hive-site.xml
just as described in hortonworks forum.. like this
or
open spark-shell with out adding hbase-site.xml
3 commands to execute in spark-shell
val conf = HBaseConfiguration.create()
conf.addResource(new Path("/home/spark/development/hbase/conf/hbase-site.xml"))
conf.set(TableInputFormat.INPUT_TABLE, table_name)

Related

Setting port and hostname when using spark to connect to cassandra using datastax driver

I'm currently trying to connect to a Apache Cassandra database using Apache Spark (2.3.0,shell) using the Datastax driver (datastax:spark-cassandra-connector:2.3.0-s_2.11).
I'm using the --conf option at the command line and when I try to run a database query its erroring out saying that it cant open a native connection to 127.0.0.1:9042.
Step 1 (I'm running this command inside the folder where spark is.)
# ./bin/spark-shell --conf spark.cassandra-connection.host=localhost spark.cassandra-connection.native.port=32771 --packages datastax:spark-cassandra-connector:2.3.0-s_2.11
Step 2 (Im Running these steps in the scala> shell of Spark)
scala> import com.datastax.spark.connector._
scala> import org.apache.spark.sql.cassandra._
scala> val rdd = sc.cassandraTable("market", "markethistory")
scala> println(rdd.first)
Step 3 (It Errors out)
java.io.IOException: Failed to open native connection to Cassandra at {127.0.0.1}:9042 +stacktrace
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: localhost/127.0.0.1:9042 (com.datastax.driver.core.exceptions.TransportException: [localhost/127.0.0.1:9042] Cannot connect)) +stacktrace
Additional notes:
Notice how it says port 9042 in the error.
I've also tried changing the host in the --conf option and that doesn't change the output of the error.
My main assumption would be that I need to specify the host and port in scala but I'm unsure how, and the datastax documentation is all about their special spark distro and it doesn't seem to match up.
Things I've tried:
spark.cassandra-connection.port=32771
spark.cassandra.connection.port=32771
spark.cassandra.connection.host=localhost
Thanks in advance.
The Answer was twofold;
The strings are indeed cassandra.connection not cassandra-connection
--conf has to be after --packages
Thanks to #user8371915 for the connection string difference.

Unable to write data on hive using spark

I am using spark1.6. I am creating hivecontext using spark context. When I save the data into hive it gives error. I am using cloudera vm. My hive is inside cloudera vm and spark in on my system. I can access the vm using IP. I have started the thrift server and hiveserver2 on vm. I have user thrift server uri for hive.metastore.uris
val hiveContext = new HiveContext(sc)
hiveContext.setConf("hive.metastore.uris", "thrift://IP:9083")
............
............
df.write.mode(SaveMode.Append).insertInto("test")
I get the following error:
FAILED: SemanticException java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClien‌​t
Probably inside spark conf folder, hive-site.xml is not available , I have added the details below.
Adding hive-site.xml inside spark configuration folder.
creating a symlink which points to hive-site.xml in hive configuration folder.
sudo ln -s /usr/lib/hive/conf/hive-site.xml /usr/lib/spark/conf/hive-site.xml
after the above steps, restarting spark-shell should help.

Connect to Spark running on VM

I have a Spark enviroment running on Ubuntu 16.2 over VirtualBox. Its configured to run locally and when I start Spark with
./start-all
I can access to it on VM via web-ui using the URL: http://localhost:8080
From the host machine (windows), I can access it too using the VM IP: http://192.168.x.x:8080.
The problem appears when I try to create a context from my host machine. I have a project in eclipse that uses maven, and I try to run the following code:
ConfigLoader.masterEndpoint = "spark://192.168.1.132:7077"
val conf = new SparkConf().setMaster(ConfigLoader.masterEndpoint).setAppName("SimpleApp")
val sc = new SparkContext(conf)
I got this error:
16/12/21 00:52:05 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://192.168.1.132:8080...
16/12/21 00:52:06 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master 192.168.1.132:8080
org.apache.spark.SparkException: Exception thrown in awaitResult
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
I've tried changing the URL for:
ConfigLoader.masterEndpoint = "spark://192.168.1.132:7077"
Unsuccessfully.
Also, if I try to access directly to the master URL via web (http://localhost:7077 in VM), I don't get anything. I don't know if its normal.
What am I missing?
In your VM go to spark-2.0.2-bin-hadoop2.7/conf directory and create spark-env.sh file using below command.
cp spark-env.sh.template spark-env.sh
Open spark-env.sh file in vi editor and add below line.
SPARK_MASTER_HOST=192.168.1.132
Stop and start Spark using stop-all.sh and start-all.sh. Now in your program you can set the master like below.
val spark = SparkSession.builder()
.appName("SparkSample")
.master("spark://192.168.1.132:7077")
.getOrCreate()

error when starting the spark shell

I just downloaded the latest version of spark and when I started the spark shell I got the following error:
java.net.BindException: Failed to bind to: /192.168.1.254:0: Service 'sparkDriver' failed after 16 retries!
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:393)
at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:389)
...
...
java.lang.NullPointerException
at org.apache.spark.sql.SQLContext.<init>(SQLContext.scala:193)
at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:71)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
at org.apache.spark.repl.SparkILoop.createSQLContext(SparkILoop.scala:1028)
at $iwC$$iwC.<init>(<console>:9)
...
...
<console>:10: error: not found: value sqlContext
import sqlContext.implicits._
^
<console>:10: error: not found: value sqlContext
import sqlContext.sql
^
Is there something that I missed in setting up spark?
Try setting the Spark env variable SPARK_LOCAL_IP to a local IP address.
In my case, I was running Spark on an Amazon EC2 Linux instance. spark-shell stopped working, with an error message similar to yours. I was able to fix it by adding a setting like the following to the Spark config file spark-env.conf.
export SPARK_LOCAL_IP=172.30.43.105
Could also set it in ~/.profile or ~/.bashrc.
Also check host settings in /etc/hosts
See SPARK-8162.
It looks like it only affects 1.4.1 and 1.5.0 - you're probably best off running the latest release (1.4.0 at time of writing).
I was experiencing the same issue. First got to .bashrc and put
export SPARK_LOCAL_IP=172.30.43.105
then goto
cd $HADOOP_HOME/bin
then run the following command
hdfs dfsadmin -safemode leave
This just switches your safemode of namenode off.
Then delete the metastore_db folder from the spark home folder or /bin. It will be generally be in a folder from which you generally start a spark session.
then I ran my spark-shell using this
spark-shell --master "spark://localhost:7077"
and voila I didnot get the sqlContext.implicits._ error.

Why Zeppelin notebook is not able to connect to S3

I have installed Zeppelin, on my aws EC2 machine to connect to my spark cluster.
Spark Version:
Standalone: spark-1.2.1-bin-hadoop1.tgz
I am able to connect to spark cluster but getting following error, when trying to access the file in S3 in my usecase.
Code:
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "YOUR_KEY_ID")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey","YOUR_SEC_KEY")
val file = "s3n://<bucket>/<key>"
val data = sc.textFile(file)
data.count
file: String = s3n://<bucket>/<key>
data: org.apache.spark.rdd.RDD[String] = s3n://<bucket>/<key> MappedRDD[1] at textFile at <console>:21
ava.lang.NoSuchMethodError: org.jets3t.service.impl.rest.httpclient.RestS3Service.<init>(Lorg/jets3t/service/security/AWSCredentials;)V
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:55)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
I have build the Zeppelin by following command:
mvn clean package -Pspark-1.2.1 -Dhadoop.version=1.0.4 -DskipTests
when I trying to build with hadoop profile "-Phadoop-1.0.4", it is giving warning that it doesn't exist.
I have also tried -Phadoop-1 mentioned in this spark website. but got the same error.
1.x to 2.1.x hadoop-1
Please let me know what I am missing here.
The following installation worked for me (spent also many days with the problem):
Spark 1.3.1 prebuild for Hadoop 2.3 setup on EC2-cluster
git clone https://github.com/apache/incubator-zeppelin.git (date: 25.07.2015)
installed zeppelin via the following command (belonging to instructions on https://github.com/apache/incubator-zeppelin):
mvn clean package -Pspark-1.3 -Dhadoop.version=2.3.0 -Phadoop-2.3 -DskipTests
Port change via "conf/zeppelin-site.xml" to 8082 (Spark uses Port 8080)
After this installation steps my notebook worked with S3 files:
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "xxx")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey","xxx")
val file = "s3n://<<bucket>>/<<file>>"
val data = sc.textFile(file)
data.first
I think that the S3 problem is not resolved completely in Zeppelin Version 0.5.0, so cloning the actual git-repo did it for me.
Important Information: The job only worked for me with zeppelin spark-interpreter setting master=local[*] (instead of using spark://master:7777)
For me it worked in one two steps-
1. creating sqlContext -
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
2. reading s3 files like this. -
val performanceFactor = sqlContext.
read. parquet("s3n://<accessKey>:<secretKey>#mybucket/myfile/")
where access key and secret key you need to supply.
in #2 I am using s3n protocol and access and secret keys in path itself.

Resources