Query remote hive tables from local spark 2x program - apache-spark

When I run a local spark2x program from eclipse, I am getting belo error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: When hive.metastore.uris is setted, please set spark.sql.authorization.enabled and hive.security.authorization.enabled to true to enable authorization.;
Code used:
System.setProperty("hadoop.home.dir", "D:/winutils");
//kerberos releated
val ZKServerPrincipal = "zookeeper/hadoop.hadoop.com";
val ZOOKEEPER_DEFAULT_LOGIN_CONTEXT_NAME = "Client";
val ZOOKEEPER_SERVER_PRINCIPAL_KEY = "zookeeper.server.principal";
val hadoopConf: Configuration = new Configuration();
LoginUtil.setZookeeperServerPrincipal(ZOOKEEPER_SERVER_PRINCIPAL_KEY, ZKServerPrincipal);
LoginUtil.login(userPrincipal, userKeytabPath, krb5ConfPath, hadoopConf);
//creating spark session
val spark = SparkSession .builder() .appName("conenction").config("spark.master", "local") .config("spark.sql.authorization.enabled","true") .enableHiveSupport() .getOrCreate()
val df75 = spark.sql("select * from dbname.tablename limit 10")

Related

NoHostAvailableException spark-cassandra-connector

I am using sparkContext to count the rows of a cassandra table but getting below error :
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried)
I am getting above error after CassandraConnector: Connected to Cassandra cluster: cluster_name .
Below is the configuration code . I read that I need to setup driver protocol but there is no option for it to add spark-cassandra-connector .
val sparkConf = new SparkConf(true).setAppName(appName)
sparkConf.set("spark.sql.orc.enabled", "true")
sparkConf.set("spark.sql.hive.convertMetastoreOrc", "false")
sparkConf.set("spark.sql.hive.metastorePartitionPruning true", "true")
sparkConf.set("spark.sql.orc.filterPushdown", "true")
sparkConf.set("spark.cassandra.connection.host", connection.hosts)
sparkConf.set("spark.cassandra.auth.username", connection.user)
sparkConf.set("spark.cassandra.auth.password", connection.password)
sparkConf.set("spark.cassandra.connection.local_dc", connection.local_dc.getOrElse(""))
sparkConf.set("spark.cassandra.input.consistency.level", connection.consistency_level)
sparkConf.set("spark.cassandra.input.reads_per_sec", Parameters.getCassandraInputReadsPerSec)
sparkConf.set("spark.cassandra.input.fetch.size_in_rows", Parameters.getCassandraInputFetchSizeInRows)
sparkConf.set("spark.cassandra.connection.timeout_ms", Parameters.getCassandraConnectionTimeoutMs)
sparkConf.set("spark.cassandra.read.timeout_ms", Parameters.getCassandraReadTimeoutMs)
sparkConf.set("spark.cassandra.connection.keep_alive_ms", Parameters.getCassandraConnectionKeepAliveMs)
val sparkSession: SparkSession = SparkSession.builder().enableHiveSupport().config(sparkConf).getOrCreate()
val sc: SparkContext = sparkSession.sparkContext
val sqlContext = sparkSession.sqlContext
sqlContext.setConf("fs.maprfs.impl", Parameters.getFsMaprfsImpl)
sqlContext.setConf("hive.exec.compress.output", Parameters.getHiveExecCompressOutput)
sqlContext.setConf("hive.merge.mapredfiles", Parameters.getHiveMergeMapredfiles)
sqlContext.setConf("hive.merge.smallfiles.avgsize", Parameters.getHiveMergeSmallfilesAvgsize)
sqlContext.setConf("hive.exec.reducers.bytes.per.reducer", Parameters.getHiveExecReducersBytesPerReducer)
sqlContext.setConf("hive.exec.dynamic.partition", Parameters.getHiveExecDynamicPartition)
sqlContext.setConf("hive.exec.dynamic.partition.mode", Parameters.getHiveExecDynamicPartitionMode)

Spark reading table after connection to HiverServer2 only gives schema not data

I try to connect to a remote hive cluster using the following code and I get the table data as expected
val spark = SparkSession
.builder()
.appName("adhocattempts")
.config("hive.metastore.uris", "thrift://<remote-host>:9083")
.enableHiveSupport()
.getOrCreate()
val seqdf=sql("select * from anon_seq")
seqdf.show
However, when I try to do this via HiveServer2, I get no data in my dataframe. This table is based on a sequencefile. Is that the issue, since I am actually trying to read this via jdbc?
val sparkJdbc = SparkSession.builder.appName("SparkHiveJob").getOrCreate
val sc = sparkJdbc.sparkContext
val sqlContext = sparkJdbc.sqlContext
val driverName = "org.apache.hive.jdbc.HiveDriver"
Class.forName(driverName)
val df = sparkJdbc.read
.format("jdbc")
.option("url", "jdbc:hive2://<remote-host>:10000/default")
.option("dbtable", "anon_seq")
.load()
df.show()
Can someone help me understand the purpose of using HiveServer2 with jdbc and relevant drivers in Spark2?

Spark session with hivecontext

I tried to use Sparksession with hive tables.
I had used the following code:
val spark= SparkSession.builder().appName("spark").master("local").enableHiveSupport().getOrCreate()
spark.sql("select * from data").show()
Shows table not found, but the table exists in hive. Please help me with this.
spark.sql("select * from databasename.data").show() - will work
Hello you have to provide the path of warehouse, like:
// warehouseLocation points to the default location for managed databases and tables
val warehouseLocation = new File("spark-warehouse").getAbsolutePath
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
For more information, You can see here: Hive Tables with Spark

Reading data from Hbase using Get command in Spark

I want to read data from an Hbase table using get command while I have also the key of the row..I want to do that in my Spark streaming application, Is there any source code someone can share?
You can use Spark newAPIHadoopRDD to read Hbase table, which returns and RDD.
For example:
val sparkConf = new SparkConf().setAppName("Hbase").setMaster("local")
val sc = new SparkContext(sparkConf)
val conf = HBaseConfiguration.create()
val tableName = "table"
conf.set("hbase.master", "localhost:60000")
conf.set("hbase.zookeeper.quorum", "localhost:2181")
conf.set("zookeeper.znode.parent", "/hbase-unsecure")
conf.set(TableInputFormat.INPUT_TABLE, tableName)
val rdd = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
println("Number of Records found : " + rdd.count())
sc.stop()
Or you can use any Spark Hbase connector like HortonWorks Hbase connector.
https://github.com/hortonworks-spark/shc
You can also use Spark-Phoenix API.
https://phoenix.apache.org/phoenix_spark.html

How to get TimeStamp data in hive when using Spark

val sql = "select time from table"
val data = sql(sql).map(_.getTimeStamp(0).toString)
In the hive table,time's type is timestamp.when i run this program,it throws NullPointerException.
val data = sql(sql).map(_.get(0).toString)
When I change to the above code,the same Exception be threw.
Is anyone can tell me how to get TimeStamp data in hive using Spark?
Tks.
If you are trying to read data from Hive table you should use, HiveContext instead of SQLContext.
If you are using Spark 2.0, you can try the following.
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
import spark.sql
val df = sql("select time from table")
df.select($"time").show()

Resources