HiveContext in Spark Version 2 - apache-spark

I am working on a spark program that inserts dataframe into Hive Table as below.
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql._
val hiveCont = val hiveCont = new org.apache.spark.sql.hive.HiveContext(sc)
val partfile = sc.textFile("partfile")
val partdata = partfile.map(p => p.split(","))
case class partc(id:Int, name:String, salary:Int, dept:String, location:String)
val partRDD = partdata.map(p => partc(p(0).toInt, p(1), p(2).toInt, p(3), p(4)))
val partDF = partRDD.toDF()
partDF.registerTempTable("party")
hiveCont.sql("insert into parttab select id, name, salary, dept from party")
I know that Spark V2 has come out and we can use SparkSession object in it.
Can we use SparkSession object to directly insert the dataframe into Hive table or do we have to use the HiveContext in version 2 also ? Can anyone let me know what is the major difference in version with respect to HiveContext ?

You can use your SparkSession (normally called spark or ss) directly to fire a sql query (make sure hive-support is enabled when creating the spark-session):
spark.sql("insert into parttab select id, name, salary, dept from party")
But I would suggest this notation, you don't need to create a temp-table etc:
partDF
.select("id","name","salary","dept")
.write.mode("overwrite")
.insertInto("parttab")

Related

Last Access Time Update in Hive metastore

I am using the following property in my Hive console/ .hiverc file, so that whenever I query the table, it updates the LAST_ACCESS_TIME column in TBLS table of Hive metastore.
set hive.exec.pre.hooks = org.apache.hadoop.hive.ql.hooks.UpdateInputAccessTimeHook$PreExec;
However, if I use spark-sql or spark-shell, it does not seems to be working and LAST_ACCESS_TIME does not gets updated in hive metastore.
Here's how I am reading the table :
>>> df = spark.sql("select * from db.sometable")
>>> df.show()
I have set up the above hook in hive-site.xml in both /etc/hive/conf and /etc/spark/conf.
Your code may skip past some of the hive integrations. My recollection is that to get more of the Hive-ish integrations you need to bring in the HiveContext, something like this:
from pyspark import SparkContext, SparkConf, HiveContext
if __name__ == "__main__":
# create Spark context with Spark configuration
conf = SparkConf().setAppName("Data Frame Join")
sc = SparkContext(conf=conf)
sqlContext = HiveContext(sc)
df_07 = sqlContext.sql("SELECT * from sample_07")
https://docs.cloudera.com/runtime/7.2.7/developing-spark-applications/topics/spark-sql-example.html
Hope this helps

Converting CassandraRow obtained from joinWithCassandraTable to DataFrame

case class SourcePartition(id: String, host:String ,bucket: Int)
joinedRDDs =partitions.joinWithCassandraTable("db_name","table_name")
joinedRDDs.values.foreach(println)
I have to use joinWithCassandraTable , How do i covert the result CassandraRow in to a DataFrame? OR is there any equivalent of joinWithCassandraTable with DataFrame ?
I have to read a lot of partitions in one go, I'm aware of Datastax Cassandra connector Predicate push down, but it allows to pull only one Partition at a time ( It doesnt seems to allow IN operator , Only = seems to be supported)
val spark: SparkSession = SparkSession.builder().master("local[4]").appName("RDD2DF").getOrCreate()
val sc: SparkContext = spark.sparkContext
import spark.implicits._
val internalJoinRDD = spark.sparkContext.cassandraTable("test", "test_table_1").joinWithCassandraTable("test", "table_table_2")
internalJoin.toDebugString
internalJoinRDD.toDF()
Can you try the above code snippet.
If you have a schema for your data, you can use
def createDataFrame(internalJoinRDD: RDD[Row], schema: StructType): DataFrame

Spark sql querying a Hive table from workers

I am trying to querying a Hive table from a map operation in Spark, but when it run a query the execution getting frozen.
This is my test code
val sc = new SparkContext(conf)
val datasetPath = "npiCodesMin.csv"
val sparkSession = SparkSession.builder().enableHiveSupport().getOrCreate()
val df = sparkSession.read.option("header", true).option("sep", ",").csv(datasetPath)
df.createOrReplaceTempView("npicodesTmp")
sparkSession.sql("DROP TABLE IF EXISTS npicodes");
sparkSession.sql("CREATE TABLE npicodes AS SELECT * FROM npicodesTmp");
val res = sparkSession.sql("SELECT * FROM npicodes WHERE NPI = '1588667638'") //This works
println(res.first())
val NPIs = sc.parallelize(List("1679576722", "1588667638", "1306849450", "1932102084"))//Some existing NPIs
val rows = NPIs.mapPartitions{ partition =>
val sparkSession = SparkSession.builder().enableHiveSupport().getOrCreate()
partition.map{code =>
val res = sparkSession.sql("SELECT * FROM npicodes WHERE NPI = '"+code+"'")//The program stops here
res.first()
}
}
rows.collect().foreach(println)
It loads the data from a CSV, creates a new Hive table and fills it with the CSV data.
Then, if I query the table from the master it works perfectly, but if I try to do that in a map operation the execution getting frozen.
It do not generate any error, it continue running without do anything.
The Spark UI shows this situation
Actually, I am not sure if I can query a table in a distributed way, I cannot find it in the documentation.
Any suggestion?
Thanks.

how to use a whole hive database in spark and read sql queries from external files?

I am using hortonworks sandbox in Azure with spark 1.6.
I have a Hive database populated with TPC-DS sample data. I want to read some SQL queries from external files and run them on the hive dataset in spark.
I follow this topic Using hive database in spark which is just using a table in my dataset and also it writes SQL query in spark again, but I need to define whole, dataset as my source to query on that, I think i should use dataframes but i am not sure and do not know how!
also I want to import the SQL query from external .sql file and do not write down the query again!
would you please guide me how can I do this?
thank you very much,
bests!
Spark Can read data directly from Hive table. You can create, drop Hive table using Spark and even you can do all Hive hql related operations through the Spark. For this you need to use Spark HiveContext
From the Spark documentation:
Spark HiveContext, provides a superset of the functionality provided by the basic SQLContext. Additional features include the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the ability to read data from Hive tables. To use a HiveContext, you do not need to have an existing Hive setup.
For more information you can visit Spark Documentation
To Avoid writing sql in code, you can use property file where you can put all your Hive query and then you can use the key in you code.
Please see below the implementation of Spark HiveContext and use of property file in Spark Scala.
package com.spark.hive.poc
import org.apache.spark._
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql._
import org.apache.spark._
import org.apache.spark.sql.DataFrame;
import org.apache.spark.rdd.RDD;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.spark.sql.hive.HiveContext;
//Import Row.
import org.apache.spark.sql.Row;
//Import Spark SQL data types
import org.apache.spark.sql.types.{ StructType, StructField, StringType };
object ReadPropertyFiles extends Serializable {
val conf = new SparkConf().setAppName("read local file");
conf.set("spark.executor.memory", "100M");
conf.setMaster("local");
val sc = new SparkContext(conf)
val sqlContext = new HiveContext(sc)
def main(args: Array[String]): Unit = {
var hadoopConf = new org.apache.hadoop.conf.Configuration();
var fileSystem = FileSystem.get(hadoopConf);
var Path = new Path(args(0));
val inputStream = fileSystem.open(Path);
var Properties = new java.util.Properties;
Properties.load(inputStream);
//Create an RDD
val people = sc.textFile("/user/User1/spark_hive_poc/input/");
//The schema is encoded in a string
val schemaString = "name address";
//Generate the schema based on the string of schema
val schema =
StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)));
//Convert records of the RDD (people) to Rows.
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim));
//Apply the schema to the RDD.
val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema);
peopleDataFrame.printSchema();
peopleDataFrame.registerTempTable("tbl_temp")
val data = sqlContext.sql(Properties.getProperty("temp_table"));
//Drop Hive table
sqlContext.sql(Properties.getProperty("drop_hive_table"));
//Create Hive table
sqlContext.sql(Properties.getProperty("create_hive_tavle"));
//Insert data into Hive table
sqlContext.sql(Properties.getProperty("insert_into_hive_table"));
//Select Data into Hive table
sqlContext.sql(Properties.getProperty("select_from_hive")).show();
sc.stop
}
}
Entry in Properties File :
temp_table=select * from tbl_temp
drop_hive_table=DROP TABLE IF EXISTS default.test_hive_tbl
create_hive_tavle=CREATE TABLE IF NOT EXISTS default.test_hive_tbl(name string, city string) STORED AS ORC
insert_into_hive_table=insert overwrite table default.test_hive_tbl select * from tbl_temp
select_from_hive=select * from default.test_hive_tbl
Spark submit Command to run this job:
[User1#hadoopdev ~]$ spark-submit --num-executors 1 \
--executor-memory 100M --total-executor-cores 2 --master local \
--class com.spark.hive.poc.ReadPropertyFiles Hive-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
/user/User1/spark_hive_poc/properties/sql.properties
Note: Property File location should be HDFS location.

how to use Cassandra Context in spark 2.0

In previous Version of Spark like 1.6.1, i am using creating Cassandra Context using spark Context,
import org.apache.spark.{ Logging, SparkContext, SparkConf }
//config
val conf: org.apache.spark.SparkConf = new SparkConf(true)
.set("spark.cassandra.connection.host", CassandraHost)
.setAppName(getClass.getSimpleName)
lazy val sc = new SparkContext(conf)
val cassandraSqlCtx: org.apache.spark.sql.cassandra.CassandraSQLContext = new CassandraSQLContext(sc)
//Query using Cassandra context
cassandraSqlCtx.sql("select id from table ")
But In Spark 2.0 , Spark Context is replaced with Spark session, how can i use cassandra context?
Short Answer: You don't. It has been deprecated and removed.
Long Answer: You don't want to. The HiveContext provides everything except for the catalogue and supports a much wider range of SQL(HQL~). In Spark 2.0 this just means you will need to manually register Cassandra tables use createOrReplaceTempView until an ExternalCatalogue is implemented.
In Sql this looks like
spark.sql("""CREATE TEMPORARY TABLE words
|USING org.apache.spark.sql.cassandra
|OPTIONS (
| table "words",
| keyspace "test")""".stripMargin)
In the raw DF api it looks like
spark
.read
.format("org.apache.spark.sql.cassandra")
.options(Map("keyspace" -> "test", "table" -> "words"))
.load
.createOrReplaceTempView("words")
Both of these commands will register the table "words" for SQL queries.

Resources