how to use Cassandra Context in spark 2.0 - apache-spark

In previous Version of Spark like 1.6.1, i am using creating Cassandra Context using spark Context,
import org.apache.spark.{ Logging, SparkContext, SparkConf }
//config
val conf: org.apache.spark.SparkConf = new SparkConf(true)
.set("spark.cassandra.connection.host", CassandraHost)
.setAppName(getClass.getSimpleName)
lazy val sc = new SparkContext(conf)
val cassandraSqlCtx: org.apache.spark.sql.cassandra.CassandraSQLContext = new CassandraSQLContext(sc)
//Query using Cassandra context
cassandraSqlCtx.sql("select id from table ")
But In Spark 2.0 , Spark Context is replaced with Spark session, how can i use cassandra context?

Short Answer: You don't. It has been deprecated and removed.
Long Answer: You don't want to. The HiveContext provides everything except for the catalogue and supports a much wider range of SQL(HQL~). In Spark 2.0 this just means you will need to manually register Cassandra tables use createOrReplaceTempView until an ExternalCatalogue is implemented.
In Sql this looks like
spark.sql("""CREATE TEMPORARY TABLE words
|USING org.apache.spark.sql.cassandra
|OPTIONS (
| table "words",
| keyspace "test")""".stripMargin)
In the raw DF api it looks like
spark
.read
.format("org.apache.spark.sql.cassandra")
.options(Map("keyspace" -> "test", "table" -> "words"))
.load
.createOrReplaceTempView("words")
Both of these commands will register the table "words" for SQL queries.

Related

Spark read data from Cassandra error org.apache.spark.unsafe.types.UTF8String is not a valid external type for schema of string

I have a Cassandra table that is created as the following(in cqlsh)
CREATE TABLE blog.session( id int PRIMARY KEY, visited text);
I write data to Cassandra and it looks like this
id | visited
1 | Url1-Url2-Url3
I then try to read it using spark Cassandra connector(2.5.1).
val sparkSession = SparkSession.builder()
.master("local")
.appName("ReadFromCass")
.config("spark.cassandra.connection.host", "localhost")
.config("spark.cassandra.connection.port", "9042")
.getOrCreate()
import sparkSession.implicits._
val readSessions = sparkSession.sqlContext
.read
.cassandraFormat("table1", "keyspace1").load().show()
However, it seems to be unable to read the visited since it is a text object with dashes in between words. The error occurs as
org.apache.spark.unsafe.types.UTF8String is not a valid external type for schema of string
any ideas on why spark is unable to read this and how to fix it?
The error seemed to be the version of the spark-cassandra-connector. Instead of using "2.5.1" use "3.0.0-beta"

Last Access Time Update in Hive metastore

I am using the following property in my Hive console/ .hiverc file, so that whenever I query the table, it updates the LAST_ACCESS_TIME column in TBLS table of Hive metastore.
set hive.exec.pre.hooks = org.apache.hadoop.hive.ql.hooks.UpdateInputAccessTimeHook$PreExec;
However, if I use spark-sql or spark-shell, it does not seems to be working and LAST_ACCESS_TIME does not gets updated in hive metastore.
Here's how I am reading the table :
>>> df = spark.sql("select * from db.sometable")
>>> df.show()
I have set up the above hook in hive-site.xml in both /etc/hive/conf and /etc/spark/conf.
Your code may skip past some of the hive integrations. My recollection is that to get more of the Hive-ish integrations you need to bring in the HiveContext, something like this:
from pyspark import SparkContext, SparkConf, HiveContext
if __name__ == "__main__":
# create Spark context with Spark configuration
conf = SparkConf().setAppName("Data Frame Join")
sc = SparkContext(conf=conf)
sqlContext = HiveContext(sc)
df_07 = sqlContext.sql("SELECT * from sample_07")
https://docs.cloudera.com/runtime/7.2.7/developing-spark-applications/topics/spark-sql-example.html
Hope this helps

HiveContext in Spark Version 2

I am working on a spark program that inserts dataframe into Hive Table as below.
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql._
val hiveCont = val hiveCont = new org.apache.spark.sql.hive.HiveContext(sc)
val partfile = sc.textFile("partfile")
val partdata = partfile.map(p => p.split(","))
case class partc(id:Int, name:String, salary:Int, dept:String, location:String)
val partRDD = partdata.map(p => partc(p(0).toInt, p(1), p(2).toInt, p(3), p(4)))
val partDF = partRDD.toDF()
partDF.registerTempTable("party")
hiveCont.sql("insert into parttab select id, name, salary, dept from party")
I know that Spark V2 has come out and we can use SparkSession object in it.
Can we use SparkSession object to directly insert the dataframe into Hive table or do we have to use the HiveContext in version 2 also ? Can anyone let me know what is the major difference in version with respect to HiveContext ?
You can use your SparkSession (normally called spark or ss) directly to fire a sql query (make sure hive-support is enabled when creating the spark-session):
spark.sql("insert into parttab select id, name, salary, dept from party")
But I would suggest this notation, you don't need to create a temp-table etc:
partDF
.select("id","name","salary","dept")
.write.mode("overwrite")
.insertInto("parttab")

WriteConf of Spark-Cassandra Connector being used or not

I am using Spark version 1.6.2, Spark-Cassandra Connector 1.6.0, Cassandra-Driver-Core 3.0.3
I am writing a simple Spark job in which I am trying to insert some rows to a table in Cassandra. The code snippet used was:
val sparkConf = (new SparkConf(true).set("spark.cassandra.connection.host", "<Cassandra IP>")
.set("spark.cassandra.auth.username", "test")
.set("spark.cassandra.auth.password", "test")
.set("spark.cassandra.output.batch.size.rows", "1"))
val sc = new SparkContext(sparkConf)
val cassandraSQLContext = new CassandraSQLContext(sc)
cassandraSQLContext.setKeyspace("test")
val query = "select * from test"
val dataRDD = cassandraSQLContext.cassandraSql(query).rdd
val addRowList = (ListBuffer(
Test(111, 10, 100000, "{'test':'0','test1':'1','others':'2'}"),
Test(111, 20, 200000, "{'test':'0','test1':'1','others':'2'}")
))
val insertRowRDD = sc.parallelize(addRowList)
insertRowRDD.saveToCassandra("test", "test")
Test() is a case class
Now, I have passed the WriteConf parameter output.batch.size.rows when making sparkConf object. I am expecting that this code will write 1 row in a batch at a time in Cassandra. I am not getting any method through which I can cross verify that the configuration of writing a batch in cassandra is not the default one but the one passed in the code snippet.
I could not find anything in the cassandra cassandra.log, system.log and debug.log
So can anyone help me with the method of cross verifying the WriteConf being used by Spark-Cassandra Connector to write batches in Cassandra?
There are two things you can do to verify that your setting was correctly set.
First you can call the method which creates WriteConf
WriteConf.fromSparkConf(sparkConf)
The resulting object can be inspected to make sure all the values are what you want. This is the default arg to SaveToCassandra
You can explicitly pass a WriteConf to the saveToCassandraMethod
saveAsCassandraTable(keyspace, table, writeConf = WriteConf(...))

how to use a whole hive database in spark and read sql queries from external files?

I am using hortonworks sandbox in Azure with spark 1.6.
I have a Hive database populated with TPC-DS sample data. I want to read some SQL queries from external files and run them on the hive dataset in spark.
I follow this topic Using hive database in spark which is just using a table in my dataset and also it writes SQL query in spark again, but I need to define whole, dataset as my source to query on that, I think i should use dataframes but i am not sure and do not know how!
also I want to import the SQL query from external .sql file and do not write down the query again!
would you please guide me how can I do this?
thank you very much,
bests!
Spark Can read data directly from Hive table. You can create, drop Hive table using Spark and even you can do all Hive hql related operations through the Spark. For this you need to use Spark HiveContext
From the Spark documentation:
Spark HiveContext, provides a superset of the functionality provided by the basic SQLContext. Additional features include the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the ability to read data from Hive tables. To use a HiveContext, you do not need to have an existing Hive setup.
For more information you can visit Spark Documentation
To Avoid writing sql in code, you can use property file where you can put all your Hive query and then you can use the key in you code.
Please see below the implementation of Spark HiveContext and use of property file in Spark Scala.
package com.spark.hive.poc
import org.apache.spark._
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql._
import org.apache.spark._
import org.apache.spark.sql.DataFrame;
import org.apache.spark.rdd.RDD;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.spark.sql.hive.HiveContext;
//Import Row.
import org.apache.spark.sql.Row;
//Import Spark SQL data types
import org.apache.spark.sql.types.{ StructType, StructField, StringType };
object ReadPropertyFiles extends Serializable {
val conf = new SparkConf().setAppName("read local file");
conf.set("spark.executor.memory", "100M");
conf.setMaster("local");
val sc = new SparkContext(conf)
val sqlContext = new HiveContext(sc)
def main(args: Array[String]): Unit = {
var hadoopConf = new org.apache.hadoop.conf.Configuration();
var fileSystem = FileSystem.get(hadoopConf);
var Path = new Path(args(0));
val inputStream = fileSystem.open(Path);
var Properties = new java.util.Properties;
Properties.load(inputStream);
//Create an RDD
val people = sc.textFile("/user/User1/spark_hive_poc/input/");
//The schema is encoded in a string
val schemaString = "name address";
//Generate the schema based on the string of schema
val schema =
StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)));
//Convert records of the RDD (people) to Rows.
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim));
//Apply the schema to the RDD.
val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema);
peopleDataFrame.printSchema();
peopleDataFrame.registerTempTable("tbl_temp")
val data = sqlContext.sql(Properties.getProperty("temp_table"));
//Drop Hive table
sqlContext.sql(Properties.getProperty("drop_hive_table"));
//Create Hive table
sqlContext.sql(Properties.getProperty("create_hive_tavle"));
//Insert data into Hive table
sqlContext.sql(Properties.getProperty("insert_into_hive_table"));
//Select Data into Hive table
sqlContext.sql(Properties.getProperty("select_from_hive")).show();
sc.stop
}
}
Entry in Properties File :
temp_table=select * from tbl_temp
drop_hive_table=DROP TABLE IF EXISTS default.test_hive_tbl
create_hive_tavle=CREATE TABLE IF NOT EXISTS default.test_hive_tbl(name string, city string) STORED AS ORC
insert_into_hive_table=insert overwrite table default.test_hive_tbl select * from tbl_temp
select_from_hive=select * from default.test_hive_tbl
Spark submit Command to run this job:
[User1#hadoopdev ~]$ spark-submit --num-executors 1 \
--executor-memory 100M --total-executor-cores 2 --master local \
--class com.spark.hive.poc.ReadPropertyFiles Hive-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
/user/User1/spark_hive_poc/properties/sql.properties
Note: Property File location should be HDFS location.

Resources