I am trying to connect to Hive through Spark using below code but unable to do so. The code fails with NoSuchDatabaseException Database 'raw' not found. I have database named 'raw' in hive. What am I missing here?
val spark = SparkSession
.appName("Connecting to hive")
.config("hive.metastore.uris", "thrift://myserver.domain.local:9083")
import spark.implicits._
import spark.sql
val frame = Seq(("one", 1), ("two", 2), ("three", 3)).toDF("word", "count")
Output for spark.sql("SHOW DATABASES")


Fetch dbfs files as a stream dataframe in databricks

I have a problem where I need to create an external table in Databricks for each CSV file that lands into an ADLS gen 2 storage.
I thought about a solution when I would get a streaming dataframe from output and then call a function that creates a table inside the forEachBatch().
I have the function ready, but I can't figure out a way to stream directory information into a streaming Dataframe. Do anyone have an idea on how this could be achieved?
Kindly check with the below code block.
package com.sparkbyexamples.spark.streaming
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
object SparkStreamingFromDirectory {
def main(args: Array[String]): Unit = {
val spark:SparkSession = SparkSession.builder()
val schema = StructType(
StructField("Zipcode", IntegerType, true),
val df = spark.readStream
.json("Your directory")
val groupDF ="Zipcode")

Spark: how to save pair rdd to json files?

My Rdd is like:
[('f1',1), ('f2',2)]
How to save it to json files?
you can convert rdd to dataframe and write to JSON
from pyspark.sql import SparkSession
spark = SparkSession.builder \
sc= spark.sparkContext
df = sc.parallelize(
[('f1', 1), ('f2', 2)]).toDF(["key", "value"])
Output in json file looks like below

Getting the hivecontext from a dataframe

I am creating a hivecontext instead of sqlcontext to create adtaframe
val conf=new SparkConf().setMaster("yarn-cluster")
val context=new SparkContext(conf)
//val sqlContext=new SQLContext(context)
val hiveContext=new HiveContext(context)
val data=Seq(1,2,3,4,5,6,7,8,9,10).map(x=>(x.toLong,x+1,x+2.toDouble)).toDF("ts","value","label")
//outdta is a dataframe
//val hiveTest=hiveContext.sql("SELECT * from df where ts < percentile(BIGINT ts, 0.5)")
val ratio1=hiveContext.sql("SELECT percentile_approx(ts, array (0.5,0.7)) from df")
I need to get the exact hive context from ratio1 and not again create a hivecontext from the povidedsql context in the dataframe, I don't know why spark don't give me a hivecontext from dataframe and it just gives sqlcontext.
If you use HiveCOntext, then the runtime-type of df.sqlContext is HiveContext (HiveContext is a subtype of SQLContext), therefore you can do:
val hiveContext = df.sqlContext.asInstanceOf[HiveContext]

Spark 2.0 - Databricks xml reader Input path does not exist

I am trying to use Databricks XML file reader api.
Sample code:
val spark = SparkSession
.appName("Java Spark SQL basic example")
.config("spark.sql.warehouse.dir", "file:///C:/TestData")
//val sqlContext = new SQLContext(sc)
val df =
.option("rowTag", "book")
If i give the file path directly , its looking for some warehouse directory. so i set the spark.sql.warehouse.dir option, but now it throws Input path does not exist.
It is actually looking under the project root directory , why is it looking for project root directory?
Finally its working.. We need to specify warehouse directory as well pass the absolute file path in the load method. I am not sure what is the use of warehouse directory.
The main part is we dont need to give C: as mentioned by other Stackoverflow answer.
working code:
val spark = SparkSession
.appName("Java Spark SQL basic example")
.config("spark.sql.warehouse.dir", "file:///TestData/")
//val sqlContext = new SQLContext(sc)
val df =
.option("rowTag", "book")

SparkSession: using the SQL API to query Cassandra

In Python, using SparkSession I can load a Cassandra keyspace and table like:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local") \
.appName("TestApp") \
cassandra ="org.apache.spark.sql.cassandra")
df = cassandra.load(keyspace="testdb", table="test")
How can I use the SQL API instead? Something like:
SELECT * FROM testdb.test
Try register temp table in Spark and run queries against it like in a following snippet:
df2 = spark.sql("SELECT * FROM my_table")
