Create table in hive through spark - apache-spark

I am trying to connect to Hive through Spark using below code but unable to do so. The code fails with NoSuchDatabaseException Database 'raw' not found. I have database named 'raw' in hive. What am I missing here?
val spark = SparkSession
.builder()
.appName("Connecting to hive")
.config("hive.metastore.uris", "thrift://myserver.domain.local:9083")
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
import spark.sql
val frame = Seq(("one", 1), ("two", 2), ("three", 3)).toDF("word", "count")
frame.show()
frame.write.mode("overwrite").saveAsTable("raw.temp1")
Output for spark.sql("SHOW DATABASES")

Related

Fetch dbfs files as a stream dataframe in databricks

I have a problem where I need to create an external table in Databricks for each CSV file that lands into an ADLS gen 2 storage.
I thought about a solution when I would get a streaming dataframe from dbutils.fs.ls() output and then call a function that creates a table inside the forEachBatch().
I have the function ready, but I can't figure out a way to stream directory information into a streaming Dataframe. Do anyone have an idea on how this could be achieved?
Kindly check with the below code block.
package com.sparkbyexamples.spark.streaming
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
object SparkStreamingFromDirectory {
def main(args: Array[String]): Unit = {
val spark:SparkSession = SparkSession.builder()
.master("local[3]")
.appName("SparkByExamples")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
val schema = StructType(
List(
StructField("Zipcode", IntegerType, true),
)
)
val df = spark.readStream
.schema(schema)
.json("Your directory")
df.printSchema()
val groupDF = df.select("Zipcode")
.groupBy("Zipcode").count()
groupDF.printSchema()
groupDF.writeStream
.format("console")
.outputMode("complete")
.start()
.awaitTermination()
}
}

Spark: how to save pair rdd to json files?

My Rdd is like:
[('f1',1), ('f2',2)]
How to save it to json files?
you can convert rdd to dataframe and write to JSON
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName('SO')\
.getOrCreate()
sc= spark.sparkContext
df = sc.parallelize(
[('f1', 1), ('f2', 2)]).toDF(["key", "value"])
df.write.format('json').save('output_path')
Output in json file looks like below
{"key":"f1","value":1}
{"key":"f2","value":2}

Getting the hivecontext from a dataframe

I am creating a hivecontext instead of sqlcontext to create adtaframe
val conf=new SparkConf().setMaster("yarn-cluster")
val context=new SparkContext(conf)
//val sqlContext=new SQLContext(context)
val hiveContext=new HiveContext(context)
val data=Seq(1,2,3,4,5,6,7,8,9,10).map(x=>(x.toLong,x+1,x+2.toDouble)).toDF("ts","value","label")
//outdta is a dataframe
data.registerTempTable("df")
//val hiveTest=hiveContext.sql("SELECT * from df where ts < percentile(BIGINT ts, 0.5)")
val ratio1=hiveContext.sql("SELECT percentile_approx(ts, array (0.5,0.7)) from df")
I need to get the exact hive context from ratio1 and not again create a hivecontext from the povidedsql context in the dataframe, I don't know why spark don't give me a hivecontext from dataframe and it just gives sqlcontext.
If you use HiveCOntext, then the runtime-type of df.sqlContext is HiveContext (HiveContext is a subtype of SQLContext), therefore you can do:
val hiveContext = df.sqlContext.asInstanceOf[HiveContext]

Spark 2.0 - Databricks xml reader Input path does not exist

I am trying to use Databricks XML file reader api.
Sample code:
val spark = SparkSession
.builder()
.master("local[*]")
.appName("Java Spark SQL basic example")
.config("spark.sql.warehouse.dir", "file:///C:/TestData")
.getOrCreate();
//val sqlContext = new SQLContext(sc)
val df = spark.read
.format("com.databricks.spark.xml")
.option("rowTag", "book")
.load("books.xml")
df.show()
If i give the file path directly , its looking for some warehouse directory. so i set the spark.sql.warehouse.dir option, but now it throws Input path does not exist.
It is actually looking under the project root directory , why is it looking for project root directory?
Finally its working.. We need to specify warehouse directory as well pass the absolute file path in the load method. I am not sure what is the use of warehouse directory.
The main part is we dont need to give C: as mentioned by other Stackoverflow answer.
working code:
val spark = SparkSession
.builder()
.master("local[*]")
.appName("Java Spark SQL basic example")
.config("spark.sql.warehouse.dir", "file:///TestData/")
.getOrCreate();
//val sqlContext = new SQLContext(sc)
val df = spark.read
.format("com.databricks.spark.xml")
.option("rowTag", "book")
.load("file:///TestData/books.xml")
df.show()

SparkSession: using the SQL API to query Cassandra

In Python, using SparkSession I can load a Cassandra keyspace and table like:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local") \
.appName("TestApp") \
.getOrCreate()
cassandra = spark.read.format("org.apache.spark.sql.cassandra")
df = cassandra.load(keyspace="testdb", table="test")
df.collect()
How can I use the SQL API instead? Something like:
SELECT * FROM testdb.test
Try register temp table in Spark and run queries against it like in a following snippet:
df.createOrReplaceTempView("my_table")
df2 = spark.sql("SELECT * FROM my_table")

Resources