How to get the value of the location for a Hive table using a Spark object? - apache-spark

I am interested in being able to retrieve the location value of a Hive table given a Spark object (SparkSession). One way to obtain this value is by parsing the output of the location via the following SQL query:
describe formatted <table name>
I was wondering if there is another way to obtain the location value without having to parse the output. An API would be great in case the output of the above command changes between Hive versions. If an external dependency is needed, which would it be? Is there some sample spark code that can obtain the location value?

Here is the correct answer:
import org.apache.spark.sql.catalyst.TableIdentifier
lazy val tblMetadata = spark.sessionState.catalog.getTableMetadata(new TableIdentifier(tableName,Some(schema)))

You can also use .toDF method on desc formatted table then filter from dataframe.
DataframeAPI:
scala> :paste
spark.sql("desc formatted data_db.part_table")
.toDF //convert to dataframe will have 3 columns col_name,data_type,comment
.filter('col_name === "Location") //filter on colname
.collect()(0)(1)
.toString
Result:
String = hdfs://nn:8020/location/part_table
(or)
RDD Api:
scala> :paste
spark.sql("desc formatted data_db.part_table")
.collect()
.filter(r => r(0).equals("Location")) //filter on r(0) value
.map(r => r(1)) //get only the location
.mkString //convert as string
.split("8020")(1) //change the split based on your namenode port..etc
Result:
String = /location/part_table

First approach
You can use input_file_name with dataframe.
it will give you absolute file-path for a part file.
spark.read.table("zen.intent_master").select(input_file_name).take(1)
And then extract table path from it.
Second approach
Its more of hack you can say.
package org.apache.spark.sql.hive
import java.net.URI
import org.apache.spark.sql.catalyst.catalog.{InMemoryCatalog, SessionCatalog}
import org.apache.spark.sql.catalyst.parser.ParserInterface
import org.apache.spark.sql.internal.{SessionState, SharedState}
import org.apache.spark.sql.SparkSession
class TableDetail {
def getTableLocation(table: String, spark: SparkSession): URI = {
val sessionState: SessionState = spark.sessionState
val sharedState: SharedState = spark.sharedState
val catalog: SessionCatalog = sessionState.catalog
val sqlParser: ParserInterface = sessionState.sqlParser
val client = sharedState.externalCatalog match {
case catalog: HiveExternalCatalog => catalog.client
case _: InMemoryCatalog => throw new IllegalArgumentException("In Memory catalog doesn't " +
"support hive client API")
}
val idtfr = sqlParser.parseTableIdentifier(table)
require(catalog.tableExists(idtfr), new IllegalArgumentException(idtfr + " done not exists"))
val rawTable = client.getTable(idtfr.database.getOrElse("default"), idtfr.table)
rawTable.location
}
}

Here is how to do it in PySpark:
(spark.sql("desc formatted mydb.myschema")
.filter("col_name=='Location'")
.collect()[0].data_type)

Use this as re-usable function in your scala project
def getHiveTablePath(tableName: String, spark: SparkSession):String =
{
import org.apache.spark.sql.functions._
val sql: String = String.format("desc formatted %s", tableName)
val result: DataFrame = spark.sql(sql).filter(col("col_name") === "Location")
result.show(false) // just for debug purpose
val info: String = result.collect().mkString(",")
val path: String = info.split(',')(1)
path
}
caller would be
println(getHiveTablePath("src", spark)) // you can prefix schema if you have
Result (I executed in local so file:/ below if its hdfs hdfs:// will come):
+--------+------------------------------------+-------+
|col_name|data_type |comment|
+--------+--------------------------------------------+
|Location|file:/Users/hive/spark-warehouse/src| |
+--------+------------------------------------+-------+
file:/Users/hive/spark-warehouse/src

USE ExternalCatalog
scala> spark
res15: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession#4eba6e1f
scala> val metastore = spark.sharedState.externalCatalog
metastore: org.apache.spark.sql.catalyst.catalog.ExternalCatalog = org.apache.spark.sql.hive.HiveExternalCatalog#24b05292
scala> val location = metastore.getTable("meta_data", "mock").location
location: java.net.URI = hdfs://10.1.5.9:4007/usr/hive/warehouse/meta_data.db/mock

Related

To avoid manual files errors how to code dynamic datatype of a column check in spark/scala

We are getting lot of manual files which we need to validate the few datatypes before process the data-frame. Can someone please suggest how can I proceed on this requirement. Basically need to write one spark Generic/common program which should work for many files. if possible please send more detail on this email id as well pathirammi1#gmail.com.
Wondering if your files have records with delimiter seperated (like csv file). If yes, you could very well read it as a text file and split the records based and delimiter and process it.
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
object RDDFromCSVFile {
def main(args:Array[String]): Unit ={
def splitString(row:String):Array[String]={
row.split(",")
}
val spark:SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExample")
.getOrCreate()
val sc = spark.sparkContext
val rdd = sc.textFile("randomfile.csv")
val rdd2:RDD = rdd.map(row=>{
val strArray = splitString(row)
val field1 = strArray(0)
val field2 = strArray(1)
val field3 = strArray(3)
val field4 = strArray(4)
// DO custom code here and return to create RDD
})
rdd2.foreach(a=>println(a.toString))
}
}
If you have non-structured data then you should use below code
import org.apache.spark.sql.SparkSession
object RDDFromWholeTextFile {
def main(args:Array[String]): Unit = {
val spark:SparkSession = SparkSession.builder()
.master("local[1]")
.appName("SparkByExample")
.getOrCreate()
val sc = spark.sparkContext
val rdd = sc.wholeTextFiles("alice.txt")
rdd.foreach(a=>println(a._1+"---->"+a._2))
}
}
Hope this helps !!
Thanks,
Naveen

How to set the property name when converting an array column to json in spark? (w/o udf) [duplicate]

Is there a simple way to converting a given Row object to json?
Found this about converting a whole Dataframe to json output:
Spark Row to JSON
But I just want to convert a one Row to json.
Here is pseudo code for what I am trying to do.
More precisely I am reading json as input in a Dataframe.
I am producing a new output that is mainly based on columns, but with one json field for all the info that does not fit into the columns.
My question what is the easiest way to write this function: convertRowToJson()
def convertRowToJson(row: Row): String = ???
def transformVenueTry(row: Row): Try[Venue] = {
Try({
val name = row.getString(row.fieldIndex("name"))
val metadataRow = row.getStruct(row.fieldIndex("meta"))
val score: Double = calcScore(row)
val combinedRow: Row = metadataRow ++ ("score" -> score)
val jsonString: String = convertRowToJson(combinedRow)
Venue(name = name, json = jsonString)
})
}
Psidom's Solutions:
def convertRowToJSON(row: Row): String = {
val m = row.getValuesMap(row.schema.fieldNames)
JSONObject(m).toString()
}
only works if the Row only has one level not with nested Row. This is the schema:
StructType(
StructField(indicator,StringType,true),
StructField(range,
StructType(
StructField(currency_code,StringType,true),
StructField(maxrate,LongType,true),
StructField(minrate,LongType,true)),true))
Also tried Artem suggestion, but that did not compile:
def row2DataFrame(row: Row, sqlContext: SQLContext): DataFrame = {
val sparkContext = sqlContext.sparkContext
import sparkContext._
import sqlContext.implicits._
import sqlContext._
val rowRDD: RDD[Row] = sqlContext.sparkContext.makeRDD(row :: Nil)
val dataFrame = rowRDD.toDF() //XXX does not compile
dataFrame
}
You can use getValuesMap to convert the row object to a Map and then convert it JSON:
import scala.util.parsing.json.JSONObject
import org.apache.spark.sql._
val df = Seq((1,2,3),(2,3,4)).toDF("A", "B", "C")
val row = df.first() // this is an example row object
def convertRowToJSON(row: Row): String = {
val m = row.getValuesMap(row.schema.fieldNames)
JSONObject(m).toString()
}
convertRowToJSON(row)
// res46: String = {"A" : 1, "B" : 2, "C" : 3}
I need to read json input and produce json output.
Most fields are handled individually, but a few json sub objects need to just be preserved.
When Spark reads a dataframe it turns a record into a Row. The Row is a json like structure. That can be transformed and written out to json.
But I need to take some sub json structures out to a string to use as a new field.
This can be done like this:
dataFrameWithJsonField = dataFrame.withColumn("address_json", to_json($"location.address"))
location.address is the path to the sub json object of the incoming json based dataframe. address_json is the column name of that object converted to a string version of the json.
to_json is implemented in Spark 2.1.
If generating it output json using json4s address_json should be parsed to an AST representation otherwise the output json will have the address_json part escaped.
Pay attention scala class scala.util.parsing.json.JSONObject is deprecated and not support null values.
#deprecated("This class will be removed.", "2.11.0")
"JSONFormat.defaultFormat doesn't handle null values"
https://issues.scala-lang.org/browse/SI-5092
JSon has schema but Row doesn't have a schema, so you need to apply schema on Row & convert to JSon. Here is how you can do it.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
def convertRowToJson(row: Row): String = {
val schema = StructType(
StructField("name", StringType, true) ::
StructField("meta", StringType, false) :: Nil)
return sqlContext.applySchema(row, schema).toJSON
}
Essentially, you can have a dataframe which contains just one row. Thus, you can try to filter your initial dataframe and then parse it to json.
I had the same issue, I had parquet files with canonical schema (no arrays), and I only want to get json events. I did as follows, and it seems to work just fine (Spark 2.1):
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.{DataFrame, Dataset, Row}
import scala.util.parsing.json.JSONFormat.ValueFormatter
import scala.util.parsing.json.{JSONArray, JSONFormat, JSONObject}
def getValuesMap[T](row: Row, schema: StructType): Map[String,Any] = {
schema.fields.map {
field =>
try{
if (field.dataType.typeName.equals("struct")){
field.name -> getValuesMap(row.getAs[Row](field.name), field.dataType.asInstanceOf[StructType])
}else{
field.name -> row.getAs[T](field.name)
}
}catch {case e : Exception =>{field.name -> null.asInstanceOf[T]}}
}.filter(xy => xy._2 != null).toMap
}
def convertRowToJSON(row: Row, schema: StructType): JSONObject = {
val m: Map[String, Any] = getValuesMap(row, schema)
JSONObject(m)
}
//I guess since I am using Any and not nothing the regular ValueFormatter is not working, and I had to add case jmap : Map[String,Any] => JSONObject(jmap).toString(defaultFormatter)
val defaultFormatter : ValueFormatter = (x : Any) => x match {
case s : String => "\"" + JSONFormat.quoteString(s) + "\""
case jo : JSONObject => jo.toString(defaultFormatter)
case jmap : Map[String,Any] => JSONObject(jmap).toString(defaultFormatter)
case ja : JSONArray => ja.toString(defaultFormatter)
case other => other.toString
}
val someFile = "s3a://bucket/file"
val df: DataFrame = sqlContext.read.load(someFile)
val schema: StructType = df.schema
val jsons: Dataset[JSONObject] = df.map(row => convertRowToJSON(row, schema))
if you are iterating through an data frame , you can directly convert the data frame to a new dataframe with json object inside and iterate that
val df_json = df.toJSON
I combining the suggestion from: Artem, KiranM and Psidom. Did a lot of trails and error and came up with this solutions that I tested for nested structures:
def row2Json(row: Row, sqlContext: SQLContext): String = {
import sqlContext.implicits
val rowRDD: RDD[Row] = sqlContext.sparkContext.makeRDD(row :: Nil)
val dataframe = sqlContext.createDataFrame(rowRDD, row.schema)
dataframe.toJSON.first
}
This solution worked, but only while running in driver mode.

update cassandra from spark

I'm a table in cassandra tfm.foehis that have data.
When i did the first charge of data from spark to cassandra, I used this set of commands:
import org.apache.spark.sql.functions._
import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra._
val wkdir="/home/adminbigdata/tablas/"
val fileIn= "originales/22_FOEHIS2.csv"
val fileOut= "22_FOEHIS_PRE2"
val fileCQL= "22_FOEHISCQL"
val data = sc.textFile(wkdir + fileIn).filter(!_.contains("----")).map(_.trim.replaceAll(" +", "")).map(_.dropRight(1)).map(_.drop(1)).map(_.replaceAll(",", "")).filter(array => array(6) != "MOBIDI").filter(array => array(17) != "").saveAsTextFile(wkdir + fileOut)
val firstDF = spark.read.format("csv").option("header", "true").option("inferSchema", "true").option("mode", "DROPMALFORMED").option("delimiter", "|").load(wkdir + fileOut)
val columns: Array[String] = firstDF.columns
val reorderedColumnNames: Array[String] = Array("hoclic","hodtac","hohrac","hotpac","honrac","hocdan","hocdrs","hocdsl","hocol","hocpny","hodesf","hodtcl","hodtcm","hodtea","hodtra","hodtrc","hodtto","hodtua","hohrcl","hohrcm","hohrea","hohrra","hohrrc","hohrua","holinh","holinr","honumr","hoobs","hooe","hotdsc","hotour","housca","houscl","houscm","housea","houser","housra","housrc")
val secondDF= firstDF.select(reorderedColumnNames.head, reorderedColumnNames.tail: _*)
secondDF.write.cassandraFormat("foehis", "tfm").save()
But when I load new data using the same script, I get errors. I don't know what's wrong?
This is the message:
java.lang.UnsupportedOperationException: 'SaveMode is set to ErrorIfExists and Table
tfm.foehis already exists and contains data.
Perhaps you meant to set the DataFrame write mode to Append?
Example: df.write.format.options.mode(SaveMode.Append).save()" '
The error message clearly says you that you need to use Append mode & shows what you can do with it. In your case it happens because destination table already exists, and writing mode is set to "error if exists". If you still want to write data, the code should be following:
import org.apache.spark.sql.SaveMode
secondDF.write.cassandraFormat("foehis", "tfm").mode(SaveMode.Append).save()

Can I read a CSV represented as a string into Apache Spark using spark-csv

I know how to read a csv file into spark using spark-csv (https://github.com/databricks/spark-csv), but I already have the csv file represented as a string and would like to convert this string directly to dataframe. Is this possible?
Update : Starting from Spark 2.2.x
there is finally a proper way to do it using Dataset.
import org.apache.spark.sql.{Dataset, SparkSession}
val spark = SparkSession.builder().appName("CsvExample").master("local").getOrCreate()
import spark.implicits._
val csvData: Dataset[String] = spark.sparkContext.parallelize(
"""
|id, date, timedump
|1, "2014/01/01 23:00:01",1499959917383
|2, "2014/11/31 12:40:32",1198138008843
""".stripMargin.lines.toList).toDS()
val frame = spark.read.option("header", true).option("inferSchema",true).csv(csvData)
frame.show()
frame.printSchema()
Old spark versions
Actually you can, though it's using library internals and not widely advertised. Just create and use your own CsvParser instance.
Example that works for me on spark 1.6.0 and spark-csv_2.10-1.4.0 below
import com.databricks.spark.csv.CsvParser
val csvData = """
|userid,organizationid,userfirstname,usermiddlename,userlastname,usertitle
|1,1,user1,m1,l1,mr
|2,2,user2,m2,l2,mr
|3,3,user3,m3,l3,mr
|""".stripMargin
val rdd = sc.parallelize(csvData.lines.toList)
val csvParser = new CsvParser()
.withUseHeader(true)
.withInferSchema(true)
val csvDataFrame: DataFrame = csvParser.csvRdd(sqlContext, rdd)
You can parse your string into a csv using, e.g. scala-csv:
val myCSVdata : Array[List[String]] =
myCSVString.split('\n').flatMap(CSVParser.parseLine(_))
Here you can do a bit more processing, data cleaning, verifying that every line parses well and has the same number of fields, etc ...
You can then make this an RDD of records:
val myCSVRDD : RDD[List[String]] = sparkContext.parallelize(msCSVdata)
Here you can massage your lists of Strings into a case class, to reflect the fields of your csv data better. You should get some inspiration from the creations of Persons in this example:
https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection
I omit this step.
You can then convert to a DataFrame:
import spark.implicits._
myCSVDataframe = myCSVRDD.toDF()
The accepted answer wasn't working for me in spark 2.2.0 but lead me to what I needed with csvData.lines.toList
val fileUrl = getClass.getResource(s"/file_in_resources.csv")
val stream = fileUrl.getContent.asInstanceOf[InputStream]
val streamString = Source.fromInputStream(stream).mkString
val csvList = streamString.lines.toList
spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(csvList.toDS())
.as[SomeCaseClass]

Xml processing in Spark

Scenario:
My Input will be multiple small XMLs and am Supposed to read these XMLs as RDDs. Perform join with another dataset and form an RDD and send the output as an XML.
Is it possible to read XML using spark, load the data as RDD? If it is possible how will the XML be read.
Sample XML:
<root>
<users>
<user>
<account>1234<\account>
<name>name_1<\name>
<number>34233<\number>
<\user>
<user>
<account>58789<\account>
<name>name_2<\name>
<number>54697<\number>
<\user>
<\users>
<\root>
How will this be loaded into the RDD?
Yes it possible but details will differ depending on an approach you take.
If files are small, as you've mentioned, the simplest solution is to load your data using SparkContext.wholeTextFiles. It loads data as RDD[(String, String)] where the the first element is path and the second file content. Then you parse each file individually like in a local mode.
For larger files you can use Hadoop input formats.
If structure is simple you can split records using textinputformat.record.delimiter. You can find a simple example here. Input is not a XML but it you should give you and idea how to proceed
Otherwise Mahout provides XmlInputFormat
Finally it is possible to read file using SparkContext.textFile and adjust later for record spanning between partitions. Conceptually it means something similar to creating sliding window or partitioning records into groups of fixed size:
use mapPartitionsWithIndex partitions to identify records broken between partitions, collect broken records
use second mapPartitionsWithIndex to repair broken records
Edit:
There is also relatively new spark-xml package which allows you to extract specific records by tag:
val df = sqlContext.read
.format("com.databricks.spark.xml")
.option("rowTag", "foo")
.load("bar.xml")
Here's the way to perform it using HadoopInputFormats to read XML data in spark as explained by #zero323.
Input data:
<root>
<users>
<user>
<account>1234<\account>
<name>name_1<\name>
<number>34233<\number>
<\user>
<user>
<account>58789<\account>
<name>name_2<\name>
<number>54697<\number>
<\user>
<\users>
<\root>
Code for reading XML Input:
You will get some jars at this link
Imports:
//---------------spark_import
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
//----------------xml_loader_import
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.{ LongWritable, Text }
import com.cloudera.datascience.common.XmlInputFormat
Code:
object Tester_loader {
case class User(account: String, name: String, number: String)
def main(args: Array[String]): Unit = {
val sparkHome = "/usr/big_data_tools/spark-1.5.0-bin-hadoop2.6/"
val sparkMasterUrl = "spark://SYSTEMX:7077"
var jars = new Array[String](3)
jars(0) = "/home/hduser/Offload_Data_Warehouse_Spark.jar"
jars(1) = "/usr/big_data_tools/JARS/Spark_jar/avro/spark-avro_2.10-2.0.1.jar"
val conf = new SparkConf().setAppName("XML Reading")
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.setMaster("local")
.set("spark.cassandra.connection.host", "127.0.0.1")
.setSparkHome(sparkHome)
.set("spark.executor.memory", "512m")
.set("spark.default.deployCores", "12")
.set("spark.cores.max", "12")
.setJars(jars)
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
// ---- loading user from XML
// calling function 1.1
val pages = readFile("src/input_data", "<user>", "<\\user>", sc)
val xmlUserDF = pages.map { tuple =>
{
val account = extractField(tuple, "account")
val name = extractField(tuple, "name")
val number = extractField(tuple, "number")
User(account, name, number)
}
}.toDF()
println(xmlUserDF.count())
xmlUserDF.show()
}
Functions:
def readFile(path: String, start_tag: String, end_tag: String,
sc: SparkContext) = {
val conf = new Configuration()
conf.set(XmlInputFormat.START_TAG_KEY, start_tag)
conf.set(XmlInputFormat.END_TAG_KEY, end_tag)
val rawXmls = sc.newAPIHadoopFile(
path, classOf[XmlInputFormat], classOf[LongWritable],
classOf[Text], conf)
rawXmls.map(p => p._2.toString)
}
def extractField(tuple: String, tag: String) = {
var value = tuple.replaceAll("\n", " ").replace("<\\", "</")
if (value.contains("<" + tag + ">") &&
value.contains("</" + tag + ">")) {
value = value.split("<" + tag + ">")(1).split("</" + tag + ">")(0)
}
value
}
}
Output:
+-------+------+------+
|account| name|number|
+-------+------+------+
| 1234|name_1| 34233|
| 58789|name_2| 54697|
+-------+------+------+
The result obtained is in dataframes you can convert them to RDD as per your requirement like this->
val xmlUserRDD = xmlUserDF.toJavaRDD.rdd.map { x =>
(x.get(0).toString(),x.get(1).toString(),x.get(2).toString()) }
Please evaluate it, if it could help you some how.
This will help you.
package packagename;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.SparkSession;
import com.databricks.spark.xml.XmlReader;
public class XmlreaderSpark {
public static void main(String arr[]){
String localxml="file path";
String booksFileTag = "user";
String warehouseLocation = "file:" + System.getProperty("user.dir") + "spark-warehouse";
System.out.println("warehouseLocation" + warehouseLocation);
SparkSession spark = SparkSession
.builder()
.master("local")
.appName("Java Spark SQL Example")
.config("spark.some.config.option", "some-value").config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport().config("set spark.sql.crossJoin.enabled", "true")
.getOrCreate();
SQLContext sqlContext = new SQLContext(spark);
Dataset<Row> df = (new XmlReader()).withRowTag(booksFileTag).xmlFile(sqlContext, localxml);
df.show();
}
}
You need to add this dependency in your POM.xml:
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-xml_2.10</artifactId>
<version>0.4.0</version>
</dependency>
and your input file is not in proper format.
Thanks.
There are two good options for simple cases:
wholeTextFiles. Use map method with your XML parser which could be Scala XML pull parser (quicker to code) or the SAX Pull Parser (better performance).
Hadoop streaming XMLInputFormat which you must define the start and end tag <user> </user> to process it, however, it creates one partition per user tag
spark-xml package is a good option too.
With all options you are limited to only process simple XMLs which can be interpreted as dataset with rows and columns.
However, if we make it a little complex, those options won’t be useful.
For example, if you have one more entity there:
<root>
<users>
<user>...</users>
<companies>
<company>...</companies>
</root>
Now you need to generate 2 RDDs and change your parser to recognise the <company> tag.
This is just a simple case, but the XML could be much more complex and you would need to include more and more changes.
To solve this complexity we’ve built Flexter on top of Apache Spark to take the pain out of processing XML files on Spark. I also recommend to read about converting XML on Spark to Parquet. The latter post also includes some code samples that show how the output can be queried with SparkSQL.
Disclaimer: I work for Sonra

Resources