Spark : Update HashMap in Spark Application - apache-spark

I am processing logs in a spark application . I create a Hashmap out of logs and then process it. As part of processing , i have to update value of one the key . But instead of update it is adding a new key-value in the HashMap. and i am getting error Reference 'column1' is ambiguous, could be: column1#31, column1#56
My code looks as follows
logs.foreachRDD(rdd => {
val newRDD = rdd.map(p => createMap(p))
.map(p => updateColumns(p))
val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)
import sqlContext.implicits._
val df = sqlContext.createDataFrame(newRDD, reqschema)
df.registerTempTable("myTable")
val df1 = sqlContext.sql("Select column1, column2 from myTable")
}
def createMap(log: String): java.util.HashMap[String, String]={
...
}
def updateColumns(map : java.util.HashMap[String, String])={
map.put("column1", "random") // this statements create a new key column1 instead of updating value exiting key "column1"
}
Anyone knows how can we update HashMap in this case ?
Thanks

Related

Spark Streaming reach dataframe columns and add new column looking up to Redis

In my previous question(Spark Structured Streaming dynamic lookup with Redis ) , i succeeded to reach redis with mapparttions thanks to https://stackoverflow.com/users/689676/fe2s
I tried to use mappartitions but i could not solve one point, how i can reach per row column in the below code part while iterating.
Because i want to enrich my per-row against my lookup fields kept in Redis.
I found something like this, but how i can reach dataframe columns and add new column looking up to Redis.
for any help i really much appreciate, Thanks.
import org.apache.spark.sql.types._
def transformRow(row: Row): Row = {
Row.fromSeq(row.toSeq ++ Array[Any]("val1", "val2"))
}
def transformRows(iter: Iterator[Row]): Iterator[Row] =
{
val redisConn =new RedisClient("xxx.xxx.xx.xxx",6379,1,Option("Secret123"))
println(redisConn.get("ModelValidityPeriodName").getOrElse(""))
//want to reach DataFrame column here
redisConn.close()
iter.map(transformRow)
}
val newSchema = StructType(raw_customer_df.schema.fields ++
Array(
StructField("ModelValidityPeriod", StringType, false),
StructField("ModelValidityPeriod2", StringType, false)
)
)
spark.sqlContext.createDataFrame(raw_customer_df.rdd.mapPartitions(transformRows), newSchema).show
Iterator iter represents an iterator over the dataframe rows. So if I got your question correctly, you can access column values by iterative over iter and calling
row.getAs[Column_Type](column_name)
Something like this
def transformRows(iter: Iterator[Row]): Iterator[Row] = {
val redisConn = new RedisClient("xxx.xxx.xx.xxx",6379,1,Option("Secret123"))
println(redisConn.get("ModelValidityPeriodName").getOrElse(""))
//want to reach DataFrame column here
val res = iter.map { row =>
val columnValue = row.getAs[String]("column_name")
// lookup in redis
val valueFromRedis = redisConn.get(...)
Row.fromSeq(row.toSeq ++ Array[Any](valueFromRedis))
}.toList
redisConn.close()
res.iterator
}

How to get the value of the location for a Hive table using a Spark object?

I am interested in being able to retrieve the location value of a Hive table given a Spark object (SparkSession). One way to obtain this value is by parsing the output of the location via the following SQL query:
describe formatted <table name>
I was wondering if there is another way to obtain the location value without having to parse the output. An API would be great in case the output of the above command changes between Hive versions. If an external dependency is needed, which would it be? Is there some sample spark code that can obtain the location value?
Here is the correct answer:
import org.apache.spark.sql.catalyst.TableIdentifier
lazy val tblMetadata = spark.sessionState.catalog.getTableMetadata(new TableIdentifier(tableName,Some(schema)))
You can also use .toDF method on desc formatted table then filter from dataframe.
DataframeAPI:
scala> :paste
spark.sql("desc formatted data_db.part_table")
.toDF //convert to dataframe will have 3 columns col_name,data_type,comment
.filter('col_name === "Location") //filter on colname
.collect()(0)(1)
.toString
Result:
String = hdfs://nn:8020/location/part_table
(or)
RDD Api:
scala> :paste
spark.sql("desc formatted data_db.part_table")
.collect()
.filter(r => r(0).equals("Location")) //filter on r(0) value
.map(r => r(1)) //get only the location
.mkString //convert as string
.split("8020")(1) //change the split based on your namenode port..etc
Result:
String = /location/part_table
First approach
You can use input_file_name with dataframe.
it will give you absolute file-path for a part file.
spark.read.table("zen.intent_master").select(input_file_name).take(1)
And then extract table path from it.
Second approach
Its more of hack you can say.
package org.apache.spark.sql.hive
import java.net.URI
import org.apache.spark.sql.catalyst.catalog.{InMemoryCatalog, SessionCatalog}
import org.apache.spark.sql.catalyst.parser.ParserInterface
import org.apache.spark.sql.internal.{SessionState, SharedState}
import org.apache.spark.sql.SparkSession
class TableDetail {
def getTableLocation(table: String, spark: SparkSession): URI = {
val sessionState: SessionState = spark.sessionState
val sharedState: SharedState = spark.sharedState
val catalog: SessionCatalog = sessionState.catalog
val sqlParser: ParserInterface = sessionState.sqlParser
val client = sharedState.externalCatalog match {
case catalog: HiveExternalCatalog => catalog.client
case _: InMemoryCatalog => throw new IllegalArgumentException("In Memory catalog doesn't " +
"support hive client API")
}
val idtfr = sqlParser.parseTableIdentifier(table)
require(catalog.tableExists(idtfr), new IllegalArgumentException(idtfr + " done not exists"))
val rawTable = client.getTable(idtfr.database.getOrElse("default"), idtfr.table)
rawTable.location
}
}
Here is how to do it in PySpark:
(spark.sql("desc formatted mydb.myschema")
.filter("col_name=='Location'")
.collect()[0].data_type)
Use this as re-usable function in your scala project
def getHiveTablePath(tableName: String, spark: SparkSession):String =
{
import org.apache.spark.sql.functions._
val sql: String = String.format("desc formatted %s", tableName)
val result: DataFrame = spark.sql(sql).filter(col("col_name") === "Location")
result.show(false) // just for debug purpose
val info: String = result.collect().mkString(",")
val path: String = info.split(',')(1)
path
}
caller would be
println(getHiveTablePath("src", spark)) // you can prefix schema if you have
Result (I executed in local so file:/ below if its hdfs hdfs:// will come):
+--------+------------------------------------+-------+
|col_name|data_type |comment|
+--------+--------------------------------------------+
|Location|file:/Users/hive/spark-warehouse/src| |
+--------+------------------------------------+-------+
file:/Users/hive/spark-warehouse/src
USE ExternalCatalog
scala> spark
res15: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession#4eba6e1f
scala> val metastore = spark.sharedState.externalCatalog
metastore: org.apache.spark.sql.catalyst.catalog.ExternalCatalog = org.apache.spark.sql.hive.HiveExternalCatalog#24b05292
scala> val location = metastore.getTable("meta_data", "mock").location
location: java.net.URI = hdfs://10.1.5.9:4007/usr/hive/warehouse/meta_data.db/mock

How to set the property name when converting an array column to json in spark? (w/o udf) [duplicate]

Is there a simple way to converting a given Row object to json?
Found this about converting a whole Dataframe to json output:
Spark Row to JSON
But I just want to convert a one Row to json.
Here is pseudo code for what I am trying to do.
More precisely I am reading json as input in a Dataframe.
I am producing a new output that is mainly based on columns, but with one json field for all the info that does not fit into the columns.
My question what is the easiest way to write this function: convertRowToJson()
def convertRowToJson(row: Row): String = ???
def transformVenueTry(row: Row): Try[Venue] = {
Try({
val name = row.getString(row.fieldIndex("name"))
val metadataRow = row.getStruct(row.fieldIndex("meta"))
val score: Double = calcScore(row)
val combinedRow: Row = metadataRow ++ ("score" -> score)
val jsonString: String = convertRowToJson(combinedRow)
Venue(name = name, json = jsonString)
})
}
Psidom's Solutions:
def convertRowToJSON(row: Row): String = {
val m = row.getValuesMap(row.schema.fieldNames)
JSONObject(m).toString()
}
only works if the Row only has one level not with nested Row. This is the schema:
StructType(
StructField(indicator,StringType,true),
StructField(range,
StructType(
StructField(currency_code,StringType,true),
StructField(maxrate,LongType,true),
StructField(minrate,LongType,true)),true))
Also tried Artem suggestion, but that did not compile:
def row2DataFrame(row: Row, sqlContext: SQLContext): DataFrame = {
val sparkContext = sqlContext.sparkContext
import sparkContext._
import sqlContext.implicits._
import sqlContext._
val rowRDD: RDD[Row] = sqlContext.sparkContext.makeRDD(row :: Nil)
val dataFrame = rowRDD.toDF() //XXX does not compile
dataFrame
}
You can use getValuesMap to convert the row object to a Map and then convert it JSON:
import scala.util.parsing.json.JSONObject
import org.apache.spark.sql._
val df = Seq((1,2,3),(2,3,4)).toDF("A", "B", "C")
val row = df.first() // this is an example row object
def convertRowToJSON(row: Row): String = {
val m = row.getValuesMap(row.schema.fieldNames)
JSONObject(m).toString()
}
convertRowToJSON(row)
// res46: String = {"A" : 1, "B" : 2, "C" : 3}
I need to read json input and produce json output.
Most fields are handled individually, but a few json sub objects need to just be preserved.
When Spark reads a dataframe it turns a record into a Row. The Row is a json like structure. That can be transformed and written out to json.
But I need to take some sub json structures out to a string to use as a new field.
This can be done like this:
dataFrameWithJsonField = dataFrame.withColumn("address_json", to_json($"location.address"))
location.address is the path to the sub json object of the incoming json based dataframe. address_json is the column name of that object converted to a string version of the json.
to_json is implemented in Spark 2.1.
If generating it output json using json4s address_json should be parsed to an AST representation otherwise the output json will have the address_json part escaped.
Pay attention scala class scala.util.parsing.json.JSONObject is deprecated and not support null values.
#deprecated("This class will be removed.", "2.11.0")
"JSONFormat.defaultFormat doesn't handle null values"
https://issues.scala-lang.org/browse/SI-5092
JSon has schema but Row doesn't have a schema, so you need to apply schema on Row & convert to JSon. Here is how you can do it.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
def convertRowToJson(row: Row): String = {
val schema = StructType(
StructField("name", StringType, true) ::
StructField("meta", StringType, false) :: Nil)
return sqlContext.applySchema(row, schema).toJSON
}
Essentially, you can have a dataframe which contains just one row. Thus, you can try to filter your initial dataframe and then parse it to json.
I had the same issue, I had parquet files with canonical schema (no arrays), and I only want to get json events. I did as follows, and it seems to work just fine (Spark 2.1):
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.{DataFrame, Dataset, Row}
import scala.util.parsing.json.JSONFormat.ValueFormatter
import scala.util.parsing.json.{JSONArray, JSONFormat, JSONObject}
def getValuesMap[T](row: Row, schema: StructType): Map[String,Any] = {
schema.fields.map {
field =>
try{
if (field.dataType.typeName.equals("struct")){
field.name -> getValuesMap(row.getAs[Row](field.name), field.dataType.asInstanceOf[StructType])
}else{
field.name -> row.getAs[T](field.name)
}
}catch {case e : Exception =>{field.name -> null.asInstanceOf[T]}}
}.filter(xy => xy._2 != null).toMap
}
def convertRowToJSON(row: Row, schema: StructType): JSONObject = {
val m: Map[String, Any] = getValuesMap(row, schema)
JSONObject(m)
}
//I guess since I am using Any and not nothing the regular ValueFormatter is not working, and I had to add case jmap : Map[String,Any] => JSONObject(jmap).toString(defaultFormatter)
val defaultFormatter : ValueFormatter = (x : Any) => x match {
case s : String => "\"" + JSONFormat.quoteString(s) + "\""
case jo : JSONObject => jo.toString(defaultFormatter)
case jmap : Map[String,Any] => JSONObject(jmap).toString(defaultFormatter)
case ja : JSONArray => ja.toString(defaultFormatter)
case other => other.toString
}
val someFile = "s3a://bucket/file"
val df: DataFrame = sqlContext.read.load(someFile)
val schema: StructType = df.schema
val jsons: Dataset[JSONObject] = df.map(row => convertRowToJSON(row, schema))
if you are iterating through an data frame , you can directly convert the data frame to a new dataframe with json object inside and iterate that
val df_json = df.toJSON
I combining the suggestion from: Artem, KiranM and Psidom. Did a lot of trails and error and came up with this solutions that I tested for nested structures:
def row2Json(row: Row, sqlContext: SQLContext): String = {
import sqlContext.implicits
val rowRDD: RDD[Row] = sqlContext.sparkContext.makeRDD(row :: Nil)
val dataframe = sqlContext.createDataFrame(rowRDD, row.schema)
dataframe.toJSON.first
}
This solution worked, but only while running in driver mode.

create a register dynamic dataframe as temptable in spark

I am trying to registerTemptables, from dynamic dataframes.
I am getting the output as a string., i am not sure if there is a way to execute dataframe or convert a string to dataframe so that the temptable can be created.
Here are the steps to replicate this issue :
import org.apache.spark.sql._
val contact_df = sc.makeRDD(1 to 5).map(i => (i, i * i)).toDF("value", "square")
val acct_df = sc.makeRDD(1 to 5).map(i => (i, i / i)).toDF("value", "devide")
val dataframeJoins = Array(
Row("x","","","" ,"Y","",1,"contact_hotline_df","contact_df","acct_nbr","hotline_df","tm49_acct_nbr"),
Row("x","","","","Y","",2,"contact_hotline_acct_df","acct_df","tm06_acct_nbr" ,"contact_hotline_df","acct_nbr")
)
val dfJoinbroadcast = sc.broadcast(dataframeJoins)
val DFJoins1 = for ( row <- dfJoinbroadcast.value ) yield {
(row(8)+".registerTempTable(\""+row(8)+"\")" )
}
for (rows <- 0 until DFJoins1.size ){
println(DFJoins1(rows) )
DFJoins1(rows)
}
Here is the output of the above for loop :
contact_df.registerTempTable("contact_df")
acct_df.registerTempTable("acct_df")
I am not getting any error. But the table is not getting created.
When i say sqlContext.sql("select * from contact_df") i am getting an error that table is not created.
Is there a way to convert string to a dataframe and execute the dataframe to create temptable.
Please suggest.
Thanks,
Sreehari
Your code concatenates the strings and prints the result, that's it. The registerTempTable method is not being called, that's why you cant use it in the SQL query. Try to do this:
// assuming we have this string to object mapping
val tableNameToDf = Map("contact_df" -> contact_df, "acct_df" -> acct_df)
you could restructure your for loop into something like:
val dfJoins = for (row <- dfJoinbroadcast.value) yield {
val wannabeTable = row(8)
tableNameToRdd(wannabeTable).createOrReplaceTempView(wannabeTable)
wannabeTableName
}

Execute SQL on Ignite cache of BinaryObjects

I am creating a cache of BinaryObject from spark a dataframe and then I want to perform SQL on that ignite cache.
Here is my code where bank is the dataframe which contains three fields (id,name and age):
val ic = new IgniteContext(sc, () => new IgniteConfiguration())
val cacheConfig = new CacheConfiguration[BinaryObject, BinaryObject]()
cacheConfig.setName("test123")
cacheConfig.setStoreKeepBinary(true)
cacheConfig.setIndexedTypes(classOf[BinaryObject], classOf[BinaryObject])
val qe = new QueryEntity()
qe.setKeyType(TestKey)
qe.setValueType(TestValue)
val fields = new java.util.LinkedHashMap[String, String]()
fields.put("id", "java.lang.Long")
fields.put("name", "java.lang.String")
fields.put("age", "java.lang.Int")
qe.setFields(fields)
val qes = new java.util.ArrayList[QueryEntity]()
qes.add(qe)
cacheConfig.setQueryEntities(qes)
val cache = ic.fromCache[BinaryObject, BinaryObject](cacheConfig)
cache.savePairs(bank.rdd, (row: Bank, iContext: IgniteContext) => {
val keyBuilder = iContext.ignite().binary().builder("TestKey");
keyBuilder.setField("id", row.id);
val key = keyBuilder.build();
val valueBuilder = iContext.ignite().binary().builder("TestValue");
valueBuilder.setField("name", row.name);
valueBuilder.setField("age", row.age);
val value = valueBuilder.build();
(key, value);
}, true)
Now I am trying to execute an SQL query like this:
cache.sql("select age from TestValue")
Which is failing with following exception:
Caused by: org.h2.jdbc.JdbcSQLException: Column "AGE" not found; SQL statement:
select age from TestValue [42122-191]
at org.h2.message.DbException.getJdbcSQLException(DbException.java:345)
at org.h2.message.DbException.get(DbException.java:179)
at org.h2.message.DbException.get(DbException.java:155)
at org.h2.expression.ExpressionColumn.optimize(ExpressionColumn.java:147)
at org.h2.command.dml.Select.prepare(Select.java:852)
What am I doing wrong here?
The type of field age is incorrect, it should be the following:
fields.put("age", "java.lang.Integer")

Resources