create a register dynamic dataframe as temptable in spark - apache-spark

I am trying to registerTemptables, from dynamic dataframes.
I am getting the output as a string., i am not sure if there is a way to execute dataframe or convert a string to dataframe so that the temptable can be created.
Here are the steps to replicate this issue :
import org.apache.spark.sql._
val contact_df = sc.makeRDD(1 to 5).map(i => (i, i * i)).toDF("value", "square")
val acct_df = sc.makeRDD(1 to 5).map(i => (i, i / i)).toDF("value", "devide")
val dataframeJoins = Array(
Row("x","","","" ,"Y","",1,"contact_hotline_df","contact_df","acct_nbr","hotline_df","tm49_acct_nbr"),
Row("x","","","","Y","",2,"contact_hotline_acct_df","acct_df","tm06_acct_nbr" ,"contact_hotline_df","acct_nbr")
)
val dfJoinbroadcast = sc.broadcast(dataframeJoins)
val DFJoins1 = for ( row <- dfJoinbroadcast.value ) yield {
(row(8)+".registerTempTable(\""+row(8)+"\")" )
}
for (rows <- 0 until DFJoins1.size ){
println(DFJoins1(rows) )
DFJoins1(rows)
}
Here is the output of the above for loop :
contact_df.registerTempTable("contact_df")
acct_df.registerTempTable("acct_df")
I am not getting any error. But the table is not getting created.
When i say sqlContext.sql("select * from contact_df") i am getting an error that table is not created.
Is there a way to convert string to a dataframe and execute the dataframe to create temptable.
Please suggest.
Thanks,
Sreehari

Your code concatenates the strings and prints the result, that's it. The registerTempTable method is not being called, that's why you cant use it in the SQL query. Try to do this:
// assuming we have this string to object mapping
val tableNameToDf = Map("contact_df" -> contact_df, "acct_df" -> acct_df)
you could restructure your for loop into something like:
val dfJoins = for (row <- dfJoinbroadcast.value) yield {
val wannabeTable = row(8)
tableNameToRdd(wannabeTable).createOrReplaceTempView(wannabeTable)
wannabeTableName
}

Related

How to iterate through dataframe without converting to dataset in spark?

I have a dataframe through which I want to iterate, but I dont want to convert dataframe to dataset.
We have to convert spark scala code to pyspark and pyspark does not support dataset.
I have tried the following code with by converting to dataset
data in file:
abc,a
mno,b
pqr,a
xyz,b
val a = sc.textFile("<path>")
//creating dataframe with column AA,BB
val b = a.map(x => x.split(",")).map(x =>(x(0).toString,x(1).toString)).toDF("AA","BB")
b.registerTempTable("test")
case class T(AA:String, BB: String)
//creating dataset from dataframe
val d = b.as[T].collect
d.foreach{ x=>
var m = spark.sql(s"select * from test where BB = '${x.BB}'")
m.show()
}
Without converting to dataset it gives error i.e. with
val d = b.collect
d.foreach{ x=>
var m = spark.sql(s"select * from test where BB = '${x.BB}'")
m.show()
}
it gives error:
error: value BB is not member of org.apache.spark.sql.ROW
You cannot loop dataframe as you have given in the above code. Use dataframe's rdd.collect to loop dataframe.
import spark.implicits._
val df = Seq(("abc","a"), ("mno","b"), ("pqr","a"),("xyz","b")).toDF("AA", "BB")
df.registerTempTable("test")
df.rdd.collect.foreach(x => {
val BBvalue = x.mkString(",").split(",")(1)
var m = spark.sql(s"select * from test where BB = '$BBvalue'")
m.show()
})
Inside the loop I used mkString to convert an rdd row to string and then split the column values with comma and use the index of column for accessing the value. For example, in the above code I have used (1) which means, column BB column index is 2.
Please let me know if you have any questions.

Join apache spark dataframes properly with scala avoiding null values

Hellow everyone!
I have two DataFrames in apache spark (2.3) and I want to join them properly. I will explain below what I mean with 'properly'. First of all the two dataframes holds the following information:
nodeDf: ( id, year, title, authors, journal, abstract )
edgeDf: ( srcId, dstId, label )
The label could be 0 or 1 in case node1 is connected with node2 or not.
I want to combine this two dataframes to get one dataframe withe the following information:
JoinedDF: ( id_from, year_from, title_from, journal_from, abstract_from, id_to, year_to, title_to, journal_to, abstract_to, time_dist )
time_dist = abs(year_from - year_to)
When I said 'properly' I meant that the query must be as fast as it could be and I don't want to contain null rows or cels ( value on a row ).
I have tried the following but I took me 500 -540 sec to execute the query and the final dataframe contains null values. I don't even know if the dataframes ware joined correctly.
I want to mention that the node file from which I create the nodeDF has 27770 rows and the edge file (edgeDf) has 615512 rows.
Code:
val spark = SparkSession.builder().master("local[*]").appName("Logistic Regression").getOrCreate()
val sc = spark.sparkContext
val data = sc.textFile("resources/data/training_set.txt").map(line =>{
val fields = line.split(" ")
(fields(0),fields(1), fields(2).toInt)
})
val data2 = sc.textFile("resources/data/test_set.txt").map(line =>{
val fields = line.split(" ")
(fields(0),fields(1))
})
import spark.implicits._
val trainingDF = data.toDF("srcId","dstId", "label")
val testDF = data2.toDF("srcId","dstId")
val infoRDD = spark.read.option("header","false").option("inferSchema","true").format("csv").load("resources/data/node_information.csv")
val infoDF = infoRDD.toDF("srcId","year","title","authors","jurnal","abstract")
println("Showing linksDF sample...")
trainingDF.show(5)
println("Rows of linksDF: ",trainingDF.count())
println("Showing infoDF sample...")
infoDF.show(2)
println("Rows of infoDF: ",infoDF.count())
println("Joining linksDF and infoDF...")
var joinedDF = trainingDF.as("a").join(infoDF.as("b"),$"a.srcId" === $"b.srcId")
println(joinedDF.count())
joinedDF = joinedDF.select($"a.srcId",$"a.dstId",$"a.label",$"b.year",$"b.title",$"b.authors",$"b.jurnal",$"b.abstract")
joinedDF.show(5)
val graphX = new GraphX()
val pageRankDf =graphX.computePageRank(spark,"resources/data/training_set.txt",0.0001)
println("Joining joinedDF and pageRankDf...")
joinedDF = joinedDF.as("a").join(pageRankDf.as("b"),$"a.srcId" === $"b.nodeId")
var dfWithRanks = joinedDF.select("srcId","dstId","label","year","title","authors","jurnal","abstract","rank").withColumnRenamed("rank","pgRank")
dfWithRanks.show(5)
println("Renameming joinedDF...")
dfWithRanks = dfWithRanks
.withColumnRenamed("srcId","id_from")
.withColumnRenamed("dstId","id_to")
.withColumnRenamed("year","year_from")
.withColumnRenamed("title","title_from")
.withColumnRenamed("authors","authors_from")
.withColumnRenamed("jurnal","jurnal_from")
.withColumnRenamed("abstract","abstract_from")
var infoDfRenamed = dfWithRanks
.withColumnRenamed("id_from","id_from")
.withColumnRenamed("id_to","id_to")
.withColumnRenamed("year_from","year_to")
.withColumnRenamed("title_from","title_to")
.withColumnRenamed("authors_from","authors_to")
.withColumnRenamed("jurnal_from","jurnal_to")
.withColumnRenamed("abstract_from","abstract_to").select("id_to","year_to","title_to","authors_to","jurnal_to","jurnal_to")
var finalDF = dfWithRanks.as("a").join(infoDF.as("b"),$"a.id_to" === $"b.srcId")
finalDF = finalDF
.withColumnRenamed("year","year_to")
.withColumnRenamed("title","title_to")
.withColumnRenamed("authors","authors_to")
.withColumnRenamed("jurnal","jurnal_to")
.withColumnRenamed("abstract","abstract_to")
println("Dropping unused columns from joinedDF...")
finalDF = finalDF.drop("srcId")
finalDF.show(5)
Here are my results!
Avoid all calculations and code related to pgRank! Is there any proper way to do this join works?
You can filter your data first and then join, in that case you will avoid nulls
df.filter($"ColumnName".isNotNull)
use <=> operator in your joining column condition
var joinedDF = trainingDF.as("a").join(infoDF.as("b"),$"a.srcId" <=> $"b.srcId")
There is a function in spark 2.1 or greater is eqNullSafe
var joinedDF = trainingDF.join(infoDF,trainingDF("srcId").eqNullSafe(infoDF("srcId")))

How to set the property name when converting an array column to json in spark? (w/o udf) [duplicate]

Is there a simple way to converting a given Row object to json?
Found this about converting a whole Dataframe to json output:
Spark Row to JSON
But I just want to convert a one Row to json.
Here is pseudo code for what I am trying to do.
More precisely I am reading json as input in a Dataframe.
I am producing a new output that is mainly based on columns, but with one json field for all the info that does not fit into the columns.
My question what is the easiest way to write this function: convertRowToJson()
def convertRowToJson(row: Row): String = ???
def transformVenueTry(row: Row): Try[Venue] = {
Try({
val name = row.getString(row.fieldIndex("name"))
val metadataRow = row.getStruct(row.fieldIndex("meta"))
val score: Double = calcScore(row)
val combinedRow: Row = metadataRow ++ ("score" -> score)
val jsonString: String = convertRowToJson(combinedRow)
Venue(name = name, json = jsonString)
})
}
Psidom's Solutions:
def convertRowToJSON(row: Row): String = {
val m = row.getValuesMap(row.schema.fieldNames)
JSONObject(m).toString()
}
only works if the Row only has one level not with nested Row. This is the schema:
StructType(
StructField(indicator,StringType,true),
StructField(range,
StructType(
StructField(currency_code,StringType,true),
StructField(maxrate,LongType,true),
StructField(minrate,LongType,true)),true))
Also tried Artem suggestion, but that did not compile:
def row2DataFrame(row: Row, sqlContext: SQLContext): DataFrame = {
val sparkContext = sqlContext.sparkContext
import sparkContext._
import sqlContext.implicits._
import sqlContext._
val rowRDD: RDD[Row] = sqlContext.sparkContext.makeRDD(row :: Nil)
val dataFrame = rowRDD.toDF() //XXX does not compile
dataFrame
}
You can use getValuesMap to convert the row object to a Map and then convert it JSON:
import scala.util.parsing.json.JSONObject
import org.apache.spark.sql._
val df = Seq((1,2,3),(2,3,4)).toDF("A", "B", "C")
val row = df.first() // this is an example row object
def convertRowToJSON(row: Row): String = {
val m = row.getValuesMap(row.schema.fieldNames)
JSONObject(m).toString()
}
convertRowToJSON(row)
// res46: String = {"A" : 1, "B" : 2, "C" : 3}
I need to read json input and produce json output.
Most fields are handled individually, but a few json sub objects need to just be preserved.
When Spark reads a dataframe it turns a record into a Row. The Row is a json like structure. That can be transformed and written out to json.
But I need to take some sub json structures out to a string to use as a new field.
This can be done like this:
dataFrameWithJsonField = dataFrame.withColumn("address_json", to_json($"location.address"))
location.address is the path to the sub json object of the incoming json based dataframe. address_json is the column name of that object converted to a string version of the json.
to_json is implemented in Spark 2.1.
If generating it output json using json4s address_json should be parsed to an AST representation otherwise the output json will have the address_json part escaped.
Pay attention scala class scala.util.parsing.json.JSONObject is deprecated and not support null values.
#deprecated("This class will be removed.", "2.11.0")
"JSONFormat.defaultFormat doesn't handle null values"
https://issues.scala-lang.org/browse/SI-5092
JSon has schema but Row doesn't have a schema, so you need to apply schema on Row & convert to JSon. Here is how you can do it.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
def convertRowToJson(row: Row): String = {
val schema = StructType(
StructField("name", StringType, true) ::
StructField("meta", StringType, false) :: Nil)
return sqlContext.applySchema(row, schema).toJSON
}
Essentially, you can have a dataframe which contains just one row. Thus, you can try to filter your initial dataframe and then parse it to json.
I had the same issue, I had parquet files with canonical schema (no arrays), and I only want to get json events. I did as follows, and it seems to work just fine (Spark 2.1):
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.{DataFrame, Dataset, Row}
import scala.util.parsing.json.JSONFormat.ValueFormatter
import scala.util.parsing.json.{JSONArray, JSONFormat, JSONObject}
def getValuesMap[T](row: Row, schema: StructType): Map[String,Any] = {
schema.fields.map {
field =>
try{
if (field.dataType.typeName.equals("struct")){
field.name -> getValuesMap(row.getAs[Row](field.name), field.dataType.asInstanceOf[StructType])
}else{
field.name -> row.getAs[T](field.name)
}
}catch {case e : Exception =>{field.name -> null.asInstanceOf[T]}}
}.filter(xy => xy._2 != null).toMap
}
def convertRowToJSON(row: Row, schema: StructType): JSONObject = {
val m: Map[String, Any] = getValuesMap(row, schema)
JSONObject(m)
}
//I guess since I am using Any and not nothing the regular ValueFormatter is not working, and I had to add case jmap : Map[String,Any] => JSONObject(jmap).toString(defaultFormatter)
val defaultFormatter : ValueFormatter = (x : Any) => x match {
case s : String => "\"" + JSONFormat.quoteString(s) + "\""
case jo : JSONObject => jo.toString(defaultFormatter)
case jmap : Map[String,Any] => JSONObject(jmap).toString(defaultFormatter)
case ja : JSONArray => ja.toString(defaultFormatter)
case other => other.toString
}
val someFile = "s3a://bucket/file"
val df: DataFrame = sqlContext.read.load(someFile)
val schema: StructType = df.schema
val jsons: Dataset[JSONObject] = df.map(row => convertRowToJSON(row, schema))
if you are iterating through an data frame , you can directly convert the data frame to a new dataframe with json object inside and iterate that
val df_json = df.toJSON
I combining the suggestion from: Artem, KiranM and Psidom. Did a lot of trails and error and came up with this solutions that I tested for nested structures:
def row2Json(row: Row, sqlContext: SQLContext): String = {
import sqlContext.implicits
val rowRDD: RDD[Row] = sqlContext.sparkContext.makeRDD(row :: Nil)
val dataframe = sqlContext.createDataFrame(rowRDD, row.schema)
dataframe.toJSON.first
}
This solution worked, but only while running in driver mode.

Taking value from one dataframe and passing that value into loop of SqlContext

Looking to try do something like this:
I have a dataframe that is one column of ID's called ID_LIST. With that column of id's I would like to pass it into a Spark SQL call looping through ID_LIST using foreach returning the result to another dataframe.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val id_list = sqlContext.sql("select distinct id from item_orc")
id_list.registerTempTable("ID_LIST")
id_list.foreach(i => println(i)
id_list println output:
[123]
[234]
[345]
[456]
Trying to now loop through ID_LIST and run a Spark SQL call for each:
id_list.foreach(i => {
val items = sqlContext.sql("select * from another_items_orc where id = " + i
items.foreach(println)
}
First.. not sure how to pull the individual value out, getting this error:
org.apache.spark.sql.AnalysisException: cannot recognize input near '[' '123' ']' in expression specification; line 1 pos 61
Second: how can I alter my code to output the result to a dataframe I can use later ?
Thanks, any help is appreciated!
Answer To First Question
When you perform the "foreach" Spark converts the dataframe into an RDD of type Row. Then when you println on the RDD it prints the Row, the first row being "[123]". It is boxing [] the elements in the row. The elements in the row are accessed by position. If you wanted to print just 123, 234, etc... try
id_list.foreach(i => println(i(0)))
Or you can use native primitive access
id_list.foreach(i => println(i.getString(0))) //For Strings
Seriously... Read the documentation I have linked about Row in Spark. This will transform your code to:
id_list.foreach(i => {
val items = sqlContext.sql("select * from another_items_orc where id = " + i.getString(0))
items.foreach(i => println(i.getString(0)))
})
Answer to Second Question
I have a sneaking suspicion about what you actually are trying to do but I'll answer your question as I have interpreted it.
Let's create an empty dataframe which we will union everything to it in a loop of the distinct items from the first dataframe.
import org.apache.spark.sql.types.{StructType, StringType}
import org.apache.spark.sql.Row
// Create the empty dataframe. The schema should reflect the columns
// of the dataframe that you will be adding to it.
val schema = new StructType()
.add("col1", StringType, true)
var df = ss.createDataFrame(ss.sparkContext.emptyRDD[Row], schema)
// Loop over, select, and union to the empty df
id_list.foreach{ i =>
val items = sqlContext.sql("select * from another_items_orc where id = " + i.getString(0))
df = df.union(items)
}
df.show()
You now have the dataframe df that you can use later.
NOTE: An easier thing to do would probably be to join the two dataframes on the matching columns.
import sqlContext.implicits.StringToColumn
val bar = id_list.join(another_items_orc, $"distinct_id" === $"id", "inner").select("id")
bar.show()

How two RDD according to funcation get Result RDD

I am a beginner of Apache Spark. I want to filter two RDD into result RDD with the below code
def runSpark(stList:List[SubStTime],icList:List[IcTemp]): Unit ={
val conf = new SparkConf().setAppName("OD").setMaster("local[*]")
val sc = new SparkContext(conf)
val st = sc.parallelize(stList).map(st => ((st.productId,st.routeNo),st)).groupByKey()
val ic = sc.parallelize(icList).map(ic => ((ic.productId,ic.routeNo),ic)).groupByKey()
//TODO
//val result = st.join(ic).mapValues( )
sc.stop()
}
here is what i want to do
List[ST] ->map ->Map(Key,st) ->groupByKey ->Map(Key,List[st])
List[IC] ->map ->Map(Key,ic) ->groupByKey ->Map(Key,List[ic])
STRDD join ICRDD get Map(Key,(List[st],List[ic]))
I have a function compare listST and listIC get the List[result] result contains both SubStTime and IcTemp information
def calcIcSt(st:List[SubStTime],ic:List[IcTemp]): List[result]
I don't know how to use mapvalues or other some way to get my result
Thanks
val result = st.join(ic).mapValues( x => calcIcSt(x._1,x._2) )

Resources