How to iterate through dataframe without converting to dataset in spark? - apache-spark

I have a dataframe through which I want to iterate, but I dont want to convert dataframe to dataset.
We have to convert spark scala code to pyspark and pyspark does not support dataset.
I have tried the following code with by converting to dataset
data in file:
abc,a
mno,b
pqr,a
xyz,b
val a = sc.textFile("<path>")
//creating dataframe with column AA,BB
val b = a.map(x => x.split(",")).map(x =>(x(0).toString,x(1).toString)).toDF("AA","BB")
b.registerTempTable("test")
case class T(AA:String, BB: String)
//creating dataset from dataframe
val d = b.as[T].collect
d.foreach{ x=>
var m = spark.sql(s"select * from test where BB = '${x.BB}'")
m.show()
}
Without converting to dataset it gives error i.e. with
val d = b.collect
d.foreach{ x=>
var m = spark.sql(s"select * from test where BB = '${x.BB}'")
m.show()
}
it gives error:
error: value BB is not member of org.apache.spark.sql.ROW

You cannot loop dataframe as you have given in the above code. Use dataframe's rdd.collect to loop dataframe.
import spark.implicits._
val df = Seq(("abc","a"), ("mno","b"), ("pqr","a"),("xyz","b")).toDF("AA", "BB")
df.registerTempTable("test")
df.rdd.collect.foreach(x => {
val BBvalue = x.mkString(",").split(",")(1)
var m = spark.sql(s"select * from test where BB = '$BBvalue'")
m.show()
})
Inside the loop I used mkString to convert an rdd row to string and then split the column values with comma and use the index of column for accessing the value. For example, in the above code I have used (1) which means, column BB column index is 2.
Please let me know if you have any questions.

Related

How to set the property name when converting an array column to json in spark? (w/o udf) [duplicate]

Is there a simple way to converting a given Row object to json?
Found this about converting a whole Dataframe to json output:
Spark Row to JSON
But I just want to convert a one Row to json.
Here is pseudo code for what I am trying to do.
More precisely I am reading json as input in a Dataframe.
I am producing a new output that is mainly based on columns, but with one json field for all the info that does not fit into the columns.
My question what is the easiest way to write this function: convertRowToJson()
def convertRowToJson(row: Row): String = ???
def transformVenueTry(row: Row): Try[Venue] = {
Try({
val name = row.getString(row.fieldIndex("name"))
val metadataRow = row.getStruct(row.fieldIndex("meta"))
val score: Double = calcScore(row)
val combinedRow: Row = metadataRow ++ ("score" -> score)
val jsonString: String = convertRowToJson(combinedRow)
Venue(name = name, json = jsonString)
})
}
Psidom's Solutions:
def convertRowToJSON(row: Row): String = {
val m = row.getValuesMap(row.schema.fieldNames)
JSONObject(m).toString()
}
only works if the Row only has one level not with nested Row. This is the schema:
StructType(
StructField(indicator,StringType,true),
StructField(range,
StructType(
StructField(currency_code,StringType,true),
StructField(maxrate,LongType,true),
StructField(minrate,LongType,true)),true))
Also tried Artem suggestion, but that did not compile:
def row2DataFrame(row: Row, sqlContext: SQLContext): DataFrame = {
val sparkContext = sqlContext.sparkContext
import sparkContext._
import sqlContext.implicits._
import sqlContext._
val rowRDD: RDD[Row] = sqlContext.sparkContext.makeRDD(row :: Nil)
val dataFrame = rowRDD.toDF() //XXX does not compile
dataFrame
}
You can use getValuesMap to convert the row object to a Map and then convert it JSON:
import scala.util.parsing.json.JSONObject
import org.apache.spark.sql._
val df = Seq((1,2,3),(2,3,4)).toDF("A", "B", "C")
val row = df.first() // this is an example row object
def convertRowToJSON(row: Row): String = {
val m = row.getValuesMap(row.schema.fieldNames)
JSONObject(m).toString()
}
convertRowToJSON(row)
// res46: String = {"A" : 1, "B" : 2, "C" : 3}
I need to read json input and produce json output.
Most fields are handled individually, but a few json sub objects need to just be preserved.
When Spark reads a dataframe it turns a record into a Row. The Row is a json like structure. That can be transformed and written out to json.
But I need to take some sub json structures out to a string to use as a new field.
This can be done like this:
dataFrameWithJsonField = dataFrame.withColumn("address_json", to_json($"location.address"))
location.address is the path to the sub json object of the incoming json based dataframe. address_json is the column name of that object converted to a string version of the json.
to_json is implemented in Spark 2.1.
If generating it output json using json4s address_json should be parsed to an AST representation otherwise the output json will have the address_json part escaped.
Pay attention scala class scala.util.parsing.json.JSONObject is deprecated and not support null values.
#deprecated("This class will be removed.", "2.11.0")
"JSONFormat.defaultFormat doesn't handle null values"
https://issues.scala-lang.org/browse/SI-5092
JSon has schema but Row doesn't have a schema, so you need to apply schema on Row & convert to JSon. Here is how you can do it.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
def convertRowToJson(row: Row): String = {
val schema = StructType(
StructField("name", StringType, true) ::
StructField("meta", StringType, false) :: Nil)
return sqlContext.applySchema(row, schema).toJSON
}
Essentially, you can have a dataframe which contains just one row. Thus, you can try to filter your initial dataframe and then parse it to json.
I had the same issue, I had parquet files with canonical schema (no arrays), and I only want to get json events. I did as follows, and it seems to work just fine (Spark 2.1):
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.{DataFrame, Dataset, Row}
import scala.util.parsing.json.JSONFormat.ValueFormatter
import scala.util.parsing.json.{JSONArray, JSONFormat, JSONObject}
def getValuesMap[T](row: Row, schema: StructType): Map[String,Any] = {
schema.fields.map {
field =>
try{
if (field.dataType.typeName.equals("struct")){
field.name -> getValuesMap(row.getAs[Row](field.name), field.dataType.asInstanceOf[StructType])
}else{
field.name -> row.getAs[T](field.name)
}
}catch {case e : Exception =>{field.name -> null.asInstanceOf[T]}}
}.filter(xy => xy._2 != null).toMap
}
def convertRowToJSON(row: Row, schema: StructType): JSONObject = {
val m: Map[String, Any] = getValuesMap(row, schema)
JSONObject(m)
}
//I guess since I am using Any and not nothing the regular ValueFormatter is not working, and I had to add case jmap : Map[String,Any] => JSONObject(jmap).toString(defaultFormatter)
val defaultFormatter : ValueFormatter = (x : Any) => x match {
case s : String => "\"" + JSONFormat.quoteString(s) + "\""
case jo : JSONObject => jo.toString(defaultFormatter)
case jmap : Map[String,Any] => JSONObject(jmap).toString(defaultFormatter)
case ja : JSONArray => ja.toString(defaultFormatter)
case other => other.toString
}
val someFile = "s3a://bucket/file"
val df: DataFrame = sqlContext.read.load(someFile)
val schema: StructType = df.schema
val jsons: Dataset[JSONObject] = df.map(row => convertRowToJSON(row, schema))
if you are iterating through an data frame , you can directly convert the data frame to a new dataframe with json object inside and iterate that
val df_json = df.toJSON
I combining the suggestion from: Artem, KiranM and Psidom. Did a lot of trails and error and came up with this solutions that I tested for nested structures:
def row2Json(row: Row, sqlContext: SQLContext): String = {
import sqlContext.implicits
val rowRDD: RDD[Row] = sqlContext.sparkContext.makeRDD(row :: Nil)
val dataframe = sqlContext.createDataFrame(rowRDD, row.schema)
dataframe.toJSON.first
}
This solution worked, but only while running in driver mode.

Spark ML - StringType is not supported [duplicate]

have a DataFrame with some categorical string values (e.g uuid|url|browser).
I would to convert it in a double to execute an ML algorithm that accept double matrix.
As convertion method I used StringIndexer (spark 1.4) that map my string values to double values, so I defined a function like this:
def str(arg: String, df:DataFrame) : DataFrame =
(
val indexer = new StringIndexer().setInputCol(arg).setOutputCol(arg+"_index")
val newDF = indexer.fit(df).transform(df)
return newDF
)
Now the issue is that i would iterate foreach column of a df, call this function and add (or convert) the original string column in the parsed double column, so the result would be:
Initial df:
[String: uuid|String: url| String: browser]
Final df:
[String: uuid|Double: uuid_index|String: url|Double: url_index|String: browser|Double: Browser_index]
Thanks in advance
You can simply foldLeft over the Array of columns:
val transformed: DataFrame = df.columns.foldLeft(df)((df, arg) => str(arg, df))
Still, I will argue that it is not a good approach. Since src discards StringIndexerModel it cannot be used when you get new data. Because of that I would recommend using Pipeline:
import org.apache.spark.ml.Pipeline
val transformers: Array[org.apache.spark.ml.PipelineStage] = df.columns.map(
cname => new StringIndexer()
.setInputCol(cname)
.setOutputCol(s"${cname}_index")
)
// Add the rest of your pipeline like VectorAssembler and algorithm
val stages: Array[org.apache.spark.ml.PipelineStage] = transformers ++ ???
val pipeline = new Pipeline().setStages(stages)
val model = pipeline.fit(df)
model.transform(df)
VectorAssembler can be included like this:
val assembler = new VectorAssembler()
.setInputCols(df.columns.map(cname => s"${cname}_index"))
.setOutputCol("features")
val stages = transformers :+ assembler
You could also use RFormula, which is less customizable, but much more concise:
import org.apache.spark.ml.feature.RFormula
val rf = new RFormula().setFormula(" ~ uuid + url + browser - 1")
val rfModel = rf.fit(dataset)
rfModel.transform(dataset)

Taking value from one dataframe and passing that value into loop of SqlContext

Looking to try do something like this:
I have a dataframe that is one column of ID's called ID_LIST. With that column of id's I would like to pass it into a Spark SQL call looping through ID_LIST using foreach returning the result to another dataframe.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val id_list = sqlContext.sql("select distinct id from item_orc")
id_list.registerTempTable("ID_LIST")
id_list.foreach(i => println(i)
id_list println output:
[123]
[234]
[345]
[456]
Trying to now loop through ID_LIST and run a Spark SQL call for each:
id_list.foreach(i => {
val items = sqlContext.sql("select * from another_items_orc where id = " + i
items.foreach(println)
}
First.. not sure how to pull the individual value out, getting this error:
org.apache.spark.sql.AnalysisException: cannot recognize input near '[' '123' ']' in expression specification; line 1 pos 61
Second: how can I alter my code to output the result to a dataframe I can use later ?
Thanks, any help is appreciated!
Answer To First Question
When you perform the "foreach" Spark converts the dataframe into an RDD of type Row. Then when you println on the RDD it prints the Row, the first row being "[123]". It is boxing [] the elements in the row. The elements in the row are accessed by position. If you wanted to print just 123, 234, etc... try
id_list.foreach(i => println(i(0)))
Or you can use native primitive access
id_list.foreach(i => println(i.getString(0))) //For Strings
Seriously... Read the documentation I have linked about Row in Spark. This will transform your code to:
id_list.foreach(i => {
val items = sqlContext.sql("select * from another_items_orc where id = " + i.getString(0))
items.foreach(i => println(i.getString(0)))
})
Answer to Second Question
I have a sneaking suspicion about what you actually are trying to do but I'll answer your question as I have interpreted it.
Let's create an empty dataframe which we will union everything to it in a loop of the distinct items from the first dataframe.
import org.apache.spark.sql.types.{StructType, StringType}
import org.apache.spark.sql.Row
// Create the empty dataframe. The schema should reflect the columns
// of the dataframe that you will be adding to it.
val schema = new StructType()
.add("col1", StringType, true)
var df = ss.createDataFrame(ss.sparkContext.emptyRDD[Row], schema)
// Loop over, select, and union to the empty df
id_list.foreach{ i =>
val items = sqlContext.sql("select * from another_items_orc where id = " + i.getString(0))
df = df.union(items)
}
df.show()
You now have the dataframe df that you can use later.
NOTE: An easier thing to do would probably be to join the two dataframes on the matching columns.
import sqlContext.implicits.StringToColumn
val bar = id_list.join(another_items_orc, $"distinct_id" === $"id", "inner").select("id")
bar.show()

create a register dynamic dataframe as temptable in spark

I am trying to registerTemptables, from dynamic dataframes.
I am getting the output as a string., i am not sure if there is a way to execute dataframe or convert a string to dataframe so that the temptable can be created.
Here are the steps to replicate this issue :
import org.apache.spark.sql._
val contact_df = sc.makeRDD(1 to 5).map(i => (i, i * i)).toDF("value", "square")
val acct_df = sc.makeRDD(1 to 5).map(i => (i, i / i)).toDF("value", "devide")
val dataframeJoins = Array(
Row("x","","","" ,"Y","",1,"contact_hotline_df","contact_df","acct_nbr","hotline_df","tm49_acct_nbr"),
Row("x","","","","Y","",2,"contact_hotline_acct_df","acct_df","tm06_acct_nbr" ,"contact_hotline_df","acct_nbr")
)
val dfJoinbroadcast = sc.broadcast(dataframeJoins)
val DFJoins1 = for ( row <- dfJoinbroadcast.value ) yield {
(row(8)+".registerTempTable(\""+row(8)+"\")" )
}
for (rows <- 0 until DFJoins1.size ){
println(DFJoins1(rows) )
DFJoins1(rows)
}
Here is the output of the above for loop :
contact_df.registerTempTable("contact_df")
acct_df.registerTempTable("acct_df")
I am not getting any error. But the table is not getting created.
When i say sqlContext.sql("select * from contact_df") i am getting an error that table is not created.
Is there a way to convert string to a dataframe and execute the dataframe to create temptable.
Please suggest.
Thanks,
Sreehari
Your code concatenates the strings and prints the result, that's it. The registerTempTable method is not being called, that's why you cant use it in the SQL query. Try to do this:
// assuming we have this string to object mapping
val tableNameToDf = Map("contact_df" -> contact_df, "acct_df" -> acct_df)
you could restructure your for loop into something like:
val dfJoins = for (row <- dfJoinbroadcast.value) yield {
val wannabeTable = row(8)
tableNameToRdd(wannabeTable).createOrReplaceTempView(wannabeTable)
wannabeTableName
}

Spark.ml DataFrame containing SparseVector

I have a spark.ml DataFrame that contains many columns, each of these columns containing a SparseVector per row. I would like to apply MultivariateStatisticalSummary.colStats to each column, and colStats signature is:
def colStats(X: RDD[Vector]): MultivariateStatisticalSummary
which seems perfect... except that I can't seem to select a column from that DataFrame and get it to be a RDD[Vector]. Here is my attempt:
val df: DataFrame = data.select(shardId)
val col = df.as[(org.apache.spark.mllib.linalg.Vector)].rdd
val s: MultivariateStatisticalSummary = Statistics.colStats(col)
which doesn't compile with the message (in Scala):
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing sqlContext.implicits._ Support for serializing other types will be added in future releases.
val col = df.as[(org.apache.spark.mllib.linalg.Vector)].rdd
I also tried:
val df = data.select(shardId)
val col: RDD[Vector] = df.map(x => x.asInstanceOf[org.apache.spark.mllib.linalg.Vector])
val s: MultivariateStatisticalSummary = Statistics.colStats(col)
which fails at runtime with error:
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to org.apache.spark.mllib.linalg.Vector
How can I bridge the gap between DataFrame and colStats?
I found the answer after all:
val df = data.select(shardId)
val col: RDD[Vector] = df.map { _.get(0).asInstanceOf[org.apache.spark.mllib.linalg.Vector] }
val s: MultivariateStatisticalSummary = Statistics.colStats(col)
The trick was only to extract the first element of each row before casting it.

Resources