Unable to Parse from String to Int with in Case Class - apache-spark

can some one help me where i am exactly missing with this code? I am unable to parse the phone from String to Integer
case class contactNew(id:Long,name:String,phone:Int,email:String)
val contactNewData = Array("1#Avinash#Mob-8885453419#avinashbasetty#gmail.com","2#rajsekhar#Mob-9848022338#raj#yahoo.com","3#kamal#Mob-98032446443#kamal#gmail.com")
val contactNewDataToRDD = spark.sparkContext.parallelize(contactNewData)
val contactNewRDD = contactNewDataToRDD.map(l=> {
val contactArray=l.split("#")
val MobRegex=contactArray(2).replaceAll("[a-zA-Z/-]","")
val MobRegex_Int=MobRegex.toInt
contactNew(contactArray(0).toLong,contactArray(1),MobRegex_Int,contactArray(3))
})
contactNewRDD.collect.foreach(println)

you are getting error as the size of last phone number is greater than int size.
convert the number too Long. it should work.
case class contactNew(id:Long,name:String,phone:Long,email:String)
val contactNewData =
Array("1#Avinash#Mob8885453419#avinashbasetty#gmail.com",
"2#rajsekhar#Mob-9848022338#raj#yahoo.com",
"3#kamal#Mob-98032446443#kamal#gmail.com")
val contactNewDataToRDD = spark.sparkContext.parallelize(contactNewData)
val contactNewRDD = contactNewDataToRDD.map(l=>
{
val contactArray=l.split("#")
val MobRegex=contactArray(2).replaceAll("[a-zA-Z/-]","")
val MobRegex_Int=MobRegex.toLong
contactNew(contactArray(0).toLong,contactArray(1),MobRegex_Int,contactArray(3))
}
)
contactNewRDD.collect.foreach(println)

Related

How to set the property name when converting an array column to json in spark? (w/o udf) [duplicate]

Is there a simple way to converting a given Row object to json?
Found this about converting a whole Dataframe to json output:
Spark Row to JSON
But I just want to convert a one Row to json.
Here is pseudo code for what I am trying to do.
More precisely I am reading json as input in a Dataframe.
I am producing a new output that is mainly based on columns, but with one json field for all the info that does not fit into the columns.
My question what is the easiest way to write this function: convertRowToJson()
def convertRowToJson(row: Row): String = ???
def transformVenueTry(row: Row): Try[Venue] = {
Try({
val name = row.getString(row.fieldIndex("name"))
val metadataRow = row.getStruct(row.fieldIndex("meta"))
val score: Double = calcScore(row)
val combinedRow: Row = metadataRow ++ ("score" -> score)
val jsonString: String = convertRowToJson(combinedRow)
Venue(name = name, json = jsonString)
})
}
Psidom's Solutions:
def convertRowToJSON(row: Row): String = {
val m = row.getValuesMap(row.schema.fieldNames)
JSONObject(m).toString()
}
only works if the Row only has one level not with nested Row. This is the schema:
StructType(
StructField(indicator,StringType,true),
StructField(range,
StructType(
StructField(currency_code,StringType,true),
StructField(maxrate,LongType,true),
StructField(minrate,LongType,true)),true))
Also tried Artem suggestion, but that did not compile:
def row2DataFrame(row: Row, sqlContext: SQLContext): DataFrame = {
val sparkContext = sqlContext.sparkContext
import sparkContext._
import sqlContext.implicits._
import sqlContext._
val rowRDD: RDD[Row] = sqlContext.sparkContext.makeRDD(row :: Nil)
val dataFrame = rowRDD.toDF() //XXX does not compile
dataFrame
}
You can use getValuesMap to convert the row object to a Map and then convert it JSON:
import scala.util.parsing.json.JSONObject
import org.apache.spark.sql._
val df = Seq((1,2,3),(2,3,4)).toDF("A", "B", "C")
val row = df.first() // this is an example row object
def convertRowToJSON(row: Row): String = {
val m = row.getValuesMap(row.schema.fieldNames)
JSONObject(m).toString()
}
convertRowToJSON(row)
// res46: String = {"A" : 1, "B" : 2, "C" : 3}
I need to read json input and produce json output.
Most fields are handled individually, but a few json sub objects need to just be preserved.
When Spark reads a dataframe it turns a record into a Row. The Row is a json like structure. That can be transformed and written out to json.
But I need to take some sub json structures out to a string to use as a new field.
This can be done like this:
dataFrameWithJsonField = dataFrame.withColumn("address_json", to_json($"location.address"))
location.address is the path to the sub json object of the incoming json based dataframe. address_json is the column name of that object converted to a string version of the json.
to_json is implemented in Spark 2.1.
If generating it output json using json4s address_json should be parsed to an AST representation otherwise the output json will have the address_json part escaped.
Pay attention scala class scala.util.parsing.json.JSONObject is deprecated and not support null values.
#deprecated("This class will be removed.", "2.11.0")
"JSONFormat.defaultFormat doesn't handle null values"
https://issues.scala-lang.org/browse/SI-5092
JSon has schema but Row doesn't have a schema, so you need to apply schema on Row & convert to JSon. Here is how you can do it.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
def convertRowToJson(row: Row): String = {
val schema = StructType(
StructField("name", StringType, true) ::
StructField("meta", StringType, false) :: Nil)
return sqlContext.applySchema(row, schema).toJSON
}
Essentially, you can have a dataframe which contains just one row. Thus, you can try to filter your initial dataframe and then parse it to json.
I had the same issue, I had parquet files with canonical schema (no arrays), and I only want to get json events. I did as follows, and it seems to work just fine (Spark 2.1):
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.{DataFrame, Dataset, Row}
import scala.util.parsing.json.JSONFormat.ValueFormatter
import scala.util.parsing.json.{JSONArray, JSONFormat, JSONObject}
def getValuesMap[T](row: Row, schema: StructType): Map[String,Any] = {
schema.fields.map {
field =>
try{
if (field.dataType.typeName.equals("struct")){
field.name -> getValuesMap(row.getAs[Row](field.name), field.dataType.asInstanceOf[StructType])
}else{
field.name -> row.getAs[T](field.name)
}
}catch {case e : Exception =>{field.name -> null.asInstanceOf[T]}}
}.filter(xy => xy._2 != null).toMap
}
def convertRowToJSON(row: Row, schema: StructType): JSONObject = {
val m: Map[String, Any] = getValuesMap(row, schema)
JSONObject(m)
}
//I guess since I am using Any and not nothing the regular ValueFormatter is not working, and I had to add case jmap : Map[String,Any] => JSONObject(jmap).toString(defaultFormatter)
val defaultFormatter : ValueFormatter = (x : Any) => x match {
case s : String => "\"" + JSONFormat.quoteString(s) + "\""
case jo : JSONObject => jo.toString(defaultFormatter)
case jmap : Map[String,Any] => JSONObject(jmap).toString(defaultFormatter)
case ja : JSONArray => ja.toString(defaultFormatter)
case other => other.toString
}
val someFile = "s3a://bucket/file"
val df: DataFrame = sqlContext.read.load(someFile)
val schema: StructType = df.schema
val jsons: Dataset[JSONObject] = df.map(row => convertRowToJSON(row, schema))
if you are iterating through an data frame , you can directly convert the data frame to a new dataframe with json object inside and iterate that
val df_json = df.toJSON
I combining the suggestion from: Artem, KiranM and Psidom. Did a lot of trails and error and came up with this solutions that I tested for nested structures:
def row2Json(row: Row, sqlContext: SQLContext): String = {
import sqlContext.implicits
val rowRDD: RDD[Row] = sqlContext.sparkContext.makeRDD(row :: Nil)
val dataframe = sqlContext.createDataFrame(rowRDD, row.schema)
dataframe.toJSON.first
}
This solution worked, but only while running in driver mode.

Spark Testing Base returning a tuple of Dstream

I'm using spark-unit-test and I don't get to compile the code.
test("Testging") {
val inputInsert = A("data2")
val inputDelete = A("data1")
val outputInsert = B(1)
val outputDelete = C(1)
val input = List(List(inputInsert), List(inputDelete))
val output = (List(List(outputInsert)), List(List(outputDelete)))
//Why doesn't it compile?? I have tried many things here.
testOperation[A,(B,C)](input, service.processing _, output)
}
My method is:
def processing(avroDstream: DStream[A]) : (DStream[B],DStream[C]) ={...}
What does the "_" means in this case?

Spark scala: convert Iterator[char] to RDD[String]

I am reading data from a file and have reached to a point where the datatype is Iterator[char]. Is there a way to transform Iterator[char] to RDD[String]? which then I can transform to Dataframe/Dataset using case class.
Below is the code:
val fileDir = "inputFileName"
val result = IOUtils.toByteArray(new FileInputStream (new File(fileDir)))
val remove_comp = result.grouped(171).map{arr => arr.update(2, 32);arr}.flatMap{arr => arr.update(3, 32); arr}
val convert_char = remove_comp.map( _.toChar)
This return convert_char: Iterator[Char] = non-empty iterator
Thanks
Not sure what you are trying to do, but this should answer your question:
val ic: Iterator[Char] = ???
val spark : SparkSession = ???
val rdd: RDD[String] = spark.sparkContext.parallelize(ic.map(_.toString).toSeq)

How do we achieve sort by two different fields in Spark-Core?

I am doing some basic programming in spark
InputFile :
2008,20
2008,40
2000,10
2000,30
2001,9
My Spark-Code :
scala> val dataRDD = sc.textFile("/user/cloudera/inputfiles/year.txt")
scala> val mapRDD = dataRDD.map(elem => elem.split(","))
scala> val keyValueRDD = mapRDD.map( elem => (elem(0),elem(1)))
scala> val sortRDD = keyValueRDD.sortByKey(true,1)
res29: Array[(String, String)] = Array((2000,30), (2000,10), (2001,9), (2008,20), (2008,40))
I want output to be sorted by year in ascending order and for each year the values to be sorted in descending order
Expected output:
2000,30
2000,10
2001,9
2008,40
2008,20
Can someone help me on getting this result?
You have to define class which holds year and value for year. This class should extend Ordered by overriding compare method. Than you use the objects of this class as a key values and apply sortBy operation.
class TwoKeys(var first: Int, var second: Int) extends Ordered[TwoKeys] {
def compare(that: TwoKeys): Int = {
if(first == that.first){
that.second - second
}else{
first - that.first
}
}
}
...
val keyValueRDD = mapRDD.map(elem => (TwoKeys(elem(0), elem(1)), TwoKeys(elem(0), elem(1))))
val sortRDD = keyValueRDD.sortByKey(true,1)

How to get Int values from Hbase using Hbase API getValue method

I am trying to fetch values from Hbase using the column names and below is my code:
val cf = Bytes.toBytes("cf")
val tkn_col_num = Bytes.toBytes("TKN_COL_NUM")
val tkn_col_val = Bytes.toBytes("TKN_COL_VAL")
val col_name = Bytes.toBytes("COLUMN_NAME")
val sc = new SparkContext("local", "hbase-test")
val conf = HBaseConfiguration.create()
conf.set(TableInputFormat.INPUT_TABLE, input_table)
conf.set(TableInputFormat.SCAN_COLUMNS, "cf:COLUMN_NAME cf:TKN_COL_NUM cf:TKN_COL_VAL")
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
hBaseRDD.map{case (x,y) => (y)}.collect().foreach(println)
val colMap : Map[String,(Int,String)] = hBaseRDD.map{case (x,y) =>
((Bytes.toString(y.getValue(cf,col_name))),
(
(Bytes.toInt(y.getValue(cf,tkn_col_num))),
(Bytes.toString(y.getValue(cf,tkn_col_val)))
))
}.collect().toMap
colMap.foreach(println)
sc.stop()
Now the Bytes.toString(y.getValue(cf,col_name)) works and I get the expected column names from table however Bytes.toInt(y.getValue(cf,tkn_col_num))) gives me some random values(I guess it is offset values for the cell but I am not sure on it.). Below is the output that I am getting:
(COL1,(-2147483639,sum))
(COL2,(-2147483636,sum))
(COL3,(-2147483645,count))
(COL4,(-2147483642,sum))
(COL5,(-2147483641,sum))
The integer values should be 1,2,3,4,5. Can anyone please guide me how can I get true integer column data.
Thanks

Resources