I have a string like bellow:
str = "Test(a=10, b=100, c=1.0, d=2.0)"
and Test entity is:
data class Test(
val a: int = 0,
val b: Int = 0,
val c: Double = 0.0,
val d: Double = 0.0
)
What should I do to convert string str to Test entity?
Thankyou!!
Regexes seem like an appropriate choice here:
data class Test(val a: Int = 0, val b: Int = 0, val c: Double = 0.0, val d: Double = 0.0)
fun main() {
val str = "Test(a=10, b=100, c=1.0, d=2.0)"
print(getTest(str))
}
fun getTest(str: String): Test {
val regex = """Test\(a=(.+), b=(.+), c=(.+), d=(.+)\)""".toRegex()
val matches = regex.find(str)
return matches?.groupValues?.let { groups ->
Test(groups[1].toInt(), groups[2].toInt(), groups[3].toDouble(), groups[4].toDouble())
} ?: Test()
}
If you're looking at storing objects as strings to re-instantiate them, consider serialization. Have a look here.
It works
data class Test(
val a: Int = 0,
val b: Int = 0,
val c: Double = 0.0,
val d: Double = 0.0
)
fun main() {
val str = "Test(a=10, b=100, c=1.0, d=2.0)"
val numbers = "([\\d.]+)".toRegex().findAll(str).map { it.value }.toList()
val test = Test(
numbers[0].toInt(),
numbers[1].toInt(),
numbers[2].toDouble(),
numbers[3].toDouble())
}
Related
The question is a follow-up of How to store custom objects in Dataset?
Spark version: 3.0.1
Non-nested custom type is achievable:
import spark.implicits._
import org.apache.spark.sql.{Encoder, Encoders}
class AnObj(val a: Int, val b: String)
implicit val myEncoder: Encoder[AnObj] = Encoders.kryo[AnObj]
val d = spark.createDataset(Seq(new AnObj(1, "a")))
d.printSchema
root
|-- value: binary (nullable = true)
However, if the custom type is nested inside a product type (i.e. case class), it gives an error:
java.lang.UnsupportedOperationException: No Encoder found for InnerObj
import spark.implicits._
import org.apache.spark.sql.{Encoder, Encoders}
class InnerObj(val a: Int, val b: String)
case class MyObj(val i: Int, val j: InnerObj)
implicit val myEncoder: Encoder[InnerObj] = Encoders.kryo[InnerObj]
// error
val d = spark.createDataset(Seq(new MyObj(1, new InnerObj(0, "a"))))
// it gives Runtime error: java.lang.UnsupportedOperationException: No Encoder found for InnerObj
How can we create Dataset with nested custom type?
Adding the encoders for both MyObj and InnerObj should make it work.
class InnerObj(val a:Int, val b: String)
case class MyObj(val i: Int, j: InnerObj)
implicit val myEncoder: Encoder[InnerObj] = Encoders.kryo[InnerObj]
implicit val objEncoder: Encoder[MyObj] = Encoders.kryo[MyObj]
The above snippet compile and run fine
Another solution apart from sujesh's:
import spark.implicits._
import org.apache.spark.sql.{Encoder, Encoders}
class InnerObj(val a: Int, val b: String)
case class MyObj[T](val i: Int, val j: T)
implicit val myEncoder: Encoder[MyObj[InnerObj]] = Encoders.kryo[MyObj[InnerObj]]
// works
val d = spark.createDataset(Seq(new MyObj(1, new InnerObj(0, "a"))))
This also shows a difference between the case where the inner type can be deduced from the type parameter, and the case where it cannot be deduced.
The former case should be done:
implicit val myEncoder: Encoder[MyObj[InnerObj]] = Encoders.kryo[MyObj[InnerObj]]
The later case should be done:
implicit val myEncoder1: Encoder[InnerObj] = Encoders.kryo[InnerObj]
implicit val myEncoder2: Encoder[MyObj] = Encoders.kryo[MyObj]
I'd like to pass an Array as input schema in an UDAF.
The example I give is pretty simple, it just sums 2 vectors. Actually my use case is more complexe and I need to use an UDAF.
import sc.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.expressions._
val df = Seq(
(1, Array(10.2, 12.3, 11.2)),
(1, Array(11.2, 12.6, 10.8)),
(2, Array(12.1, 11.2, 10.1)),
(2, Array(10.1, 16.0, 9.3))
).toDF("siteId", "bidRevenue")
class BidAggregatorBySiteId() extends UserDefinedAggregateFunction {
def inputSchema: StructType = StructType(Array(StructField("bidRevenue", ArrayType(DoubleType))))
def bufferSchema = StructType(Array(StructField("sumArray", ArrayType(DoubleType))))
def dataType: DataType = ArrayType(DoubleType)
def deterministic = true
def initialize(buffer: MutableAggregationBuffer) = {
buffer.update(0, Array(0.0, 0.0, 0.0))
}
def update(buffer: MutableAggregationBuffer, input: Row) = {
val seqBuffer = buffer(0).asInstanceOf[IndexedSeq[Double]]
val seqInput = input(0).asInstanceOf[IndexedSeq[Double]]
buffer(0) = seqBuffer.zip(seqInput).map{ case (x, y) => x + y }
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
val seqBuffer1 = buffer1(0).asInstanceOf[IndexedSeq[Double]]
val seqBuffer2 = buffer2(0).asInstanceOf[IndexedSeq[Double]]
buffer1(0) = seqBuffer1.zip(seqBuffer2).map{ case (x, y) => x + y }
}
def evaluate(buffer: Row) = {
buffer
}
}
val fun = new BidAggregatorBySiteId()
df.select($"siteId", $"bidRevenue" cast(ArrayType(DoubleType)))
.groupBy("siteId").agg(fun($"bidRevenue"))
.show
All works fine for transformations before the "show" action. But the show raises the error:
scala.MatchError: [WrappedArray(21.4, 24.9, 22.0)] (of class org.apache.spark.sql.execution.aggregate.InputAggregationBuffer)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:160)
The structure of my dataframe is :
root
|-- siteId: integer (nullable = false)
|-- bidRevenue: array (nullable = true)
| |-- element: double (containsNull = true)
df.dtypes = Array[(String, String)] = Array(("siteId", "IntegerType"), ("bidRevenue", "ArrayType(DoubleType,true)"))
Tanks for you valuable help.
def evaluate(buffer: Row): Any
Above method is called once a group is processed completely to get the final result.
As you are initializing and updating only buffer's 0th index
i.e. buffer(0)
So you need to return the 0th index value at the end as your aggregated results are stored at 0 index.
def evaluate(buffer: Row) = {
buffer.get(0)
}
Above modification to evaluate() method will result in:
// +------+---------------------------------+
// |siteId|bidaggregatorbysiteid(bidRevenue)|
// +------+---------------------------------+
// | 1| [21.4, 24.9, 22.0]|
// | 2| [22.2, 27.2, 19.4]|
// +------+---------------------------------+
This question already has answers here:
value toDF is not a member of org.apache.spark.rdd.RDD
(5 answers)
Closed 4 years ago.
this code is normal in spark-shell,
but it is abnomal in Intellj IDE.
this is Error message.
Error:(59, 7) value toDF is not a member of org.apache.spark.rdd.RDD[Weather]
possible cause: maybe a semicolon is missing before `value toDF'?
}.toDF()
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.feature.StandardScaler
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.Pipeline
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions._
import org.apache.spark.ml.tuning.CrossValidator
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.tuning.ParamGridBuilder
import org.apache.spark.rdd.PairRDDFunctions
import org.apache.spark.sql.DataFrame
case class Weather(
date: String,
day_of_week: String,
avg_temp: Double,
max_temp: Double,
min_temp: Double,
rainfall: Double,
daylight_hours: Double,
max_depth_snowfall: Double,
total_snowfall: Double,
solar_radiation: Double,
mean_wind_speed: Double,
max_wind_speed: Double,
max_instantaneous_wind_speed: Double,
avg_humidity: Double,
avg_cloud_cover: Double)
case class Tracffic(date: String, down: Double, up: Double)
case class Predict(describe: String, avg_temp: Double, rainfall: Double, weekend: Double, total_snowfall: Double)
object weather2 {
def main(args : Array[String]): Unit = {
val conf = new SparkConf().setMaster("local").setAppName("weather2")
val sc = new SparkContext(conf)
val weatherCSVTmp = sc.textFile("D:\\shared\\weather.csv")
val weatherHeader = sc.parallelize(Array(weatherCSVTmp.first))
val weatherCSV = weatherCSVTmp.subtract(weatherHeader)
val weatherDF = weatherCSV.map(_.split(",")).map { p =>
Weather(p(0),
p(1),
p(2).trim.toDouble,
p(3).trim.toDouble,
p(4).trim.toDouble,
p(5).trim.toDouble,
p(6).trim.toDouble,
p(7).trim.toDouble,
p(8).trim.toDouble,
p(9).trim.toDouble,
p(10).trim.toDouble,
p(11).trim.toDouble,
p(12).trim.toDouble,
p(13).trim.toDouble,
p(14).trim.toDouble)
}.toDF()//error
val tracfficCSVTmp = sc.textFile("D:\\shared\\tracffic_volume.csv")
val tracfficHeader = sc.parallelize(Array(tracfficCSVTmp.first))
val tracfficCSV = tracfficCSVTmp.subtract(tracfficHeader)
val tracfficDF = tracfficCSV.map(_.split(",")).map { p =>
Tracffic(p(0),
p(1).trim.toDouble,
p(2).trim.toDouble)
}.toDF() //error
val tracfficAndWeatherDF = tracfficDF.join(weatherDF, "date")
val isWeekend = udf((t: String) =>
t match {
case x if x.contains("Sunday") => 1d
case x if x.contains("Saturday") => 1d
case _ => 0d
})
val replacedtracfficAndWeatherDF = tracfficAndWeatherDF.withColumn(
"weekend", isWeekend(tracfficAndWeatherDF("day_of_week"))
).drop("day_of_week")
val va = new VectorAssembler().setInputCols {
Array("avg_temp", "weekend", "rainfall")
}.setOutputCol("input_vec")
val scaler = new StandardScaler().setInputCol(va.getOutputCol).setOutputCol("scaled_vec")
va.explainParams
scaler.explainParams
//down predict
val lr = new LinearRegression().setMaxIter(10).setFeaturesCol(scaler.getOutputCol).setLabelCol("down")
val pipeline = new Pipeline().setStages(Array(va, scaler, lr))
val pipelineModel = pipeline.fit(replacedtracfficAndWeatherDF)
val test = sc.parallelize(Seq(
Predict("Ussally Day", 20.0, 20, 0, 0),
Predict("Weekend", 20.0, 20, 1, 0),
Predict("Cold day", 3.0, 20, 0, 20)
)).toDF //error
val predictedDataDF = pipelineModel.transform(test)
val desAndPred = predictedDataDF.select("describe", "prediction").collect()
desAndPred.foreach {
case Row(describe: String, prediction: Double) =>
println(s"($describe) -> prediction = $prediction")
}
what is the problem? Libraries is spark 2.11.x. would you help me?
Add the below code and try
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
Let's say I have 2 data frames.
DF1 may have values {3, 4, 5} in column A of various rows.
DF2 may have values {4, 5, 6} in column A of various rows.
I can aggregate these into a set of distinct elements using distinct_set(A), assuming all those rows fall into the same grouping.
At this point I have a set in the resulting data frame. Is there anyway to aggregate that set with another set? Basically, if I have 2 data frames resulting from the first aggregation, I want to be able to aggregate their results.
While explode and collect_set could solve this, it made more sense just to write a custom aggregator to merge the sets themselves. The structure underlying them is a WrappedArray.
case class SetMergeUDAF() extends UserDefinedAggregateFunction {
def deterministic: Boolean = false
def inputSchema: StructType = StructType(StructField("input", ArrayType(LongType)) :: Nil)
def bufferSchema: StructType = StructType(StructField("buffer", ArrayType(LongType)) :: Nil)
def dataType: DataType = ArrayType(LongType)
def initialize(buf: MutableAggregationBuffer): Unit = {
buf(0) = mutable.WrappedArray.empty[LongType]
}
def update(buf: MutableAggregationBuffer, input: Row): Unit = {
if (!input.isNullAt(0)) {
val result : mutable.WrappedArray[LongType] = mutable.WrappedArray.empty[LongType]
val x = result ++ (buf.getAs[mutable.WrappedArray[Long]](0).toSet ++ input.getAs[mutable.WrappedArray[Long]](0).toSet).toArray[Long]
buf(0) = x
}
}
def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = {
val result : mutable.WrappedArray[LongType] = mutable.WrappedArray.empty[LongType]
val x = result ++ (buf1.getAs[mutable.WrappedArray[Long]](0).toSet ++ buf2.getAs[mutable.WrappedArray[Long]](0).toSet).toArray[Long]
buf1(0) = x
}
def evaluate(buf: Row): Any = buf.getAs[mutable.WrappedArray[LongType]](0)
}
I have a CSV of both textual and numerical data. I need to convert it to feature vector data in Spark (Double values). Is there any way to do that ?
I see some e.g where each keyword is mapped to some double value and use this to convert. However if there are multiple keywords, it is difficult to do this way.
Is there any other way out? I see Spark provides Extractors which will convert into feature vectors. Could someone please give an example?
48, Private, 105808, 9th, 5, Widowed, Transport-moving, Unmarried, White, Male, 0, 0, 40, United-States, >50K
42, Private, 169995, Some-college, 10, Married-civ-spouse, Prof-specialty, Husband, White, Male, 0, 0, 45, United-States, <=50K
Finally I did this way. I iterate through each data and make a map with key as each item and increment a Double counter.
def createMap(data: RDD[String]) : Map[String,Double] = {
var mapData:Map[String,Double] = Map()
var counter = 0.0
data.collect().foreach{ item =>
counter = counter +1
mapData += (item -> counter)
}
mapData
}
def getLablelValue(input: String): Int = input match {
case "<=50K" => 0
case ">50K" => 1
}
val census = sc.textFile("/user/cloudera/census_data.txt")
val orgTypeRdd = census.map(line => line.split(", ")(1)).distinct
val gradeTypeRdd = census.map(line => line.split(", ")(3)).distinct
val marStatusRdd = census.map(line => line.split(", ")(5)).distinct
val jobTypeRdd = census.map(line => line.split(", ")(6)).distinct
val familyStatusRdd = census.map(line => line.split(", ")(7)).distinct
val raceTypeRdd = census.map(line => line.split(", ")(8)).distinct
val genderTypeRdd = census.map(line => line.split(", ")(9)).distinct
val countryRdd = census.map(line => line.split(", ")(13)).distinct
val salaryRange = census.map(line => line.split(", ")(14)).distinct
val orgTypeMap = createMap(orgTypeRdd)
val gradeTypeMap = createMap(gradeTypeRdd)
val marStatusMap = createMap(marStatusRdd)
val jobTypeMap = createMap(jobTypeRdd)
val familyStatusMap = createMap(familyStatusRdd)
val raceTypeMap = createMap(raceTypeRdd)
val genderTypeMap = createMap(genderTypeRdd)
val countryMap = createMap(countryRdd)
val salaryRangeMap = createMap(salaryRange)
val featureVector = census.map{line =>
val fields = line.split(", ")
LabeledPoint(getLablelValue(fields(14).toString) , Vectors.dense(fields(0).toDouble, orgTypeMap(fields(1).toString) , fields(2).toDouble , gradeTypeMap(fields(3).toString) , fields(4).toDouble , marStatusMap(fields(5).toString), jobTypeMap(fields(6).toString), familyStatusMap(fields(7).toString),raceTypeMap(fields(8).toString),genderTypeMap (fields(9).toString), fields(10).toDouble , fields(11).toDouble , fields(12).toDouble,countryMap(fields(13).toString) , salaryRangeMap(fields(14).toString)))
}