spark spelling correction via udf - apache-spark

I need to correct some spellings using spark.
Unfortunately a naive approach like
val misspellings3 = misspellings1
.withColumn("A", when('A === "error1", "replacement1").otherwise('A))
.withColumn("A", when('A === "error1", "replacement1").otherwise('A))
.withColumn("B", when(('B === "conditionC") and ('D === condition3), "replacementC").otherwise('B))
does not work with spark How to add new columns based on conditions (without facing JaninoRuntimeException or OutOfMemoryError)?
The simple cases (the first 2 examples) can nicely be handled via
val spellingMistakes = Map(
"error1" -> "fix1"
)
val spellingNameCorrection: (String => String) = (t: String) => {
titles.get(t) match {
case Some(tt) => tt // correct spelling
case None => t // keep original
}
}
val spellingUDF = udf(spellingNameCorrection)
val misspellings1 = hiddenSeasonalities
.withColumn("A", spellingUDF('A))
But I am unsure how to handle the more complex / chained conditional replacements in an UDF in a nice & generalizeable manner.
If it is only a rather small list of spellings < 50 would you suggest to hard code them within a UDF?

You can make the UDF receive more than one column:
val spellingCorrection2= udf((x: String, y: String) => if (x=="conditionC" && y=="conditionD") "replacementC" else x)
val misspellings3 = misspellings1.withColumn("B", spellingCorrection2($"B", $"C")
To make this more generalized you can use a map from a tuple of the two conditions to a string same as you did for the first case.
If you want to generalize it even more then you can use dataset mapping. Basically create a case class with the relevant columns and then use as to convert the dataframe to a dataset of the case class. Then use the dataset map and in it use pattern matching on the input data to generate the relevant corrections and convert back to dataframe.
This should be easier to write but would have a performance cost.

For now I will go with the following which seems to work just fine and is more understandable: https://gist.github.com/rchukh/84ac39310b384abedb89c299b24b9306

If spellingMap is the map containing correct spellings, and df is the dataframe.
val df: DataFrame = _
val spellingMap = Map.empty[String, String] //fill it up yourself
val columnsWithSpellingMistakes = List("abc", "def")
Write a UDF like this
def spellingCorrectionUDF(spellingMap:Map[String, String]) =
udf[(String), Row]((value: Row) =>
{
val cellValue = value.getString(0)
if(spellingMap.contains(cellValue)) spellingMap(cellValue)
else cellValue
})
And finally, you can call them as
val newColumns = df.columns.map{
case columnName =>
if(columnsWithSpellingMistakes.contains(columnName)) spellingCorrectionUDF(spellingMap)(Column(columnName)).as(columnName)
else Column(columnName)
}
df.select(newColumns:_*)

Related

how to efficiently parse dataframe object into a map of key-value pairs

i'm working with a dataframe with the columns basketID and itemID. is there a way to efficiently parse through the dataset and generate a map where the keys are basketID and the value is a set of all the itemID contained within each basket?
my current implementation uses a for loop over the data frame which isn't very scalable. is it possible to do this more efficiently? any help would be appreciated thanks!
screen shot of sample data
the goal is to obtain basket = Map("b1" -> Set("i1", "i2", "i3"), "b2" -> Set("i2", "i4"), "b3" -> Set("i3", "i5"), "b4" -> Set("i6")). heres the implementation I have using a for loop
// create empty container
val basket = scala.collection.mutable.Map[String, Set[String]]()
// loop over all numerical indexes for baskets (b<i>)
for (i <- 1 to 4) {
basket("b" + i.toString) = Set();
}
// loop over every row in df and store the items to the set
df.collect().foreach(row =>
basket(row(0).toString) += row(1).toString
)
You can simply do aggregateByKey operation then collectItAsMap will directly give you the desired result. It is much more efficient than simple groupBy.
import scala.collection.mutable
case class Items(basketID: String,itemID: String)
import spark.implicits._
val result = output.as[Items].rdd.map(x => (x.basketID,x.itemID))
.aggregateByKey[mutable.Buffer[String]](new mutable.ArrayBuffer[String]())
((l: mutable.Buffer[String], p: String) => l += p ,
(l1: mutable.Buffer[String], l2: mutable.Buffer[String]) => (l1 ++ l2).distinct)
.collectAsMap();
you can check other aggregation api's like reduceBy and groupBy over here.
please also check aggregateByKey vs groupByKey vs ReduceByKey differences.
This is efficient assuming your dataset is small enough to fit into the driver's memory. .collect will give you an array of rows on which you are iterating which is fine. If you want scalability then instead of Map[String, Set[String]] (this will reside in driver memory) you can use PairRDD[String, Set[String]] (this will be distributed).
//NOT TESTED
//Assuming df is dataframe with 2 columns, first is your basketId and second is itemId
df.rdd.map(row => (row.getAs[String](0), row.getAs[String](1)).groupByKey().mapValues(x => x.toSet)

opposite of spark dataframe `withColumn` method?

I'd like to be able to chain a transformation on my DataFrame that drops a column, rather than assigning the DataFrame to a variable (i.e. df.drop()). If I wanted to add a column, I could simply call df.withColumn(). What is the way to drop a column in an in-line chain of transformations?
For the entire example use this as baseline:
val testVariable = 10
var finalDF = spark.sql("'test' as test_column")
val iDF = spark.sql("select 'John Smith' as Name, cast('10' as integer) as Age, 'Illinois' as State")
val iDF2 = spark.sql("select 'Jane Doe' as Name, cast('40' as integer) as Age, 'Iowa' as State")
val iDF3 = spark.sql("select 'Blobby' as Name, cast('150' as integer) as Age, 'Non-US' as State")
val nameDF = iDF.unionAll(iDF2).unionAll(iDF3)
1 Conditional Drop
If you want to only drop on certain outputs and these are known outputs, you can build out conditional loops to check if the iterator needs to be dropped or not. In this case if the test variable exceeds 4 it will drop the name column, else it adds a new column.
finalDF = if (testVariable>=5) {
nameDF.drop("Name")
} else {
nameDF.withColumn("Cooler_Name", lit("Cool_Name")
}
finalDF.printSchema
2 Programmatically build the select statement. Baseline the select expression statement takes in independent strings and build them into commands that can be read by Spark. In the case below we know we have a test for drop but we do know what columns might be dropped. In this case if a column gets a test values that does not equal 1 we do not include the value in out command array. When we run the command array against the select expression on the table, those columns are dropped.
val columnNames = nameDF.columns
val arrayTestOutput = Array(1,0,1)
var iteratorArray = 1
var commandArray = Array("")
while(iteratorArray <= columnNames.length) {
if (arrayTestOutput(iteratorArray-1) == 1) {
if (iteratorArray == 1) {
commandArray = columnNames(iteratorArray-1)
} else {
commandArray = commandArray ++ columnNames(iteratorArray-1)
}
}
iteratorArray = iteratorArray + 1
}
finalDF=nameDF.selectExpr(commandArray:_*)
finalDF.printSchema

How to Flatten spark dataframe Row to multiple Dataframe Rows

Hi I have a spark data frame which prints like this (single row)
[abc,WrappedArray(11918,1233),WrappedArray(46734,1234),1487530800317]
So inside a row i have wrapped array, I want to flatten it and create a dataframe which has single value for each array for example above row should transform something like this
[abc,11918,46734,1487530800317]
[abc,1233,1234,1487530800317]
So i got dataframe with 2 Rows instead of 1, So each corresponding element from wrapped array should go in new row.
Edit 1 after 1st answer:
What if i have 3 arrays in my input
WrappedArray(46734,1234,[abc,WrappedArray(11918,1233),WrappedArray(46734,1234),WrappedArray(1,2),1487530800317]
my output should be
[abc,11918,46734,1,1487530800317]
[abc,1233,1234,2,1487530800317]
Definitely not the best solution, but this would work:
case class TestFormat(a: String, b: Seq[String], c: Seq[String], d: String)
val data = Seq(TestFormat("abc", Seq("11918","1233"),
Seq("46734","1234"), "1487530800317")).toDS
val zipThem: (Seq[String], Seq[String]) => Seq[(String, String)] = _.zip(_)
val udfZip = udf(zipThem)
data.select($"a", explode(udfZip($"b", $"c")) as "tmp", $"d")
.select($"a", $"tmp._1" as "b", $"tmp._2" as "c", $"d")
.show
The problem is that by default you cannot be sure that both Sequences are of equal length.
The probably better solution would be to reformat the whole data frame into a structure that models the data, e.g.
root
-- a
-- d
-- records
---- b
---- c
Thanks for answering #swebbo, you answer helped me getting this done:
I did this:
import org.apache.spark.sql.functions.{explode, udf}
import sqlContext.implicits._
val zipColumns = udf((x: Seq[Long], y: Seq[Long], z: Seq[Long]) => (x.zip(y).zip(z)) map {
case ((a,b),c) => (a,b,c)
})
val flattened = subDf.withColumn("columns", explode(zipColumns($"col3", $"col4", $"col5"))).select(
$"col1", $"col2",
$"columns._1".alias("col3"), $"columns._2".alias("col4"), $"columns._3".alias("col5"))
flattened.show
Hope that is understandable :)

Spark - How to handle error case in RDD.map() method correctly?

I am trying to do some text processing using Spark RDD.
The format of the input file is:
2015-05-20T18:30 <some_url>/?<key1>=<value1>&<key2>=<value2>&...&<keyn>=<valuen>
I want to extract some fields from the text and convert them into CSV format like:
<value1>,<value5>,<valuek>,<valuen>
The following code is how I do this:
val lines = sc.textFile(s"s3n://${MY_BUCKET}/${MY_FOLDER}/test/*.gz")
val records = lines.map { line =>
val mp = line.split("&")
.map(_.split("="))
.filter(_.length >= 2)
.map(t => (t(0), t(1))).toMap
(mp.get("key1"), mp.get("key5"), mp.get("keyk"), mp.get("keyn"))
}
I would like to know that, if some line of the input text is of wrong format or invalid, then the map() function cannot return a valid value. This should very common in text processing, what is the best practice to deal with this problem?
in order to manage this errors you can use the scala's class Try within a flatMap operation, in code:
val lines = sc.textFile(s"s3n://${MY_BUCKET}/${MY_FOLDER}/test/*.gz")
val records = lines.flatMap (line =>
Try{
val mp = line.split("&")
.map(_.split("="))
.filter(_.length >= 2)
.map(t => (t(0), t(1))).toMap
(mp.get("key1"), mp.get("key5"), mp.get("keyk"), mp.get("keyn"))
} match {
case Success(map) => Seq(map)
case _ => Seq()
})
With this you have only the "good ones" but if you want both (the errors and the good ones) i would recommend to use a map function that returns a Scala Either and then use a Spark filter, in code:
val lines = sc.textFile(s"s3n://${MY_BUCKET}/${MY_FOLDER}/test/*.gz")
val goodBadRecords = lines.map (line =>
Try{
val mp = line.split("&")
.map(_.split("="))
.filter(_.length >= 2)
.map(t => (t(0), t(1))).toMap
(mp.get("key1"), mp.get("key5"), mp.get("keyk"), mp.get("keyn"))
} match {
case Success(map) => Right(map)
case Failure(e) => Left(e)
})
val records = goodBadRecords.filter(_.isRight)
val errors = goodBadRecords.filter(_.isLeft)
I hope this will be useful

Save Spark org.apache.spark.mllib.linalg.Matrix to a file

The result of correlation in Spark MLLib is a of type org.apache.spark.mllib.linalg.Matrix. (see http://spark.apache.org/docs/1.2.1/mllib-statistics.html#correlations)
val data: RDD[Vector] = ...
val correlMatrix: Matrix = Statistics.corr(data, "pearson")
I would like to save the result into a file. How can I do this?
Here is a simple and effective approach to save the Matrix to hdfs and specify the separator.
(The transpose is used since .toArray is in column major format.)
val localMatrix: List[Array[Double]] = correlMatrix
.transpose // Transpose since .toArray is column major
.toArray
.grouped(correlMatrix.numCols)
.toList
val lines: List[String] = localMatrix
.map(line => line.mkString(" "))
sc.parallelize(lines)
.repartition(1)
.saveAsTextFile("hdfs:///home/user/spark/correlMatrix.txt")
As Matrix is Serializable, you can write it using normal Scala.
You can find an example here.
The answer by Dylan Hogg was great, to enhance it slightly, add a column index. (In my use case, once I created a file and downloaded it, it was not sorted due to the nature of parallel process etc.)
ref: https://www.safaribooksonline.com/library/view/scala-cookbook/9781449340292/ch10s12.html
substitute with this line and it will put a sequence number on the line (starting w/ 0) making it easier to sort when you go to view it
val lines: List[String] = localMatrix
.map(line => line.mkString(" "))
.zipWithIndex.map { case(line, count) => s"$count $line" }
Thank you for your suggestion. I came out with this solution. Thanks to Ignacio for his suggestions
val vtsd = sd.map(x => Vectors.dense(x.toArray))
val corrMat = Statistics.corr(vtsd)
val arrayCor = corrMat.toArray.toList
val colLen = columnHeader.size
val toArr2 = sc.parallelize(arrayCor).zipWithIndex().map(
x => {
if ((x._2 + 1) % colLen == 0) {
(x._2, arrayCor.slice(x._2.toInt + 1 - colLen, x._2.toInt + 1).mkString(";"))
} else {
(x._2, "")
}
}).filter(_._2.nonEmpty).sortBy(x => x._1, true, 1).map(x => x._2)
toArr2.coalesce(1, true).saveAsTextFile("/home/user/spark/cor_" + System.currentTimeMillis())

Resources