Spark: cannot resolve input columns - apache-spark

I have
val colNames = data.schema.fieldNames
.filter(colName => colName.split("-")(0) == "20003" || colName == "eid")
which I then use to select a subset of a dataframe:
var medData = data.select(colNames.map(c => col(c)): _*).rdd
but I get
cannot resolve '`20003-0.0`' given input columns:
[20003-0.0, 20003-0.1, 20003-0.2, 20003-0.3];;
What is going on?

I had to include backticks like this:
var medData = data.select(colNames.map(c => col(s"`$c`")): _*).rdd
spark is for some reason adding the backticks

Related

spark spelling correction via udf

I need to correct some spellings using spark.
Unfortunately a naive approach like
val misspellings3 = misspellings1
.withColumn("A", when('A === "error1", "replacement1").otherwise('A))
.withColumn("A", when('A === "error1", "replacement1").otherwise('A))
.withColumn("B", when(('B === "conditionC") and ('D === condition3), "replacementC").otherwise('B))
does not work with spark How to add new columns based on conditions (without facing JaninoRuntimeException or OutOfMemoryError)?
The simple cases (the first 2 examples) can nicely be handled via
val spellingMistakes = Map(
"error1" -> "fix1"
)
val spellingNameCorrection: (String => String) = (t: String) => {
titles.get(t) match {
case Some(tt) => tt // correct spelling
case None => t // keep original
}
}
val spellingUDF = udf(spellingNameCorrection)
val misspellings1 = hiddenSeasonalities
.withColumn("A", spellingUDF('A))
But I am unsure how to handle the more complex / chained conditional replacements in an UDF in a nice & generalizeable manner.
If it is only a rather small list of spellings < 50 would you suggest to hard code them within a UDF?
You can make the UDF receive more than one column:
val spellingCorrection2= udf((x: String, y: String) => if (x=="conditionC" && y=="conditionD") "replacementC" else x)
val misspellings3 = misspellings1.withColumn("B", spellingCorrection2($"B", $"C")
To make this more generalized you can use a map from a tuple of the two conditions to a string same as you did for the first case.
If you want to generalize it even more then you can use dataset mapping. Basically create a case class with the relevant columns and then use as to convert the dataframe to a dataset of the case class. Then use the dataset map and in it use pattern matching on the input data to generate the relevant corrections and convert back to dataframe.
This should be easier to write but would have a performance cost.
For now I will go with the following which seems to work just fine and is more understandable: https://gist.github.com/rchukh/84ac39310b384abedb89c299b24b9306
If spellingMap is the map containing correct spellings, and df is the dataframe.
val df: DataFrame = _
val spellingMap = Map.empty[String, String] //fill it up yourself
val columnsWithSpellingMistakes = List("abc", "def")
Write a UDF like this
def spellingCorrectionUDF(spellingMap:Map[String, String]) =
udf[(String), Row]((value: Row) =>
{
val cellValue = value.getString(0)
if(spellingMap.contains(cellValue)) spellingMap(cellValue)
else cellValue
})
And finally, you can call them as
val newColumns = df.columns.map{
case columnName =>
if(columnsWithSpellingMistakes.contains(columnName)) spellingCorrectionUDF(spellingMap)(Column(columnName)).as(columnName)
else Column(columnName)
}
df.select(newColumns:_*)

Fitter Spark RDD based on result from filtering of different RDD

conf = SparkConf().setAppName("my_app")
with SparkContext(conf=conf) as sc:
sqlContext = SQLContext(sc)
df = sqlContext.read.parquet(*s3keys)
# this gives me distinct values as list
rdd = df.filter(
(1442170800000 <= df.timestamp) & (
df.timestamp <= 1442185200000) & (
df.lat > 40.7480) & (df.lat < 40.7513) & (
df.lon > -73.8492) & (
df.lon < -73.8438)).map(lambda p: p.userid).distinct()
# how do I apply the above list to filter another rdd?
df2 = sqlContext.read.parquet(*s3keys_part2)
# example:
rdd = df2.filter(df2.col1 in (rdd values from above))
As mentioned by Matthew Graves what you need here is a join. It means more or less something like this:
pred = ((1442170800000 <= df.timestamp) &
(df.timestamp <= 1442185200000) &
(df.lat > 40.7480) &
(df.lat < 40.7513) &
(df.lon > -73.8492) &
(df.lon < -73.8438))
users = df.filter(pred).select("userid").distinct()
users.join(df2, users.userid == df2.col1)
This is Scala code, instead of Python, but hopefully it can still serve as an example.
val x = 1 to 9
val df2 = sc.parallelize(x.map(a => (a,a*a))).toDF()
val df3 = sc.parallelize(x.map(a => (a,a*a*a))).toDF()
This gives us two dataframes, each with columns named _1 and _2, which are the first nine natural numbers and their squares/cubes.
val fil = df2.filter("_1 < 5") // Nine is too many, let's go to four.
val filJoin = fil.join(df3,fil("_1") === df3("_1")
filJoin.collect
This gets us:
Array[org.apache.spark.sql.Row] = Array([1,1,1,1], [2,4,2,8], [3,9,3,27], [4,16,4,64])
To apply this to your problem, I would start with something like the following:
rdd2 = rdd.join(df2, rdd.userid == df2.userid, 'inner')
But notice that we need to tell it what columns to join on, which might be something other than userid for df2. I'd also recommend, instead of map(lambda p: p.userid) you use .select('userid').distinct() so that it's still a dataframe.
You can find out more about join here.

Multiple filters per column in Spark

This might sound like a stupid question but any help would be appreciated. I am trying to apply a filter on my RDD based on a date column.
val tran1 = sc
.textFile("TranData.Dat")
.map(_.split("\t"))
.map(p => postran(
p(0), p(1), p(2), p(3), p(4), p(5), p(6), p(7), p(8),
p(9).toDouble, p(10).toDouble,p(11).toDouble))
I was able to apply a single date filter like below.
val tran = tran1.filter(x => x.PeriodDate == "2015-03-21 00:00:00.000")
How do I add more dates to this filter ? Is there a way I can read the comma separated date values in a variable and just pass that variable inside the filter() ?
Thanks
The following SQL:
select * From Table where age in (25, 35, 45) and country in ("Ireland", "Italy")
can be written with the following Scala:
val allowedAges: Seq[Int] = Seq(25, 35, 45)
val allowedCountries: Seq[String] = Seq("Ireland", "Italy")
val result = table.filter(x => (allowedAges.contains(x.age) && allowedCountries.contains(x.country))

Spark - How to handle error case in RDD.map() method correctly?

I am trying to do some text processing using Spark RDD.
The format of the input file is:
2015-05-20T18:30 <some_url>/?<key1>=<value1>&<key2>=<value2>&...&<keyn>=<valuen>
I want to extract some fields from the text and convert them into CSV format like:
<value1>,<value5>,<valuek>,<valuen>
The following code is how I do this:
val lines = sc.textFile(s"s3n://${MY_BUCKET}/${MY_FOLDER}/test/*.gz")
val records = lines.map { line =>
val mp = line.split("&")
.map(_.split("="))
.filter(_.length >= 2)
.map(t => (t(0), t(1))).toMap
(mp.get("key1"), mp.get("key5"), mp.get("keyk"), mp.get("keyn"))
}
I would like to know that, if some line of the input text is of wrong format or invalid, then the map() function cannot return a valid value. This should very common in text processing, what is the best practice to deal with this problem?
in order to manage this errors you can use the scala's class Try within a flatMap operation, in code:
val lines = sc.textFile(s"s3n://${MY_BUCKET}/${MY_FOLDER}/test/*.gz")
val records = lines.flatMap (line =>
Try{
val mp = line.split("&")
.map(_.split("="))
.filter(_.length >= 2)
.map(t => (t(0), t(1))).toMap
(mp.get("key1"), mp.get("key5"), mp.get("keyk"), mp.get("keyn"))
} match {
case Success(map) => Seq(map)
case _ => Seq()
})
With this you have only the "good ones" but if you want both (the errors and the good ones) i would recommend to use a map function that returns a Scala Either and then use a Spark filter, in code:
val lines = sc.textFile(s"s3n://${MY_BUCKET}/${MY_FOLDER}/test/*.gz")
val goodBadRecords = lines.map (line =>
Try{
val mp = line.split("&")
.map(_.split("="))
.filter(_.length >= 2)
.map(t => (t(0), t(1))).toMap
(mp.get("key1"), mp.get("key5"), mp.get("keyk"), mp.get("keyn"))
} match {
case Success(map) => Right(map)
case Failure(e) => Left(e)
})
val records = goodBadRecords.filter(_.isRight)
val errors = goodBadRecords.filter(_.isLeft)
I hope this will be useful

Save Spark org.apache.spark.mllib.linalg.Matrix to a file

The result of correlation in Spark MLLib is a of type org.apache.spark.mllib.linalg.Matrix. (see http://spark.apache.org/docs/1.2.1/mllib-statistics.html#correlations)
val data: RDD[Vector] = ...
val correlMatrix: Matrix = Statistics.corr(data, "pearson")
I would like to save the result into a file. How can I do this?
Here is a simple and effective approach to save the Matrix to hdfs and specify the separator.
(The transpose is used since .toArray is in column major format.)
val localMatrix: List[Array[Double]] = correlMatrix
.transpose // Transpose since .toArray is column major
.toArray
.grouped(correlMatrix.numCols)
.toList
val lines: List[String] = localMatrix
.map(line => line.mkString(" "))
sc.parallelize(lines)
.repartition(1)
.saveAsTextFile("hdfs:///home/user/spark/correlMatrix.txt")
As Matrix is Serializable, you can write it using normal Scala.
You can find an example here.
The answer by Dylan Hogg was great, to enhance it slightly, add a column index. (In my use case, once I created a file and downloaded it, it was not sorted due to the nature of parallel process etc.)
ref: https://www.safaribooksonline.com/library/view/scala-cookbook/9781449340292/ch10s12.html
substitute with this line and it will put a sequence number on the line (starting w/ 0) making it easier to sort when you go to view it
val lines: List[String] = localMatrix
.map(line => line.mkString(" "))
.zipWithIndex.map { case(line, count) => s"$count $line" }
Thank you for your suggestion. I came out with this solution. Thanks to Ignacio for his suggestions
val vtsd = sd.map(x => Vectors.dense(x.toArray))
val corrMat = Statistics.corr(vtsd)
val arrayCor = corrMat.toArray.toList
val colLen = columnHeader.size
val toArr2 = sc.parallelize(arrayCor).zipWithIndex().map(
x => {
if ((x._2 + 1) % colLen == 0) {
(x._2, arrayCor.slice(x._2.toInt + 1 - colLen, x._2.toInt + 1).mkString(";"))
} else {
(x._2, "")
}
}).filter(_._2.nonEmpty).sortBy(x => x._1, true, 1).map(x => x._2)
toArr2.coalesce(1, true).saveAsTextFile("/home/user/spark/cor_" + System.currentTimeMillis())

Resources