Spark-Scala Try Select Statement - apache-spark

I'm trying to incorporate a Try().getOrElse() statement in my select statement for a Spark DataFrame. The project I'm working on is going to be applied to multiple environments. However, each environment is a little different in terms of the naming of the raw data for ONLY one field. I do not want to write several different functions to handle each different field. Is there a elegant way to handle exceptions, like this below, in a DataFrame select statement?
val dfFilter = dfRaw
.select(
Try($"some.field.nameOption1).getOrElse($"some.field.nameOption2"),
$"some.field.abc",
$"some.field.def"
)
dfFilter.show(33, false)
However, I keep getting the following error, which makes sense because it does not exist in this environments raw data, but I'd expect the getOrElse statement to catch that exception.
org.apache.spark.sql.AnalysisException: No such struct field nameOption1 in...
Is there a good way to handle exceptions in Scala Spark for select statements? Or will I need to code up different functions for each case?

val selectedColumns = if (dfRaw.columns.contains("some.field.nameOption1")) $"some.field.nameOption2" else $"some.field.nameOption2"
val dfFilter = dfRaw
.select(selectedColumns, ...)

So I'm revisiting this question after a year. I believe this solution to be much more elegant to implement. Please let me know anyone else's thoughts:
// Generate a fake DataFrame
val df = Seq(
("1234", "A", "AAA"),
("1134", "B", "BBB"),
("2353", "C", "CCC")
).toDF("id", "name", "nameAlt")
// Extract the column names
val columns = df.columns
// Add a "new" column name that is NOT present in the above DataFrame
val columnsAdd = columns ++ Array("someNewColumn")
// Let's then "try" to select all of the columns
df.select(columnsAdd.flatMap(c => Try(df(c)).toOption): _*).show(false)
// Let's reduce the DF again...should yield the same results
val dfNew = df.select("id", "name")
dfNew.select(columnsAdd.flatMap(c => Try(dfNew(c)).toOption): _*).show(false)
// Results
columns: Array[String] = Array(id, name, nameAlt)
columnsAdd: Array[String] = Array(id, name, nameAlt, someNewColumn)
+----+----+-------+
|id |name|nameAlt|
+----+----+-------+
|1234|A |AAA |
|1134|B |BBB |
|2353|C |CCC |
+----+----+-------+
dfNew: org.apache.spark.sql.DataFrame = [id: string, name: string]
+----+----+
|id |name|
+----+----+
|1234|A |
|1134|B |
|2353|C |
+----+----+

Related

Combine multiple columns into single column in SPARK

I have a flattened incoming data in the below format in my parquet file:
I want to convert it into the below format where I am non-flattening my structure:
I tried the following:
Dataset<Row> rows = df.select(col("id"), col("country_cd"),
explode(array("fullname_1", "fullname_2")).as("fullname"),
explode(array("firstname_1", "firstname_2")).as("firstname"));
But it gives the below error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Only one generator allowed per select clause but found 2: explode(array(fullname_1, fullname_2)), explode(array(firstname_1, firstname_2));
I understand it is because you cannot use more than 1 explode in a query.
I am looking for options to do the above in Spark Java.
This type of problem is most easily solved with a .flatMap(). A .flatMap() is like a .map() except that it allows you to output n records for each input record, as opposed to a 1:1 ratio.
val df = Seq(
(1, "USA", "Lee M", "Lee", "Dan A White", "Dan"),
(2, "CAN", "Pate Poland", "Pate", "Don Derheim", "Don")
).toDF("id", "country_code", "fullname_1", "firstname_1", "fullname_2", "firstname_2")
df.flatMap(row => {
val id = row.getAs[Int]("id")
val cc = row.getAs[String]("country_code")
Seq(
(id, cc, row.getAs[String]("fullname_1"), row.getAs[String]("firstname_1")),
(id, cc, row.getAs[String]("fullname_1"), row.getAs[String]("firstname_1"))
)
}).toDF("id", "country_code", "fullname", "firstname").show()
This results in the following:
+---+------------+-----------+---------+
| id|country_code| fullname|firstname|
+---+------------+-----------+---------+
| 1| USA| Lee M| Lee|
| 1| USA| Lee M| Lee|
| 2| CAN|Pate Poland| Pate|
| 2| CAN|Pate Poland| Pate|
+---+------------+-----------+---------+
You need to wrap first and last names into an array of structs, which you later then explode:
Dataset<Row> rows = df.select(col("id"), col("country_cd"),
explode(
array(
struct(
col("firstname_1").as("firstname"), col("fullname_1").as("fullname")),
struct(
col("firstname_2").as("firstname"), col("fullname_2").as("fullname"))
)
)
)
This way you'll get fast narrow transformation, have Scala/Python/R portability and it should run quicker than the df.flatMap solution, which will turn Dataframe to an RDD, which Query Optimizer cannot improve. There might be additional pressure from Java Garbage Collector because of copying from unsafe byte arrays to java objects.
As a database person, I like to use set-based operations for things like this, eg union
val df = Seq(
("1", "USA", "Lee M", "Lee", "Dan A White", "Dan"),
("2", "CAN", "Pate Poland", "Pate", "Don Derheim", "Don")
).toDF("id", "country_code", "fullname_1", "firstname_1", "fullname_2", "firstname_2")
val df_new = df
.select("id", "country_code", "fullname_1", "firstname_1").union(df.select("id", "country_code", "fullname_2", "firstname_2"))
.orderBy("id")
df_new.show
df.createOrReplaceTempView("tmp")
Or the equivalent SQL:
%sql
SELECT id, country_code, fullname_1 AS fullname, firstname_1 AS firstname
FROM tmp
UNION
SELECT id, country_code, fullname_2, firstname_2
FROM tmp
My results:
I suppose one advantage over the flatMap technique is you don't have to specify the datatypes and it appears simpler on the face of it. It's up to you of course.

Spark aggregate rows with custom function

To make it simple, let's assume we have a dataframe containing the following data:
+----------+---------+----------+----------+
|firstName |lastName |Phone |Address |
+----------+---------+----------+----------+
|firstName1|lastName1|info1 |info2 |
|firstName1|lastName1|myInfo1 |dummyInfo2|
|firstName1|lastName1|dummyInfo1|myInfo2 |
+----------+---------+----------+----------+
How can I merge all rows grouping by (firstName,lastName) and keep in the columns Phone and Address only data starting by "my" to get the following :
+----------+---------+----------+----------+
|firstName |lastName |Phone |Address |
+----------+---------+----------+----------+
|firstName1|lastName1|myInfo1 |myInfo2 |
+----------+---------+----------+----------+
Maybe should I use agg function with a custom UDAF? But how can I implement it?
Note: I'm using Spark 2.2 along with Scala 2.11.
You can use groupBy and collect_set aggregation function and use a udf function to filter in the first string that starts with "my"
import org.apache.spark.sql.functions._
def myudf = udf((array: Seq[String]) => array.filter(_.startsWith("my")).head)
df.groupBy("firstName ", "lastName")
.agg(myudf(collect_set("Phone")).as("Phone"), myudf(collect_set("Address")).as("Address"))
.show(false)
which should give you
+----------+---------+-------+-------+
|firstName |lastName |Phone |Address|
+----------+---------+-------+-------+
|firstName1|lastName1|myInfo1|myInfo2|
+----------+---------+-------+-------+
I hope the answer is helpful
If only two columns involved, filtering and join can be used instead of UDF:
val df = List(
("firstName1", "lastName1", "info1", "info2"),
("firstName1", "lastName1", "myInfo1", "dummyInfo2"),
("firstName1", "lastName1", "dummyInfo1", "myInfo2")
).toDF("firstName", "lastName", "Phone", "Address")
val myPhonesDF = df.filter($"Phone".startsWith("my"))
val myAddressDF = df.filter($"Address".startsWith("my"))
val result = myPhonesDF.alias("Phones").join(myAddressDF.alias("Addresses"), Seq("firstName", "lastName"))
.select("firstName", "lastName", "Phones.Phone", "Addresses.Address")
result.show(false)
Output:
+----------+---------+-------+-------+
|firstName |lastName |Phone |Address|
+----------+---------+-------+-------+
|firstName1|lastName1|myInfo1|myInfo2|
+----------+---------+-------+-------+
For many columns, when only one row expected, such construction can be used:
val columnsForSearch = List("Phone", "Address")
val minExpressions = columnsForSearch.map(c => min(when(col(c).startsWith("my"), col(c)).otherwise(null)).alias(c))
df.groupBy("firstName", "lastName").agg(minExpressions.head, minExpressions.tail: _*)
Output is the same.
UDF with two parameters example:
val twoParamFunc = (firstName: String, Phone: String) => firstName + ": " + Phone
val twoParamUDF = udf(twoParamFunc)
df.select(twoParamUDF($"firstName", $"Phone")).show(false)

Get examples for rows that are removed by a filter from a spark dataframe

Suppose I have a spark dataframe df with some columns (id,...) and a string sqlFilter with a SQL filter, e.g. "id is not null".
I want to filter the dataframe df based on sqlFilter, i.e.
val filtered = df.filter(sqlFilter)
Now, I want to have a list of 10 ids from df that were removed by the filter.
Currently, I'm using a "leftanti" join to achieve this, i.e.
val examples = df.select("id").join(filtered.select("id"), Seq("id"), "leftanti")
.take(10)
.map(row => Option(row.get(0)) match { case None => "null" case Some(x) => x.toString})
However, this is really slow.
My guess is that this can be implemented faster, because spark only has to have a list for every partitition
and add an id to the list when filter removes a row and the list contains less than 10 elements. Once the action after
filter finishes, spark has to collect all the lists from the partitions until it has 10 ids.
I wanted to use accumulators as described here,
but I failed because I could not find out how to parse and use sqlFilter.
Has anybody an idea how I can improve the performance?
Update
Ramesh Maharjan suggested in the comments to inverse the SQL query, i.e.
df.filter(s"NOT ($filterString)")
.select(key)
.take(10)
.map(row => Option(row.get(0)) match { case None => "null" case Some(x) => x.toString})
This indeed improves the performance but it is not 100% equivalent.
If there are multiple rows with the same id, the id will end up in the examples if one row is removed due to the filter. With the leftantit join it does not end up in the examples because the id is still in filtered.
However, that is fine with me.
I'm still interested if it is possible to create the list "on the fly" with accumulators or something similar.
Update 2
Another issue with inverting the filter is the logical value UNKNOWN in SQL, because NOT UNKNWON = UNKNOWN, i.e. NOT(null <> 1) <=> UNKNOWN and hence this row shows up neither in the filtered dataframe nor in the inverted dataframe.
You can use a custom accumulator (because longAccumulator won't help you as all ids will be null); and you must formulate your filter statement as function :
Suppose you have a dataframe :
+----+--------+
| id| name|
+----+--------+
| 1|record 1|
|null|record 2|
| 3|record 3|
+----+--------+
Then you could do :
import org.apache.spark.util.AccumulatorV2
class RowAccumulator(var value: Seq[Row]) extends AccumulatorV2[Row, Seq[Row]] {
def this() = this(Seq.empty[Row])
override def isZero: Boolean = value.isEmpty
override def copy(): AccumulatorV2[Row, Seq[Row]] = new RowAccumulator(value)
override def reset(): Unit = value = Seq.empty[Row]
override def add(v: Row): Unit = value = value :+ v
override def merge(other: AccumulatorV2[Row, Seq[Row]]): Unit = value = value ++ other.value
}
val filteredAccum = new RowAccumulator()
ss.sparkContext.register(filteredAccum, "Filter Accum")
val filterIdIsNotNull = (r:Row) => {
if(r.isNullAt(r.fieldIndex("id"))) {
filteredAccum.add(r)
false
} else {
true
}}
df
.filter(filterIdIsNotNull)
.show()
println(filteredAccum.value)
gives
+---+--------+
| id| name|
+---+--------+
| 1|record 1|
| 3|record 3|
+---+--------+
List([null,record 2])
But personally I would not do this, I would rather do something like you've already suggested :
val dfWithFilter = df
.withColumn("keep",expr("id is not null"))
.cache() // check whether caching is feasibly
// show 10 records which we do not keep
dfWithFilter.filter(!$"keep").drop($"keep").show(10) // or use take(10)
+----+--------+
| id| name|
+----+--------+
|null|record 2|
+----+--------+
// rows that we keep
val filteredDf = dfWithFilter.filter($"keep").drop($"keep")

How to create a dataframe from a string key=value delimited by ";"

I have a Hive table with the structure:
I need to read the string field, breaking the keys and turn into a Hive table columns, the final table should look like this:
Very important, the number of keys in the string is dynamic and the name of the keys is also dynamic
An attempt would be to read the string with Spark SQL, create a dataframe with the schema based on all the strings and use saveAsTable () function to transform the dataframe the hive final table, but do not know how to do this
Any suggestion ?
A naive (assuming unique (code, date) combinations and no embedded = and ; in the string) can look like this:
import org.apache.spark.sql.functions.{explode, split}
val df = Seq(
(1, 1, "key1=value11;key2=value12;key3=value13;key4=value14"),
(1, 2, "key1=value21;key2=value22;key3=value23;key4=value24"),
(2, 4, "key3=value33;key4=value34;key5=value35")
).toDF("code", "date", "string")
val bits = split($"string", ";")
val kv = split($"pair", "=")
df
.withColumn("bits", bits) // Split column by `;`
.withColumn("pair", explode($"bits")) // Explode into multiple rows
.withColumn("key", kv(0)) // Extract key
.withColumn("val", kv(1)) // Extract value
// Pivot to wide format
.groupBy("code", "date")
.pivot("key")
.agg(first("val"))
// +----+----+-------+-------+-------+-------+-------+
// |code|date| key1| key2| key3| key4| key5|
// +----+----+-------+-------+-------+-------+-------+
// | 1| 2|value21|value22|value23|value24| null|
// | 1| 1|value11|value12|value13|value14| null|
// | 2| 4| null| null|value33|value34|value35|
// +----+----+-------+-------+-------+-------+-------+
This can be easily adjust to handle the case when (code, date) are not unique and you can process more complex string patterns using UDF.
Depending on a language you use and a number of columns you may be better with using RDD or Dataset. It is also worth to consider dropping full explode / pivot in favor of an UDF.
val parse = udf((text: String) => text.split(";").map(_.split("=")).collect {
case Array(k, v) => (k, v)
}.toMap)
val keys = udf((pairs: Map[String, String]) => pairs.keys.toList)
// Parse strings to Map[String, String]
val withKVs = df.withColumn("kvs", parse($"string"))
val keys = withKVs
.select(explode(keys($"kvs"))).distinct // Get unique keys
.as[String]
.collect.sorted.toList // Collect and sort
// Build a list of expressions for subsequent select
val exprs = keys.map(key => $"kvs".getItem(key).alias(key))
withKVs.select($"code" :: $"date" :: exprs: _*)
In Spark 1.5 you can try:
val keys = withKVs.select($"kvs").rdd
.flatMap(_.getAs[Map[String, String]]("kvs").keys)
.distinct
.collect.sorted.toList

Accessing column names with periods - Spark SQL 1.3

I have a DataFrame with fields which contain a period. When I attempt to use select() on them the Spark cannot resolve them, likely because '.' is used for accessing nested fields.
Here's the error:
enrichData.select("google.com")
org.apache.spark.sql.AnalysisException: cannot resolve 'google.com' given input columns google.com, yahoo.com, ....
Is there a way to access these columns? Or an easy way to change the column names (as I can't select them, how can I change the names?).
Having a period in column name makes spark assume it as Nested field, field in a field. To counter that, you need to use a backtick "`". This should work:
scala> val df = Seq(("yr", 2000), ("pr", 12341234)).toDF("x.y", "e")
df: org.apache.spark.sql.DataFrame = [x.y: string, e: int]
scala> df.select("`x.y`").show
+---+
|x.y|
+---+
| yr|
| pr|
+---+
you need to put a backtick(`)
You can drop the schema and recreate it without the periods like this:
val newEnrichData = sqlContext.createDataFrame(
enrichData.rdd,
StructType(enrichData.schema.fields.map(sf =>
StructField(sf.name.replace(".", ""), sf.dataType, sf.nullable)
))
)

Resources