Accessing nested data in spark - apache-spark

I have a collection of nested case classes. I've got a job that generates a dataset using these case classes, and writes the output to parquet.
I was pretty annoyed to discover that I have to manually do a load of faffing around to load and convert this data back to case classes to work with it in subsequent jobs. Anyway, that's what I'm now trying to do.
My case classes are like:
case class Person(userId: String, tech: Option[Tech])
case class Tech(browsers: Seq[Browser], platforms: Seq[Platform])
case class Browser(family: String, version: Int)
So I'm loading my parquet data. I can get the tech data as a Row with:
val df = sqlContext.load("part-r-00716.gz.parquet")
val x = df.head
val tech = x.getStruct(x.fieldIndex("tech"))
But now I can't find how to actually iterate over the browsers. If I try val browsers = tech.getStruct(tech.fieldIndex("browsers")) I get an exception:
java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to org.apache.spark.sql.Row
How can I iterate over my nested browser data using spark 1.5.2?
Update
In fact, my case classes contain optional values, so Browser actually is:
case class Browser(family: String,
major: Option[String] = None,
minor: Option[String] = None,
patch: Option[String] = None,
language: String,
timesSeen: Long = 1,
firstSeenAt: Long,
lastSeenAt: Long)
I also have similar for Os:
case class Os(family: String,
major: Option[String] = None,
minor: Option[String] = None,
patch: Option[String] = None,
patchMinor: Option[String],
override val timesSeen: Long = 1,
override val firstSeenAt: Long,
override val lastSeenAt: Long)
And so Tech is really:
case class Technographic(browsers: Seq[Browser],
devices: Seq[Device],
oss: Seq[Os])
Now, given the fact that some values are optional, I need a solution that will allow me to reconstruct my case classes correctly. The current solution doesn't support None values, so for example given the input data:
Tech(browsers=Seq(
Browser(family=Some("IE"), major=Some(7), language=Some("en"), timesSeen=3),
Browser(family=None, major=None, language=Some("en-us"), timesSeen=1),
Browser(family=Some("Firefox), major=None, language=None, timesSeen=1)
)
)
I need it to load the data as follows:
family=IE, major=7, language=en, timesSeen=3,
family=None, major=None, language=en-us, timesSeen=1,
family=Firefox, major=None, language=None, timesSeen=1
Because the current solution doesn't support None values, it in fact has an arbitrary number of values per list item, i.e.:
browsers.family = ["IE", "Firefox"]
browsers.major = [7]
browsers.language = ["en", "en-us"]
timesSeen = [3, 1, 1]
As you can see, there's no way of converting the final data (returned by spark) into the case classes that generated it.
How can I work around this insanity?

Some examples
// Select two columns
df.select("userId", "tech.browsers").show()
// Select the nested values only
df.select("tech.browsers").show(truncate = false)
+-------------------------+
|browsers |
+-------------------------+
|[[Firefox,4], [Chrome,2]]|
|[[Firefox,4], [Chrome,2]]|
|[[IE,25]] |
|[] |
|null |
+-------------------------+
// Extract the family (nested value)
// This way you can iterate over the persons, and get their browsers
// Family values are nested
df.select("tech.browsers.family").show()
+-----------------+
| family|
+-----------------+
|[Firefox, Chrome]|
|[Firefox, Chrome]|
| [IE]|
| []|
| null|
+-----------------+
// Normalize the family: One row for each family
// Then you can iterate over all families
// Family values are un-nested, empty values/null/None are handled by explode()
df.select(explode(col("tech.browsers.family")).alias("family")).show()
+-------+
| family|
+-------+
|Firefox|
| Chrome|
|Firefox|
| Chrome|
| IE|
+-------+
Based on the last example:
val families = df.select(explode(col("tech.browsers.family")))
.map(r => r.getString(0)).distinct().collect().toList
println(families)
gives the unique list of browers in a "normal" local Scala list:
List(IE, Firefox, Chrome)

Related

Spark Aggregator on sorted Window never uses merge - is this reliable?

I am using org.apache.spark.sql.expressions.Aggregator to implement custom logic on a series of rows. I have noticed that the merge() function is never called when the Aggregator is applied to an ordered window with rows between unboundedPreceding and currentRow, i.e. the aggregation behavior is entirely determined by how new elements are added to the latest reduction, reduce().
If merge() is indeed never called in this case, UDAFs would be a great tool to integrate arbitrary custom logic on large partitions of ordered rows; see https://softwarerecs.stackexchange.com/questions/83666/foss-data-stack-to-perform-complex-custom-logic-on-billions-of-ordered-rows. However, I cannot find this being mentioned in the Spark documentation or the Spark issue tracker, and hence I am wondering if it is safe to use in this way - specifically for custom algorithms that don't allow for a merge()-like operation.
Below is some code specifically to test this behavior. I have locally checked the observation with a set of 300 million rows and partitioning based on three columns (each partition having a few million rows), and the observation holds up.
timestampdata.csv
category,eventTime
a,240
a,489
b,924
a,890
b,563
a,167
a,134
b,600
b,901
OrderedProcessing.scala
object OrderedProcessing {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._
val checkOrderingUdf: UserDefinedFunction = udaf[Int, OrderProcessingInfo, OrderProcessingInfo](CheckOrdering)
val df_data = spark.read
.options(Map("inferSchema" -> "true", "delimiter" -> ",", "header" -> "true"))
.csv("./timestampdata.csv")
val df_checked = df_data
.withColumn("orderProcessingInfo",
checkOrderingUdf.apply($"eventTime").over(
Window.partitionBy("category").orderBy("eventTime")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)))
.select($"category", $"eventTime",
$"orderProcessingInfo".getItem("processedAllInOrder").alias("processedAllInOrder"),
$"orderProcessingInfo".getItem("haveUsedReduce").alias("haveUsedReduce"),
$"orderProcessingInfo".getItem("haveUsedMerge").alias("haveUsedMerge"))
df_checked.groupBy("processedAllInOrder", "haveUsedReduce", "haveUsedMerge").count().show()
}
}
OrderProcessingInfo.scala
case class OrderProcessingInfo(latestTime: Int, processedAllInOrder: Boolean, haveUsedReduce: Boolean, haveUsedMerge: Boolean)
CheckOrdering.scala
object CheckOrdering extends Aggregator[Int, OrderProcessingInfo, OrderProcessingInfo] {
override def zero = OrderProcessingInfo(0, true, false, false)
override def reduce(agg: OrderProcessingInfo, e: Int) = OrderProcessingInfo(
latestTime = e, processedAllInOrder = agg.processedAllInOrder & (e >= agg.latestTime),
haveUsedReduce = true, haveUsedMerge = agg.haveUsedMerge
)
override def merge(agg1: OrderProcessingInfo, agg2: OrderProcessingInfo) = OrderProcessingInfo(
latestTime = agg1.latestTime.max(agg2.latestTime),
processedAllInOrder = agg1.processedAllInOrder & agg2.processedAllInOrder & (agg2.latestTime >= agg1.latestTime),
haveUsedReduce = agg1.haveUsedReduce | agg2.haveUsedReduce,
haveUsedMerge = true
)
override def finish(agg: OrderProcessingInfo) = agg
override def bufferEncoder: Encoder[OrderProcessingInfo] = implicitly(ExpressionEncoder[OrderProcessingInfo])
override def outputEncoder: Encoder[OrderProcessingInfo] = implicitly(ExpressionEncoder[OrderProcessingInfo])
}
output
+-------------------+--------------+-------------+-----+
|processedAllInOrder|haveUsedReduce|haveUsedMerge|count|
+-------------------+--------------+-------------+-----+
| true| true| false| 9|
+-------------------+--------------+-------------+-----+

Use Str_to_map in bigquery

I have a function str_to_map() in hive that I need to convert to Big Query. As we don't have map in Bigquery, I want to find another way to have a map format and then after that to extract the key-values by using the key name.
Example :
Select str_to_map('cars:0,kids:143,cats:1,lost:0,win:1,chances:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0,missed:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0',',',':')
If I call the key 'cars' I get the value '0'.
If I call the key 'chances' I should get '0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0'
It's necessary for me to have a type like the 'map' type (key-value).
Thank you 😀
Google provides some useful UDFs for BigQuery here in bigquery-utils.
Don't reinvent the wheel
So, I brought two udfs to answer this question.
1. get_value(k STRING, arr ANY TYPE)
Given a key and a list of key-value maps in the form [{'key': 'a', 'value': 'aaa'}], returns the SCALAR type value.
2. cw_map_parse(m string, pd string, kvd string)
String to map convert.
With these, you can write a query like below:
SELECT get_value('kids', cw_map_parse(str, ',', ':')) kids,
get_value('chances', cw_map_parse(str, ',', ':')) chances,
FROM UNNEST(['cars:0,kids:143,cats:1,lost:0,win:1,chances:0,missed:0']) str;
+------+---------+
| kids | chances |
+------+---------+
| 143 | 0 |
+------+---------+
But due to below requirements, cw_map_parse implementation needs to be customized a little bit.
If I call the key 'cars' I get the value '0'. If I call the key 'chances' I should get '0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0'
Below is a query with cutomized UDFs functions. str_to_map() is a customized version of cw_map_parse().
CREATE TEMP FUNCTION str_to_map(m string, pd string, kvd string)
RETURNS ARRAY<STRUCT<key STRING, value STRING>> AS (
ARRAY(
SELECT AS STRUCT kv[SAFE_OFFSET(0)] AS key, kv[SAFE_OFFSET(1)] AS value
FROM (
SELECT SPLIT(REGEXP_REPLACE(kv, r'^(.*?)' || kvd, r'\1|'), '|') AS kv
FROM UNNEST(SPLIT(m, pd)) AS kv
)
));
CREATE TEMP FUNCTION get_value(get_key STRING, arr ANY TYPE) AS (
(SELECT value FROM UNNEST(arr) WHERE key = get_key)
);
SELECT get_value('cars', map) cars,
get_value('kids', map) kids,
get_value('chances', map) chances,
get_value('missed', map) missed,
FROM UNNEST(['cars:0,kids:143,cats:1,lost:0,win:1,chances:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0,missed:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0']) str,
UNNEST([STRUCT(str_to_map(str, ',', ':') AS map)]);
+------+------+-------------------------------------+-------------------------------------+
| cars | kids | chances | missed |
+------+------+-------------------------------------+-------------------------------------+
| 0 | 143 | 0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0 | 0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0:0 |
+------+------+-------------------------------------+-------------------------------------+
Another super simple option for that particular case
select
json_value(json, '$.cars') cars,
json_value(json, '$.kids') kids,
json_value(json, '$.cats') cats,
json_value(json, '$.lost') lost,
json_value(json, '$.win') win,
json_value(json, '$.chances') chances,
json_value(json, '$.missed') missed
from your_table,
unnest([format('{%s}', regexp_replace(str, r'([^:,]+):([\d:]*\d)', r'"\1":"\2"'))]) json
with output

Extract Numeric data from the Column in Spark Dataframe

I have a Dataframe with 20 columns and I want to update one particular column (whose data is null) with the data extracted from another column and do some formatting. Below is a sample input
+------------------------+----+
|col1 |col2|
+------------------------+----+
|This_is_111_222_333_test|NULL|
|This_is_111_222_444_test|3296|
|This_is_555_and_666_test|NULL|
|This_is_999_test |NULL|
+------------------------+----+
and my output should be like below
+------------------------+-----------+
|col1 |col2 |
+------------------------+-----------+
|This_is_111_222_333_test|111,222,333|
|This_is_111_222_444_test|3296 |
|This_is_555_and_666_test|555,666 |
|This_is_999_test |999 |
+------------------------+-----------+
Here is the code I have tried and it is working only when the the numeric is continuous, could you please help me with a solution.
df.withColumn("col2",when($"col2".isNull,regexp_replace(regexp_replace(regexp_extract($"col1","([0-9]+_)+",0),"_",","),".$","")).otherwise($"col2")).show(false)
I can do this by creating a UDF, but I am thinking is it possible with the spark in-built functions. My Spark version is 2.2.0
Thank you in advance.
A UDF is a good choice here. Performance is similar to that of the withColumn approach given in the OP (see benchmark below), and it works even if the numbers are not contiguous, which is one of the issues mentioned in the OP.
import org.apache.spark.sql.functions.udf
import scala.util.Try
def getNums = (c: String) => {
c.split("_").map(n => Try(n.toInt).getOrElse(0)).filter(_ > 0)
}
I recreated your data as follows
val data = Seq(("This_is_111_222_333_test", null.asInstanceOf[Array[Int]]),
("This_is_111_222_444_test",Array(3296)),
("This_is_555_666_test",null.asInstanceOf[Array[Int]]),
("This_is_999_test",null.asInstanceOf[Array[Int]]))
.toDF("col1","col2")
data.createOrReplaceTempView("data")
Register the UDF and run it in a query
spark.udf.register("getNums",getNums)
spark.sql("""select col1,
case when size(col2) > 0 then col2 else getNums(col1) end new_col
from data""").show
Which returns
+--------------------+---------------+
| col1| new_col|
+--------------------+---------------+
|This_is_111_222_3...|[111, 222, 333]|
|This_is_111_222_4...| [3296]|
|This_is_555_666_test| [555, 666]|
| This_is_999_test| [999]|
+--------------------+---------------+
Performance was tested with a larger data set created as follows:
val bigData = (0 to 1000).map(_ => data union data).reduce( _ union _)
bigData.createOrReplaceTempView("big_data")
With that, the solution given in the OP was compared to the UDF solution and found to be about the same.
// With UDF
spark.sql("""select col1,
case when length(col2) > 0 then col2 else getNums(col1) end new_col
from big_data""").count
/// OP solution:
bigData.withColumn("col2",when($"col2".isNull,regexp_replace(regexp_replace(regexp_extract($"col1","([0-9]+_)+",0),"_",","),".$","")).otherwise($"col2")).count
Here is another way, please check the performance.
df.withColumn("col2", expr("coalesce(col2, array_join(filter(split(col1, '_'), x -> CAST(x as INT) IS NOT NULL), ','))"))
.show(false)
+------------------------+-----------+
|col1 |col2 |
+------------------------+-----------+
|This_is_111_222_333_test|111,222,333|
|This_is_111_222_444_test|3296 |
|This_is_555_666_test |555,666 |
|This_is_999_test |999 |
+------------------------+-----------+

Order Spark SQL Dataframe with nested values / complex data types

My goal is to collect an ordered list of nested values. It should be ordered based on an element in the nested list. I tried out different approaches but have concerns in terms of performance and correctness.
Order globally
case class Payment(Id: String, Date: String, Paid: Double)
val payments = Seq(
Payment("mk", "10:00 AM", 8.6D),
Payment("mk", "06:00 AM", 12.6D),
Payment("yc", "07:00 AM", 16.6D),
Payment("yc", "09:00 AM", 2.6D),
Payment("mk", "11:00 AM", 5.6D)
)
val df = spark.createDataFrame(payments)
// order globally
df.orderBy(col("Paid").desc)
.groupBy(col("Id"))
.agg(
collect_list(struct(col("Date"), col("Paid"))).as("UserPayments")
)
.withColumn("LargestPayment", col("UserPayments")(0).getField("Paid"))
.withColumn("LargestPaymentDate", col("UserPayments")(0).getField("Date"))
.show(false)
+---+-------------------------------------------------+--------------+------------------+
|Id |UserPayments |LargestPayment|LargestPaymentDate|
+---+-------------------------------------------------+--------------+------------------+
|yc |[[07:00 AM,16.6], [09:00 AM,2.6]] |16.6 |07:00 AM |
|mk |[[06:00 AM,12.6], [10:00 AM,8.6], [11:00 AM,5.6]]|12.6 |06:00 AM |
+---+-------------------------------------------------+--------------+------------------+
This is a naive and straight-forward approach, but I have concerns in terms of correctness. Will the list really be ordered globally or only within a partition?
Window function
// use Window
val window = Window.partitionBy(col("Id")).orderBy(col("Paid").desc)
df.withColumn("rank", rank().over(window))
.groupBy(col("Id"))
.agg(
collect_list(struct(col("rank"), col("Date"), col("Paid"))).as("UserPayments")
)
.withColumn("LargestPayment", col("UserPayments")(0).getField("Paid"))
.withColumn("LargestPaymentDate", col("UserPayments")(0).getField("Date"))
.show(false)
+---+-------------------------------------------------------+--------------+------------------+
|Id |UserPayments |LargestPayment|LargestPaymentDate|
+---+-------------------------------------------------------+--------------+------------------+
|yc |[[1,07:00 AM,16.6], [2,09:00 AM,2.6]] |16.6 |07:00 AM |
|mk |[[1,06:00 AM,12.6], [2,10:00 AM,8.6], [3,11:00 AM,5.6]]|12.6 |06:00 AM |
+---+-------------------------------------------------------+--------------+------------------+
This should work or do I miss something?
Order in UDF on-the-fly
// order in UDF
val largestPaymentDate = udf((lr: Seq[Row]) => {
lr.max(Ordering.by((l: Row) => l.getAs[Double]("Paid"))).getAs[String]("Date")
})
df.groupBy(col("Id"))
.agg(
collect_list(struct(col("Date"), col("Paid"))).as("UserPayments")
)
.withColumn("LargestPaymentDate", largestPaymentDate(col("UserPayments")))
.show(false)
+---+-------------------------------------------------+------------------+
|Id |UserPayments |LargestPaymentDate|
+---+-------------------------------------------------+------------------+
|yc |[[07:00 AM,16.6], [09:00 AM,2.6]] |07:00 AM |
|mk |[[10:00 AM,8.6], [06:00 AM,12.6], [11:00 AM,5.6]]|06:00 AM |
+---+-------------------------------------------------+------------------+
I guess nothing to complain here in terms of correctness. But for the following operations, I'd prefer that the list is ordered and I don't have to do every time explicitly.
I tried to write a UDF which takes the list as an input and returns the ordered list - but returning a list was too painful and I gave it up.
I'd reverse the order of the struct and aggregate with max:
val result = df
.groupBy(col("Id"))
.agg(
collect_list(struct(col("Date"), col("Paid"))) as "UserPayments",
max(struct(col("Paid"), col("Date"))) as "MaxPayment"
)
result.show
// +---+--------------------+---------------+
// | Id| UserPayments| MaxPayment|
// +---+--------------------+---------------+
// | yc|[[07:00 AM,16.6],...|[16.6,07:00 AM]|
// | mk|[[10:00 AM,8.6], ...|[12.6,06:00 AM]|
// +---+--------------------+---------------+
You can later flatten the struct:
result.select($"id", $"UserPayments", $"MaxPayment.*").show
// +---+--------------------+----+--------+
// | id| UserPayments|Paid| Date|
// +---+--------------------+----+--------+
// | yc|[[07:00 AM,16.6],...|16.6|07:00 AM|
// | mk|[[10:00 AM,8.6], ...|12.6|06:00 AM|
// +---+--------------------+----+--------+
Same way you can sort_array of reordered structs
df
.groupBy(col("Id"))
.agg(
sort_array(collect_list(struct(col("Paid"), col("Date")))) as "UserPayments"
)
.show(false)
// +---+-------------------------------------------------+
// |Id |UserPayments |
// +---+-------------------------------------------------+
// |yc |[[2.6,09:00 AM], [16.6,07:00 AM]] |
// |mk |[[5.6,11:00 AM], [8.6,10:00 AM], [12.6,06:00 AM]]|
// +---+-------------------------------------------------+
Finally:
This is a naive and straight-forward approach, but I have concerns in terms of correctness. Will the list really be ordered globally or only within a partition?
Data will be ordered globally, but the order will be destroyed by groupBy so this is is not a solution, and can work only accidentally.
repartition (by id) and sortWithinPartitions (by id and Paid) should be reliable replacement.

How can I handle custom NULL string with Spark DataFrame

I have a data file look like below:
// data.txt
1 2016-01-01
2 \N
3 2016-03-01
I used \N to represent a null value for some reason. (It's not a special character, it's a string consists of 2 chars: \ and N).
I want to create DataFrame like below:
case class Data(
val id : Int,
val date : java.time.LocalDate)
val df = sc.textFile("data.txt")
.map(_.split("\t"))
.map(p => Data(
p(0).toInt,
_helper(p(1))
))
.toDF()
My question is how can I write the helper method ?
def _helper(s : String) = s match {
case "\\N" => null, // type error
case _ => LocalDate.parse(s, dateFormat)
}
This is where an Option type will come in handy.
I changed the custom null value to make the case more explicit but it should work in your case. My data is in a .txt file like so:
Ryan,11
Bob,22
Kevin,23
Asop,-nnn-
Notice the -nnn- is my custom null. I use a slightly different case class:
case class DataSet(name: String, age: Option[Int])
And write a pattern matching function to capture the nuances of the situation:
def customNull (col: String): Option[Int] = col match {
case "-nnn-" => None
case _ => Some(Integer.parseInt(col))
}
From here it should work as expected when you combine the two:
val df = sc.textFile("./data.txt")
.map(_.split(","))
.map(p=>DataSet(p(0), customNull(p(1))))
.toDF()
When I do a df.show() I get the following:
+-----+----+
| name| age|
+-----+----+
| Ryan| 11|
| Bob| 22|
|Kevin| 23|
| Asop|null|
+-----+----+
Treating the ages like a string gets around the problem. It might not be the fastest to parse values like this. Ideally, you could also use an Either but that can also get complex.

Resources