how to select by max(date) with spark dataframe API - apache-spark

Given the following dataset
id v date
1 a1 1
1 a2 2
2 b1 3
2 b2 4
I want to select only the last value (regarding the date) for each id.
I've came up with this code :
scala> val df = sc.parallelize(List((41,"a1",1), (1, "a2", 2), (2, "b1", 3), (2, "b2", 4))).toDF("id", "v", "date")
df: org.apache.spark.sql.DataFrame = [id: int, v: string, date: int]
scala> val agg = df.groupBy("id").max("date")
agg: org.apache.spark.sql.DataFrame = [id: int, max(date): int]
scala> val res = df.join(agg, df("id") === agg("id") && df("date") === agg("max(date)"))
16/11/14 22:25:01 WARN sql.Column: Constructing trivially true equals predicate, 'id#3 = id#3'. Perhaps you need to use aliases.
res: org.apache.spark.sql.DataFrame = [id: int, v: string, date: int, id: int, max(date): int]
Is there a better way (more idiomatic, …) ?
Bonus : how to perform a max on a date column and avoid this error Aggregation function can only be applied on a numeric column. ?

You can try agg() with the max function :
import static org.apache.spark.sql.functions.*
df.groupBy("id").agg(max("date"))

For me it only worked taht way:
df = df.groupBy('CPF').agg({'DATA': 'max'})

Related

How to conditionally replace Spark SQL array values using SQL language?

I have this column inside myTable:
myColumn
[red, green]
[green, green, red]
I need to modify it so that I can replace red with 1, green with 2:
myColumn
[1, 2]
[2, 2, 1]
In short, is there a way to apply case clause for each element in the array, row wise?
The closest I've gotten so far:
select replace(replace(to_json(myColumn), 'red', 1), 'green', 2)
On the other hand, in case we have a column of strings, I could simply use:
select (
case
when myColumn='red' then 1
when myColumn='green' then 2
end
) from myTable;
Assuming that the dataframe has registered a temporary view named tmp, use the following SQL statement to get the result.
sql = """
select
collect_list(
case col
when 'red' then 1
when 'green' then 2
end)
myColumn
from
(select mid,explode(myColumn) col
from
(select monotonically_increasing_id() mid,myColumn
from tmp)
)
group by mid
"""
df = spark.sql(sql)
df.show(truncate=False)
In pure Spark SQL, you could convert your array into a string with concat_ws, make the substitutions with regexp_replace and then recreate the array with split.
select split(
regexp_replace(
regexp_replace(
concat_ws(',', myColumn)
, 'red', '1')
, 'green', '2')
, ',') myColumn from df
I could perform a simple transform (Spark 3 onwards)
select transform(myColumn, value ->
case value
when 'red' then 1
when 'green' then 2
end
from myTable
Let's create some sample data and a map that contains the substitutions: tou want to make
val df = Seq((1, Seq("red", "green")),
(2, Seq("green", "green", "red")))
.toDF("id", "myColumn")
val values = Map("red" -> "1", "green" -> "2")
The most straight forward way would be to define a UDF that does exactly what you want:
val replace = udf((x : Array[String]) =>
x.map(value => values.getOrElse(value, value)))
df.withColumn("myColumn", replace('myColumn)).show
+---+---------+
| id| myColumn|
+---+---------+
| 1| [1, 2]|
| 2|[2, 2, 1]|
+---+---------+
Without UDFs, you could transform the array into a string with concat_ws using separators that are not in your array. Then we could use string functions to make the edits:
val sep = ","
val replace = values
.foldLeft(col("myColumn")){ case (column, (key, value)) =>
regexp_replace(column, sep + key + sep, sep + value + sep)
}
df.withColumn("myColumn", concat(lit(sep), concat_ws(sep+sep, 'myColumn), lit(sep)))
.withColumn("myColumn", regexp_replace(replace, "(^,)|(,$)", ""))
.withColumn("myColumn", split('myColumn, sep+sep))
.show

How to add a column for equality of MapType in Spark withColumn?

I have two columns with type Map[String, Integer], I want to use withColumn to add a column to represent equality of the two maps.
I try df.withColumn("euqal", col("A") === col("B")), it throws exception EqualTo does not support ordering on type map<string,int>
We can do this using UDF:
import spark.implicits._
import org.apache.spark.sql.functions._
//UDF
val mapEqual = udf((col1: Map[StringType, IntegerType], col2: Map[StringType, IntegerType]) => {
col1.sameElements(col2)
})
//sample DF
var df = List((Map("1" -> 1), Map("1" -> 1)), (Map("2" -> 1), Map("2" -> 10)))
.toDF("col1", "col2")
df.withColumn("equal", mapEqual('col1, 'col2)).show(false)
+--------+---------+-----+
|col1 |col2 |equal|
+--------+---------+-----+
|[1 -> 1]|[1 -> 1] |true |
|[2 -> 1]|[2 -> 10]|false|
+--------+---------+-----+

Passing an entire row as an argument to spark udf through spark dataframe - throws AnalysisException

I am trying to pass an entire row to the spark udf along with few other arguments, I am not using spark sql rather I am using dataframe withColumn api, but I am getting the following exception:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved attribute(s) col3#9 missing from col1#7,col2#8,col3#13 in operator !Project [col1#7, col2#8, col3#13, UDF(col3#9, col2, named_struct(col1, col1#7, col2, col2#8, col3, col3#9)) AS contcatenated#17]. Attribute(s) with the same name appear in the operation: col3. Please check if the right attribute(s) are used.;;
The above exception can be replicated using the below code:
addRowUDF() // call invokes
def addRowUDF() {
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().config(new SparkConf().set("master", "local[*]")).appName(this.getClass.getSimpleName).getOrCreate()
import spark.implicits._
val df = Seq(
("a", "b", "c"),
("a1", "b1", "c1")).toDF("col1", "col2", "col3")
execute(df)
}
def execute(df: org.apache.spark.sql.DataFrame) {
import org.apache.spark.sql.Row
def concatFunc(x: Any, y: String, row: Row) = x.toString + ":" + y + ":" + row.mkString(", ")
import org.apache.spark.sql.functions.{ udf, struct }
val combineUdf = udf((x: Any, y: String, row: Row) => concatFunc(x, y, row))
def udf_execute(udf: String, args: org.apache.spark.sql.Column*) = (combineUdf)(args: _*)
val columns = df.columns.map(df(_))
val df2 = df.withColumn("col3", lit("xxxxxxxxxxx"))
val df3 = df2.withColumn("contcatenated", udf_execute("uudf", df2.col("col3"), lit("col2"), struct(columns: _*)))
df3.show(false)
}
output should be:
+----+----+-----------+----------------------------+
|col1|col2|col3 |contcatenated |
+----+----+-----------+----------------------------+
|a |b |xxxxxxxxxxx|xxxxxxxxxxx:col2:a, b, c |
|a1 |b1 |xxxxxxxxxxx|xxxxxxxxxxx:col2:a1, b1, c1 |
+----+----+-----------+----------------------------+
That happens because you refer to column that is no longer in the scope. When you call:
val df2 = df.withColumn("col3", lit("xxxxxxxxxxx"))
it shades the original col3 column, effectively making preceding columns with the same name accessible. Even if it wasn't the case, let's say after:
val df2 = df.select($"*", lit("xxxxxxxxxxx") as "col3")
the new col3 would be ambiguous, and indistinguishable by name from the one defined brought by *.
So to achieve the required output you'll have to use another name:
val df2 = df.withColumn("col3_", lit("xxxxxxxxxxx"))
and then adjust the rest of your code accordingly:
df2.withColumn(
"contcatenated",
udf_execute("uudf", df2.col("col3_") as "col3",
lit("col2"), struct(columns: _*))
).drop("_3")
If the logic is as simple as the one in the example, you can of course just inline things:
df.withColumn(
"contcatenated",
udf_execute("uudf", lit("xxxxxxxxxxx") as "col3",
lit("col2"), struct(columns: _*))
).drop("_3")

Iterating over a grouped dataset in Spark 1.6

In an ordered dataset, I want to aggregate data until a condition is met, but grouped by a certain key.
To set some context to my question I simplify my problem to the below problem statement:
In spark I need to aggregate strings, grouped by key when a user stops
"shouting" (the 2nd char in a string is not uppercase).
Dataset example:
ID, text, timestamps
1, "OMG I like bananas", 123
1, "Bananas are the best", 234
1, "MAN I love banana", 1235
2, "ORLY? I'm more into grapes", 123565
2, "BUT I like apples too", 999
2, "unless you count veggies", 9999
2, "THEN don't forget tomatoes", 999999
The expected result would be:
1, "OMG I like bananas Bananas are the best"
2, "ORLY? I'm more into grapes BUT I like apples too unless you count veggies"
via groupby and agg I can't seem to set a condition to "stop when an uppercase char" is found.
This only works in Spark 2.1 or above
What you want to do is possible, but it may be very expensive.
First, let's create some test data. As general advice, when you ask something on Stackoverflow please provide something similar to this so people have somewhere to start.
import spark.sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = List(
(1, "OMG I like bananas", 1),
(1, "Bananas are the best", 2),
(1, "MAN I love banana", 3),
(2, "ORLY? I'm more into grapes", 1),
(2, "BUT I like apples too", 2),
(2, "unless you count veggies", 3),
(2, "THEN don't forget tomatoes", 4)
).toDF("ID", "text", "timestamps")
In order to get a column with the collected texts in order, we need to add a new column using a window function.
Using the spark shell:
scala> val df2 = df.withColumn("coll", collect_list("text").over(Window.partitionBy("id").orderBy("timestamps")))
df2: org.apache.spark.sql.DataFrame = [ID: int, text: string ... 2 more fields]
scala> val x = df2.groupBy("ID").agg(max($"coll").as("texts"))
x: org.apache.spark.sql.DataFrame = [ID: int, texts: array<string>]
scala> x.collect.foreach(println)
[1,WrappedArray(OMG I like bananas, Bananas are the best, MAN I love banana)]
[2,WrappedArray(ORLY? I'm more into grapes, BUT I like apples too, unless you count veggies, THEN don't forget tomatoes)]
To get the actual text we may need a UDF. Here's mine (I'm far from an expert in Scala, so bear with me)
import scala.collection.mutable
val aggText: Seq[String] => String = (list: Seq[String]) => {
def tex(arr: Seq[String], accum: Seq[String]): Seq[String] = arr match {
case Seq() => accum
case Seq(single) => accum :+ single
case Seq(str, xs #_*) => if (str.length >= 2 && !(str.charAt(0).isUpper && str.charAt(1).isUpper))
tex(Nil, accum :+ str )
else
tex(xs, accum :+ str)
}
val res = tex(list, Seq())
res.mkString(" ")
}
val textUDF = udf(aggText(_: mutable.WrappedArray[String]))
So, we have a dataframe with the collected texts in the proper order, and a Scala function (wrapped as a UDF). Let's piece it together:
scala> val x = df2.groupBy("ID").agg(max($"coll").as("texts"))
x: org.apache.spark.sql.DataFrame = [ID: int, texts: array<string>]
scala> val y = x.select($"ID", textUDF($"texts"))
y: org.apache.spark.sql.DataFrame = [ID: int, UDF(texts): string]
scala> y.collect.foreach(println)
[1,OMG I like bananas Bananas are the best]
[2,ORLY? I'm more into grapes BUT I like apples too unless you count veggies]
scala>
I think this is the result you want.

Spark HiveContext get the same format as hive client select

When a Hive table has values like maps or arrays, if you select it in the Hive client they are shown as JSON, e.g.: {"a":1,"b":1} or [1,2,2].
When you select those in Spark, they are map/array objects in the DataFrame. If you stringify each row they are Map("a" -> 1, "b" -> 1) or WrappedArray(1, 2, 2).
I want to have the same format as the Hive client when using Spark's HiveContext.
How can I do this?
Spark has its own functions to convert complex objects into their JSON representation.
Here is the documentation for the org.apache.spark.sql.functions package, which also comes with the to_json function that does the following:
Converts a column containing a StructType, ArrayType of StructTypes, a MapType or ArrayType of MapTypes into a JSON string with the specified schema. Throws an exception, in the case of an unsupported type.
Here is a short example as ran on the spark-shell:
scala> val df = spark.createDataFrame(
| Seq(("hello", Map("a" -> 1)), ("world", Map("b" -> 2)))
| ).toDF("name", "map")
df: org.apache.spark.sql.DataFrame = [name: string, map: map<string,int>]
scala> df.show
+-----+-----------+
| name| map|
+-----+-----------+
|hello|Map(a -> 1)|
|world|Map(b -> 2)|
+-----+-----------+
scala> df.select($"name", to_json(struct($"map")) as "json").show
+-----+---------------+
| name| json|
+-----+---------------+
|hello|{"map":{"a":1}}|
|world|{"map":{"b":2}}|
+-----+---------------+
Here is a similar example, with arrays instead of maps:
scala> val df = spark.createDataFrame(
| Seq(("hello", Seq("a", "b")), ("world", Seq("c", "d")))
| ).toDF("name", "array")
df: org.apache.spark.sql.DataFrame = [name: string, array: array<string>]
scala> df.select($"name", to_json(struct($"array")) as "json").show
+-----+-------------------+
| name| json|
+-----+-------------------+
|hello|{"array":["a","b"]}|
|world|{"array":["c","d"]}|
+-----+-------------------+

Resources