Spark 3.1 String Array to Date Array Coversion Error - apache-spark

I want to find out Whether this array contains this date or not. if yes i need to put yes in one column.
Dataset<Row> dataset = dataset.withColumn("incoming_timestamp", col("incoming_timestamp").cast("timestamp"))
.withColumn("incoming_date", to_date(col("incoming_timestamp")));
my incoming_timestamp is 2021-03-30 00:00:00 after converting to date it is 2021-03-30
output dataset is like this
+----------------------+-------------------+----------------------------------------+
|col 1 |incoming_timestamp | incoming_date |
+----------------------+-------------------+-----------------------------------------
|val1 |2021-03-30 00:00:00| 2021-07-06 |
|val2 |2020-03-30 00:00:00| 2020-03-30 |
|val3 |1889-03-30 00:00:00| 1889-03-30 |
-------------------------------------------------------------------------------------
i have a String declared like this,
String Dates = "2021-07-06,1889-03-30";
i want to add one more col in the result dataset is the incoming date is present in Dates String.
Like this,
+----------------------+-------------------+----------------------------------------+--------------+
|col 1 |incoming_timestamp | incoming_date | result |
+----------------------+-------------------+--------------------------------------------------------
|val1 |2021-03-30 00:00:00| 2021-07-06 | true |
|val2 |2020-03-30 00:00:00| 2020-03-30 | false |
|val3 |1889-03-30 00:00:00| 1889-03-30 | true |
----------------------------------------------------------------------------------------------------
for that first i need to convert this String into Array, then array_contains(value,array) Returns true if the array contains the value.
i tried the following,
METHOD 1
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd");
Date[] dateArr = Arrays.stream((dates.split(","))).map(d->(LocalDate.parse(d,
formatter))).toArray(Date[]::new);
it throws error, java.lang.ArrayStoreException: java.time.LocalDate
METHOD 2
SimpleDateFormat formatter = new SimpleDateFormat("YYYY-MM-DD", Locale.ENGLISH);
formatter.setTimeZone(TimeZone.getTimeZone("America/New_York"));
Date[] dateArr = Arrays.stream((Dates.split(","))).map(d-> {
try {
return (formatter.parse(d));
} catch (ParseException e) {
e.printStackTrace();
}
return null;
}).toArray(Date[]::new);
dataset = dataset.withColumn("result",array_contains(col("incoming_date"),dates));
it throws error
org.apache.spark.sql.AnalysisException: Unsupported component type class java.util.Date in arrays
Can anyone help on this?

This can be solved by typecasting String to java.sql.Date.
import java.sql.Date
val data: Seq[(String, String)] = Seq(
("val1", "2020-07-31 00:00:00"),
("val2", "2021-02-28 00:00:00"),
("val3", "2019-12-31 00:00:00"))
val compareDate = "2020-07-31, 2019-12-31"
val compareDateArray = compareDate.split(",").map(x => Date.valueOf(x.trim))
import spark.implicits._
val df = data.toDF("variable", "date")
.withColumn("date_casted", to_date(col("date"), "y-M-d H:m:s"))
df.show()
val outputDf = df.withColumn("result", col("date_casted").isin(compareDateArray: _*))
outputDf.show()
Input:
+--------+-------------------+-----------+
|variable| date|date_casted|
+--------+-------------------+-----------+
| val1|2020-07-31 00:00:00| 2020-07-31|
| val2|2021-02-28 00:00:00| 2021-02-28|
| val3|2019-12-31 00:00:00| 2019-12-31|
+--------+-------------------+-----------+
root
|-- variable: string (nullable = true)
|-- date: string (nullable = true)
|-- date_casted: date (nullable = true)
output:
+--------+-------------------+-----------+------+
|variable| date|date_casted|result|
+--------+-------------------+-----------+------+
| val1|2020-07-31 00:00:00| 2020-07-31| true|
| val2|2021-02-28 00:00:00| 2021-02-28| false|
| val3|2019-12-31 00:00:00| 2019-12-31| true|
+--------+-------------------+-----------+------+

Related

how to handle this in spark

I am using spark-sql 2.4.x version , datastax-spark-cassandra-connector for Cassandra-3.x version. Along with kafka.
I have a scenario for some finance data coming from kafka topic. data (base dataset) contains companyId, year , prev_year fields information.
If columns year === prev_year then I need to join with different table i.e. exchange_rates.
If columns year =!= prev_year then I need to return the base dataset itself
How to do this in spark-sql ?
You can refer below approach for your case.
scala> Input_df.show
+---------+----+---------+----+
|companyId|year|prev_year|rate|
+---------+----+---------+----+
| 1|2016| 2017| 12|
| 1|2017| 2017|21.4|
| 2|2018| 2017|11.7|
| 2|2018| 2018|44.6|
| 3|2016| 2017|34.5|
| 4|2017| 2017| 56|
+---------+----+---------+----+
scala> exch_rates.show
+---------+----+
|companyId|rate|
+---------+----+
| 1|12.3|
| 2|12.5|
| 3|22.3|
| 4|34.6|
| 5|45.2|
+---------+----+
scala> val equaldf = Input_df.filter(col("year") === col("prev_year"))
scala> val notequaldf = Input_df.filter(col("year") =!= col("prev_year"))
scala> val joindf = notequaldf.alias("n").drop("rate").join(exch_rates.alias("e"), List("companyId"), "left")
scala> val finalDF = equaldf.union(joindf)
scala> finalDF.show()
+---------+----+---------+----+
|companyId|year|prev_year|rate|
+---------+----+---------+----+
| 1|2017| 2017|21.4|
| 2|2018| 2018|44.6|
| 4|2017| 2017| 56|
| 1|2016| 2017|12.3|
| 2|2018| 2017|12.5|
| 3|2016| 2017|22.3|
+---------+----+---------+----+

Spark: DataFrame Aggregation (Scala)

I have a below requirement to aggregate the data on Spark dataframe in scala.
And, I have two datasets.
Dataset 1 contains values (val1, val2..) for each "t" types distributed on several different columns like (t1,t2...) .
val data1 = Seq(
("1","111",200,"221",100,"331",1000),
("2","112",400,"222",500,"332",1000),
("3","113",600,"223",1000,"333",1000)
).toDF("id1","t1","val1","t2","val2","t3","val3")
data1.show()
+---+---+----+---+----+---+----+
|id1| t1|val1| t2|val2| t3|val3|
+---+---+----+---+----+---+----+
| 1|111| 200|221| 100|331|1000|
| 2|112| 400|222| 500|332|1000|
| 3|113| 600|223|1000|333|1000|
+---+---+----+---+----+---+----+
Dataset 2 represent the same thing by having a separate row for each "t" type.
val data2 = Seq(("1","111",200),("1","221",100),("1","331",1000),
("2","112",400),("2","222",500),("2","332",1000),
("3","113",600),("3","223",1000), ("3","333",1000)
).toDF("id*","t*","val*")
data2.show()
+---+---+----+
|id*| t*|val*|
+---+---+----+
| 1|111| 200|
| 1|221| 100|
| 1|331|1000|
| 2|112| 400|
| 2|222| 500|
| 2|332|1000|
| 3|113| 600|
| 3|223|1000|
| 3|333|1000|
+---+---+----+
Now,I need to groupBY(id,t,t*) fields and print the balances for sum(val) and sum(val*) as a separate record.
And both balances should be equal.
My output should look like below:
+---+---+--------+---+---------+
|id1| t |sum(val)| t*|sum(val*)|
+---+---+--------+---+---------+
| 1|111| 200|111| 200|
| 1|221| 100|221| 100|
| 1|331| 1000|331| 1000|
| 2|112| 400|112| 400|
| 2|222| 500|222| 500|
| 2|332| 1000|332| 1000|
| 3|113| 600|113| 600|
| 3|223| 1000|223| 1000|
| 3|333| 1000|333| 1000|
+---+---+--------+---+---------+
I'm thinking of exploding the dataset1 into mupliple records for each "t" type and then join with dataset2.
But could you please suggest me a better approach which wouldn't affect the performance if datasets become bigger?
The simplest solution is to do sub-selects and then union datasets:
val ts = Seq(1, 2, 3)
val dfs = ts.map (t => data1.select("t" + t as "t", "v" + t as "v"))
val unioned = dfs.drop(1).foldLeft(dfs(0))((l, r) => l.union(r))
val ds = unioned.join(df2, 't === col("t*")
here aggregation
You can also try array with explode:
val df1 = data1.withColumn("colList", array('t1, 't2, 't3))
.withColumn("t", explode(colList))
.select('t, 'id1 as "id")
val ds = df2.withColumn("val",
when('t === 't1, 'val1)
.when('t === 't2, 'val2)
.when('t === 't3, 'val3)
.otherwise(0))
The last step is to join this Dataset with data2:
ds.join(data2, 't === col("t*"))
.groupBy("t", "t*")
.agg(first("id1") as "id1", sum(val), sum("val*"))

How to flatten columns of type array of structs (as returned by Spark ML API)?

Maybe it's just because I'm relatively new to the API, but I feel like Spark ML methods often return DFs that are unnecessarily difficult to work with.
This time, it's the ALS model that's tripping me up. Specifically, the recommendForAllUsers method. Let's reconstruct the type of DF it would return:
scala> val arrayType = ArrayType(new StructType().add("itemId", IntegerType).add("rating", FloatType))
scala> val recs = Seq((1, Array((1, .7), (2, .5))), (2, Array((0, .9), (4, .1)))).
toDF("userId", "recommendations").
select($"userId", $"recommendations".cast(arrayType))
scala> recs.show()
+------+------------------+
|userId| recommendations|
+------+------------------+
| 1|[[1,0.7], [2,0.5]]|
| 2|[[0,0.9], [4,0.1]]|
+------+------------------+
scala> recs.printSchema
root
|-- userId: integer (nullable = false)
|-- recommendations: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- itemId: integer (nullable = true)
| | |-- rating: float (nullable = true)
Now, I only care about the itemId in the recommendations column. After all, the method is recommendForAllUsers not recommendAndScoreForAllUsers (ok ok I'll stop being sassy...)
How do I do this??
I thought I had it when I created a UDF:
scala> val itemIds = udf((arr: Array[(Int, Float)]) => arr.map(_._1))
but that produces an error:
scala> recs.withColumn("items", items($"recommendations"))
org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(recommendations)' due to data type mismatch: argument 1 requires array<struct<_1:int,_2:float>> type, however, '`recommendations`' is of array<struct<itemId:int,rating:float>> type.;;
'Project [userId#87, recommendations#92, UDF(recommendations#92) AS items#238]
+- Project [userId#87, cast(recommendations#88 as array<struct<itemId:int,rating:float>>) AS recommendations#92]
+- Project [_1#84 AS userId#87, _2#85 AS recommendations#88]
+- LocalRelation [_1#84, _2#85]
Any ideas? thanks!
wow, my coworker came up with an extremely elegant solution:
scala> recs.select($"userId", $"recommendations.itemId").show
+------+------+
|userId|itemId|
+------+------+
| 1|[1, 2]|
| 2|[0, 4]|
+------+------+
So maybe the Spark ML API isn't that difficult after all :)
With an array as the type of a column, e.g. recommendations, you'd be quite productive using explode function (or the more advanced flatMap operator).
explode(e: Column): Column Creates a new row for each element in the given array or map column.
That gives you bare structs to work with.
import org.apache.spark.sql.types._
val structType = new StructType().
add($"itemId".int).
add($"rating".float)
val arrayType = ArrayType(structType)
val recs = Seq((1, Array((1, .7), (2, .5))), (2, Array((0, .9), (4, .1)))).
toDF("userId", "recommendations").
select($"userId", $"recommendations" cast arrayType)
val exploded = recs.withColumn("recs", explode($"recommendations"))
scala> exploded.show
+------+------------------+-------+
|userId| recommendations| recs|
+------+------------------+-------+
| 1|[[1,0.7], [2,0.5]]|[1,0.7]|
| 1|[[1,0.7], [2,0.5]]|[2,0.5]|
| 2|[[0,0.9], [4,0.1]]|[0,0.9]|
| 2|[[0,0.9], [4,0.1]]|[4,0.1]|
+------+------------------+-------+
structs are nice in select operator with * (star) to flatten them to columns per the struct fields.
You could do select($"element.*").
scala> exploded.select("userId", "recs.*").show
+------+------+------+
|userId|itemId|rating|
+------+------+------+
| 1| 1| 0.7|
| 1| 2| 0.5|
| 2| 0| 0.9|
| 2| 4| 0.1|
+------+------+------+
I think that could do what you're after.
p.s. Stay away from UDFs as long as possible since they "trigger" row conversion from the internal format (InternalRow) to JVM objects that can lead to excessive GCs.

How to filter Spark dataframe by array column containing any of the values of some other dataframe/set

I have a Dataframe A that contains a column of array string.
...
|-- browse: array (nullable = true)
| |-- element: string (containsNull = true)
...
For example three sample rows would be
+---------+--------+---------+
| column 1| browse| column n|
+---------+--------+---------+
| foo1| [X,Y,Z]| bar1|
| foo2| [K,L]| bar2|
| foo3| [M]| bar3|
And another Dataframe B that contains a column of string
|-- browsenodeid: string (nullable = true)
Some sample rows for it would be
+------------+
|browsenodeid|
+------------+
| A|
| Z|
| M|
How can I filter A so that I keep all the rows whose browse contains any of the the values of browsenodeid from B? In terms of the above examples the result will be:
+---------+--=-----+---------+
| column 1| browse| column n|
+---------+--------+---------+
| foo1| [X,Y,Z]| bar1| <- because Z is a value of B.browsenodeid
| foo3| [M]| bar3| <- because M is a value of B.browsenodeid
If I had a single value then I would use something like
A.filter(array_contains(A("browse"), single_value))
But what do I do with a list or DataFrame of values?
I found an elegant solution for this, without the need to cast DataFrames/Datasets to RDDs.
Assuming you have a DataFrame dataDF:
+---------+--------+---------+
| column 1| browse| column n|
+---------+--------+---------+
| foo1| [X,Y,Z]| bar1|
| foo2| [K,L]| bar2|
| foo3| [M]| bar3|
and an array b containing the values you want to match in browse
val b: Array[String] = Array(M,Z)
Implement the udf:
import org.apache.spark.sql.expressions.UserDefinedFunction
import scala.collection.mutable.WrappedArray
def array_contains_any(s:Seq[String]): UserDefinedFunction = {
udf((c: WrappedArray[String]) =>
c.toList.intersect(s).nonEmpty)
}
and then simply use the filter or where function (with a little bit of fancy currying :P) to do the filtering like:
dataDF.where(array_contains_any(b)($"browse"))
In Spark >= 2.4.0 you can use arrays_overlap:
import org.apache.spark.sql.functions.{array, arrays_overlap, lit}
val df = Seq(
("foo1", Seq("X", "Y", "Z"), "bar1"),
("foo2", Seq("K", "L"), "bar2"),
("foo3", Seq("M"), "bar3")
).toDF("col1", "browse", "coln")
val b = Seq("M" ,"Z")
val searchArray = array(b.map{lit}:_*) // cast to lit(i) then create Spark array
df.where(arrays_overlap($"browse", searchArray)).show()
// +----+---------+----+
// |col1| browse|coln|
// +----+---------+----+
// |foo1|[X, Y, Z]|bar1|
// |foo3| [M]|bar3|
// +----+---------+----+
Assume input data:Dataframe A
browse
200,300,889,767,9908,7768,9090
300,400,223,4456,3214,6675,333
234,567,890
123,445,667,887
and you have to match it with Dataframe B
browsenodeid:(I flatten the column browsenodeid) 123,200,300
val matchSet = "123,200,300".split(",").toSet
val rawrdd = sc.textFile("D:\\Dataframe_A.txt")
rawrdd.map(_.split("|"))
.map(arr => arr(0).split(",").toSet.intersect(matchSet).mkString(","))
.foreach(println)
Your output:
300,200
300
123
Updated
val matchSet = "A,Z,M".split(",").toSet
val rawrdd = sc.textFile("/FileStore/tables/mvv45x9f1494518792828/input_A.txt")
rawrdd.map(_.split("|"))
.map(r => if (! r(1).split(",").toSet.intersect(matchSet).isEmpty) org.apache.spark.sql.Row(r(0),r(1), r(2))).collect.foreach(println)
Output is
foo1,X,Y,Z,bar1
foo3,M,bar3

spark conditional replacement of values

For pandas I have a code snippet like this:
def setUnknownCatValueConditional(df, conditionCol, condition, colToSet, _valueToSet='KEINE'):
df.loc[(df[conditionCol] == condition) & (df[colToSet].isnull()), colToSet] = _valueToSet
which conditionally will replace values in a data frame.
Trying to port this functionality to spark
df.withColumn("A", when($"A" === "x" and $"B" isNull, "replacement")).show
Did not work out for me
df.withColumn("A", when($"A" === "x" and $"B" isNull, "replacement")).show
warning: there was one feature warning; re-run with -feature for details
org.apache.spark.sql.AnalysisException: cannot resolve '((`A` = 'x') AND `B`)' due to data type mismatch: differing types in '((`A` = 'X') AND `B`)' (boolean and string).;;
even though df.printSchema returns a string for A and b
What is wrong here?
edit
A minimal example:
import java.sql.{ Date, Timestamp }
case class FooBar(foo:Date, bar:String)
val myDf = Seq(("2016-01-01","first"),("2016-01-02","second"),("2016-wrongFormat","noValidFormat"), ("2016-01-04","lastAssumingSameDate"))
.toDF("foo","bar")
.withColumn("foo", 'foo.cast("Date"))
.as[FooBar]
myDf.printSchema
root
|-- foo: date (nullable = true)
|-- bar: string (nullable = true)
scala> myDf.show
+----------+--------------------+
| foo| bar|
+----------+--------------------+
|2016-01-01| first|
|2016-01-02| second|
| null| noValidFormat|
|2016-01-04|lastAssumingSameDate|
+----------+--------------------+
myDf.withColumn("foo", when($"bar" === "noValidFormat" and $"foo" isNull, "noValue")).show
And the expected output
+----------+--------------------+
| foo| bar|
+----------+--------------------+
|2016-01-01| first|
|2016-01-02| second|
| "noValue"| noValidFormat|
|2016-01-04|lastAssumingSameDate|
+----------+--------------------+
edit2
in case chaining of conditions is required
df
.withColumn("A",
when(
(($"B" === "x") and ($"B" isNull)) or
(($"B" === "y") and ($"B" isNull)), "replacement")
should work
Mind the operator precedence. It should be:
myDf.withColumn("foo",
when(($"bar" === "noValidFormat") and ($"foo" isNull), "noValue"))
This:
$"bar" === "noValidFormat" and $"foo" isNull
is evaluated as:
(($"bar" === "noValidFormat") and $"foo") isNull

Resources