Spark: DataFrame Aggregation (Scala) - apache-spark

I have a below requirement to aggregate the data on Spark dataframe in scala.
And, I have two datasets.
Dataset 1 contains values (val1, val2..) for each "t" types distributed on several different columns like (t1,t2...) .
val data1 = Seq(
("1","111",200,"221",100,"331",1000),
("2","112",400,"222",500,"332",1000),
("3","113",600,"223",1000,"333",1000)
).toDF("id1","t1","val1","t2","val2","t3","val3")
data1.show()
+---+---+----+---+----+---+----+
|id1| t1|val1| t2|val2| t3|val3|
+---+---+----+---+----+---+----+
| 1|111| 200|221| 100|331|1000|
| 2|112| 400|222| 500|332|1000|
| 3|113| 600|223|1000|333|1000|
+---+---+----+---+----+---+----+
Dataset 2 represent the same thing by having a separate row for each "t" type.
val data2 = Seq(("1","111",200),("1","221",100),("1","331",1000),
("2","112",400),("2","222",500),("2","332",1000),
("3","113",600),("3","223",1000), ("3","333",1000)
).toDF("id*","t*","val*")
data2.show()
+---+---+----+
|id*| t*|val*|
+---+---+----+
| 1|111| 200|
| 1|221| 100|
| 1|331|1000|
| 2|112| 400|
| 2|222| 500|
| 2|332|1000|
| 3|113| 600|
| 3|223|1000|
| 3|333|1000|
+---+---+----+
Now,I need to groupBY(id,t,t*) fields and print the balances for sum(val) and sum(val*) as a separate record.
And both balances should be equal.
My output should look like below:
+---+---+--------+---+---------+
|id1| t |sum(val)| t*|sum(val*)|
+---+---+--------+---+---------+
| 1|111| 200|111| 200|
| 1|221| 100|221| 100|
| 1|331| 1000|331| 1000|
| 2|112| 400|112| 400|
| 2|222| 500|222| 500|
| 2|332| 1000|332| 1000|
| 3|113| 600|113| 600|
| 3|223| 1000|223| 1000|
| 3|333| 1000|333| 1000|
+---+---+--------+---+---------+
I'm thinking of exploding the dataset1 into mupliple records for each "t" type and then join with dataset2.
But could you please suggest me a better approach which wouldn't affect the performance if datasets become bigger?

The simplest solution is to do sub-selects and then union datasets:
val ts = Seq(1, 2, 3)
val dfs = ts.map (t => data1.select("t" + t as "t", "v" + t as "v"))
val unioned = dfs.drop(1).foldLeft(dfs(0))((l, r) => l.union(r))
val ds = unioned.join(df2, 't === col("t*")
here aggregation
You can also try array with explode:
val df1 = data1.withColumn("colList", array('t1, 't2, 't3))
.withColumn("t", explode(colList))
.select('t, 'id1 as "id")
val ds = df2.withColumn("val",
when('t === 't1, 'val1)
.when('t === 't2, 'val2)
.when('t === 't3, 'val3)
.otherwise(0))
The last step is to join this Dataset with data2:
ds.join(data2, 't === col("t*"))
.groupBy("t", "t*")
.agg(first("id1") as "id1", sum(val), sum("val*"))

Related

Pyspark : aggregate in expression vs function call

Can anyone tell me why aggregate call via expr works here but not through functions? Spark version 3.1.1
I am trying to calculate sum of an array column using aggregate.
Works:
import pyspark.sql.functions as f
from pyspark.sql.types import *
sdf_login_7.withColumn("new_col", f.array( f.col("a"), f.col("b"))
.cast(ArrayType(IntegerType())))\
.withColumn("sum", f.expr('aggregate(new_col, 0L, (acc,x) -> acc+x)'))\
.select(["a", "b", "new_col", "sum"]).show(3)
+---+---+--------+----+
| a| b| new_col| sum|
+---+---+--------+----+
| 10| 41|[10, 41]| 51|
| 11| 74|[11, 74]| 85|
| 11| 80|[11, 80]| 91|
+---+---+--------+----+
only showing top 3 rows
Doesn't work:
sdf_login_7.withColumn("new_col", f.array( f.col("a"), f.col("b"))
.cast(ArrayType(IntegerType())))\
.withColumn("sum", f.aggregate("new_col", f.lit(0), lambda acc, x: acc+x))\
.select(["a", "b", "new_col", "sum"]).show(3)
Py4JError:
org.apache.spark.sql.catalyst.expressions.UnresolvedNamedLambdaVariable.freshVarName
does not exist in the JVM

Spark 3.1 String Array to Date Array Coversion Error

I want to find out Whether this array contains this date or not. if yes i need to put yes in one column.
Dataset<Row> dataset = dataset.withColumn("incoming_timestamp", col("incoming_timestamp").cast("timestamp"))
.withColumn("incoming_date", to_date(col("incoming_timestamp")));
my incoming_timestamp is 2021-03-30 00:00:00 after converting to date it is 2021-03-30
output dataset is like this
+----------------------+-------------------+----------------------------------------+
|col 1 |incoming_timestamp | incoming_date |
+----------------------+-------------------+-----------------------------------------
|val1 |2021-03-30 00:00:00| 2021-07-06 |
|val2 |2020-03-30 00:00:00| 2020-03-30 |
|val3 |1889-03-30 00:00:00| 1889-03-30 |
-------------------------------------------------------------------------------------
i have a String declared like this,
String Dates = "2021-07-06,1889-03-30";
i want to add one more col in the result dataset is the incoming date is present in Dates String.
Like this,
+----------------------+-------------------+----------------------------------------+--------------+
|col 1 |incoming_timestamp | incoming_date | result |
+----------------------+-------------------+--------------------------------------------------------
|val1 |2021-03-30 00:00:00| 2021-07-06 | true |
|val2 |2020-03-30 00:00:00| 2020-03-30 | false |
|val3 |1889-03-30 00:00:00| 1889-03-30 | true |
----------------------------------------------------------------------------------------------------
for that first i need to convert this String into Array, then array_contains(value,array) Returns true if the array contains the value.
i tried the following,
METHOD 1
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd");
Date[] dateArr = Arrays.stream((dates.split(","))).map(d->(LocalDate.parse(d,
formatter))).toArray(Date[]::new);
it throws error, java.lang.ArrayStoreException: java.time.LocalDate
METHOD 2
SimpleDateFormat formatter = new SimpleDateFormat("YYYY-MM-DD", Locale.ENGLISH);
formatter.setTimeZone(TimeZone.getTimeZone("America/New_York"));
Date[] dateArr = Arrays.stream((Dates.split(","))).map(d-> {
try {
return (formatter.parse(d));
} catch (ParseException e) {
e.printStackTrace();
}
return null;
}).toArray(Date[]::new);
dataset = dataset.withColumn("result",array_contains(col("incoming_date"),dates));
it throws error
org.apache.spark.sql.AnalysisException: Unsupported component type class java.util.Date in arrays
Can anyone help on this?
This can be solved by typecasting String to java.sql.Date.
import java.sql.Date
val data: Seq[(String, String)] = Seq(
("val1", "2020-07-31 00:00:00"),
("val2", "2021-02-28 00:00:00"),
("val3", "2019-12-31 00:00:00"))
val compareDate = "2020-07-31, 2019-12-31"
val compareDateArray = compareDate.split(",").map(x => Date.valueOf(x.trim))
import spark.implicits._
val df = data.toDF("variable", "date")
.withColumn("date_casted", to_date(col("date"), "y-M-d H:m:s"))
df.show()
val outputDf = df.withColumn("result", col("date_casted").isin(compareDateArray: _*))
outputDf.show()
Input:
+--------+-------------------+-----------+
|variable| date|date_casted|
+--------+-------------------+-----------+
| val1|2020-07-31 00:00:00| 2020-07-31|
| val2|2021-02-28 00:00:00| 2021-02-28|
| val3|2019-12-31 00:00:00| 2019-12-31|
+--------+-------------------+-----------+
root
|-- variable: string (nullable = true)
|-- date: string (nullable = true)
|-- date_casted: date (nullable = true)
output:
+--------+-------------------+-----------+------+
|variable| date|date_casted|result|
+--------+-------------------+-----------+------+
| val1|2020-07-31 00:00:00| 2020-07-31| true|
| val2|2021-02-28 00:00:00| 2021-02-28| false|
| val3|2019-12-31 00:00:00| 2019-12-31| true|
+--------+-------------------+-----------+------+

Pyspark: join some the type of date data

Forexample:
I have two dataframes in Pyspark.
A_dataframe【table name: link_data_test】,The size is so big about 1 billion rows:
-----+--------------------+---------------+
| id| link_date| tuch_url|
+-----+--------------------+-------------+
|day_1|2020-01-01 06:00:...|www.google.com|
|day_2|2020-01-01 11:00:...|www.33e.......|
|day_3|2020-01-03 22:21:...|www.3tg.......|
|day_4|2019-01-04 20:00:...|www.96g.......|
.........
+-----+--------------------+
B_dataframe【table name: url_data_test】:
-----+--------------------+
| url| extra_date|
+-----+--------------------+
|www.google.com|2019-02-01 |
|www.23........|2020-01-02 |
|www.hsi.......|2020-01-03 |
|www.cc........|2020-01-05 |
.......
+-----+--------------------+
I can use the spark.sql() to create a query:
sql_str="""
select
t1.*,t2.*
from
link_data_test as t1
inner join
url_data_test as t2
on
t1.link_date> t2.extra_date and t1.link_date< date_add(t2.extra_date,8)
where
t1.tuch_url like "%t2.url%"
"""
test1=spark.sql(sql_str).saveAsTable("xxxx",mode="overwrite")
I tried this to use the following writing replace the sql wording above for some other tests,but I don't know how writing this.
A_dataframe.join(B_dataframe, ......,'inner').select(....).saveAsTable("xxxx",mode="overwrite")
Thank you for your help!
Here is the way.
df1 = spark.read.option("header","true").option("inferSchema","true").csv("test1.csv")
df1.show(10, False)
df2 = spark.read.option("header","true").option("inferSchema","true").csv("test2.csv")
df2.show(10, False)
+-----+-------------------+--------------+
|id |link_date |tuch_url |
+-----+-------------------+--------------+
|day_1|2020-01-08 23:59:59|www.google.com|
+-----+-------------------+--------------+
+--------------+----------+
|url |extra_date|
+--------------+----------+
|www.google.com|2020-01-01|
+--------------+----------+
df1.join(broadcast(df2),
col('link_date').between(col('extra_date'), date_add('extra_date', 7))
& col('url').contains(col('tuch_url')), 'inner') \
.show(10, False)
+-----+-------------------+--------------+--------------+----------+
|id |link_date |tuch_url |url |extra_date|
+-----+-------------------+--------------+--------------+----------+
|day_1|2020-01-08 23:59:59|www.google.com|www.google.com|2020-01-01|
+-----+-------------------+--------------+--------------+----------+

how to handle this in spark

I am using spark-sql 2.4.x version , datastax-spark-cassandra-connector for Cassandra-3.x version. Along with kafka.
I have a scenario for some finance data coming from kafka topic. data (base dataset) contains companyId, year , prev_year fields information.
If columns year === prev_year then I need to join with different table i.e. exchange_rates.
If columns year =!= prev_year then I need to return the base dataset itself
How to do this in spark-sql ?
You can refer below approach for your case.
scala> Input_df.show
+---------+----+---------+----+
|companyId|year|prev_year|rate|
+---------+----+---------+----+
| 1|2016| 2017| 12|
| 1|2017| 2017|21.4|
| 2|2018| 2017|11.7|
| 2|2018| 2018|44.6|
| 3|2016| 2017|34.5|
| 4|2017| 2017| 56|
+---------+----+---------+----+
scala> exch_rates.show
+---------+----+
|companyId|rate|
+---------+----+
| 1|12.3|
| 2|12.5|
| 3|22.3|
| 4|34.6|
| 5|45.2|
+---------+----+
scala> val equaldf = Input_df.filter(col("year") === col("prev_year"))
scala> val notequaldf = Input_df.filter(col("year") =!= col("prev_year"))
scala> val joindf = notequaldf.alias("n").drop("rate").join(exch_rates.alias("e"), List("companyId"), "left")
scala> val finalDF = equaldf.union(joindf)
scala> finalDF.show()
+---------+----+---------+----+
|companyId|year|prev_year|rate|
+---------+----+---------+----+
| 1|2017| 2017|21.4|
| 2|2018| 2018|44.6|
| 4|2017| 2017| 56|
| 1|2016| 2017|12.3|
| 2|2018| 2017|12.5|
| 3|2016| 2017|22.3|
+---------+----+---------+----+

best way to write Spark Dataset[U]

Ok, so I am aware of the fact that Dataset.as[U] just changes the view of the Dataframe for typed operations.
As seen in this example:
case class One(one: Int)
val df = Seq(
(1,2,3),
(11,22,33),
(111,222,333)
).toDF("one", "two", "thre")
val ds : Dataset[One] = df.as[One]
ds.show
prints
+----+----+-----+
| one| two|three|
+----+----+-----+
| 1| 2| 3|
| 11| 22| 33|
| 111| 222| 333|
+----+----+-----+
This is totally fine and works in my favor most of the times. BUT now I need to write that ds to disk, with only that column one.
To enforce the schema I could do
.map(x => x) as this is a typed operation, the case class schema will take effect. This operation also results in a Dataset[One], but with the underlying data reduced to column one. This just seems awfully expensive looking at the execution plan
== Physical Plan ==
*SerializeFromObject [assertnotnull(input[0, $line2012488405320.$read$$iw$$iw$One, true]).one AS one#9408]
+- *MapElements <function1>, obj#9407: $line2012488405320.$read$$iw$$iw$One
+- *DeserializeToObject newInstance(class $line2012488405320.$read$$iw$$iw$One), obj#9406: $line2012488405320.$read$$iw$$iw$One
+- LocalTableScan [one#9391]
What are alternative implementations to achieve
ds.show
+----+
| one|
+----+
| 1|
| 11|
| 111|
+----+
UPDATE 1
I was thinking of a general solution to the problem. Maybe something along these lines?:
def caseClassAccessorNames[T <: Product](implicit tag: TypeTag[T]) = {
typeOf[T]
.members
.collect {
case m: MethodSymbol if m.isCaseAccessor => m.name
}
.map(m => m.toString)
}
def project[T <: Product](ds: Dataset[T])(implicit tag: TypeTag[T]): Dataset[T] = {
import ds.sparkSession.implicits._
val columnsOfT: Seq[Column] =
caseClassAccessorNames
.map(col)(scala.collection.breakOut)
val t: DataFrame = ds.select(columnsOfT: _*)
t.as[T]
}
I managed to get this working for the trivial example, but need to evaluate it further. I wonder if there are alternative, maybe built-in, ways to achieve something like this?

Resources