Spark sql replace collect_list empty lists with null - apache-spark

I have below data in dataframe
+----------+--------------+-------------------+---------------+
|id |mid |ppp |qq |
+----------+--------------+-------------------+---------------+
|A |4 |[{P}] |null |
|B |4 |[{P}] |null |
|A |4 |null |[{P}] |
|A |4 |null |[{Q}] |
|C |4 |null |[{Q}] |
|D |4 |null |[{Q}] |
|A |4 |null |[{R}] |
+----------+--------------+-------------------+---------------+
I have below code
String[] array = {"id", "mid", "ppp", "qq"};
List<String> columns = Arrays.asList(array)
Column[] columns = columns
.stream()
.filter(field -> !field.equals("id") && !field.equals("mid"))
.map(column -> flatten(when(size(collect_list(column)).equalTo(0), null)
.otherwise(collect_list(column)))
.as(column))
.collect(Collectors.toList()).toArray(new Column[0]);
Dataset<Row> output = df
.groupBy(functions.col("id"), functions.col("mid"))
.agg(columns[0], Arrays.copyOfRange(columns, 1, columns.length));
The above code produces groups by id and mid and then collect_list collects elements of ppp and qq into arrays in both columns.
Output :
+----------+--------------+-------------------+----------------+
|id |mid | ppp |qq |
+----------+--------------+-------------------+----------------+
|A |4 |[[P]] |[[R], [P], [Q]] |
|B |4 |null |[[Q]] |
|C |4 |[[P]] |null |
|D |4 |null |[[Q]] |
Code works fine exactly as required where if collect_list creates empty list, I am replacing that by null.
Is there a way to avoid calling collect_list twice in when and otherwise and achieve the same result that if collect_list creates empty list, replace that by null.

of course you can do that, just call size on the array on set it to null if it is 0, something like
df
.groupBy()
.agg(
collect_list($"mycol").as("arr_mycol")
)
// set empty arrays to null
.withColumn("arr_mycol",when(size($"arr_mycol")>0,$"arr_mycol"))

Related

How to use when() .otherwise function in Spark with multiple conditions

This is my first post so let me know if I need to give more details.
I am trying to create a boolean column, "immediate", that shows true when at least on of the columns has some data in it. If all are null then the column should be false. I am using the when() .otherwise function in spark but I'm not getting the result I would expect.
Below is the code I'm using:
val evaluation = evaluation_raw
.withColumn("immediate",
when(col("intended_outcome_review").isNull
&& col("outcome").isNull
&& col("impact").isNull
&& col("impact_self").isNull
&& col("next_step").isNull,
lit(false))
.otherwise(lit(true)))
.select(
col("id"),
col("intended_outcome_review"),
col("outcome"),
col("impact"),
col("impact_self"),
col("next_step"),
col("immediate"))
Desired outcome:
+--------+------------------------+-------------+-------+------------+----------+----------+
|id |intended_outcome_review |outcome |impact |impact_self |next_step |immediate |
+--------+------------------------+-------------+-------+------------+----------+----------+
|1568 |null |null |4 |3 |null |true |
|1569 |null |null |null |null |null |false |
|1570 |null |null |null |null |null |false |
|1571 |1 |improved coms|3 |3 |email prof|true |
+--------+------------------------+-------------+-------+------------+----------+----------+
Actual outcome:
+--------+------------------------+-------------+-------+------------+----------+----------+
|id |intended_outcome_review |outcome |impact |impact_self |next_step |immediate |
+--------+------------------------+-------------+-------+------------+----------+----------+
|1568 |null |null |4 |3 |null |true |
|1569 |null |null |null |null |null |true |
|1570 |null |null |null |null |null |false |
|1571 |1 |improved coms|3 |3 |email prof|true |
+--------+------------------------+-------------+-------+------------+----------+----------+
If anyone knows what I may be doing wrong please let me know.
Thanks!
You can use a trick and cast column.isNull() to in int and calculate sum of them. if the sum is above 0 then it's true.
.withColumn(
'immediate',
(
F.col('intended_outcome_review').isNull().cast('int') +
F.col('outcome').isNull().cast('int') +
F.col('impact').isNull().cast('int') +
F.col('next_step').isNull().cast('int')
) != 0
)
Turns out some of the columns are converted from Null to "" when other parts of the form are filled out.
Answer below considers empty strings and Null values:
.withColumn("immediate",
when((col("intended_outcome_review").isNull || col("intended_outcome_review") ==="")
&& (col("outcome").isNull || col("outcome") === "")
&& (col("impact").isNull || col("outcome") === "")
&& (col("impact_self").isNull || col("impact_self") === "")
&& (col("next_step").isNull || col("next_step") === ""),
lit(false))
.otherwise(lit(true)))

Filter rows with minimum and maximum count

This is what the dataframe looks like:
+---+-----------------------------------------+-----+
|eco|eco_name |count|
+---+-----------------------------------------+-----+
|B63|Sicilian, Richter-Rauzer Attack |5 |
|D86|Grunfeld, Exchange |3 |
|C99|Ruy Lopez, Closed, Chigorin, 12...cd |5 |
|A44|Old Benoni Defense |3 |
|C46|Three Knights |1 |
|C08|French, Tarrasch, Open, 4.ed ed |13 |
|E59|Nimzo-Indian, 4.e3, Main line |2 |
|A20|English |2 |
|B20|Sicilian |4 |
|B37|Sicilian, Accelerated Fianchetto |2 |
|A33|English, Symmetrical |8 |
|C77|Ruy Lopez |8 |
|B43|Sicilian, Kan, 5.Nc3 |10 |
|A04|Reti Opening |6 |
|A59|Benko Gambit |1 |
|A54|Old Indian, Ukrainian Variation, 4.Nf3 |3 |
|D30|Queen's Gambit Declined |19 |
|C01|French, Exchange |3 |
|D75|Neo-Grunfeld, 6.cd Nxd5, 7.O-O c5, 8.dxc5|1 |
|E74|King's Indian, Averbakh, 6...c5 |2 |
+---+-----------------------------------------+-----+
Schema:
root
|-- eco: string (nullable = true)
|-- eco_name: string (nullable = true)
|-- count: long (nullable = false)
I want to filter it so that only two rows with minimum and maximum counts remain.
The output dataframe should look something like:
+---+-----------------------------------------+--------------------+
|eco|eco_name |number_of_occurences|
+---+-----------------------------------------+--------------------+
|D30|Queen's Gambit Declined |19 |
|C46|Three Knights |1 |
+---+-----------------------------------------+--------------------+
I'm a beginner, I'm really sorry if this is a stupid question.
No need to apologize since this is the place to learn! One of the solutions is to use the Window and rank to find the min/max row:
df = spark.createDataFrame(
[('a', 1), ('b', 1), ('c', 2), ('d', 3)],
schema=['col1', 'col2']
)
df.show(10, False)
+----+----+
|col1|col2|
+----+----+
|a |1 |
|b |1 |
|c |2 |
|d |3 |
+----+----+
Just use filtering to find the min/max count row after the ranking:
df\
.withColumn('min_row', func.rank().over(Window.orderBy(func.asc('col2'))))\
.withColumn('max_row', func.rank().over(Window.orderBy(func.desc('col2'))))\
.filter((func.col('min_row') == 1) | (func.col('max_row') == 1))\
.show(100, False)
+----+----+-------+-------+
|col1|col2|min_row|max_row|
+----+----+-------+-------+
|d |3 |4 |1 |
|a |1 |1 |3 |
|b |1 |1 |3 |
+----+----+-------+-------+
Please note that if the min/max row count are the same, they will be both filtered out.
You can use row_number function twice to order records by count, ascending and descending.
SELECT eco, eco_name, count
FROM (SELECT *,
row_number() over (order by count asc) as rna,
row_number() over (order by count desc) as rnd
FROM df)
WHERE rna = 1 or rnd = 1;
Note there's a tie for count = 1. If you care about it add a secondary sort to control which record is selected or maybe use rank instead to select all.

When dynamically generating join condition as list in PySpark, How to apply "OR" in between the elements instead of "AND"?

I am joining two dataframes site_bs and site_wrk_int1 and creating site_wrk using a dynamic join condition.
My code is like below:
join_cond=[ col(v_col) == col('wrk_'+v_col) for v_col in primaryKeyCols] #result would be
site_wrk=site_bs.join(site_wrk_int1,join_cond,'inner').select(*site_bs.columns)
join_cond will be dynamic and the value will be something like [ col(id) == col(wrk_id), col(id) == col(wrk_parentId)]
In the above join condition, join will happen satisfying both the conditions above. i.e., the join condition will be
id = wrk_id and id = wrk_parentId
But I want or condition to be applied like below
id = wrk_id or id = wrk_parentId
How to achieve this in Pyspark?
Since logical operations on pyspark columns return column objects, you can chain these conditions in the join statement such as:
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
(1, "A", "A"),
(2, "C", "C"),
(3, "E", "D"),
], ['id', 'col1', 'col2']
)
df.show()
+---+----+----+
| id|col1|col2|
+---+----+----+
| 1| A| A|
| 2| C| C|
| 3| E| D|
+---+----+----+
df.alias("t1").join(
df.alias("t2"),
(f.col("t1.col1") == f.col("t2.col2")) | (f.col("t1.col1") == f.lit("E")),
"left_outer"
).show(truncate=False)
+---+----+----+---+----+----+
|id |col1|col2|id |col1|col2|
+---+----+----+---+----+----+
|1 |A |A |1 |A |A |
|2 |C |C |2 |C |C |
|3 |E |D |1 |A |A |
|3 |E |D |2 |C |C |
|3 |E |D |3 |E |D |
+---+----+----+---+----+----+
As you can see, I get the True value for left rows with IDs 1 and 2 as col1 == col2 OR col1 == E which is True for three rows of my DataFrame. In terms of syntax, it's important that the Python operators (| & ...) are separated by closed brackets as in example above, otherwise you might get confusing py4j errors.
Alternatively, if you wish to keep to similar notation as you stated in your questions, why not use functools.reduce and operator.or_ to apply this logic to your list, such as:
In this example, I have an AND condition between my column conditions and get NULL only, as expected:
df.alias("t1").join(
df.alias("t2"),
[f.col("t1.col1") == f.col("t2.col2"), f.col("t1.col1") == f.lit("E")],
"left_outer"
).show(truncate=False)
+---+----+----+----+----+----+
|id |col1|col2|id |col1|col2|
+---+----+----+----+----+----+
|3 |E |D |null|null|null|
|1 |A |A |null|null|null|
|2 |C |C |null|null|null|
+---+----+----+----+----+----+
In this example, I leverage functools and operator to get same result as above:
df.alias("t1").join(
df.alias("t2"),
functools.reduce(
operator.or_,
[f.col("t1.col1") == f.col("t2.col2"), f.col("t1.col1") == f.lit("E")]),
"left_outer"
).show(truncate=False)
+---+----+----+---+----+----+
|id |col1|col2|id |col1|col2|
+---+----+----+---+----+----+
|1 |A |A |1 |A |A |
|2 |C |C |2 |C |C |
|3 |E |D |1 |A |A |
|3 |E |D |2 |C |C |
|3 |E |D |3 |E |D |
+---+----+----+---+----+----+
I am quite new in spark SQL.
Please notify me if this can be a solution.
site_wrk = site_bs.join(site_work_int1, [(site_bs.id == site_work_int1.wrk_id) | (site_bs.id == site_work_int1.wrk_parentId)], how = "inner")

Can we reorder spark dataframe's columns?

I am creating dataframe as per given schema, after that i want to create new dataframe by reordering the existing dataframe.
Can it be possible the re-ordering of columns in spark dataframe?
object Demo extends Context {
def main(args: Array[String]): Unit = {
val emp = Seq((1,"Smith",-1,"2018","10","M",3000),
(2,"Rose",1,"2010","20","M",4000),
(3,"Williams",1,"2010","10","M",1000),
(4,"Jones",2,"2005","10","F",2000),
(5,"Brown",2,"2010","40","",-1),
(6,"Brown",2,"2010","50","",-1)
)
val empColumns = Seq("emp_id","name","superior_emp_id","year_joined",
"emp_dept_id","gender","salary")
import sparkSession.sqlContext.implicits._
val empDF = emp.toDF(empColumns: _*)
empDF.show(false)
}
}
Current DF:
+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1 |Smith |-1 |2018 |10 |M |3000 |
|2 |Rose |1 |2010 |20 |M |4000 |
|3 |Williams|1 |2010 |10 |M |1000 |
|4 |Jones |2 |2005 |10 |F |2000 |
|5 |Brown |2 |2010 |40 | |-1 |
|6 |Brown |2 |2010 |50 | |-1 |
+------+--------+---------------+-----------+-----------+------+------+
I want output as this following df, where gender and salary column re-ordered
New DF:
+------+--------+------+------+---------------+-----------+-----------+
|emp_id|name |gender|salary|superior_emp_id|year_joined|emp_dept_id|
+------+--------+------+------+---------------+-----------+-----------+
|1 |Smith |M |3000 |-1 |2018 |10 |
|2 |Rose |M |4000 |1 |2010 |20 |
|3 |Williams|M |1000 |1 |2010 |10 |
|4 |Jones |F |2000 |2 |2005 |10 |
|5 |Brown | |-1 |2 |2010 |40 |
|6 |Brown | |-1 |2 |2010 |50 |
+------+--------+------+------+---------------+-----------+-----------+
Just use select() to re-order the columns:
df = df.select('emp_id','name','gender','salary','superior_emp_id','year_joined','emp_dept_id')
It will be shown according to your ordering in select() argument.
Scala way of doing it
//Order the column names as you want
val columns = Array("emp_id","name","gender","salary","superior_emp_id","year_joined","emp_dept_id")
.map(col)
//Pass it to select
df.select(columns: _*)

Calculating sum,count of multiple top K values spark

I have an input dataframe of the format
+---------------------------------+
|name| values |score |row_number|
+---------------------------------+
|A |1000 |0 |1 |
|B |947 |0 |2 |
|C |923 |1 |3 |
|D |900 |2 |4 |
|E |850 |3 |5 |
|F |800 |1 |6 |
+---------------------------------+
I need to get sum(values) when score > 0 and row_number < K (i,e) SUM of all values when score > 0 for the top k values in the dataframe.
I am able to achieve this by running the following query for top 100 values
val top_100_data = df.select(
count(when(col("score") > 0 and col("row_number")<=100, col("values"))).alias("count_100"),
sum(when(col("score") > 0 and col("row_number")<=100, col("values"))).alias("sum_filtered_100"),
sum(when(col("row_number") <=100, col(values))).alias("total_sum_100")
)
However, I need to fetch data for top 100,200,300......2500. meaning I would need to run this query 25 times and finally union 25 dataframes.
I'm new to spark and still figuring lots of things out. What would be the best approach to solve this problem?
Thanks!!
You can create an Array of limits as
val topFilters = Array(100, 200, 300) // you can add more
Then you can loop through the topFilters array and create the dataframe you require. I suggest you to use join rather than union as join will give you separate columns and unions will give you separate rows. You can do the following
Given your dataframe as
+----+------+-----+----------+
|name|values|score|row_number|
+----+------+-----+----------+
|A |1000 |0 |1 |
|B |947 |0 |2 |
|C |923 |1 |3 |
|D |900 |2 |200 |
|E |850 |3 |150 |
|F |800 |1 |250 |
+----+------+-----+----------+
You can do by using the topFilters array defined above as
import sqlContext.implicits._
import org.apache.spark.sql.functions._
var finalDF : DataFrame = Seq("1").toDF("rowNum")
for(k <- topFilters) {
val top_100_data = df.select(lit("1").as("rowNum"), sum(when(col("score") > 0 && col("row_number") < k, col("values"))).alias(s"total_sum_$k"))
finalDF = finalDF.join(top_100_data, Seq("rowNum"))
}
finalDF.show(false)
Which should give you final dataframe as
+------+-------------+-------------+-------------+
|rowNum|total_sum_100|total_sum_200|total_sum_300|
+------+-------------+-------------+-------------+
|1 |923 |1773 |3473 |
+------+-------------+-------------+-------------+
You can do the same for your 25 limits that you have.
If you intend to use union, then the idea is similar to above.
I hope the answer is helpful
Updated
If you require union then you can apply following logic with the same limit array defined above
var finalDF : DataFrame = Seq((0, 0, 0, 0)).toDF("limit", "count", "sum_filtered", "total_sum")
for(k <- topFilters) {
val top_100_data = df.select(lit(k).as("limit"), count(when(col("score") > 0 and col("row_number")<=k, col("values"))).alias("count"),
sum(when(col("score") > 0 and col("row_number")<=k, col("values"))).alias("sum_filtered"),
sum(when(col("row_number") <=k, col("values"))).alias("total_sum"))
finalDF = finalDF.union(top_100_data)
}
finalDF.filter(col("limit") =!= 0).show(false)
which should give you
+-----+-----+------------+---------+
|limit|count|sum_filtered|total_sum|
+-----+-----+------------+---------+
|100 |1 |923 |2870 |
|200 |3 |2673 |4620 |
|300 |4 |3473 |5420 |
+-----+-----+------------+---------+

Resources