Spark Dataframe computing percentile on an array - apache-spark

I need to compute have a spark quantiles on a numeric field after a group by operation. Is there a way to apply the approxPercentile on an aggregated list instead of a column?
E.g.
The Dataframe looks like
k1 | k2 | k3 | v1
a1 | b1 | c1 | 879
a2 | b2 | c2 | 769
a1 | b1 | c1 | 129
a2 | b2 | c2 | 323
I need to first run groupBy (k1, k2, k3) and collect_list(v1), and then compute quantiles [10th, 50th...] on list of v1's

you can use percentile_approx in spark sql.
Assuming your data is in df, then you can do:
df.registerTempTable("df_tmp")
val dfWithPercentiles = sqlContext.sql("select k1,k2,k3,percentile_approx(v1, 0.05) as 5th, percentile_approx(v1, 0.50) as 50th, percentile_approx(v1, 0.95) as 95th from df_tmp group by k1,k2,k3")
On your sample data, this gives:
+---+---+---+-----+-----+-----------------+
| k1| k2| k3| 5th| 50th| 95th|
+---+---+---+-----+-----+-----------------+
| a1| b1| c1|129.0|129.0|803.9999999999999|
| a2| b2| c2|323.0|323.0| 724.4|
+---+---+---+-----+-----+-----------------+

Related

How to group by rollup on only some columns in Apache Spark SQL?

I'm using the the SQL API for Spark 3.0 in a Databricks 7.0 runtime cluster. I know that I can do the following:
select
coalesce(a, "All A") as colA,
coalesce(b, "All B") as colB,
sum(c) as sumC
from
myTable
group by rollup (
colA,
colB
)
order by
colA asc,
colB asc
I'd then expect an output like:
+-------+-------+------+
| colA | colB | sumC |
+-------+-------+------+
| All A | All B | 300 |
| a1 | All B | 100 |
| a1 | b1 | 30 |
| a1 | b2 | 70 |
| a2 | All B | 200 |
| a2 | b1 | 50 |
| a2 | b2 | 150 |
+-------+-------+------+
However, I'm trying to write a query where only column b needs to be rolled up. I've written something like:
select
a as colA,
coalesce(b, "All B") as colB,
sum(c) as sumC
from
myTable
group by
a,
rollup (b)
order by
colA asc,
colB asc
And I'd expect an output like:
+-------+-------+------+
| colA | colB | sumC |
+-------+-------+------+
| a1 | All B | 100 |
| a1 | b1 | 30 |
| a1 | b2 | 70 |
| a2 | All B | 200 |
| a2 | b1 | 50 |
| a2 | b2 | 150 |
+-------+-------+------+
I know this sort of operation is supported in at least some SQL APIs, but I get Error in SQL statement: UnsupportedOperationException when trying to run the above query. Does anyone know whether this behavior is simply as-of-yet unsupported in Spark 3.0 or if I just have the syntax wrong? The docs aren't helpful on the subject.
I know that I can accomplish this with union all, but I'd prefer to avoid that route, if only for the sake of elegance and brevity.
Thanks in advance, and please let me know if I can clarify anything.
Try this GROUPING SETS option:
%sql
SELECT
COALESCE( a, 'all a' ) a,
COALESCE( b, 'all b' ) b,
SUM(c) c
FROM myTable
GROUP BY a, b
GROUPING SETS ( ( a , b ), a )
ORDER BY a, b
My results (with updated numbers):

select top max rows which has sum = 50 in dataframe

This is my dataframe
+--------------+-----------+------------------+
| _c3|sum(number)| perc|
+--------------+-----------+------------------+
| France| 5170305|1.3201573334529797|
| Germany| 9912088|2.5308982087190754|
| Vietnam| 14729566| 3.760966630301244|
|United Kingdom| 19435674| 4.962598446648971|
| Philippines| 21994132| 5.615861086093151|
| Japan| 35204549| 8.988936539189615|
| China| 39453426|10.073821498682275|
| Hong Kong| 39666589| 10.1282493704753|
| Thailand| 57202857|14.605863902228613|
| Malaysia| 72364309| 18.47710593603423|
| Indonesia| 76509597|19.535541048174547|
+--------------+-----------+------------------+
I want to select only top countries which sum up to 50 percent of passengers (country, number of passengers, percentage of passengers). How can I do it?
You can use a running total to store cumulative percentage and then filter by it. So, assuming your dataframe is small enough, something like this should do it:
import org.apache.spark.sql.expressions.Window
val result = df.withColumn("cumulativepercentage", sum("perc").over(
Window.orderBy(col("perc").desc))
).where(col("cumulativepercentage").leq(50))
result.show(false)

Pyspark agg function to "explode" rows into columns

Basically, I have a dataframe that looks like this:
+----+-------+------+------+
| id | index | col1 | col2 |
+----+-------+------+------+
| 1 | a | a11 | a12 |
+----+-------+------+------+
| 1 | b | b11 | b12 |
+----+-------+------+------+
| 2 | a | a21 | a22 |
+----+-------+------+------+
| 2 | b | b21 | b22 |
+----+-------+------+------+
and my desired output is this:
+----+--------+--------+--------+--------+
| id | col1_a | col1_b | col2_a | col2_b |
+----+--------+--------+--------+--------+
| 1 | a11 | b11 | a12 | b12 |
+----+--------+--------+--------+--------+
| 2 | a21 | b21 | a22 | b22 |
+----+--------+--------+--------+--------+
So basically I want to "explode" the index column into new columns after I groupby id. Btw, the id counts are the same and each id has the same set of index values. I'm using pyspark.
using pivot you can achieve the desired output.
from pyspark.sql import functions as F
df = spark.createDataFrame([[1,"a","a11","a12"],[1,"b","b11","b12"],[2,"a","a21","a22"],[2,"b","b21","b22"]],["id","index","col1","col2"])
df.show()
+---+-----+----+----+
| id|index|col1|col2|
+---+-----+----+----+
| 1| a| a11| a12|
| 1| b| b11| b12|
| 2| a| a21| a22|
| 2| b| b21| b22|
+---+-----+----+----+
using pivot
df3 =df.groupBy("id").pivot("index").agg(F.first(F.col("col1")),F.first(F.col("col2")))
collist=["id","col1_a","col2_a","col1_b","col2_b"]
Rename Column
df3.toDF(*collist).show()
+---+------+------+------+------+
| id|col1_a|col2_a|col1_b|col2_b|
+---+------+------+------+------+
| 1| a11| a12| b11| b12|
| 2| a21| a22| b21| b22|
+---+------+------+------+------+
Note rearrange column based on your requirement.

Running sum between two timestamp in pyspark

I have a data in below format :
+---------------------+----+----+---------+----------+
| date_time | id | cm | p_count | bcm |
+---------------------+----+----+---------+----------+
| 2018-02-01 04:38:00 | v1 | c1 | 1 | null |
| 2018-02-01 05:37:07 | v1 | c1 | 1 | null |
| 2018-02-01 11:19:38 | v1 | c1 | 1 | null |
| 2018-02-01 12:09:19 | v1 | c1 | 1 | c1 |
| 2018-02-01 14:05:10 | v2 | c2 | 1 | c2 |
+---------------------+----+----+---------+----------+
I need to find rolling sum of p_count column between two date_time and partition by id.
logic for start_date_time and end_date_time for rolling sum window is below :
start_date_time=min(date_time) group by (id,cm)
end_date_time= bcm == cm ? date_time : null
in this case start_date_time=2018-02-01 04:38:00 and end_date_time=2018-02-01 12:09:19 .
Output should look like :
+---------------------+----+----+---------+----------+-------------+
| date_time | id | cm | p_count | bcm | p_sum_count |
+---------------------+----+----+---------+----------+-------------+
| 2018-02-01 04:38:00 | v1 | c1 | 1 | null |1 |
| 2018-02-01 05:37:07 | v1 | c1 | 1 | null |2 |
| 2018-02-01 11:19:38 | v1 | c1 | 1 | null |3 |
| 2018-02-01 12:09:19 | v1 | c1 | 1 | c1 |4 |
| 2018-02-01 14:05:10 | v2 | c2 | 1 | c2 |1 |
+---------------------+----+----+---------+----------+-------------+
var input = sqlContext.createDataFrame(Seq(
("2018-02-01 04:38:00", "v1", "c1",1,null),
("2018-02-01 05:37:07", "v1", "c1",1,null),
("2018-02-01 11:19:38", "v1", "c1",1,null),
("2018-02-01 12:09:19", "v1", "c1",1,"c1"),
("2018-02-01 14:05:10", "v2", "c2",1,"c2")
)).toDF("date_time","id","cm","p_count" ,"bcm")
input.show()
Results:
+---------------------+----+----+---------+----------+-------------+
| date_time | id | cm | p_count | bcm | p_sum_count |
+---------------------+----+----+---------+----------+-------------+
| 2018-02-01 04:38:00 | v1 | c1 | 1 | null |1 |
| 2018-02-01 05:37:07 | v1 | c1 | 1 | null |2 |
| 2018-02-01 11:19:38 | v1 | c1 | 1 | null |3 |
| 2018-02-01 12:09:19 | v1 | c1 | 1 | c1 |4 |
| 2018-02-01 14:05:10 | v2 | c2 | 1 | c2 |1 |
+---------------------+----+----+---------+----------+-------------+
Next Code:
input.createOrReplaceTempView("input_Table");
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
//val results = spark.sqlContext.sql("SELECT sum(p_count) from input_Table tbl GROUP BY tbl.cm")
val results = sqlContext.sql("select *, " +
"SUM(p_count) over ( order by id rows between unbounded preceding and current row ) cumulative_Sum " +
"from input_Table ").show
Results:
+-------------------+---+---+-------+----+--------------+
| date_time| id| cm|p_count| bcm|cumulative_Sum|
+-------------------+---+---+-------+----+--------------+
|2018-02-01 04:38:00| v1| c1| 1|null| 1|
|2018-02-01 05:37:07| v1| c1| 1|null| 2|
|2018-02-01 11:19:38| v1| c1| 1|null| 3|
|2018-02-01 12:09:19| v1| c1| 1| c1| 4|
|2018-02-01 14:05:10| v2| c2| 1| c2| 5|
+-------------------+---+---+-------+----+--------------+
You need to group by while windowing and add your logic to get expected reslts
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
Logically a Windowed Aggregate Function is newly calculated for each row within the PARTITION based on all ROWS between a starting row and an ending row.
Starting and ending rows might be fixed or relative to the current row based on the following keywords:
UNBOUNDED PRECEDING, all rows before the current row -> fixed
UNBOUNDED FOLLOWING, all rows after the current row -> fixed
x PRECEDING, x rows before the current row -> relative
y FOLLOWING, y rows after the current row -> relative
Possible kinds of calculation include:
Both starting and ending row are fixed, the window consists of all rows of a partition, e.g. a Group Sum, i.e. aggregate plus detail rows
One end is fixed, the other relative to current row, the number of rows increases or decreases, e.g. a Running Total, Remaining Sum
Starting and ending row are relative to current row, the number of rows within a window is fixed, e.g. a Moving Average over n rows
So SUM(x) OVER (ORDER BY col ROWS UNBOUNDED PRECEDING) results in a Cumulative Sum or Running Total
11 -> 11
2 -> 11 + 2 = 13
3 -> 13 + 3 (or 11+2+3) = 16
44 -> 16 + 44 (or 11+2+3+44) = 60
What is ROWS UNBOUNDED PRECEDING used for in Teradata?

Add a new Column to my DataSet in spark Java API

I'm new In Spark .
My DataSet contains two columns. I want to add the third that is the sum of the two columns.
My DataSet is:
+---------+-------------------+
|C1 | C2 |
+---------+-------------------+
| 44 | 10|
| 55 | 10|
+---------+-------------------+
I want to obtain a DataSet like this:
+---------+-------------------+---------+
|C1 | C2 | C3 |
+---------+-------------------+---------+
| 44 | 10| 54 |
| 55 | 10| 65 |
+---------+-------------------+---------+
Any help will be apprecieted.
The correct solution is:
df.withColumn("C3", df.col1("C1").plus(df.col("C2")));
or
df.selectExpr("*", "C1 + C2");
For more arithmetic operators check Java-specific expression operators in the Column documentation.

Resources