Add output of rollup as a new row in a PySpark DataFrame - apache-spark

I am converting sql code into Pyspark.
The sql code is using rollup to sum up the count for each state.
I am try to do the same thing in pyspark, but don't know how to get the total count row.
I have a table with state, city, and count, I want to add a total count for each state at the end of the state sections.
This is a sample input:
State City Count
WA Seattle 10
WA Tacoma 11
MA Boston 11
MA Cambridge 3
MA Quincy 5
This is my desired output:
State City Count
WA Seattle 10
WA Tacoma 11
WA Total 21
MA Boston 11
MA Cambridge 3
MA Quincy 5
MA Total 19
I don't know how to add the total count in between states.
I did try rollup, here is my code:
df2=df.rollup('STATE').count()
and the result show up like this:
State Count
WA 21
MA 19
But I want the Total after each state.

Since you want the Total as a new row inside your DataFrame, one option is to union the results of the groupBy() and sort by ["State", "City", "Count"] (to ensure that the "Total" row displays last in each group):
import pyspark.sql.functions as f
df.union(
df.groupBy("State")\
.agg(f.sum("Count").alias("Count"))\
.select("State", f.lit("Total").alias("City"), "Count")
).sort("State", "City", "Count").show()
#+-----+---------+-----+
#|State| City|Count|
#+-----+---------+-----+
#| MA| Boston| 11|
#| MA|Cambridge| 3|
#| MA| Quincy| 5|
#| MA| Total| 19|
#| WA| Seattle| 10|
#| WA| Tacoma| 11|
#| WA| Total| 21|
#+-----+---------+-----+

Either:
df.groubpBy("State", "City").rollup(count("*"))
or just register table:
df.createOrReplaceTempView("df")
and apply your current SQL query with
spark.sql("...")

Related

Fill missing timestamp with multiple categories using pyspark

im trying to fill missing timestamp using pyspark in aws Glue.
My raw data's date cloumns format is like 20220202
I want to convert 20220202 to 2022-02-02.
so, i used the code like this.
(There are 5 columns.
(1)'date' is date column(like 20220202),
(2)'sku' is categorical data like A,B,C..and it has 25 different values and each sku has their own timestamp,
(3)'unitprice' is numeric data and each sku has different unitprice. Forexample, if sku A has unitprice 30 and sku A has 300 rows in dataframe, 300 rows have same unitprice. However sku B has different unitprice.
(4) 'trand_item' is categorical data. It's kind of metadata of sku like color. It is just categorical data and same condition of (3)
(5) 'target' is numeric data and each row has different value.
When we fill missing timestamp, i want to fill timestamp per day and i want same value of 'unitprice', 'trand_item' for each SKU but want to fill 0 in target when we add new rows for new timestamp.
sparkDF = sparkDF.select('date', 'sku', 'unitprice', 'trand_item', 'target')
sparkDF = sparkDF.withColumn("date",sparkDF["date"].cast(StringType()))
sparkDF = sparkDF.withColumn("date", to_date(col("date"), "yyyymmdd"))
In data, there is 'sku' column.
This column is categorical data and it has 25 different values like A,B,C...
Each value has their own timestamp and each value's starting date is different.(ending date is same.)
sparkDF = sparkDF.dropDuplicates(['date', 'sku'])
sparkDF = sparkDF.sort("sku", "date")
Each sku(we have 25 sku in data) has their own timestamp and has missing timestamp, so i want to fill it.
How can i handle this?
<sample data>
date sku unitprice trand_item target
2018-01-01 A 10 Black 3
2018-02-01 A 10 Black 7
2018-04-01 A 10 Black 13
2017-08-01 B 20 White 4
2017-10-01 B 20 White 17
2017-11-01 B 20 White 9
<output i want>
date sku unitprice trand_item target
2018-01-01 A 10 Black 3
2018-02-01 A 10 Black 7
2018-03-01 A 10 Black 0
2018-04-01 A 10 Black 13
2017-08-01 B 20 White 4
2017-09-01 B 20 White 0
2017-10-01 B 20 White 17
2017-11-01 B 20 White 9
Your input:
data = [('2018-01-01','A',10,'Black',3),
('2018-02-01','A',10,'Black',7),
('2018-04-01','A',10,'Black',13),
('2017-08-01','B',20,'White',4),
('2017-10-01','B',20,'White',17),
('2017-11-01','B',20,'White',9)]
cols = ['date', 'sku', 'unitprice', 'trand_item', 'target']
df = sqlContext.createDataFrame(data, cols)
Inspired by amazing solution from #blackbishop on PySpark generate missing dates and fill data with previous value
from pyspark.sql import functions as F
from pyspark.sql import Window
df = df.withColumn("date", F.to_date(F.col("date"), "yyyy-dd-MM"))
dates_range = df.groupBy("sku").agg(
F.date_trunc("dd", F.max(F.col("date"))).alias("max_date"),
F.date_trunc("dd", F.min(F.col("date"))).alias("min_date")
).select(
"sku",
F.expr("sequence(min_date, max_date, interval 1 day)").alias("date")
).withColumn(
"date", F.explode("date")
).withColumn(
"date",
F.date_format("date", "yyyy-MM-dd")
)
w = Window.partitionBy("sku").orderBy("date")
result = dates_range\
.join(df, ["sku", "date"], "left")\
.select("sku","date",*[F.last(F.col(c), ignorenulls=True).over(w).alias(c)\
for c in df.columns if c not in ("sku", "date", "target")],"target")\
.fillna(0, subset=['target'])
result.show()
+---+----------+---------+----------+------+
|sku| date|unitprice|trand_item|target|
+---+----------+---------+----------+------+
| A|2018-01-01| 10| Black| 3|
| A|2018-01-02| 10| Black| 7|
| A|2018-01-03| 10| Black| 0|
| A|2018-01-04| 10| Black| 13|
| B|2017-01-08| 20| White| 4|
| B|2017-01-09| 20| White| 0|
| B|2017-01-10| 20| White| 17|
| B|2017-01-11| 20| White| 9|
+---+----------+---------+----------+------+

Spark DataFrame Get Difference between values of two rows

I have calculated the average temperature for two cities grouped by seasons, but I'm having trouble in with getting the difference between the avg(TemperatureF) for City A vs City B. Here is an example of what my Spark Scala DataFrame looks like:
City
Season
avg(TemperatureF)
A
Fall
52
A
Spring
50
A
Summer
72
A
Winter
25
B
Fall
49
B
Spring
44
B
Summer
69
B
Winter
22
You may use the pivot function as follows:
df.groupBy('Season').pivot('City').agg(f.first('avg')) \
.withColumn('diff', f.expr('A - B')) \
.show()
+------+---+---+----+
|Season| A| B|diff|
+------+---+---+----+
|Spring| 50| 44| 6.0|
|Summer| 72| 69| 3.0|
| Fall| 52| 49| 3.0|
|Winter| 25| 22| 3.0|
+------+---+---+----+

how I can make a column pair with respect of a group?

I have a dataframe and an id column as a group. For each id I want to pair its elements in the following way:
title id
sal 1
summer 1
fada 1
row 2
winter 2
gole 2
jack 3
noway 3
output
title id pair
sal 1 None
summer 1 summer,sal
fada 1 fada,summer
row 2 None
winter 2 winter, row
gole 2 gole,winter
jack 3 None
noway 3 noway,jack
As you can see in the output, we pair from the last element of the group id, with an element above it. Since the first element of the group does not have a pair I put None. I should also mention that this can be done in pandas by the following code, but I need Pyspark code since my data is big.
df=data.assign(pair=data.groupby('id')['title'].apply(lambda x: x.str.cat(x.shift(1),sep=',')))
|
I can't emphasise more that a Spark dataframe is an unordered collection of rows, so saying something like "the element above it" is undefined without a column to order by. You can fake an ordering using F.monotonically_increasing_id(), but I'm not sure if that's what you wanted.
from pyspark.sql import functions as F, Window
w = Window.partitionBy('id').orderBy(F.monotonically_increasing_id())
df2 = df.withColumn(
'pair',
F.when(
F.lag('title').over(w).isNotNull(),
F.concat_ws(',', 'title', F.lag('title').over(w))
)
)
df2.show()
+------+---+-----------+
| title| id| pair|
+------+---+-----------+
| sal| 1| null|
|summer| 1| summer,sal|
| fada| 1|fada,summer|
| jack| 3| null|
| noway| 3| noway,jack|
| row| 2| null|
|winter| 2| winter,row|
| gole| 2|gole,winter|
+------+---+-----------+

Counting number of days in transaction data but from 6AM to 6AM of next day in PySpark

I have transaction data but I need to calculate the number of visits based on countDistinct of dates. The problem is that I need to calculate it based on a timestamp of 6AM to 6AM, i.e. if the transaction happens on 04/07 between 12 AM and 6AM, it still should be counted as a single visit.
Is there any way I can achieve that?
CUSTOMER_ID TRANSACTION_ID TRANSACTION_DATETIME
C1 T1 04/07/2019 22:20:00
C1 T1 04/08/2019 1:00:00
C1 T2 04/07/2019 17:10:00
C1 T3 05/08/2019 12:00:00
So as per the above, I need the visits for each customer_ID.
This is the code I have so far
testdfmod = df.groupBy("CUSTOMER_ID") \
.agg(F.max(F.col('TRANSACTION_DATETIME')).alias("TRANSACTION_DATETIME"), \
F.countDistinct(
F.to_date(F.col('TRANSACTION_DATETIME')).alias('TRANSACTION_DATETIME').cast("date")) \
.alias("TOTAL_TRIPS"))
Thank you so much for all the help.
IIUC, you can just add a new column with value equals to TRANSACTION_DATETIME minus 6 hours (6*3600 seconds):
from pyspark.sql import functions as F
df.withColumn('adjusted_trx_date', F.from_unixtime(F.unix_timestamp('TRANSACTION_DATETIME', format='MM/dd/yyyy HH:mm:ss')-6*3600, format='yyyy-MM-dd')).show()
#+-----------+--------------+--------------------+-----------------+
#|CUSTOMER_ID|TRANSACTION_ID|TRANSACTION_DATETIME|adjusted_trx_date|
#+-----------+--------------+--------------------+-----------------+
#| C1| T1| 04/07/2019 22:20:00| 2019-04-07|
#| C1| T1| 04/08/2019 1:00:00| 2019-04-07|
#| C1| T2| 04/07/2019 17:10:00| 2019-04-07|
#| C1| T3| 05/08/2019 12:00:00| 2019-05-08|
#+-----------+--------------+--------------------+-----------------+
Then you can do countDistinct() on the new column adjusted_trx_date with the code you had.

NTILE function not working in Spark SQL 1.5

I'm testing the NTILE function on a simple dataset like this:
(id: string, value: double)
A 10
B 3
C 4
D 4
E 4
F 30
C 30
D 10
A 4
H 4
Running the following query against HIVE (on MapReduce)
SELECT tmp.id, tmp.sum_val, NTILE(4) OVER (ORDER BY tmp.sum_val) AS quartile FROM (SELECT id, sum(value) AS sum_val FROM testntile GROUP BY id) AS tmp
works fine with the following result:
(id, sum_val, quartile)
B 3 1
H 4 1
E 4 2
D 14 2
A 14 3
F 30 3
C 34 4
Running the same query against Hive on Spark (v 1.5) still works fine.
Running the same query against Spark SQL 1.5 (CDH 5.5.1)
val result = sqlContext.sql("SELECT tmp.id, tmp.sum_val, NTILE(4) OVER (ORDER BY tmp.sum_val) AS quartile FROM (SELECT id, sum(value) AS sum_val FROM testntile GROUP BY id) AS tmp")
result.collect().foreach(println)
I get the following wrong result:
[B,3.0,0]
[E,4.0,0]
[H,4.0,0]
[A,14.0,0]
[D,14.0,0]
[F,30.0,0]
[C,34.0,0]
IMPORTANT: the result is NOT deterministic because "sometimes" correct values are returned
Running the same algorithm directly on the dataframe
val x = sqlContext.sql("select id, sum(value) as sum_val from testntile group by id")
val w = Window.partitionBy("id").orderBy("sum_val")
val resultDF = x.select( x("id"),x("sum_val"), ntile(4).over(w) )
still returns a wrong result.
Am I doing something wrong? Any ideas? Thanks in advance for your answers.
If you use Window.partitionBy("id").orderBy("sum_val") you are grouping by id and after you are applying ntile function. So in this way every group has one element and ntile apply the same value for every id.
In order to achieve your first result, you need to remove partitionBy("id") and use only Window.orderBy("sum_val").
This is how I modify your code:
val w = Window.orderBy("sum_val")
val resultDF = x.orderBy("sum_val").select( x("id"),x("sum_val"), ntile(4).over(w) )
And this is the print of resultDF.show():
+---+-------+-----+
| id|sum_val|ntile|
+---+-------+-----+
| B| 3| 1|
| E| 4| 1|
| H| 4| 2|
| D| 14| 2|
| A| 14| 3|
| F| 30| 3|
| C| 34| 4|
+---+-------+-----+

Resources