Case Based scenerio in pyspark - apache-spark

The DataFrame in pyspark looks like below.
model,DAYS
MarutiDesire,15
MarutiErtiga,30
Suzukicelerio,45
I10lxi,60
Verna,55
Output i am trying to get like
Output : I am trying to get the output as
when days less than 30 than economical,
between 30 and 60 than average,
and when greater than 60 than Low Profit
Code i tried but giving incorrect output.
dataset1.selectExpr("*", "CASE WHEN DAYS <=30 THEN 'ECONOMICAL' WHEN DAYS>30 AND LESS THEN 60 THEN 'AVERAGE' ELSE 'LOWPROFIT' END REASON").show()
kindly share your suggestion. is there any better way to do this in pyspark.

>>> from pyspark.sql.functions import *
>>> df.show()
+-------------+----+
| model|DAYS|
+-------------+----+
| MarutiDesire| 15|
| MarutiErtiga| 30|
|Suzukicelerio| 45|
| I10lxi| 60|
| Verna| 55|
+-------------+----+
>>> df.withColumn("REMARKS", when(col("DAYS") < 30, lit("ECONOMICAL")).when((col("DAYS") >= 30) & (col("DAYS") < 60), lit("AVERAGE")).otherwise(lit("LOWPROFIT"))).show()
+-------------+----+----------+
| model|DAYS| REMARKS|
+-------------+----+----------+
| MarutiDesire| 15|ECONOMICAL|
| MarutiErtiga| 30| AVERAGE|
|Suzukicelerio| 45| AVERAGE|
| I10lxi| 60| LOWPROFIT|
| Verna| 55| AVERAGE|
+-------------+----+----------+

Related

Sparksql get sample rows with where clause

Is it possible to get a sample n rows of a query with a where clause?
I tried to use the tablesample function below but I ended up only getting records in the first partition '2021-09-14.' P
select * from (select * from table where ts in ('2021-09-14', '2021-09-15')) tablesample (100 rows)
You can utilise Monotonically Increasing ID - here or Rand to generate an additional column which can be used to Order your dataset to generate the necessary sampling field
Both of these functions can be used in conjunction or individually
Further more you can use LIMIT clause to sample your required N records
NOTE - orderBy would be a costly operation
Data Preparation
input_str = """
1 2/12/2019 114 2
2 3/5/2019 116 1
3 3/3/2019 120 6
4 3/4/2019 321 10
6 6/5/2019 116 1
7 6/3/2019 116 1
8 10/1/2019 120 3
9 10/1/2019 120 3
10 10/1/2020 120 3
11 10/1/2020 120 3
12 10/1/2020 120 3
13 10/1/2022 120 3
14 10/1/2021 120 3
15 10/6/2019 120 3
""".split()
input_values = list(map(lambda x: x.strip() if x.strip() != 'null' else None, input_str))
cols = list(map(lambda x: x.strip() if x.strip() != 'null' else None, "shipment_id ship_date customer_id quantity".split()))
n = len(input_values)
input_list = [tuple(input_values[i:i+4]) for i in range(0,n,4)]
sparkDF = sql.createDataFrame(input_list, cols)
sparkDF = sparkDF.withColumn('ship_date',F.to_date(F.col('ship_date'),'d/M/yyyy'))
sparkDF.show()
+-----------+----------+-----------+--------+
|shipment_id| ship_date|customer_id|quantity|
+-----------+----------+-----------+--------+
| 1|2019-12-02| 114| 2|
| 2|2019-05-03| 116| 1|
| 3|2019-03-03| 120| 6|
| 4|2019-04-03| 321| 10|
| 6|2019-05-06| 116| 1|
| 7|2019-03-06| 116| 1|
| 8|2019-01-10| 120| 3|
| 9|2019-01-10| 120| 3|
| 10|2020-01-10| 120| 3|
| 11|2020-01-10| 120| 3|
| 12|2020-01-10| 120| 3|
| 13|2022-01-10| 120| 3|
| 14|2021-01-10| 120| 3|
| 15|2019-06-10| 120| 3|
+-----------+----------+-----------+--------+
Order By - Monotonically Increasing ID & Rand
sparkDF.createOrReplaceTempView("shipment_table")
sql.sql("""
SELECT
*
FROM (
SELECT
*
,monotonically_increasing_id() as increasing_id
,RAND(10) as random_order
FROM shipment_table
WHERE ship_date BETWEEN '2019-01-01' AND '2019-12-31'
ORDER BY monotonically_increasing_id() DESC ,RAND(10) DESC
LIMIT 5
)
""").show()
+-----------+----------+-----------+--------+-------------+-------------------+
|shipment_id| ship_date|customer_id|quantity|increasing_id| random_order|
+-----------+----------+-----------+--------+-------------+-------------------+
| 15|2019-06-10| 120| 3| 8589934593|0.11682250456449328|
| 9|2019-01-10| 120| 3| 8589934592|0.03422639313807285|
| 8|2019-01-10| 120| 3| 6| 0.8078688178371882|
| 7|2019-03-06| 116| 1| 5|0.36664222617947817|
| 6|2019-05-06| 116| 1| 4| 0.2093704977577|
+-----------+----------+-----------+--------+-------------+-------------------+
If you are using Dataset there is built-in functionality for this as outlined in the documenation:
sample(withReplacement: Boolean, fraction: Double): Dataset[T]
Returns a new Dataset by sampling a fraction of rows, using a random seed.
withReplacement: Sample with replacement or not.
fraction: Fraction of rows to generate, range [0.0, 1.0].
Since
1.6.0
Note
This is NOT guaranteed to provide exactly the fraction of the total count of the given Dataset.
To use this you'd filter your dataset against whatever criteria you're looking for, then sample the result. If you need an exact number of rows rather than a fraction you can follow the call to sample with limit(n) where n is the number of rows to return.

How to use spark window function as cascading changes of previous row to next row

I tried to use window function to calculate current value based on previous value in dynamic way
rowID | value
------------------
1 | 5
2 | 7
3 | 6
Logic:
If value > pre_value then value
So in row 2, since 7 > 5 then value becomes 5.
The final result should be
rowID | value
------------------
1 | 5
2 | 5
3 | 5
However using lag().over(w) gave the result as
rowID | value
------------------
1 | 5
2 | 5
3 | 6
it compares third row value 6 against the "7" not the new value "5"
Any suggestion how to achieve this?
df.show()
#exampledataframe
+-----+-----+
|rowID|value|
+-----+-----+
| 1| 5|
| 2| 7|
| 3| 6|
| 4| 9|
| 5| 4|
| 6| 3|
+-----+-----+
Your required logic is too dynamic for window functions, therefore, we have to go row by row updating our values. One solution could be to use normal python udf on collected list and then explode once udf has been applied. If have relatively small data, this should be fine.(spark2.4 only because of arrays_zip).
from pyspark.sql import functions as F
from pyspark.sql.types import *
def add_one(a):
for i in range(1,len(a)):
if a[i]>a[i-1]:
a[i]=a[i-1]
return a
udf1= F.udf(add_one, ArrayType(IntegerType()))
df.agg(F.collect_list("rowID").alias("rowID"),F.collect_list("value").alias("value"))\
.withColumn("value", udf1("value"))\
.withColumn("zipped", F.explode(F.arrays_zip("rowID","value"))).select("zipped.*").show()
+-----+-----+
|rowID|value|
+-----+-----+
| 1| 5|
| 2| 5|
| 3| 5|
| 4| 5|
| 5| 4|
| 6| 3|
+-----+-----+
UPDATE:
Better yet, as you have groups of 5000, using a Pandas vectorized udf( grouped MAP) should help a lot with processing. And you do not have to collect_list with 5000 integers and explode or use pivot. I think this should be the optimal solution. Pandas UDAF available for spark2.3+
GroupBy below is empty, but you can add your grouping column in that.
from pyspark.sql.functions import pandas_udf, PandasUDFType
#pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
def grouped_map(df1):
for i in range(1, len(df1)):
if df1.loc[i, 'value']>df1.loc[i-1,'value']:
df1.loc[i,'value']=df1.loc[i-1,'value']
return df1
df.groupby().apply(grouped_map).show()
+-----+-----+
|rowID|value|
+-----+-----+
| 1| 5|
| 2| 5|
| 3| 5|
| 4| 5|
| 5| 4|
| 6| 3|
+-----+-----+

Sum last 2 days with day data gaps in Spark dataframe

I have a data frame
| Id | Date | Value |
| 1 | 1/1/2019 | 11 |
| 1 | 1/2/2019 | 12 |
| 1 | 1/3/2019 | 13 |
| 1 | 1/5/2019 | 14 |
| 1 | 1/6/2019 | 15 |
I want to calculate the sum of last 2 values by date:
| Id | Date | Value | Sum |
| 1 | 1/1/2019 | 11 | null |
| 1 | 1/2/2019 | 12 | null |
| 1 | 1/3/2019 | 13 | 23 |
| 1 | 1/5/2019 | 14 | -13 | // there is no 1/4 so 0 - 13
| 1 | 1/6/2019 | 15 | 14 | // there is no 1/4 so 14 - 0
Right now I have
let window = Window
.PartitionBy("Id")
.OrderBy(Functions.Col("Date").Cast("timestamp").Cast("long"))
data.WithColumn("Sum", Functions.Lag("Value", 1).Over(window) - Functions.Lag("Value", 2).Over(window))
With this approach I can assume that the missed value is equal to previous one (so 1/4 is equal to 1/3 = 13).
How can I consider 1/4 as zero?
You got two ways to do this.
One would be to use lagfunction with when and otherwise and use the api data to remove one day from date.
The pros is this is working fine and quickly, the cons is that each time you to change your lag formula, you have to rewrite it...
However, I found a more generalizable method. The idea will be to fill the missing date using the Timestamp to Long and use spark.range to generate every possible date between minDate and maxDate
// Some imports
import org.apache.spark.sql.{functions => F}
import org.apache.spark.sql.expressions.Window
// Our DF
val df = Seq(
(1, "1/1/2019", 11),
(1, "1/2/2019", 12),
(1, "1/3/2019", 13),
(1, "1/5/2019", 14),
(1, "1/6/2019", 15)
).
toDF("id", "date", "value").
withColumn("date", F.to_timestamp($"date", "MM/dd/yyyy"))
// min and max date
val (mindate, maxdate) = df.select(min($"date"), max($"date")).as[(Long, Long)].first
// Our step in seconds, so one day here
val step: Long = 24 * 60 * 60
// Generate missing dates
val reference = spark.
range(mindate, ((maxdate / step) + 1) * step, step).
select($"id".cast("timestamp").as("date"))
// Our df filled !
val filledDf = reference.join(df, Seq("date"), "leftouter").na.fill(0, Seq("value"))
/**
+-------------------+----+-----+
| date| id|value|
+-------------------+----+-----+
|2019-01-01 00:00:00| 1| 11|
|2019-01-02 00:00:00| 1| 12|
|2019-01-03 00:00:00| 1| 13|
|2019-01-04 00:00:00|null| 0|
|2019-01-05 00:00:00| 1| 14|
|2019-01-06 00:00:00| 1| 15|
+-------------------+----+-----+
*/
filledDf.
withColumn("result", F.lag($"value", 1, 0).over(windowSpec) - F.lag($"value", 2, 0).over(windowSpec)).show
/**
+-------------------+----+-----+------+
| date| id|value|result|
+-------------------+----+-----+------+
|2019-01-01 00:00:00| 1| 11| 0|
|2019-01-02 00:00:00| 1| 12| 11|
|2019-01-03 00:00:00| 1| 13| 1|
|2019-01-04 00:00:00|null| 0| 1|
|2019-01-05 00:00:00| 1| 14| -13|
|2019-01-06 00:00:00| 1| 15| 14|
+-------------------+----+-----+------+
*/

Pyspark: select top k entries in Dataframe, and break ties randomly [duplicate]

Using pyspark, I'd like to be able to group a spark dataframe, sort the group, and then provide a row number. So
Group Date
A 2000
A 2002
A 2007
B 1999
B 2015
Would become
Group Date row_num
A 2000 0
A 2002 1
A 2007 2
B 1999 0
B 2015 1
Use window function:
from pyspark.sql.window import *
from pyspark.sql.functions import row_number
df.withColumn("row_num", row_number().over(Window.partitionBy("Group").orderBy("Date")))
The accepted solution almost has it right. Here is the solution based on the output requested in the question:
df = spark.createDataFrame([("A", 2000), ("A", 2002), ("A", 2007), ("B", 1999), ("B", 2015)], ["Group", "Date"])
+-----+----+
|Group|Date|
+-----+----+
| A|2000|
| A|2002|
| A|2007|
| B|1999|
| B|2015|
+-----+----+
# accepted solution above
from pyspark.sql.window import *
from pyspark.sql.functions import row_number
df.withColumn("row_num", row_number().over(Window.partitionBy("Group").orderBy("Date")))
# accepted solution above output
+-----+----+-------+
|Group|Date|row_num|
+-----+----+-------+
| B|1999| 1|
| B|2015| 2|
| A|2000| 1|
| A|2002| 2|
| A|2007| 3|
+-----+----+-------+
As you can see, the function row_number starts from 1 and not 0 and the requested question wanted to have the row_num starting from 0. Simple change like I have made below:
df.withColumn("row_num", row_number().over(Window.partitionBy("Group").orderBy("Date"))-1).show()
Output :
+-----+----+-------+
|Group|Date|row_num|
+-----+----+-------+
| B|1999| 0|
| B|2015| 1|
| A|2000| 0|
| A|2002| 1|
| A|2007| 2|
+-----+----+-------+
Then you can sort the "Group" column in whatever order you want. The above solution almost has it but it is important to remember that row_number begins with 1 and not 0.

PySpark - get row number for each row in a group

Using pyspark, I'd like to be able to group a spark dataframe, sort the group, and then provide a row number. So
Group Date
A 2000
A 2002
A 2007
B 1999
B 2015
Would become
Group Date row_num
A 2000 0
A 2002 1
A 2007 2
B 1999 0
B 2015 1
Use window function:
from pyspark.sql.window import *
from pyspark.sql.functions import row_number
df.withColumn("row_num", row_number().over(Window.partitionBy("Group").orderBy("Date")))
The accepted solution almost has it right. Here is the solution based on the output requested in the question:
df = spark.createDataFrame([("A", 2000), ("A", 2002), ("A", 2007), ("B", 1999), ("B", 2015)], ["Group", "Date"])
+-----+----+
|Group|Date|
+-----+----+
| A|2000|
| A|2002|
| A|2007|
| B|1999|
| B|2015|
+-----+----+
# accepted solution above
from pyspark.sql.window import *
from pyspark.sql.functions import row_number
df.withColumn("row_num", row_number().over(Window.partitionBy("Group").orderBy("Date")))
# accepted solution above output
+-----+----+-------+
|Group|Date|row_num|
+-----+----+-------+
| B|1999| 1|
| B|2015| 2|
| A|2000| 1|
| A|2002| 2|
| A|2007| 3|
+-----+----+-------+
As you can see, the function row_number starts from 1 and not 0 and the requested question wanted to have the row_num starting from 0. Simple change like I have made below:
df.withColumn("row_num", row_number().over(Window.partitionBy("Group").orderBy("Date"))-1).show()
Output :
+-----+----+-------+
|Group|Date|row_num|
+-----+----+-------+
| B|1999| 0|
| B|2015| 1|
| A|2000| 0|
| A|2002| 1|
| A|2007| 2|
+-----+----+-------+
Then you can sort the "Group" column in whatever order you want. The above solution almost has it but it is important to remember that row_number begins with 1 and not 0.

Resources