Hive query to find the count for the weeks in middle - apache-spark

I have a table like below
id week count
A100 201008 2
A100 201009 9
A100 201010 16
A100 201011 23
A100 201012 30
A100 201013 36
A100 201015 43
A100 201017 50
A100 201018 57
A100 201019 63
A100 201023 70
A100 201024 82
A100 201025 88
A100 201026 95
A100 201027 102
Here, we can see that below weeks are missing :
First 201014 is missing
Second 201016 is missing
Third weeks missing 201020, 201021, 201022
My requirement is whenever we have missing values we need to show the count of previous week.
In this case output should be :
id week count
A100 201008 2
A100 201009 9
A100 201010 16
A100 201011 23
A100 201012 30
A100 201013 36
A100 201014 36
A100 201015 43
A100 201016 43
A100 201017 50
A100 201018 57
A100 201019 63
A100 201020 63
A100 201021 63
A100 201022 63
A100 201023 70
A100 201024 82
A100 201025 88
A100 201026 95
A100 201027 102
How I can achieve this requirement using hive/pyspark?

Although this answer is in Scala, Python version will look almost the same & can be easily converted.
Step 1:
Find the rows that has missing week(s) value prior to it.
Sample Input:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
//sample input
val input = sc.parallelize(List(("A100",201008,2), ("A100",201009,9),("A100",201014,4), ("A100",201016,45))).toDF("id","week","count")
scala> input.show
+----+------+-----+
| id| week|count|
+----+------+-----+
|A100|201008| 2|
|A100|201009| 9|
|A100|201014| 4| //missing 4 rows
|A100|201016| 45| //missing 1 row
+----+------+-----+
To find it, we can use .lead() function on week. And compute the difference between the leadWeek and week. The difference should not be > 1, if so there are missing row prior to it.
val diffDF = input
.withColumn("leadWeek", lead($"week", 1).over(Window.partitionBy($"id").orderBy($"week"))) // partitioning by id & computing lead()
.withColumn("diff", ($"leadWeek" - $"week") -1) // finding difference between leadWeek & week
scala> diffDF.show
+----+------+-----+--------+----+
| id| week|count|leadWeek|diff|
+----+------+-----+--------+----+
|A100|201008| 2| 201009| 0| // diff -> 0 represents that no rows needs to be added
|A100|201009| 9| 201014| 4| // diff -> 4 represents 4 rows are to be added after this row.
|A100|201014| 4| 201016| 1| // diff -> 1 represents 1 row to be added after this row.
|A100|201016| 45| null|null|
+----+------+-----+--------+----+
Step 2:
If the diff is >= 1: Create and Add n number of rows(InputWithDiff, check the case class below) as specified in
diff and increment week value accordingly. Return the newly
created rows along with the original row.
If the diff is 0, No additional computation is required. Return the original row as it is.
Convert diffDF to Dataset for ease of computation.
case class InputWithDiff(id: Option[String], week: Option[Int], count: Option[Int], leadWeek: Option[Int], diff: Option[Int])
val diffDS = diffDF.as[InputWithDiff]
val output = diffDS.flatMap(x => {
val diff = x.diff.getOrElse(0)
diff match {
case n if n >= 1 => x :: (1 to diff).map(y => InputWithDiff(x.id, Some(x.week.get + y), x.count,x.leadWeek, x.diff)).toList // create and append new Rows
case _ => List(x) // return as it is
}
}).drop("leadWeek", "diff").toDF // drop unnecessary columns & convert to DF
final output:
scala> output.show
+----+------+-----+
| id| week|count|
+----+------+-----+
|A100|201008| 2|
|A100|201009| 9|
|A100|201010| 9|
|A100|201011| 9|
|A100|201012| 9|
|A100|201013| 9|
|A100|201014| 4|
|A100|201015| 4|
|A100|201016| 45|
+----+------+-----+

PySpark solution
Sample Data
df = spark.createDataFrame([(1,201901,10),
(1,201903,9),
(1,201904,21),
(1,201906,42),
(1,201909,3),
(1,201912,56)
],['id','weeknum','val'])
df.show()
+---+-------+---+
| id|weeknum|val|
+---+-------+---+
| 1| 201901| 10|
| 1| 201903| 9|
| 1| 201904| 21|
| 1| 201906| 42|
| 1| 201909| 3|
| 1| 201912| 56|
+---+-------+---+
1) The basic idea is to create a combination of all id's and weeks (starting from the minimum possible value to the maximum) with a cross join.
from pyspark.sql.functions import min,max,sum,when
from pyspark.sql import Window
min_max_week = df.agg(min(df.weeknum),max(df.weeknum)).collect()
#Generate all weeks using range
all_weeks = spark.range(min_max_week[0][0],min_max_week[0][1]+1)
all_weeks = all_weeks.withColumnRenamed('id','weekno')
#all_weeks.show()
id_all_weeks = df.select(df.id).distinct().crossJoin(all_weeks).withColumnRenamed('id','aid')
#id_all_weeks.show()
2) Thereafter, left joining the original dataframe on to these combinations helps identify missing values.
res = id_all_weeks.join(df,(df.id == id_all_weeks.aid) & (df.weeknum == id_all_weeks.weekno),'left')
res.show()
+---+------+----+-------+----+
|aid|weekno| id|weeknum| val|
+---+------+----+-------+----+
| 1|201911|null| null|null|
| 1|201905|null| null|null|
| 1|201903| 1| 201903| 9|
| 1|201904| 1| 201904| 21|
| 1|201901| 1| 201901| 10|
| 1|201906| 1| 201906| 42|
| 1|201908|null| null|null|
| 1|201910|null| null|null|
| 1|201912| 1| 201912| 56|
| 1|201907|null| null|null|
| 1|201902|null| null|null|
| 1|201909| 1| 201909| 3|
+---+------+----+-------+----+
3) Then, use a combination of window functions, sum -> to assign groups
and max -> to fill in the missing values once the groups are classified.
w1 = Window.partitionBy(res.aid).orderBy(res.weekno)
groups = res.withColumn("grp",sum(when(res.id.isNull(),0).otherwise(1)).over(w1))
w2 = Window.partitionBy(groups.aid,groups.grp)
missing_values_filled = groups.withColumn('filled',max(groups.val).over(w2)) #select required columns as needed
missing_values_filled.show()
+---+------+----+-------+----+---+------+
|aid|weekno| id|weeknum| val|grp|filled|
+---+------+----+-------+----+---+------+
| 1|201901| 1| 201901| 10| 1| 10|
| 1|201902|null| null|null| 1| 10|
| 1|201903| 1| 201903| 9| 2| 9|
| 1|201904| 1| 201904| 21| 3| 21|
| 1|201905|null| null|null| 3| 21|
| 1|201906| 1| 201906| 42| 4| 42|
| 1|201907|null| null|null| 4| 42|
| 1|201908|null| null|null| 4| 42|
| 1|201909| 1| 201909| 3| 5| 3|
| 1|201910|null| null|null| 5| 3|
| 1|201911|null| null|null| 5| 3|
| 1|201912| 1| 201912| 56| 6| 56|
+---+------+----+-------+----+---+------+
Hive Query with the same logic as described above (assuming a table with all weeks can be created)
select id,weeknum,max(val) over(partition by id,grp) as val
from (select i.id
,w.weeknum
,t.val
,sum(case when t.id is null then 0 else 1 end) over(partition by i.id order by w.weeknum) as grp
from (select distinct id from tbl) i
cross join weeks_table w
left join tbl t on t.id = i.id and w.weeknum = t.weeknum
) t

Related

Running total with conditional in Pyspark/Hive

I have product, brand and percentage columns. I want to calculate the sum of the percentage column for the rows above the current row for those with different brand than the current row and also for those with same brand as the current row. How can I do it in PySpark or using using spark.sql?
sample data:
df = pd.DataFrame({'a': ['a1','a2','a3','a4','a5','a6'],
'brand':['b1','b2','b1', 'b3', 'b2','b1'],
'pct': [40, 30, 10, 8,7,5]})
df = spark.createDataFrame(df)
What I am looking for:
product brand pct pct_same_brand pct_different_brand
a1 b1 40 null null
a2 b2 30 null 40
a3 b1 10 40 30
a4 b3 8 null 80
a5 b2 7 30 58
a6 b1 5 50 45
This is what I have tried:
df.createOrReplaceTempView('tmp')
spark.sql("""
select *, sum(pct * (select case when n1.brand = n2.brand then 1 else 0 end
from tmp n1)) over(order by pct desc rows between UNBOUNDED PRECEDING and 1
preceding)
from tmp n2
""").show()
You can get the pct_different_brand column by subtracting the total rolling sum from the partitioned rolling sum (i.e. pct_same_brand column):
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'pct_same_brand',
F.sum('pct').over(
Window.partitionBy('brand')
.orderBy(F.desc('pct'))
.rowsBetween(Window.unboundedPreceding, -1)
)
).withColumn(
'pct_different_brand',
F.sum('pct').over(
Window.orderBy(F.desc('pct'))
.rowsBetween(Window.unboundedPreceding, -1)
) - F.coalesce(F.col('pct_same_brand'), F.lit(0))
)
df2.show()
+---+-----+---+--------------+-------------------+
| a|brand|pct|pct_same_brand|pct_different_brand|
+---+-----+---+--------------+-------------------+
| a1| b1| 40| null| null|
| a2| b2| 30| null| 40|
| a3| b1| 10| 40| 30|
| a4| b3| 8| null| 80|
| a5| b2| 7| 30| 58|
| a6| b1| 5| 50| 45|
+---+-----+---+--------------+-------------------+

Filling gaps in time series Spark for different entities

I have a data frame containing daily events related to various entities in time.
I want to fill the gaps in those times series.
Here is the aggregate data I have (left), and on the right side, the data I want to have:
+---------+----------+-------+ +---------+----------+-------+
|entity_id| date|counter| |entity_id| date|counter|
+---------+----------+-------+ +---------+----------+-------+
| 3|2020-01-01| 7| | 3|2020-01-01| 7|
| 1|2020-01-01| 10| | 1|2020-01-01| 10|
| 2|2020-01-01| 3| | 2|2020-01-01| 3|
| 2|2020-01-02| 9| | 2|2020-01-02| 9|
| 1|2020-01-03| 15| | 1|2020-01-02| 0|
| 2|2020-01-04| 3| | 3|2020-01-02| 0|
| 1|2020-01-04| 14| | 1|2020-01-03| 15|
| 2|2020-01-05| 6| | 2|2020-01-03| 0|
+---------+----------+-------+ | 3|2020-01-03| 0|
| 3|2020-01-04| 0|
| 2|2020-01-04| 3|
| 1|2020-01-04| 14|
| 2|2020-01-05| 6|
| 1|2020-01-05| 0|
| 3|2020-01-05| 0|
+---------+----------+-------+
I have used this stack overflow topic, which was very useful:
Filling gaps in timeseries Spark
Here is my code (filter for only one entity), it is in Python but I think the API is the same in Scala:
(
df
.withColumn("date", sf.to_date("created_at"))
.groupBy(
sf.col("entity_id"),
sf.col("date")
)
.agg(sf.count(sf.lit(1)).alias("counter"))
.filter(sf.col("entity_id") == 1)
.select(
sf.col("date"),
sf.col("counter")
)
.join(
spark
.range(
df # range start
.filter(sf.col("entity_id") == 1)
.select(sf.unix_timestamp(sf.min("created_at")).alias("min"))
.first().min // a * a, # a = 60 * 60 * 24 = seconds in one day
(df # range end
.filter(sf.col("entity_id") == 1)
.select(sf.unix_timestamp(sf.max("created_at")).alias("max"))
.first().max // a + 1) * a,
a # range step, a = 60 * 60 * 24 = seconds in one day
)
.select(sf.to_date(sf.from_unixtime("id")).alias("date")),
["date"], # column which will be used for the join
how="right" # type of join
)
.withColumn("counter", sf.when(sf.isnull("counter"), 0).otherwise(sf.col("counter")))
.sort(sf.col("date"))
.show(200)
)
This work very well, but now I want to avoid the filter and do a range to fill the time series gaps for every entity (entity_id == 2, entity_id == 3, ...). For your information, depending on the entity_id value, the minimum and the maximum of the column date can be different, nevertheless if your help involves the global minimum and maximum of the whole data frame, it is ok for me as well.
If you need any other information, feel free to ask.
edit: add data example I want to have
When creating the elements of the date range, I would rather use the Pandas function than the Spark range, as the Spark range function has some shortcomings when dealing with date values. The amount of different dates is usually small. Even when dealing with a time span of multiple years, the number of different dates is so small that it can be easily broadcasted in a join.
#get the minimun and maximun date and collect it to the driver
min_date, max_date = df.select(F.min("date"), F.max("date")).first()
#use Pandas to create all dates and switch back to PySpark DataFrame
from pandas import pandas as pd
timerange = pd.date_range(start=min_date, end=max_date, freq='1d')
all_dates = spark.createDataFrame(timerange.to_frame(),['date'])
#get all combinations of dates and entity_ids
all_dates_and_ids = all_dates.crossJoin(df.select("entity_id").distinct())
#create the final result by doing a left join and filling null values with 0
result = all_dates_and_ids.join(df, on=['date', 'entity_id'], how="left_outer")\
.fillna({'counter':'0'}) \
.orderBy(['date', 'entity_id'])
This gives
+-------------------+---------+-------+
| date|entity_id|counter|
+-------------------+---------+-------+
|2020-01-01 00:00:00| 1| 10|
|2020-01-01 00:00:00| 2| 3|
|2020-01-01 00:00:00| 3| 7|
|2020-01-02 00:00:00| 1| 0|
|2020-01-02 00:00:00| 2| 9|
|2020-01-02 00:00:00| 3| 0|
|2020-01-03 00:00:00| 1| 15|
|2020-01-03 00:00:00| 2| 0|
|2020-01-03 00:00:00| 3| 0|
|2020-01-04 00:00:00| 1| 14|
|2020-01-04 00:00:00| 2| 3|
|2020-01-04 00:00:00| 3| 0|
|2020-01-05 00:00:00| 1| 0|
|2020-01-05 00:00:00| 2| 6|
|2020-01-05 00:00:00| 3| 0|
+-------------------+---------+-------+

Get the previous row value using spark sql

I have a table like this.
Id prod val
1 0 0
2 0 0
3 1 1000
4 0 0
5 1 2000
6 0 0
7 0 0
I want to add a new column new_val and the condition for this column is, if prod = 0, then new_val should be from the preceding row where prod = 1.
If prod = 1 it should have the same value as val column. How do I achieve this using spark sql?
Id prod val new_val
1 0 0 1000
2 0 0 1000
3 1 1000 1000
4 0 0 2000
5 1 2000 2000
6 1 4000 4000
7 1 3000 3000
Any help is greatly appreciated
You can use something like this:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window().orderBy("id")
df = df.withColumn("new_val", F.when(F.col("prod") == 0, F.lag("val").over(w)).otherwise(F.col("val")))
What we are basically doing is using an if-else condition:
When prod == 0, take lag of val which is value of previous row (over a window that is ordered by id column), and if prod == 1, then we use the present value of the column.
You can acheive that with
val w = Window.orderBy("id").rowsBetween(0, Window.unboundedFollowing)
df
.withColumn("new_val", when($"prod" === 0, null).otherwise($"val"))
.withColumn("new_val", first("new_val", ignoreNulls = true).over(w))
It first, creates new column with null values whenever value doesn't change:
+---+----+----+-------+
| id|prod| val|new_val|
+---+----+----+-------+
| 1| 0| 0| null|
| 2| 0| 0| null|
| 3| 1|1000| 1000|
| 4| 0| 0| null|
| 5| 1|2000| 2000|
| 6| 1|4000| 4000|
| 7| 1|3000| 3000|
+---+----+----+-------+
And it replaces values with first non-null value in the following records
+---+----+----+-------+
| id|prod| val|new_val|
+---+----+----+-------+
| 1| 0| 0| 1000|
| 2| 0| 0| 1000|
| 3| 1|1000| 1000|
| 4| 0| 0| 2000|
| 5| 1|2000| 2000|
| 6| 1|4000| 4000|
| 7| 1|3000| 3000|
+---+----+----+-------+

check if value of one column lies between the range of another column(array) in a dataframe

I have a dataframe where I need to compare a few values and deduce a few things out of it.
For instance,
my DF
CITY DAY MONTH TAG RANGE VALUE RANK
A 1 01 A [50, 90] 55 1
A 2 02 B [30, 40] 34 3
A 1 03 A [05, 10] 15 20
A 1 04 B [50, 60] 11 10
A 1 05 B [50, 60] 54 4
I have to check , for every row if the value of "VALUE" lies between the "RANGE". Here, arr[0] is the lower limit and arr[1] is the upper limit.
I need to create a new DF such that,
NEW-DF
TAG Positive Negative
A 1 1
B 2 1
If the "value" lies between the given range and the rank < 5 then I would add it to "positive"
If the value doesnt lie in the given range , then it is a negative
If the value lies in the given range, but the rank > 5, then I would count it as negative
"Positive" and "Negative" is nothing but the count of the values fulfilling either conditions.
We can use element_at to get the elements at each position and compare them to the corresponding value in each row, along with the rank condition, and then perform a groupby with sum on the tag:
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType
range_df = df.withColumn('in_range', (F.element_at('range', 1).cast(IntegerType()) < F.col('value')) &
(F.col('value') < F.element_at('range', 2).cast(IntegerType())) &
(F.col('rank') < 5))
range_df.show()
grouped_df = range_df.groupby('tag').agg(F.sum(F.col('in_range').cast(IntegerType())).alias('total_positive'),
F.sum((~F.col('in_range')).cast(IntegerType())).alias('total_negative'))
grouped_df.show()
Output:
+---+--------+-----+----+--------+
|tag| range|value|rank|in_range|
+---+--------+-----+----+--------+
| A|[50, 90]| 55| 1| true|
| B|[30, 40]| 34| 3| true|
| A|[05, 10]| 15| 20| false|
| B|[50, 60]| 11| 10| false|
| B|[50, 60]| 54| 4| true|
+---+--------+-----+----+--------+
+---+--------------+--------------+
|tag|total_positive|total_negative|
+---+--------------+--------------+
| B| 2| 1|
| A| 1| 1|
+---+--------------+--------------+
You have to use a UDF first to process the range :
val df = Seq(("A","1","01","A","[50,90]","55","1")).toDF("city","day","month","tag","range","value","rank")
+----+---+-----+---+-------+-----+----+
|city|day|month|tag| range|value|rank|
+----+---+-----+---+-------+-----+----+
| A| 1| 01| A|[50,90]| 55| 1|
+----+---+-----+---+-------+-----+----+
def checkRange(range : String,rank : String, value : String) : String = {
val rangeProcess = range.dropRight(1).drop(1).split(",")
if (rank.toInt > 5){
"negative"
} else {
if (value > rangeProcess(0) && value < rangeProcess(1)){
"positive"
} else {
"negative"
}
}
}
val checkRangeUdf = udf(checkRange _)
df.withColumn("Result",checkRangeUdf(col("range"),col("rank"),col("value"))).show()
+----+---+-----+---+-------+-----+----+--------+
|city|day|month|tag| range|value|rank| Result|
+----+---+-----+---+-------+-----+----+--------+
| A| 1| 01| A|[50,90]| 55| 1|positive|
+----+---+-----+---+-------+-----+----+--------+
val result = df.withColumn("Result",checkRangeUdf(col("range"),col("rank"),col("value"))).groupBy("city","Result").count.show
+----+--------+-----+
|city| Result|count|
+----+--------+-----+
| A|positive| 1|
+----+--------+-----+

Keep track of the previous row values with additional condition using pyspark

I'm using pyspark to generate a dataframe where I need to update 'amt' column with previous row's 'amt' value only when amt = 0.
For example, below is my dataframe
+---+-----+
| id|amt |
+---+-----+
| 1| 5|
| 2| 0|
| 3| 0|
| 4| 6|
| 5| 0|
| 6| 3|
+---+-----+
Now, I want the following DF to be created. whenever amt = 0, modi_amt col will contain previous row's non zero value, else no change.
+---+-----+----------+
| id|amt |modi_amt |
+---+-----+----------+
| 1| 5| 5|
| 2| 0| 5|
| 3| 0| 5|
| 4| 6| 6|
| 5| 0| 6|
| 6| 3| 3|
+---+-----+----------+
I'm able to get the previous rows value but need help for the rows where multiple 0 amt appears (example, id = 2,3)
code I'm using :
from pyspark.sql.window import Window
my_window = Window.partitionBy().orderBy("id")
DF= DF.withColumn("prev_amt", F.lag(DF.amt).over(my_window))
DF= DF.withColumn("modi_amt",when(DF.amt== 0,DF.prev_amt).otherwise(DF.amt)).drop('prev_amt')
I'm getting the below DF
+---+-----+----------+
| id|amt |modi_amt |
+---+-----+----------+
| 1| 5| 5|
| 2| 0| 5|
| 3| 0| 0|
| 4| 6| 6|
| 5| 0| 6|
| 6| 3| 3|
+---+-----+----------+
basically id 3 also should have modi_amt = 5
I've used the below approach to get the output and it's working fine,
from pyspark.sql.window import Window
my_window = Window.partitionBy().orderBy("id")
# this will hold the previous col value
DF= DF.withColumn("prev_amt", F.lag(DF.amt).over(my_window))
# this will replace the amt 0 with previous column value, but not consecutive rows having 0 amt.
DF = DF.withColumn("amt_adjusted",when(DF.prev_amt == 0,DF.prev_OffSet).otherwise(DF.amt))
# define null for the rows where both amt and amt_adjusted are having 0 (logic for consecutive rows having 0 amt)
DF = DF.withColumn('zeroNonZero', when((DF.amt== 0)&(DF.amt_adjusted == 0),lit(None)).otherwise(DF.amt_adjusted))
# replace all null values with previous Non zero amt row value
DF= DF.withColumn('modi_amt',last("zeroNonZero", ignorenulls= True).over(Window.orderBy("id").rowsBetween(Window.unboundedPreceding,0)))
Is there any other better approach?

Resources