Pyspark : forward fill with last observation for a DataFrame - apache-spark

Using Spark 1.5.1,
I've been trying to forward fill null values with the last known observation for one column of my DataFrame.
It is possible to start with a null value and for this case I would to backward fill this null value with the first knwn observation. However, If that too complicates the code, this point can be skipped.
In this post, a solution in Scala was provided for a very similar problem by zero323.
But, I don't know Scala and I don't succeed to ''translate'' it in Pyspark API code. It's possible to do it with Pyspark ?
Thanks for your help.
Below, a simple example sample input:
| cookie_ID | Time | User_ID
| ------------- | -------- |-------------
| 1 | 2015-12-01 | null
| 1 | 2015-12-02 | U1
| 1 | 2015-12-03 | U1
| 1 | 2015-12-04 | null
| 1 | 2015-12-05 | null
| 1 | 2015-12-06 | U2
| 1 | 2015-12-07 | null
| 1 | 2015-12-08 | U1
| 1 | 2015-12-09 | null
| 2 | 2015-12-03 | null
| 2 | 2015-12-04 | U3
| 2 | 2015-12-05 | null
| 2 | 2015-12-06 | U4
And the expected output:
| cookie_ID | Time | User_ID
| ------------- | -------- |-------------
| 1 | 2015-12-01 | U1
| 1 | 2015-12-02 | U1
| 1 | 2015-12-03 | U1
| 1 | 2015-12-04 | U1
| 1 | 2015-12-05 | U1
| 1 | 2015-12-06 | U2
| 1 | 2015-12-07 | U2
| 1 | 2015-12-08 | U1
| 1 | 2015-12-09 | U1
| 2 | 2015-12-03 | U3
| 2 | 2015-12-04 | U3
| 2 | 2015-12-05 | U3
| 2 | 2015-12-06 | U4

Another workaround to get this working, is to try something like this:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
window = (
Window
.partitionBy('cookie_id')
.orderBy('Time')
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
)
final = (
joined
.withColumn('UserIDFilled', F.last('User_ID', ignorenulls=True).over(window))
)
So what this is doing is that it constructs your window based on the partition key and the order column. It also tells the window to look back all rows within the window up to the current row. Finally, at each row, you return the last value that is not null (which remember, according to your window, it includes your current row)

The partitioned example code from Spark / Scala: forward fill with last observation in pyspark is shown. This only works for data that can be partitioned.
Load the data
values = [
(1, "2015-12-01", None),
(1, "2015-12-02", "U1"),
(1, "2015-12-02", "U1"),
(1, "2015-12-03", "U2"),
(1, "2015-12-04", None),
(1, "2015-12-05", None),
(2, "2015-12-04", None),
(2, "2015-12-03", None),
(2, "2015-12-02", "U3"),
(2, "2015-12-05", None),
]
rdd = sc.parallelize(values)
df = rdd.toDF(["cookie_id", "c_date", "user_id"])
df = df.withColumn("c_date", df.c_date.cast("date"))
df.show()
The DataFrame is
+---------+----------+-------+
|cookie_id| c_date|user_id|
+---------+----------+-------+
| 1|2015-12-01| null|
| 1|2015-12-02| U1|
| 1|2015-12-02| U1|
| 1|2015-12-03| U2|
| 1|2015-12-04| null|
| 1|2015-12-05| null|
| 2|2015-12-04| null|
| 2|2015-12-03| null|
| 2|2015-12-02| U3|
| 2|2015-12-05| null|
+---------+----------+-------+
Column used to sort the partitions
# get the sort key
def getKey(item):
return item.c_date
The fill function. Can be used to fill in multiple columns if necessary.
# fill function
def fill(x):
out = []
last_val = None
for v in x:
if v["user_id"] is None:
data = [v["cookie_id"], v["c_date"], last_val]
else:
data = [v["cookie_id"], v["c_date"], v["user_id"]]
last_val = v["user_id"]
out.append(data)
return out
Convert to rdd, partition, sort and fill the missing values
# Partition the data
rdd = df.rdd.groupBy(lambda x: x.cookie_id).mapValues(list)
# Sort the data by date
rdd = rdd.mapValues(lambda x: sorted(x, key=getKey))
# fill missing value and flatten
rdd = rdd.mapValues(fill).flatMapValues(lambda x: x)
# discard the key
rdd = rdd.map(lambda v: v[1])
Convert back to DataFrame
df_out = sqlContext.createDataFrame(rdd)
df_out.show()
The output is
+---+----------+----+
| _1| _2| _3|
+---+----------+----+
| 1|2015-12-01|null|
| 1|2015-12-02| U1|
| 1|2015-12-02| U1|
| 1|2015-12-03| U2|
| 1|2015-12-04| U2|
| 1|2015-12-05| U2|
| 2|2015-12-02| U3|
| 2|2015-12-03| U3|
| 2|2015-12-04| U3|
| 2|2015-12-05| U3|
+---+----------+----+

Hope you find this forward fill function useful. It is written using native pyspark function. Neither udf nor rdd being used (both of them are very slow, especially UDF!).
Let's use example provided by #Sid.
values = [
(1, "2015-12-01", None),
(1, "2015-12-02", "U1"),
(1, "2015-12-02", "U1"),
(1, "2015-12-03", "U2"),
(1, "2015-12-04", None),
(1, "2015-12-05", None),
(2, "2015-12-04", None),
(2, "2015-12-03", None),
(2, "2015-12-02", "U3"),
(2, "2015-12-05", None),
]
df = spark.createDataFrame(values, ['cookie_ID', 'Time', 'User_ID'])
Functions:
def cum_sum(df, sum_col , order_col, cum_sum_col_nm='cum_sum'):
'''Find cumulative sum of a column.
Parameters
-----------
sum_col : String
Column to perform cumulative sum.
order_col : List
Column/columns to sort for cumulative sum.
cum_sum_col_nm : String
The name of the resulting cum_sum column.
Return
-------
df : DataFrame
Dataframe with additional "cum_sum_col_nm".
'''
df = df.withColumn('tmp', lit('tmp'))
windowval = (Window.partitionBy('tmp')
.orderBy(order_col)
.rangeBetween(Window.unboundedPreceding, 0))
df = df.withColumn('cum_sum', sum(sum_col).over(windowval).alias('cumsum').cast(StringType()))
df = df.drop('tmp')
return df
def forward_fill(df, order_col, fill_col, fill_col_name=None):
'''Forward fill a column by a column/set of columns (order_col).
Parameters:
------------
df: Dataframe
order_col: String or List of string
fill_col: String (Only work for a column for this version.)
Return:
---------
df: Dataframe
Return df with the filled_cols.
'''
# "value" and "constant" are tmp columns created ton enable forward fill.
df = df.withColumn('value', when(col(fill_col).isNull(), 0).otherwise(1))
df = cum_sum(df, 'value', order_col).drop('value')
df = df.withColumn(fill_col,
when(col(fill_col).isNull(), 'constant').otherwise(col(fill_col)))
win = (Window.partitionBy('cum_sum')
.orderBy(order_col))
if not fill_col_name:
fill_col_name = 'ffill_{}'.format(fill_col)
df = df.withColumn(fill_col_name, collect_list(fill_col).over(win)[0])
df = df.drop('cum_sum')
df = df.withColumn(fill_col_name, when(col(fill_col_name)=='constant', None).otherwise(col(fill_col_name)))
df = df.withColumn(fill_col, when(col(fill_col)=='constant', None).otherwise(col(fill_col)))
return df
Let's see the results.
ffilled_df = forward_fill(df,
order_col=['cookie_ID', 'Time'],
fill_col='User_ID',
fill_col_name = 'User_ID_ffil')
ffilled_df.sort(['cookie_ID', 'Time']).show()

// Forward filling
w1 = Window.partitionBy('cookie_id').orderBy('c_date').rowsBetween(Window.unboundedPreceding,0)
w2 = w1.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
//Backward filling
final_df = df.withColumn('UserIDFilled', F.coalesce(F.last('user_id', True).over(w1),
F.first('user_id',True).over(w2)))
final_df.orderBy('cookie_id', 'c_date').show(truncate=False)
+---------+----------+-------+------------+
|cookie_id|c_date |user_id|UserIDFilled|
+---------+----------+-------+------------+
|1 |2015-12-01|null |U1 |
|1 |2015-12-02|U1 |U1 |
|1 |2015-12-02|U1 |U1 |
|1 |2015-12-03|U2 |U2 |
|1 |2015-12-04|null |U2 |
|1 |2015-12-05|null |U2 |
|2 |2015-12-02|U3 |U3 |
|2 |2015-12-03|null |U3 |
|2 |2015-12-04|null |U3 |
|2 |2015-12-05|null |U3 |
+---------+----------+-------+------------+

Cloudera has released a library called spark-ts that offers a suite of useful methods for processing time series and sequential data in Spark. This library supports a number of time-windowed methods for imputing data points based on other data in the sequence.
http://blog.cloudera.com/blog/2015/12/spark-ts-a-new-library-for-analyzing-time-series-data-with-apache-spark/

Related

Pyspark add or remove rows in a dataframe based on another similar dataframe

Consider that I have two Dataframes DF1 and DF2 with the same schema.
what I want to do is that :
For each row in DF1,
if DF1.uniqueId exists in DF2 and type is new, then add to DF2 with a repeat count.
if DF1.uniqueId exists in DF2 and type is old, change DF2 type to that of DF1 type (old).
if DF1.uniqueId does not exists in DF2 and type is new, add a new row to DF2.
if DF1.uniqueId does not exist in DF2 and type is old, move that row to a new table -DF3
ie. if the tables are as shown below, the resultant or the updated DF2 should be like resultDF2 table below
DF1
+----------+--------------------------+
|UniqueID |type_ |
+----------+--------------------------+
|1 |new |
|1 |new |
|1 |new |
|2 |old |
|1 |new |
+----------+--------------------------+
DF2
+----------+--------------------------+
|UniqueID |type_ |
+----------+--------------------------+
| | |
+----------+--------------------------+
resultDF2
+----------+--------------------------++----------+--------------------------+
|UniqueID |type_ | repeatCount |
+----------+--------------------------++----------+--------------------------+
| 1 | new | 3 |
+----------+--------------------------++----------+--------------------------+
resultDF3
+----------+--------------------------++----------+--------------------------+
|UniqueID |type_ | repeatCount |
+----------+--------------------------++----------+--------------------------+
| 1 | old | 0 |
+----------+--------------------------++----------+--------------------------+
** if there is only one entry repeatCount is zero.
I am trying to achieve this using pyspark.
Can anyone please suggest me with any pointers on how to achieve this considering that I have both the tables in-memory.
The desired output can be obtained by:
Group df1 on UniqueId and get repeatCount, during this operation remove UniqueId that have old and new type_.
Apply a Full Join between dataframe from step 1 and df2.
From the joined result, remove rows where df.UniqueId is absent from df2 and df1.type_ is old.
Finally, select the UniqueID, type_ and repeatCount.
from pyspark.sql import functions as F
data = [(1, "new",), # Not exists and new
(1, "new",),
(1, "new",),
(2, "old",), # Not exists and old
(1, "new",),
(3, "old",), # cancel out
(3, "new",), # cancel out
(4, "new",), # one entry count zero example
(5, "new",), # Exists and new
(6, "old",), ] # Exists and old
df1 = spark.createDataFrame(data, ("UniqueID", "type_", ))
df2 = spark.createDataFrame([(5, "new", ), (6, "new", ), ], ("UniqueID", "type_", ))
df1_grouped = (df1.groupBy("UniqueID").agg(F.collect_set("type_").alias("types_"),
(F.count("type_") - F.lit(1)).alias("repeatCount"))
.filter(F.size(F.col("types_")) == 1) # when more than one type of `type_` is present they cancel out
.withColumn("type_", F.col("types_")[0])
.drop("types_")
)
id_not_exists_old = (df2["UniqueID"].isNull() & (df1_grouped["type_"] == F.lit("old")))
(df1_grouped.join(df2, df1_grouped["UniqueID"] == df2["UniqueID"], "full")
.filter(~(id_not_exists_old))
.select(df1_grouped["UniqueID"], df1_grouped["type_"], "repeatCount")
).show()
"""
+--------+-----+-----------+
|UniqueID|type_|repeatCount|
+--------+-----+-----------+
| 1| new| 3|
| 4| new| 0|
| 5| new| 0|
| 6| old| 0|
+--------+-----+-----------+
"""

Filter in a spark window by comparing a single row element with all rows of the window

Suppose you have a dataframe as follows:
+---+----------+----------+
| id| date_a| date_b|
+---+----------+----------+
| 1|2020-01-30|2020-01-19|
| 1|2020-01-10|2020-01-19|
| 1|2020-01-10|2020-01-26|
| 1|2020-01-30|2020-01-26|
| 2|2020-01-05|2020-01-08|
| 3|2020-01-08|2020-01-10|
| 3|2020-01-12|2020-01-10|
+---+----------+----------+
For each id, there are date_a and date_b values, in various combinations.
I'd like to filter entries, where for a single id, date_b's are outside of a certain set time range around all date_a's.
A visual interpretation for id = 1 looks like (horizontal is time axis):
|---x---| o |-o--x---|
, where x = date_a, o = date_b and |--- ---| indicates the time range (i.e. +- 5 days).
Thus, "o" (date_b) entries should be kept, that are within none of the date_a timeranges (here, the first "o").
Example input/output:
Input:
df = spark.createDataFrame(
[(1, '2020-01-10', '2020-01-19'),
(1, '2020-01-10', '2020-01-26'),
(1, '2020-01-30', '2020-01-19'),
(1, '2020-01-30', '2020-01-26'),
(2, '2020-01-05', '2020-01-08'),
(3, '2020-01-08', '2020-01-10'),
(3, '2020-01-12', '2020-01-10'),],
['id', 'date_a', 'date_b']
)
df = df.withColumn('date_a', F.to_date('date_a'))
df = df.withColumn('date_b', F.to_date('date_b'))
df = df.withColumn('diff', F.datediff(df.date_b, df.date_a))
df.orderBy('id', 'date_b').show()
+---+----------+----------+----+
| id| date_a| date_b|diff|
+---+----------+----------+----+
| 1|2020-01-30|2020-01-19| -11|
| 1|2020-01-10|2020-01-19| 9|
| 1|2020-01-30|2020-01-26| -4|
| 1|2020-01-10|2020-01-26| 16|
| 2|2020-01-05|2020-01-08| 3|
| 3|2020-01-08|2020-01-10| 2|
| 3|2020-01-12|2020-01-10| -2|
+---+----------+----------+----+
Within the same id, we want to get date_b's where the diff is >5 or <-6 for all rows with the same date_b (date_b is outside of the interval [date_a - 6, date_b + 5]).
I.e.:
For id=1, date_b='2020-01-19', (11 > 5 | 11 < -6) & (9 > 5 | 9 < -6) -> entry is kept (True & True)
For id=1, date_b='2020-01-26', (4 > 5 | 4 < -6) & (16 > 5 | 16 < -6) -> entry is discarded (False & True)
...
Expected output:
+---+----------+----------+
| id| date_a| date_b|
+---+----------+----------+
| 1|2020-01-10|2020-01-19|
| 1|2020-01-30|2020-01-19|
+---+----------+----------+
here is a possible approach, you can try (comments inline):
w = Window.partitionBy("id","date_b").orderBy("id")
cond = (F.col("diff")>5) | (F.col("diff")<-6)
#check if condition is true and get sum over the window
sum_of_true_on_w = F.sum(cond.cast("Integer")).over(w)
#get window size to compare with the sum , there might be a better way to get size
size_of_window = F.max(F.row_number().over(w)).over(w)
#filter where sum over the window is equal to size of window
(df.withColumn("Sum_bool",sum_of_true_on_w)
.withColumn("Window_Size",size_of_window)
.filter(F.col("Sum_bool")==F.col("Window_Size"))
.drop("diff","Sum_bool","Window_Size")).show()
+---+----------+----------+
| id| date_a| date_b|
+---+----------+----------+
| 1|2020-01-10|2020-01-19|
| 1|2020-01-30|2020-01-19|
+---+----------+----------+

Spark filter multiple group of rows to a single row

I am trying to acheive the following,
Lets say I have a dataframe with the following columns
id | name | alias
-------------------
1 | abc | short
1 | abc | ailas-long-1
1 | abc | another-long-alias
2 | xyz | short_alias
2 | xyz | same_length
3 | def | alias_1
I want to groupby id and name and select the shorter alias,
The output I am expecting is
id | name | alias
-------------------
1 | abc | short
2 | xyz | short_alias
3 | def | alias_1
I can achevie this using window and row_number, is there anyother efficient method to get the same result. In general, the thrid column filter condition can be anything in this case its the length of the field.
Any help would be much appreciated.
Thank you.
All you need to do is use length inbuilt function and use that in window function as
from pyspark.sql import functions as f
from pyspark.sql import Window
windowSpec = Window.partitionBy('id', 'name').orderBy('length')
df.withColumn('length', f.length('alias'))\
.withColumn('length', f.row_number().over(windowSpec))\
.filter(f.col('length') == 1)\
.drop('length')\
.show(truncate=False)
which should give you
+---+----+-----------+
|id |name|alias |
+---+----+-----------+
|3 |def |alias_1 |
|1 |abc |short |
|2 |xyz |short_alias|
+---+----+-----------+
A solution without window (Not very pretty..) and the easiest, in my opinion, rdd solution:
from pyspark.sql import functions as F
from pyspark.sql import HiveContext
hiveCtx = HiveContext(sc)
rdd = sc.parallelize([(1 , "abc" , "short-alias"),
(1 , "abc" , "short"),
(1 , "abc" , "ailas-long-1"),
(1 , "abc" , "another-long-alias"),
(2 , "xyz" , "same_length"),
(2 , "xyz" , "same_length1"),
(3 , "def" , "short_alias") ])
df = hiveCtx.createDataFrame(\
rdd, ["id", "name", "alias"])
len_df = df.groupBy(["id", "name"]).agg(F.min(F.length("alias")).alias("alias_len"))
df = df.withColumn("alias_len", F.length("alias"))
cond = ["alias_len", "id", "name"]
df.join(len_df, cond).show()
print rdd.map(lambda x: ((x[0], x[1]), x[2]))\
.reduceByKey(lambda x,y: x if len(x) < len(y) else y ).collect()
Output:
+---------+---+----+-----------+
|alias_len| id|name| alias|
+---------+---+----+-----------+
| 11| 3| def|short_alias|
| 11| 2| xyz|same_length|
| 5| 1| abc| short|
+---------+---+----+-----------+
[((2, 'xyz'), 'same_length'), ((3, 'def'), 'short_alias'), ((1, 'abc'), 'short')]

How to clone column values in spark with their original order

I would like to clone the values of a column n times as they are in their original order.
For example if I want to replicate below column 2 times:
+---+
| v |
+---+
| 1 |
| 2 |
| 3 |
+---+
What I am looking for :
+---+
| v |
+---+
| 1 |
| 2 |
| 3 |
| 1 |
| 2 |
| 3 |
+---+
Using explode or flatMap I can only get :
+---+
| v |
+---+
| 1 |
| 1 |
| 2 |
| 2 |
| 3 |
| 3 |
+---+
Code:
%spark
val ds = spark.range(1, 4)
val cloneCount = 2
val clonedDs = ds.flatMap(r => Seq.fill(cloneCount)(r))
clonedDs.show()
I can probably do a self union of the dataset ds but if the cloneCount is huge, eg. cloneCount = 200000, is it a preferred solution to union in a loop that many times?
You can try this:
// If the column values are expected to be in an increasing/descresing sequence
// then we add that to the orderBy: clone_index and col_value
// to get the values in order as they were initially
val clonedDs = ds.flatMap(col_value => Range(0, cloneCount)
.map(clone_index=>(clone_index,col_value)) )
clonedDs.orderBy("_1", "_2").map(_._2).show()
// If the column values are not expected to follow a sequence
// then we add another rank column and use that in orderBy along with clone_index
// to get the col_values in desired order
val clonedDs = ds.withColumn("rank", monotonically_increasing_id())
.flatMap(row => Range(0, cloneCount).map(
clone_index=> (clone_index, row.getLong(1), row.getLong(0))
) )
clonedDs.orderBy("_1", "_2").map(_._3).show()

Compute a value using multiple preceding rows

I have a DataFrame, that contains events ordered by timestamp.
Certain events mark the beginning of a new epoch:
+------+-----------+
| Time | Type |
+------+-----------+
| 0 | New Epoch |
| 2 | Foo |
| 3 | Bar |
| 11 | New Epoch |
| 12 | Baz |
+------+-----------+
I would like to add a column with epoch number, that, for simplicity, can be equal to the timestamp of its beginning:
+------+-----------+–------+
| Time | Type | Epoch |
+------+-----------+-------+
| 0 | New Epoch | 0 |
| 2 | Foo | 0 |
| 3 | Bar | 0 |
| 11 | New Epoch | 11 |
| 12 | Baz | 11 |
+------+-----------+-------+
How can I achieve this?
The naive algorithm would be to write a function that goes backwards until it finds a row with $"Type" === "New Epoch" and takes its $"Time". In case I know the maximum number of events within an epoch, I can probably implement it by calling lag() that many times. But for the general case I don't have any ideas.
Below is my solution. Briefly, I create a dataframe that represents epoch intervals then join it with original dataframe.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val ds = List((0, "New Epoch"), (2, "Fo"), (3, "Bar"), (11, "New Epoch"), (12, "Baz")).toDF("Time", "Type")
val epoch = ds.filter($"Type" === "New Epoch")
val spec = Window.orderBy("Time")
val epochInterval = epoch.withColumn("next_epoch", lead($"Time", 1).over(spec))//.show(false)
val result = ds.as("left").join(epochInterval.as("right"), $"left.Time" >= $"right.Time" && ($"left.Time" < $"right.next_epoch" || $"right.next_epoch".isNull))
.select($"left.Time", $"left.Type", $"right.Time".as("Epoch"))
result.show(false)
+----+---------+-----+
|Time|Type |Epoch|
+----+---------+-----+
|0 |New Epoch|0 |
|2 |Fo |0 |
|3 |Bar |0 |
|11 |New Epoch|11 |
|12 |Baz |11 |
+----+---------+-----+

Resources