Compute a value using multiple preceding rows - apache-spark

I have a DataFrame, that contains events ordered by timestamp.
Certain events mark the beginning of a new epoch:
+------+-----------+
| Time | Type |
+------+-----------+
| 0 | New Epoch |
| 2 | Foo |
| 3 | Bar |
| 11 | New Epoch |
| 12 | Baz |
+------+-----------+
I would like to add a column with epoch number, that, for simplicity, can be equal to the timestamp of its beginning:
+------+-----------+–------+
| Time | Type | Epoch |
+------+-----------+-------+
| 0 | New Epoch | 0 |
| 2 | Foo | 0 |
| 3 | Bar | 0 |
| 11 | New Epoch | 11 |
| 12 | Baz | 11 |
+------+-----------+-------+
How can I achieve this?
The naive algorithm would be to write a function that goes backwards until it finds a row with $"Type" === "New Epoch" and takes its $"Time". In case I know the maximum number of events within an epoch, I can probably implement it by calling lag() that many times. But for the general case I don't have any ideas.

Below is my solution. Briefly, I create a dataframe that represents epoch intervals then join it with original dataframe.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val ds = List((0, "New Epoch"), (2, "Fo"), (3, "Bar"), (11, "New Epoch"), (12, "Baz")).toDF("Time", "Type")
val epoch = ds.filter($"Type" === "New Epoch")
val spec = Window.orderBy("Time")
val epochInterval = epoch.withColumn("next_epoch", lead($"Time", 1).over(spec))//.show(false)
val result = ds.as("left").join(epochInterval.as("right"), $"left.Time" >= $"right.Time" && ($"left.Time" < $"right.next_epoch" || $"right.next_epoch".isNull))
.select($"left.Time", $"left.Type", $"right.Time".as("Epoch"))
result.show(false)
+----+---------+-----+
|Time|Type |Epoch|
+----+---------+-----+
|0 |New Epoch|0 |
|2 |Fo |0 |
|3 |Bar |0 |
|11 |New Epoch|11 |
|12 |Baz |11 |
+----+---------+-----+

Related

Collapse DataFrame using Window functions

I would like to collapse the rows in a dataframe based on an ID column and count the number of records per ID using window functions. Doing this, I would like to avoid partitioning the window by ID, because this would result in a very large number of partitions.
I have a dataframe of the form
+----+-----------+-----------+-----------+
| ID | timestamp | metadata1 | metadata2 |
+----+-----------+-----------+-----------+
| 1 | 09:00 | ABC | apple |
| 1 | 08:00 | NULL | NULL |
| 1 | 18:00 | XYZ | apple |
| 2 | 07:00 | NULL | banana |
| 5 | 23:00 | ABC | cherry |
+----+-----------+-----------+-----------+
where I would like to keep only the records with the most recent timestamp per ID, such that I have
+----+-----------+-----------+-----------+-------+
| ID | timestamp | metadata1 | metadata2 | count |
+----+-----------+-----------+-----------+-------+
| 1 | 18:00 | XYZ | apple | 3 |
| 2 | 07:00 | NULL | banana | 1 |
| 5 | 23:00 | ABC | cherry | 1 |
+----+-----------+-----------+-----------+-------+
I have tried:
window = Window.orderBy( [asc('ID'), desc('timestamp')] )
window_count = Window.orderBy( [asc('ID'), desc('timestamp')] ).rowsBetween(-sys.maxsize,sys.maxsize)
columns_metadata = [metadata1, metadata2]
df = df.select(
*(first(col_name, ignorenulls=True).over(window).alias(col_name) for col_name in columns_metadata),
count(col('ID')).over(window_count).alias('count')
)
df = df.withColumn("row_tmp", row_number().over(window)).filter(col('row_tmp') == 1).drop(col('row_tmp'))
which is in part based on How to select the first row of each group?
This without the use of pyspark.sql.Window.partitionBy, this does not give the desired output.
I read you wanted without partitioning by ID after I posted it. I could only think of this approach.
Your dataframe:
df = sqlContext.createDataFrame(
[
('1', '09:00', 'ABC', 'apple')
,('1', '08:00', '', '')
,('1', '18:00', 'XYZ', 'apple')
,('2', '07:00', '', 'banana')
,('5', '23:00', 'ABC', 'cherry')
]
,['ID', 'timestamp', 'metadata1', 'metadata2']
)
We can use rank and partition by ID over timestamp:
from pyspark.sql.window import Window
import pyspark.sql.functions as F
w1 = Window().partitionBy(df['ID']).orderBy(df['timestamp']).orderBy(F.desc('timestamp'))
w2 = Window().partitionBy(df['ID'])
df\
.withColumn("rank", F.rank().over(w1))\
.withColumn("count", F.count('ID').over(w2))\
.filter(F.col('rank') == 1)\
.select('ID', 'timestamp', 'metadata1', 'metadata2', 'count')\
.show()
+---+---------+---------+---------+-----+
| ID|timestamp|metadata1|metadata2|count|
+---+---------+---------+---------+-----+
| 1| 18:00| XYZ| apple| 3|
| 2| 07:00| | banana| 1|
| 5| 23:00| ABC| cherry| 1|
+---+---------+---------+---------+-----+

PySpark: Timeslice and split rows in dataframe with 5 minutes interval on a specific condition

I have a dataframe with the following columns:
+-----+----------+--------------------------+-----------+
|id | sourceid | timestamp | indicator |
+-----+----------+--------------------------+-----------+
| 0 | 128 | 2019-12-03 12:00:00.0 | 0 |
| 1 | 128 | 2019-12-03 12:30:00.0 | 1 |
| 2 | 128 | 2019-12-03 12:37:00.0 | 0 |
| 3 | 128 | 2019-12-03 13:15:00.0 | 1 |
| 4 | 128 | 2019-12-03 13:17:00.0 | 0 |
+-----+----------+--------------------------+-----------+
I am trying to split the timestamp column into rows of 5 minute time intervals for indicator values which are not 0.
Explanation:
The first entry is at time timestamp = 2019-12-03 12:00:00.0, indicator= 0, do nothing.
Moving on to the next entry with timestamp = 2019-12-03 12:30:00.0, indicator= 1, I want to split timestamp into rows with a 5 minutes interval till we reach the next entry which is timestamp = 2019-12-03 12:37:00.0, indicator= 0.
If there is a case where timestamp = 2019-12-03 13:15:00.0, indicator = 1 and the next timestamp = 2019-12-03 13:17:00.0, indicator = 0, I'd like to split the row considering both the times have indicator as 1 as 13:17:00.0 falls between 13:15:00.0 - 13:20:00.0 as shown below.
How can I achieve this with PySpark?
Expected Output:
+-----+----------+--------------------------+-------------+
|id | sourceid | timestamp | indicator |
+-----+----------+--------------------------+-------------+
| 1 | 128 | 2019-12-03 12:30:00.0 | 1 |
| 1 | 128 | 2019-12-03 12:35:00.0 | 1 |
| 4 | 128 | 2019-12-03 13:15:00.0 | 1 |
| 4 | 128 | 2019-12-03 13:20:00.0 | 1 |
+-----+----------+--------------------------+-------------+
IIUC, you can filter rows based on indicators on the current and the next rows, and then use array + explode to create new rows (for testing purpose, I added some more rows into your original example):
from pyspark.sql import Window, functions as F
w1 = Window.partitionBy('sourceid').orderBy('timestamp')
# add a flag to check if the next indicator is '0'
df1 = df.withColumn('next_indicator_is_0', F.lead('indicator').over(w1) == 0)
df1.show(truncate=False)
+---+--------+---------------------+---------+-------------------+
|id |sourceid|timestamp |indicator|next_indicator_is_0|
+---+--------+---------------------+---------+-------------------+
|0 |128 |2019-12-03 12:00:00.0|0 |false |
|1 |128 |2019-12-03 12:30:00.0|1 |true |
|2 |128 |2019-12-03 12:37:00.0|0 |false |
|3 |128 |2019-12-03 13:12:00.0|1 |false |
|4 |128 |2019-12-03 13:15:00.0|1 |true |
|5 |128 |2019-12-03 13:17:00.0|0 |false |
|6 |128 |2019-12-03 13:20:00.0|1 |null |
+---+--------+---------------------+---------+-------------------+
df1.filter("indicator = 1 AND next_indicator_is_0") \
.withColumn('timestamp', F.expr("explode(array(`timestamp`, `timestamp` + interval 5 minutes))")) \
.drop('next_indicator_is_0') \
.show(truncate=False)
+---+--------+---------------------+---------+
|id |sourceid|timestamp |indicator|
+---+--------+---------------------+---------+
|1 |128 |2019-12-03 12:30:00.0|1 |
|1 |128 |2019-12-03 12:35:00 |1 |
|4 |128 |2019-12-03 13:15:00.0|1 |
|4 |128 |2019-12-03 13:20:00 |1 |
+---+--------+---------------------+---------+
Note: you can reset id column by using F.row_number().over(w1) or F.monotonically_increasing_id() based on your requirements.

Getting a column as concatenated column from a reference table and primary id's from a Dataset

I'm trying to get a concatenated data as a single column using below datasets.
Sample DS:
val df = sc.parallelize(Seq(
("a", 1,2,3),
("b", 4,6,5)
)).toDF("value", "id1", "id2", "id3")
+-------+-----+-----+-----+
| value | id1 | id2 | id3 |
+-------+-----+-----+-----+
| a | 1 | 2 | 3 |
| b | 4 | 6 | 5 |
+-------+-----+-----+-----+
from the Reference Dataset
+----+----------+--------+
| id | descr | parent|
+----+----------+--------+
| 1 | apple | fruit |
| 2 | banana | fruit |
| 3 | cat | animal |
| 4 | dog | animal |
| 5 | elephant | animal |
| 6 | Flight | object |
+----+----------+--------+
val ref= sc.parallelize(Seq(
(1,"apple","fruit"),
(2,"banana","fruit"),
(3,"cat","animal"),
(4,"dog","animal"),
(5,"elephant","animal"),
(6,"Flight","object"),
)).toDF("id", "descr", "parent")
I am trying to get the below desired OutPut
+-----------------------+--------------------------+
| desc | parent |
+-----------------------+--------------------------+
| apple+banana+cat/M | fruit+fruit+animal/M |
| dog+Flight+elephant/M | animal+object+animal/M |
+-----------------------+--------------------------+
And also I need to concat only if(id2,id3) is not null. Otherwise only with id1.
I breaking my head for the solution.
Exploding the first dataframe df and joining to ref with followed by groupBy should work as you expected
val dfNew = df.withColumn("id", explode(array("id1", "id2", "id3")))
.select("id", "value")
ref.join(dfNew, Seq("id"))
.groupBy("value")
.agg(
concat_ws("+", collect_list("descr")) as "desc",
concat_ws("+", collect_list("parent")) as "parent"
)
.drop("value")
.show()
Output:
+-------------------+--------------------+
|desc |parent |
+-------------------+--------------------+
|Flight+elephant+dog|object+animal+animal|
|apple+cat+banana |fruit+animal+fruit |
+-------------------+--------------------+

How to operate global variable in Spark SQL dataframe row by row sequentially on Spark cluster?

I have dataset which like this:
+-------+------+-------+
|groupid|rownum|column2|
+-------+------+-------+
| 1 | 1 | 7 |
| 1 | 2 | 9 |
| 1 | 3 | 8 |
| 1 | 4 | 5 |
| 1 | 5 | 1 |
| 1 | 6 | 0 |
| 1 | 7 | 15 |
| 1 | 8 | 1 |
| 1 | 9 | 13 |
| 1 | 10 | 20 |
| 2 | 1 | 8 |
| 2 | 2 | 1 |
| 2 | 3 | 4 |
| 2 | 4 | 2 |
| 2 | 5 | 19 |
| 2 | 6 | 11 |
| 2 | 7 | 5 |
| 2 | 8 | 6 |
| 2 | 9 | 15 |
| 2 | 10 | 8 |
still have more rows......
I want to add a new column "column3" , which if the continuous column2 values are less than 10,then they will be arranged a same number such as 1. if their appear a value larger than 10 in column2, this row will be dropped ,then the following column3 row’s value will increase 1. For example, when groupid = 1,the column3's value from rownum 1 to 6 will be 1 and the rownum7 will be dropped, the column3's value of rownum 8 will be 2 and the rownum9,10 will be dropped.After the procedure, the table will like this:
+-------+------+-------+-------+
|groupid|rownum|column2|column3|
+-------+------+-------+-------+
| 1 | 1 | 7 | 1 |
| 1 | 2 | 9 | 1 |
| 1 | 3 | 8 | 1 |
| 1 | 4 | 5 | 1 |
| 1 | 5 | 1 | 1 |
| 1 | 6 | 0 | 1 |
| 1 | 7 | 15 | drop | this row will be dropped, in fact not exist
| 1 | 8 | 1 | 2 |
| 1 | 9 | 13 | drop | like above
| 1 | 10 | 20 | drop | like above
| 2 | 1 | 8 | 1 |
| 2 | 2 | 1 | 1 |
| 2 | 3 | 4 | 1 |
| 2 | 4 | 2 | 1 |
| 2 | 5 | 19 | drop | ...
| 2 | 6 | 11 | drop | ...
| 2 | 7 | 5 | 2 |
| 2 | 8 | 6 | 2 |
| 2 | 9 | 15 | drop | ...
| 2 | 10 | 8 | 3 |
In our project, the dataset is expressed as dataframe in spark sql
I try to solve this problem by udf in this way:
var last_rowNum: Int = 1
var column3_Num: Int = 1
def assign_column3_Num(rowNum:Int): Int = {
if (rowNum == 1){ //do nothing, just arrange 1
column3_Num = 1
last_rowNum = 1
return column3_Num
}
/*** if the difference between rownum is 1, they have the same column3
* value, if not, column3_Num++, so they are different
*/
if(rowNum - last_rowNum == 1){
last_rowNum = rowNum
return column3_Num
}else{
column3_Num += 1
last_rowNum = rowNum
return column3_Num
}
}
spark.sqlContext.udf.register("assign_column3_Num",assign_column3_Num _)
df.filter("column2>10") //drop the larger rows
.withColumn("column3",assign_column3_Num(col("column2"))) //add column3
as you can see, I use global variable. However, it's only effective in spark local[1] model. if i use local[8] or yarn-client, the result will totally wrong! this is because spark's running mechanism,they operate the global variable without distinguishing groupid and order!
So the question is how can i arrange right number when spark running on cluster?
use udf or udaf or RDD or other ?
thank you!
You can achieve your requirement by defining a udf function as below (comments are given for clarity)
import org.apache.spark.sql.functions._
def createNewCol = udf((rownum: collection.mutable.WrappedArray[Int], column2: collection.mutable.WrappedArray[Int]) => { // udf function
var value = 1 //value for column3
var previousValue = 0 //value for checking condition
var arrayBuffer = Array.empty[(Int, Int, Int)] //initialization of array to be returned
for((a, b) <- rownum.zip(column2)){ //zipping the collected lists and looping
if(b > 10 && previousValue < 10) //checking condition for column3
value = value +1 //adding 1 for column3
arrayBuffer = arrayBuffer ++ Array((a, b, value)) //adding the values
previousValue = b
}
arrayBuffer
})
Now utilize the algorithm defined in the udf function and to get the desired result, you would need to collect the values of rownum and column2 grouping them by groupid and sorting them by rownum and then call the udf function. Next steps would be to explode and select necessary columns. (commented for clarity)
df.orderBy("rownum").groupBy("groupid").agg(collect_list("rownum").as("rownum"), collect_list("column2").as("column2")) //collecting in order for generating values for column3
.withColumn("new", createNewCol(col("rownum"), col("column2"))) //calling udf function and storing the array of struct(rownum, column2, column3) in new column
.drop("rownum", "column2") //droping unnecessary columns
.withColumn("new", explode(col("new"))) //exploding the new column array so that each row can have struct(rownum, column2, column3)
.select(col("groupid"), col("new._1").as("rownum"), col("new._2").as("column2"), col("new._3").as("column3")) //selecting as separate columns
.filter(col("column2") < 10) // filtering the rows with column2 greater than 10
.show(false)
You should have your desired output as
+-------+------+-------+-------+
|groupid|rownum|column2|column3|
+-------+------+-------+-------+
|1 |1 |7 |1 |
|1 |2 |9 |1 |
|1 |3 |8 |1 |
|1 |4 |5 |1 |
|1 |5 |1 |1 |
|1 |6 |0 |1 |
|1 |8 |1 |2 |
|2 |1 |8 |1 |
|2 |2 |1 |1 |
|2 |3 |4 |1 |
|2 |4 |2 |1 |
|2 |7 |5 |2 |
|2 |8 |6 |2 |
|2 |10 |8 |3 |
+-------+------+-------+-------+

Pyspark : forward fill with last observation for a DataFrame

Using Spark 1.5.1,
I've been trying to forward fill null values with the last known observation for one column of my DataFrame.
It is possible to start with a null value and for this case I would to backward fill this null value with the first knwn observation. However, If that too complicates the code, this point can be skipped.
In this post, a solution in Scala was provided for a very similar problem by zero323.
But, I don't know Scala and I don't succeed to ''translate'' it in Pyspark API code. It's possible to do it with Pyspark ?
Thanks for your help.
Below, a simple example sample input:
| cookie_ID | Time | User_ID
| ------------- | -------- |-------------
| 1 | 2015-12-01 | null
| 1 | 2015-12-02 | U1
| 1 | 2015-12-03 | U1
| 1 | 2015-12-04 | null
| 1 | 2015-12-05 | null
| 1 | 2015-12-06 | U2
| 1 | 2015-12-07 | null
| 1 | 2015-12-08 | U1
| 1 | 2015-12-09 | null
| 2 | 2015-12-03 | null
| 2 | 2015-12-04 | U3
| 2 | 2015-12-05 | null
| 2 | 2015-12-06 | U4
And the expected output:
| cookie_ID | Time | User_ID
| ------------- | -------- |-------------
| 1 | 2015-12-01 | U1
| 1 | 2015-12-02 | U1
| 1 | 2015-12-03 | U1
| 1 | 2015-12-04 | U1
| 1 | 2015-12-05 | U1
| 1 | 2015-12-06 | U2
| 1 | 2015-12-07 | U2
| 1 | 2015-12-08 | U1
| 1 | 2015-12-09 | U1
| 2 | 2015-12-03 | U3
| 2 | 2015-12-04 | U3
| 2 | 2015-12-05 | U3
| 2 | 2015-12-06 | U4
Another workaround to get this working, is to try something like this:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
window = (
Window
.partitionBy('cookie_id')
.orderBy('Time')
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
)
final = (
joined
.withColumn('UserIDFilled', F.last('User_ID', ignorenulls=True).over(window))
)
So what this is doing is that it constructs your window based on the partition key and the order column. It also tells the window to look back all rows within the window up to the current row. Finally, at each row, you return the last value that is not null (which remember, according to your window, it includes your current row)
The partitioned example code from Spark / Scala: forward fill with last observation in pyspark is shown. This only works for data that can be partitioned.
Load the data
values = [
(1, "2015-12-01", None),
(1, "2015-12-02", "U1"),
(1, "2015-12-02", "U1"),
(1, "2015-12-03", "U2"),
(1, "2015-12-04", None),
(1, "2015-12-05", None),
(2, "2015-12-04", None),
(2, "2015-12-03", None),
(2, "2015-12-02", "U3"),
(2, "2015-12-05", None),
]
rdd = sc.parallelize(values)
df = rdd.toDF(["cookie_id", "c_date", "user_id"])
df = df.withColumn("c_date", df.c_date.cast("date"))
df.show()
The DataFrame is
+---------+----------+-------+
|cookie_id| c_date|user_id|
+---------+----------+-------+
| 1|2015-12-01| null|
| 1|2015-12-02| U1|
| 1|2015-12-02| U1|
| 1|2015-12-03| U2|
| 1|2015-12-04| null|
| 1|2015-12-05| null|
| 2|2015-12-04| null|
| 2|2015-12-03| null|
| 2|2015-12-02| U3|
| 2|2015-12-05| null|
+---------+----------+-------+
Column used to sort the partitions
# get the sort key
def getKey(item):
return item.c_date
The fill function. Can be used to fill in multiple columns if necessary.
# fill function
def fill(x):
out = []
last_val = None
for v in x:
if v["user_id"] is None:
data = [v["cookie_id"], v["c_date"], last_val]
else:
data = [v["cookie_id"], v["c_date"], v["user_id"]]
last_val = v["user_id"]
out.append(data)
return out
Convert to rdd, partition, sort and fill the missing values
# Partition the data
rdd = df.rdd.groupBy(lambda x: x.cookie_id).mapValues(list)
# Sort the data by date
rdd = rdd.mapValues(lambda x: sorted(x, key=getKey))
# fill missing value and flatten
rdd = rdd.mapValues(fill).flatMapValues(lambda x: x)
# discard the key
rdd = rdd.map(lambda v: v[1])
Convert back to DataFrame
df_out = sqlContext.createDataFrame(rdd)
df_out.show()
The output is
+---+----------+----+
| _1| _2| _3|
+---+----------+----+
| 1|2015-12-01|null|
| 1|2015-12-02| U1|
| 1|2015-12-02| U1|
| 1|2015-12-03| U2|
| 1|2015-12-04| U2|
| 1|2015-12-05| U2|
| 2|2015-12-02| U3|
| 2|2015-12-03| U3|
| 2|2015-12-04| U3|
| 2|2015-12-05| U3|
+---+----------+----+
Hope you find this forward fill function useful. It is written using native pyspark function. Neither udf nor rdd being used (both of them are very slow, especially UDF!).
Let's use example provided by #Sid.
values = [
(1, "2015-12-01", None),
(1, "2015-12-02", "U1"),
(1, "2015-12-02", "U1"),
(1, "2015-12-03", "U2"),
(1, "2015-12-04", None),
(1, "2015-12-05", None),
(2, "2015-12-04", None),
(2, "2015-12-03", None),
(2, "2015-12-02", "U3"),
(2, "2015-12-05", None),
]
df = spark.createDataFrame(values, ['cookie_ID', 'Time', 'User_ID'])
Functions:
def cum_sum(df, sum_col , order_col, cum_sum_col_nm='cum_sum'):
'''Find cumulative sum of a column.
Parameters
-----------
sum_col : String
Column to perform cumulative sum.
order_col : List
Column/columns to sort for cumulative sum.
cum_sum_col_nm : String
The name of the resulting cum_sum column.
Return
-------
df : DataFrame
Dataframe with additional "cum_sum_col_nm".
'''
df = df.withColumn('tmp', lit('tmp'))
windowval = (Window.partitionBy('tmp')
.orderBy(order_col)
.rangeBetween(Window.unboundedPreceding, 0))
df = df.withColumn('cum_sum', sum(sum_col).over(windowval).alias('cumsum').cast(StringType()))
df = df.drop('tmp')
return df
def forward_fill(df, order_col, fill_col, fill_col_name=None):
'''Forward fill a column by a column/set of columns (order_col).
Parameters:
------------
df: Dataframe
order_col: String or List of string
fill_col: String (Only work for a column for this version.)
Return:
---------
df: Dataframe
Return df with the filled_cols.
'''
# "value" and "constant" are tmp columns created ton enable forward fill.
df = df.withColumn('value', when(col(fill_col).isNull(), 0).otherwise(1))
df = cum_sum(df, 'value', order_col).drop('value')
df = df.withColumn(fill_col,
when(col(fill_col).isNull(), 'constant').otherwise(col(fill_col)))
win = (Window.partitionBy('cum_sum')
.orderBy(order_col))
if not fill_col_name:
fill_col_name = 'ffill_{}'.format(fill_col)
df = df.withColumn(fill_col_name, collect_list(fill_col).over(win)[0])
df = df.drop('cum_sum')
df = df.withColumn(fill_col_name, when(col(fill_col_name)=='constant', None).otherwise(col(fill_col_name)))
df = df.withColumn(fill_col, when(col(fill_col)=='constant', None).otherwise(col(fill_col)))
return df
Let's see the results.
ffilled_df = forward_fill(df,
order_col=['cookie_ID', 'Time'],
fill_col='User_ID',
fill_col_name = 'User_ID_ffil')
ffilled_df.sort(['cookie_ID', 'Time']).show()
// Forward filling
w1 = Window.partitionBy('cookie_id').orderBy('c_date').rowsBetween(Window.unboundedPreceding,0)
w2 = w1.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
//Backward filling
final_df = df.withColumn('UserIDFilled', F.coalesce(F.last('user_id', True).over(w1),
F.first('user_id',True).over(w2)))
final_df.orderBy('cookie_id', 'c_date').show(truncate=False)
+---------+----------+-------+------------+
|cookie_id|c_date |user_id|UserIDFilled|
+---------+----------+-------+------------+
|1 |2015-12-01|null |U1 |
|1 |2015-12-02|U1 |U1 |
|1 |2015-12-02|U1 |U1 |
|1 |2015-12-03|U2 |U2 |
|1 |2015-12-04|null |U2 |
|1 |2015-12-05|null |U2 |
|2 |2015-12-02|U3 |U3 |
|2 |2015-12-03|null |U3 |
|2 |2015-12-04|null |U3 |
|2 |2015-12-05|null |U3 |
+---------+----------+-------+------------+
Cloudera has released a library called spark-ts that offers a suite of useful methods for processing time series and sequential data in Spark. This library supports a number of time-windowed methods for imputing data points based on other data in the sequence.
http://blog.cloudera.com/blog/2015/12/spark-ts-a-new-library-for-analyzing-time-series-data-with-apache-spark/

Resources