Conditionally get previous row value - apache-spark

I have the following dataset
columns = ['id','trandatetime','code','zip']
data = [('1','2020-02-06T17:33:21.000+0000', '0','35763'),('1','2020-02-06T17:39:55.000+0000', '0','35763'), ('1','2020-02-07T06:06:42.000+0000', '0','35741'), ('1','2020-02-07T06:28:17.000+0000', '4','94043'),('1','2020-02-07T07:12:13.000+0000','0','35802'), ('1','2020-02-07T08:23:29.000+0000', '0','30738')]
df = spark.createDataFrame(data).toDF(*columns)
df= df.withColumn("trandatetime",to_timestamp("trandatetime"))
+---+--------------------+----+-----+
| id| trandatetime|code| zip|
+---+--------------------+----+-----+
| 1|2020-02-06T17:33:...| 0|35763|
| 1|2020-02-06T17:39:...| 0|35763|
| 1|2020-02-07T06:06:...| 0|35741|
| 1|2020-02-07T06:28:...| 4|94043|
| 1|2020-02-07T07:12:...| 0|35802|
| 1|2020-02-07T08:23:...| 0|30738|
+---+--------------------+----+-----+
I am trying to get the previous row zip when code = 0 within a time period.
This is my attempt, but you can see that the row where code is 4 is getting a value, that should be null. The row after the 4 is null, but that one should have a value in it.
from pyspark.sql.functions import *
from pyspark.sql import functions as F
from pyspark.sql import Window
w = Window.partitionBy('id').orderBy('timestamp').rangeBetween(-60*60*24,-1)
df = df.withColumn("Card_Present_Last_Zip",F.last(F.when(col("code") == '0', col("zip"))).over(w))
+---+--------------------+----+-----+----------+---------------------+
| id| trandatetime|code| zip| timestamp|Card_Present_Last_Zip|
+---+--------------------+----+-----+----------+---------------------+
| 1|2020-02-06T17:33:...| 0|35763|1581010401| null|
| 1|2020-02-06T17:39:...| 0|35763|1581010795| 35763|
| 1|2020-02-07T06:06:...| 0|35741|1581055602| 35763|
| 1|2020-02-07T06:28:...| 4|94043|1581056897| 35741|
| 1|2020-02-07T07:12:...| 0|35802|1581059533| null|
| 1|2020-02-07T08:23:...| 0|30738|1581063809| 35802|
+---+--------------------+----+-----+----------+---------------------+

Put the last function (with ignorenulls set to True) expression into another when clause to only apply window operation on rows with code = '0'
w = Window.partitionBy('id').orderBy('timestamp').rangeBetween(-60*60*24,-1)
df = (df
.withColumn("timestamp", F.unix_timestamp("trandatetime"))
.withColumn("Card_Present_Last_Zip", F.when(F.col("code") == '0', F.last(F.when(F.col("code") == '0', F.col("zip")), ignorenulls=True).over(w)))
)
df.show()
# +---+-------------------+----+-----+----------+---------------------+
# | id| trandatetime|code| zip| timestamp|Card_Present_Last_Zip|
# +---+-------------------+----+-----+----------+---------------------+
# | 1|2020-02-06 17:33:21| 0|35763|1581010401| null|
# | 1|2020-02-06 17:39:55| 0|35763|1581010795| 35763|
# | 1|2020-02-07 06:06:42| 0|35741|1581055602| 35763|
# | 1|2020-02-07 06:28:17| 4|94043|1581056897| null|
# | 1|2020-02-07 07:12:13| 0|35802|1581059533| 35741|
# | 1|2020-02-07 08:23:29| 0|30738|1581063809| 35802|
# +---+-------------------+----+-----+----------+---------------------+

You can use window function lag() .
window_spec = Window.partitionBy('id').orderBy('timestamp')
df.withColumn('prev_zip', lag('zip').over(window_spec)).\
withColumn('Card_Present_Last_Zip', when(col('code') == 0, col('prev_zip')).otherwise(None)).show()

Related

Create sequential unique id for each group

I'm trying to find an equivalent for the following snippet (reference) to create unique id to every unique combination from two columns in PySpark.
Pandas approach:
df['my_id'] = df.groupby(['foo', 'bar'], sort=False).ngroup() + 1
I tried the following, but it's creating more ids than required:
df = df.withColumn("my_id", F.row_number().over(Window.orderBy('foo', 'bar')))
Instead of row_number, use dense_rank:
from pyspark.sql import functions as F, Window
df = spark.createDataFrame(
[('r1', 'ph1'),
('r1', 'ph1'),
('r1', 'ph2'),
('s4', 'ph3'),
('s3', 'ph2'),
('s3', 'ph2')],
['foo', 'bar'])
df = df.withColumn("my_id", F.dense_rank().over(Window.orderBy('foo', 'bar')))
df.show()
# +---+---+-----+
# |foo|bar|my_id|
# +---+---+-----+
# | r1|ph1| 1|
# | r1|ph1| 1|
# | r1|ph2| 2|
# | s3|ph2| 3|
# | s3|ph2| 3|
# | s4|ph3| 4|
# +---+---+-----+

Spark: Find the value with the highest occurrence per group over rolling time window

Starting from the following spark data frame:
from io import StringIO
import pandas as pd
from pyspark.sql.functions import col
pd_df = pd.read_csv(StringIO("""device_id,read_date,id,count
device_A,2017-08-05,4041,3
device_A,2017-08-06,4041,3
device_A,2017-08-07,4041,4
device_A,2017-08-08,4041,3
device_A,2017-08-09,4041,3
device_A,2017-08-10,4041,1
device_A,2017-08-10,4045,2
device_A,2017-08-11,4045,3
device_A,2017-08-12,4045,3
device_A,2017-08-13,4045,3"""),infer_datetime_format=True, parse_dates=['read_date'])
df = spark.createDataFrame(pd_df).withColumn('read_date', col('read_date').cast('date'))
df.show()
Output:
+--------------+----------+----+-----+
|device_id | read_date| id|count|
+--------------+----------+----+-----+
| device_A|2017-08-05|4041| 3|
| device_A|2017-08-06|4041| 3|
| device_A|2017-08-07|4041| 4|
| device_A|2017-08-08|4041| 3|
| device_A|2017-08-09|4041| 3|
| device_A|2017-08-10|4041| 1|
| device_A|2017-08-10|4045| 2|
| device_A|2017-08-11|4045| 3|
| device_A|2017-08-12|4045| 3|
| device_A|2017-08-13|4045| 3|
+--------------+----------+----+-----+
I would like to find the most frequent id for each (device_id, read_date) combination, over a 3 day rolling window. For each group of rows selected by the time window, I need to find the most frequent id by summing up the counts per id, then return the top id.
Expected Output:
+--------------+----------+----+
|device_id | read_date| id|
+--------------+----------+----+
| device_A|2017-08-05|4041|
| device_A|2017-08-06|4041|
| device_A|2017-08-07|4041|
| device_A|2017-08-08|4041|
| device_A|2017-08-09|4041|
| device_A|2017-08-10|4041|
| device_A|2017-08-11|4045|
| device_A|2017-08-12|4045|
| device_A|2017-08-13|4045|
+--------------+----------+----+
I am starting to think this is only possible using a custom aggregation function. Since spark 2.3 is not out I will have to write this in Scala or use collect_list. Am I missing something?
Add window:
from pyspark.sql.functions import window, sum as sum_, date_add
df_w = df.withColumn(
"read_date", window("read_date", "3 days", "1 day")["start"].cast("date")
)
# Then handle the counts
df_w = df_w.groupBy('device_id', 'read_date', 'id').agg(sum_('count').alias('count'))
Use one of the solutions from Find maximum row per group in Spark DataFrame for example
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
rolling_window = 3
top_df = (
df_w
.withColumn(
"rn",
row_number().over(
Window.partitionBy("device_id", "read_date")
.orderBy(col("count").desc())
)
)
.where(col("rn") == 1)
.orderBy("read_date")
.drop("rn")
)
# results are calculated on the start of the time window - adjust read_date as needed
final_df = top_df.withColumn('read_date', date_add('read_date', rolling_window - 1))
final_df.show()
# +---------+----------+----+-----+
# |device_id| read_date| id|count|
# +---------+----------+----+-----+
# | device_A|2017-08-05|4041| 3|
# | device_A|2017-08-06|4041| 6|
# | device_A|2017-08-07|4041| 10|
# | device_A|2017-08-08|4041| 10|
# | device_A|2017-08-09|4041| 10|
# | device_A|2017-08-10|4041| 7|
# | device_A|2017-08-11|4045| 5|
# | device_A|2017-08-12|4045| 8|
# | device_A|2017-08-13|4045| 9|
# | device_A|2017-08-14|4045| 6|
# | device_A|2017-08-15|4045| 3|
# +---------+----------+----+-----+
I managed to find a very inefficient solution. Hopefully someone can spot improvements to avoid the python udf and call to collect_list.
from pyspark.sql import Window
from pyspark.sql.functions import col, collect_list, first, udf
from pyspark.sql.types import IntegerType
def top_id(ids, counts):
c = Counter()
for cnid, count in zip(ids, counts):
c[cnid] += count
return c.most_common(1)[0][0]
rolling_window = 3
days = lambda i: i * 86400
# Define a rolling calculation window based on time
window = (
Window()
.partitionBy("device_id")
.orderBy(col("read_date").cast("timestamp").cast("long"))
.rangeBetween(-days(rolling_window - 1), 0)
)
# Use window and collect_list to store data matching the window definition on each row
df_collected = df.select(
'device_id', 'read_date',
collect_list(col('id')).over(window).alias('ids'),
collect_list(col('count')).over(window).alias('counts')
)
# Get rid of duplicate rows where necessary
df_grouped = df_collected.groupBy('device_id', 'read_date').agg(
first('ids').alias('ids'),
first('counts').alias('counts'),
)
# Register and apply udf to return the most frequently seen id
top_id_udf = udf(top_id, IntegerType())
df_mapped = df_grouped.withColumn('top_id', top_id_udf(col('ids'), col('counts')))
df_mapped.show(truncate=False)
returns:
+---------+----------+------------------------+------------+------+
|device_id|read_date |ids |counts |top_id|
+---------+----------+------------------------+------------+------+
|device_A |2017-08-05|[4041] |[3] |4041 |
|device_A |2017-08-06|[4041, 4041] |[3, 3] |4041 |
|device_A |2017-08-07|[4041, 4041, 4041] |[3, 3, 4] |4041 |
|device_A |2017-08-08|[4041, 4041, 4041] |[3, 4, 3] |4041 |
|device_A |2017-08-09|[4041, 4041, 4041] |[4, 3, 3] |4041 |
|device_A |2017-08-10|[4041, 4041, 4041, 4045]|[3, 3, 1, 2]|4041 |
|device_A |2017-08-11|[4041, 4041, 4045, 4045]|[3, 1, 2, 3]|4045 |
|device_A |2017-08-12|[4041, 4045, 4045, 4045]|[1, 2, 3, 3]|4045 |
|device_A |2017-08-13|[4045, 4045, 4045] |[3, 3, 3] |4045 |
+---------+----------+------------------------+------------+------+

Removing NULL , NAN, empty space from PySpark DataFrame

I have a dataframe in PySpark which contains empty space, Null, and Nan.
I want to remove rows which have any of those. I tried below commands, but, nothing seems to work.
myDF.na.drop().show()
myDF.na.drop(how='any').show()
Below is the dataframe:
+---+----------+----------+-----+-----+
|age| category| date|empId| name|
+---+----------+----------+-----+-----+
| 25|electronic|17-01-2018| 101| abc|
| 24| sports|16-01-2018| 102| def|
| 23|electronic|17-01-2018| 103| hhh|
| 23|electronic|16-01-2018| 104| yyy|
| 29| men|12-01-2018| 105| ajay|
| 31| kids|17-01-2018| 106|vijay|
| | Men| nan| 107|Sumit|
+---+----------+----------+-----+-----+
What am I missing? What is the best way to tackle NULL, Nan or empty spaces so that there is no problem in the actual calculation?
NaN (not a number) has different meaning that NULL and empty string is just a normal value (can be converted to NULL automatically with csv reader) so na.drop won't match these.
You can convert all to null and drop
from pyspark.sql.functions import col, isnan, when, trim
df = spark.createDataFrame([
("", 1, 2.0), ("foo", None, 3.0), ("bar", 1, float("NaN")),
("good", 42, 42.0)])
def to_null(c):
return when(~(col(c).isNull() | isnan(col(c)) | (trim(col(c)) == "")), col(c))
df.select([to_null(c).alias(c) for c in df.columns]).na.drop().show()
# +----+---+----+
# | _1| _2| _3|
# +----+---+----+
# |good| 42|42.0|
# +----+---+----+
Maybe in your case it is not important but this code (modifed answer of Alper t. Turker) can handle different datatypes accordingly. The dataTypes can vary according your DataFrame of course. (tested on Spark version: 2.4)
from pyspark.sql.functions import col, isnan, when, trim
# Find out dataType and act accordingly
def to_null_bool(c, dt):
if df == "double":
return c.isNull() | isnan(c)
elif df == "string":
return ~c.isNull() & (trim(c) != "")
else:
return ~c.isNull()
# Only keep columns with not empty strings
def to_null(c, dt):
c = col(c)
return when(to_null_bool(c, dt), c)
df.select([to_null(c, dt[1]).alias(c) for c, dt in zip(df.columns, df.dtypes)]).na.drop(how="any").show()

Pyspark Unsupported literal type class java.util.ArrayList [duplicate]

This question already has answers here:
Passing a data frame column and external list to udf under withColumn
(4 answers)
Closed 5 years ago.
I am using python3 on Spark(2.2.0). I want to apply my UDF to a specified list of strings.
df = ['Apps A','Chrome', 'BBM', 'Apps B', 'Skype']
def calc_app(app, app_list):
browser_list = ['Chrome', 'Firefox', 'Opera']
chat_list = ['WhatsApp', 'BBM', 'Skype']
sum = 0
for data in app:
name = data['name']
if name in app_list:
sum += 1
return sum
calc_appUDF = udf(calc_app)
df = df.withColumn('app_browser', calc_appUDF(df['apps'], browser_list))
df = df.withColumn('app_chat', calc_appUDF(df['apps'], chat_list))
But it failed and returns : 'Unsupported literal type class java.util.ArrayList'
If I understood your requirement correctly then you should try this
from pyspark.sql.functions import udf, col
#sample data
df_list = ['Apps A','Chrome', 'BBM', 'Apps B', 'Skype']
df = sqlContext.createDataFrame([(l,) for l in df_list], ['apps'])
df.show()
#some lists definition
browser_list = ['Chrome', 'Firefox', 'Opera']
chat_list = ['WhatsApp', 'BBM', 'Skype']
#udf definition
def calc_app(app, app_list):
if app in app_list:
return 1
else:
return 0
def calc_appUDF(app_list):
return udf(lambda l: calc_app(l, app_list))
#add new columns
df = df.withColumn('app_browser', calc_appUDF(browser_list)(col('apps')))
df = df.withColumn('app_chat', calc_appUDF(chat_list)(col('apps')))
df.show()
Sample input:
+------+
| apps|
+------+
|Apps A|
|Chrome|
| BBM|
|Apps B|
| Skype|
+------+
Output is:
+------+-----------+--------+
| apps|app_browser|app_chat|
+------+-----------+--------+
|Apps A| 0| 0|
|Chrome| 1| 0|
| BBM| 0| 1|
|Apps B| 0| 0|
| Skype| 0| 1|
+------+-----------+--------+

Searching an instance in dataframe in Pyspark with filter takes too much time

I have a DataFrame with N Attributes (Atr1, Atr2, Atr3, ..., AtrN) and an individual instance with the same [1..N-1] attributes, except the Nth one.
I want to check if there is any instance in the DataFrame with the same values for the Attributes [1..N-1] of the instance, and if it exists an occurrence of that instance, my goal is to get the instance in the DataFrame with the Attributes [1..N].
For example, if I have:
Instance:
[Row(Atr1=u'A', Atr2=u'B', Atr3=24)]
Dataframe:
+------+------+------+------+
| Atr1 | Atr2 | Atr3 | Atr4 |
+------+------+------+------+
| 'C' | 'B' | 21 | 'H' |
+------+------+------+------+
| 'D' | 'B' | 21 | 'J' |
+------+------+------+------+
| 'E' | 'B' | 21 | 'K' |
+------+------+------+------+
| 'A' | 'B' | 24 | 'I' |
+------+------+------+------+
I want to get the 4th row of the DataFrame also with the value of Atr4.
I tried it with "filter()" method like this:
df.filter("Atr1 = 'C' and Atr2 = 'B', and Atr3 = 24").take(1)
And I get the result I wanted, but it took much time.
So, my question is: is there any way to do the same but in less time?
Thanks!
You can use locality sensitive hashing(minhashLSH) to find the closest neighbor and check whether it's same or not.
Since, your data has strings , you need to process it before applying LSH.
We will be using pyspark ml's feature module
Start with stringIndexing and onehotencoding
df= spark.createDataFrame([('C','B',21,'H'),('D','B',21,'J'),('E','c',21,'K'),('A','B',24,'J')], ["attr1","attr2","attr3","attr4"])
for col_ in ["attr1","attr2","attr4"]:
stringIndexer = StringIndexer(inputCol=col_, outputCol=col_+"_")
model = stringIndexer.fit(df)
df = model.transform(df)
encoder = OneHotEncoder(inputCol=col_+"_", outputCol="features_"+col_, dropLast = False)
df = encoder.transform(df)
df = df.drop("attr1","attr2","attr4","attr1_","attr2_","attr4_")
df.show()
+-----+--------------+--------------+--------------+
|attr3|features_attr1|features_attr2|features_attr4|
+-----+--------------+--------------+--------------+
| 21| (4,[2],[1.0])| (2,[0],[1.0])| (3,[1],[1.0])|
| 21| (4,[0],[1.0])| (2,[0],[1.0])| (3,[0],[1.0])|
| 21| (4,[3],[1.0])| (2,[1],[1.0])| (3,[2],[1.0])|
| 24| (4,[1],[1.0])| (2,[0],[1.0])| (3,[0],[1.0])|
+-----+--------------+--------------+--------------+
Add id and assemble all features vectors
from pyspark.sql.functions import monotonically_increasing_id
df = df.withColumn("id", monotonically_increasing_id())
df.show()
assembler = VectorAssembler(inputCols = ["features_attr1", "features_attr2", "features_attr4", "attr3"]
, outputCol = "features")
df_ = assembler.transform(df)
df_ = df_.select("id", "features")
df_.show()
+----------+--------------------+
| id| features|
+----------+--------------------+
| 0|(10,[2,4,7,9],[1....|
| 1|(10,[0,4,6,9],[1....|
|8589934592|(10,[3,5,8,9],[1....|
|8589934593|(10,[1,4,6,9],[1....|
+----------+--------------------+
Create your minHashLSH model and search for nearest neighbors
mh = MinHashLSH(inputCol="features", outputCol="hashes", seed=12345)
model = mh.fit(df_)
model.transform(df_)
key = df_.select("features").collect()[0]["features"]
model.approxNearestNeighbors(df_, key, 1).collect()
output
[Row(id=0, features=SparseVector(10, {2: 1.0, 4: 1.0, 7: 1.0, 9: 21.0}), hashes=[DenseVector([-1272095496.0])], distCol=0.0)]

Resources