Pyspark: create a lag column - apache-spark

I am using pyspark and obtained a table as below, table_1
+--------+------------------+---------------+----------+-----+
| Number|Institution | datetime| date| time|
+--------+------------------+---------------+----------+-----+
|AE19075B| ABC| 7/20/2019 7:45|07/20/2019| 7:45|
|AE11688U| CBT|2/11/2019 20:31|02/11/2019|20:31|
+--------+------------------+---------------+----------+-----+
I would like to add a lag column of the time (15 minutes) to the table_1
+--------+------------------+---------------+----------+-----+-----+
| Number|Institution | datetime| date| time|lag1 |
+--------+------------------+---------------+----------+-----+-----+
|AE19075B| ABC| 7/20/2019 7:45|07/20/2019| 7:45|7:30 |
|AE11688U| CBT|2/11/2019 20:31|02/11/2019|20:31|20:16|
+--------+------------------+---------------+----------+-----+-----+
from datetime import datetime, timedelta
table_2 = table_.withColumn('lag1', (datetime.strptime(table1['time'], '%H:%M') -timedelta(minutes=15)).strftime('%H:%M'))
The code above could be applied to a string but I have no idea why it cannot apply to the table in this case. It showed an error '''TypeError: strptime() argument 1 must be str, not Column''', is there any method to obtain a string from a column in Pyspark? Thanks!

You can't use Python functions directly on Spark dataframe columns. You can use Spark SQL functions instead, as shown below:
import pyspark.sql.functions as F
df2 = df.withColumn(
'lag1',
F.expr("date_format(to_timestamp(time, 'H:m') - interval 15 minute, 'H:m')")
)
df2.show()
+--------+-----------+---------------+----------+-----+-----+
| Number|Institution| datetime| date| time| lag1|
+--------+-----------+---------------+----------+-----+-----+
|AE19075B| ABC| 7/20/2019 7:45|07/20/2019| 7:45| 7:30|
|AE11688U| CBT|2/11/2019 20:31|02/11/2019|20:31|20:16|
+--------+-----------+---------------+----------+-----+-----+
Alternatively, you can call the Python function as a UDF (but the performance should be worse than calling Spark SQL functions directly):
import pyspark.sql.functions as F
from datetime import datetime, timedelta
lag = F.udf(lambda t: (datetime.strptime(t, '%H:%M') -timedelta(minutes=15)).strftime('%H:%M'))
df2 = df.withColumn('lag1', lag('time'))
df2.show()
+--------+-----------+---------------+----------+-----+-----+
| Number|Institution| datetime| date| time| lag1|
+--------+-----------+---------------+----------+-----+-----+
|AE19075B| ABC| 7/20/2019 7:45|07/20/2019| 7:45|07:30|
|AE11688U| CBT|2/11/2019 20:31|02/11/2019|20:31|20:16|
+--------+-----------+---------------+----------+-----+-----+

Related

how to convert date of format string to timestamp in spark?

%scala
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, to_date}
Seq(("20110813"),("20090724")).toDF("Date").select(
col("Date"),
to_date(col("Date"),"yyyy-mm-dd").as("to_date")
).show()
+--------+-------+
| Date|to_date|
+--------+-------+
|20110813| null|
|20090724| null|
+--------+-------+
+--------+----------+
| Date| to_date|
+--------+----------+
|20110813|2011-01-13|
|20090724|2009-01-24|
+--------+----------+
Seq(("20110813"),("20090724")).toDF("Date").select(
col("Date"),
to_date(col("Date"),"yyyymmdd").as("to_date")
).show()
I am trying to convert a string to timestamp, but I am getting always null/default values returned to the date value
You haven't given value for the new column to convert. you should use withColumn to add the new date column and tell him to use Date column values.
import org.apache.spark.sql.functions.{col, to_date}
import org.apache.spark.sql.types._
val df = Seq((20110813),(20090724)).toDF("Date")
val newDf = df.withColumn("to_date", to_date(col("Date").cast(TimestampType), "yyyy-MM-dd"))
newDf.show()

How to capture frequency of words after group by with pyspark

I have a tabular data with keys and values and the keys are not unique.
for example:
+-----+------+
| key | value|
--------------
| 1 | the |
| 2 | i |
| 1 | me |
| 1 | me |
| 2 | book |
| 1 |table |
+-----+------+
Now assume this table is distributed across the different nodes in spark cluster.
How do I use pyspark to calculate frequencies of the words with respect to the different keys? for instance, in the above example I wish to output:
+-----+------+-------------+
| key | value| frequencies |
---------------------------+
| 1 | the | 1/4 |
| 2 | i | 1/2 |
| 1 | me | 2/4 |
| 2 | book | 1/2 |
| 1 |table | 1/4 |
+-----+------+-------------+
Not sure if you can combine multi-level operations with DFs, but doing it in 2 steps and leaving concat to you, this works:
# Running in Databricks, not all stuff required
# You may want to do to upper or lowercase for better results.
from pyspark.sql import Row
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import *
data = [("1", "the"), ("2", "I"), ("1", "me"),
("1", "me"), ("2", "book"), ("1", "table")]
rdd = sc.parallelize(data)
someschema = rdd.map(lambda x: Row(c1=x[0], c2=x[1]))
df = sqlContext.createDataFrame(someschema)
df1 = df.groupBy("c1", "c2") \
.count()
df2 = df1.groupBy('c1') \
.sum('count')
df3 = df1.join(df2,'c1')
df3.show()
returns:
+---+-----+-----+----------+
| c1| c2|count|sum(count)|
+---+-----+-----+----------+
| 1|table| 1| 4|
| 1| the| 1| 4|
| 1| me| 2| 4|
| 2| I| 1| 2|
| 2| book| 1| 2|
+---+-----+-----+----------+
You can reformat last 2 cols, but am curious if we can do all in 1 go. In normal SQL we would use inline views and combine I suspect.
This works across cluster standardly, what Spark is generally all about. The groupBy takes it all into account.
minor edit
As it is rather hot outside, I looked into this in a little more depth. This is a good overview: http://stevendavistechnotes.blogspot.com/2018/06/apache-spark-bi-level-aggregation.html. After reading this and experimenting I could not get it any more elegant, reducing to 5 rows of output all in 1 go appears not to be possible.
Another viable option is with window functions.
First, define the number of occurrences per values-keys and for key. Then just add another column with the Fraction (you will have reduced fractions)
from pyspark.sql import Row
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.window import Window
from pyspark.sql.types import *
from fractions import Fraction
from pyspark.sql.functions import udf
#udf (StringType())
def getFraction(frequency):
return str(Fraction(frequency))
schema = StructType([StructField("key", IntegerType(), True),
StructField("value", StringType(), True)])
data = [(1, "the"), (2, "I"), (1, "me"),
(1, "me"), (2, "book"), (1, "table")]
spark = SparkSession.builder.appName('myPython').getOrCreate()
input_df = spark.createDataFrame(data, schema)
(input_df.withColumn("key_occurrence",
F.count(F.lit(1)).over(Window.partitionBy(F.col("key"))))
.withColumn("value_occurrence", F.count(F.lit(1)).over(Window.partitionBy(F.col("value"), F.col('key'))))
.withColumn("frequency", getFraction(F.col("value_occurrence"), F.col("key_occurrence"))).dropDuplicates().show())

using window function sql inside pyspark

I have data like the example data below. I’m trying to create a new column in my data using PySpark that would be the category of the first event for a customer based on the timestamp. Like the example output data below.
I have an example below of what I think would accomplish it using a window function in sql.
I’m pretty new to PySpark. I understand you can run sql inside of PySpark. I’m wondering if I have the code correct below to run the sql window function in PySpark. That is I’m wondering if I can just paste the sql code inside of spark.sql, as I have below.
Input:
eventid customerid category timestamp
1 3 a 1/1/12
2 3 b 2/3/14
4 2 c 4/1/12
Output:
eventid customerid category timestamp first_event
1 3 a 1/1/12 a
2 3 b 2/3/14 a
4 2 c 4/1/12 c
window function example:
select eventid, customerid, category, timestamp
FIRST_VALUE(catgegory) over(partition by customerid order by timestamp) first_event
from table
# implementing window function example with pyspark
PySpark:
# Note: assume df is dataframe with structure of table above
# (df is table)
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(“Operations”).getOrCreate()
# Register the DataFrame as a SQL temporary view
df.createOrReplaceView(“Table”)
sql_results = spark.sql(“select eventid, customerid, category, timestamp
FIRST_VALUE(catgegory) over(partition by customerid order by timestamp) first_event
from table”)
# display results
sql_results.show()
You can use window function in pyspark as well
>>> import pyspark.sql.functions as F
>>> from pyspark.sql.window import Window
>>>
>>> df.show()
+-------+----------+--------+---------+
|eventid|customerid|category|timestamp|
+-------+----------+--------+---------+
| 1| 3| a| 1/1/12|
| 2| 3| b| 2/3/14|
| 4| 2| c| 4/1/12|
+-------+----------+--------+---------+
>>> window = Window.partitionBy('customerid')
>>> df = df.withColumn('first_event', F.first('category').over(window))
>>>
>>> df.show()
+-------+----------+--------+---------+-----------+
|eventid|customerid|category|timestamp|first_event|
+-------+----------+--------+---------+-----------+
| 1| 3| a| 1/1/12| a|
| 2| 3| b| 2/3/14| a|
| 4| 2| c| 4/1/12| c|
+-------+----------+--------+---------+-----------+

How to use lag and rangeBetween functions on timestamp values?

I have data that looks like this:
userid,eventtime,location_point
4e191908,2017-06-04 03:00:00,18685891
4e191908,2017-06-04 03:04:00,18685891
3136afcb,2017-06-04 03:03:00,18382821
661212dd,2017-06-04 03:06:00,80831484
40e8a7c3,2017-06-04 03:12:00,18825769
I would like to add a new boolean column that marks true if there are 2 or moreuserid within a 5 minutes window in the same location_point. I had an idea of using lag function to lookup over a window partitioned by the userid and with the range between the current timestamp and the next 5 minutes:
from pyspark.sql import functions as F
from pyspark.sql import Window as W
from pyspark.sql.functions import col
days = lambda i: i * 60*5
windowSpec = W.partitionBy(col("userid")).orderBy(col("eventtime").cast("timestamp").cast("long")).rangeBetween(0, days(5))
lastURN = F.lag(col("location_point"), 1).over(windowSpec)
visitCheck = (last_location_point == output.location_pont)
output.withColumn("visit_check", visitCheck).select("userid","eventtime", "location_pont", "visit_check")
This code is giving me an analysis exception when I use the RangeBetween function:
AnalysisException: u'Window Frame RANGE BETWEEN CURRENT ROW AND 1500
FOLLOWING must match the required frame ROWS BETWEEN 1 PRECEDING AND 1
PRECEDING;
Do you know any way to tackle this problem?
Given your data:
Let's add a column with a timestamp in seconds:
df = df.withColumn('timestamp',df_taf.eventtime.astype('Timestamp').cast("long"))
df.show()
+--------+-------------------+--------------+----------+
| userid| eventtime|location_point| timestamp|
+--------+-------------------+--------------+----------+
|4e191908|2017-06-04 03:00:00| 18685891|1496545200|
|4e191908|2017-06-04 03:04:00| 18685891|1496545440|
|3136afcb|2017-06-04 03:03:00| 18382821|1496545380|
|661212dd|2017-06-04 03:06:00| 80831484|1496545560|
|40e8a7c3|2017-06-04 03:12:00| 18825769|1496545920|
|4e191908|2017-06-04 03:11:30| 18685891|1496545890|
+--------+-------------------+--------------+----------+
Now, let's define a window function, with a partition by location_point, an order by timestamp and a range between -300s and current time. We can count the number of elements in this window and put these data in a column named 'occurences in_5_min':
w = Window.partitionBy('location_point').orderBy('timestamp').rangeBetween(-60*5,0)
df = df.withColumn('occurrences_in_5_min',F.count('timestamp').over(w))
df.show()
+--------+-------------------+--------------+----------+--------------------+
| userid| eventtime|location_point| timestamp|occurrences_in_5_min|
+--------+-------------------+--------------+----------+--------------------+
|40e8a7c3|2017-06-04 03:12:00| 18825769|1496545920| 1|
|3136afcb|2017-06-04 03:03:00| 18382821|1496545380| 1|
|661212dd|2017-06-04 03:06:00| 80831484|1496545560| 1|
|4e191908|2017-06-04 03:00:00| 18685891|1496545200| 1|
|4e191908|2017-06-04 03:04:00| 18685891|1496545440| 2|
|4e191908|2017-06-04 03:11:30| 18685891|1496545890| 1|
+--------+-------------------+--------------+----------+--------------------+
Now you can add the desired column with True if the number of occurences is strictly more than 1 in the last 5 minutes on a particular location:
add_bool = udf(lambda col : True if col>1 else False, BooleanType())
df = df.withColumn('already_occured',add_bool('occurrences_in_5_min'))
df.show()
+--------+-------------------+--------------+----------+--------------------+---------------+
| userid| eventtime|location_point| timestamp|occurrences_in_5_min|already_occured|
+--------+-------------------+--------------+----------+--------------------+---------------+
|40e8a7c3|2017-06-04 03:12:00| 18825769|1496545920| 1| false|
|3136afcb|2017-06-04 03:03:00| 18382821|1496545380| 1| false|
|661212dd|2017-06-04 03:06:00| 80831484|1496545560| 1| false|
|4e191908|2017-06-04 03:00:00| 18685891|1496545200| 1| false|
|4e191908|2017-06-04 03:04:00| 18685891|1496545440| 2| true|
|4e191908|2017-06-04 03:11:30| 18685891|1496545890| 1| false|
+--------+-------------------+--------------+----------+--------------------+---------------+
rangeBetween just doesn't make sense for non-aggregate function like lag. lag takes always a specific row, denoted by offset argument, so specifying frame is pointless.
To get a window over time series you can use window grouping with standard aggregates:
from pyspark.sql.functions import window, countDistinct
(df
.groupBy("location_point", window("eventtime", "5 minutes"))
.agg( countDistinct("userid")))
You can add more arguments to modify slide duration.
You can try something similar with window functions if you partition by location:
windowSpec = (W.partitionBy(col("location"))
.orderBy(col("eventtime").cast("timestamp").cast("long"))
.rangeBetween(0, days(5)))
df.withColumn("id_count", countDistinct("userid").over(windowSpec))

Spark SQL - UDF for calculating the difference between two date and times

Is there any Spark SQL UDF available for calculating the difference between two date and times?
I created one on my own. Here is how it goes:-
def time_delta(y,x):
from datetime import datetime
end = datetime.strptime(y, '%Y-%m-%d %H:%M:%S')
start = datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
delta = (end-start).total_seconds()
return int(delta/(60*60*24))
This takes in the two dates y and x and return the result in days. I used the below code to register it:-
f = udf(time_delta, IntegerType())
sqlContext.udf.register("time_diff", time_delta)
It works like a charm. Here is an example:-
df = df.withColumn("Duration", f(df.end_date, df.start_date))
df.show()
Results are:-
Column<unix_timestamp(end_date, %Y-%m-%d %H:%M:%S)>
+---+-------------------+-------------------+-----+----+--------+
| id| end_date| start_date|state|city|Duration|
+---+-------------------+-------------------+-----+----+--------+
| 1|2015-10-14 00:00:00|2015-09-14 00:00:00| CA| SF| 30|
| 2|2015-10-15 01:00:20|2015-08-14 00:00:00| CA| SD| 62|
| 3|2015-10-16 02:30:00|2015-01-14 00:00:00| NY| NY| 275|
| 4|2015-10-17 03:00:20|2015-02-14 00:00:00| NY| NY| 245|
| 5|2015-10-18 04:30:00|2014-04-14 00:00:00| CA| SD| 552|
+---+-------------------+-------------------+-----+----+--------+
I am also able to use it in Spark SQL:-
%sql select time_diff(end_date,start_date) from data_loc
And the results are:-
Spark SQL results
There is no function for now (Spark 2.0) to compute the difference between two dates in number of hours, but there is one to compute the difference in number of days :
def
datediff(end: Column, start: Column): Column
Returns the number of days from start to end.
Since
1.5.0
Ref. Scaladoc - functions.

Resources