How to add hours as variable to timestamp in Pyspark - apache-spark

Dataframe schema is like this:
["id", "t_create", "hours"]
string, timestamp, int
Sample data is like:
["abc", "2022-07-01 12:23:21.343998", 5]
I want to add hours to the t_create and get a new column t_update: "2022-07-01 17:23:21.343998"
Here is my code:
df_cols = ["id", "t_create", "hour"]
df = spark.read.format("delta").load("blablah path")
df = df.withColumn("t_update", df.t_create + expr(f"INTERVAL 5 HOURS"))
It works no problem. However the hours column should be a variable. I did not figure out how to put the variable to the expr, f string and the INTERVAL function, something like:
df = df.withColumn("t_update", df.t_create + expr(f"INTERVAL {df.hours} HOURS"))
df = df.withColumn("t_update", df.t_create + expr(f"INTERVAL {col(df.hours)} HOURS"))
etc... They don't work. Need help here.
Another way is to write a udf and wrap the whole expr string to the udf return value:
#udf
def udf_interval(hours):
return "INTERVAL " + str(hours) + " HOURS"
Then:
df = df.withColumn("t_update", df.t_create + expr(udf_interval(df.hours)))
Now I get TypeError: Column is not iterable.
Stuck. Need help in either the udf or non-udf way. Thanks!

You can do this without using the fiddly unix_timestamp and utilise make_interval within SparkSQL
SparkSQL - TO_TIMESTAMP & MAKE_INTERVAL
sql.sql("""
WITH INP AS (
SELECT
"abc" as id,
TO_TIMESTAMP("2022-07-01 12:23:21.343998","yyyy-MM-dd HH:mm:ss.SSSSSS") as t_create,
5 as t_hour
)
SELECT
id,
t_create,
t_hour,
t_create + MAKE_INTERVAL(0,0,0,0,t_hour,0,0) HOURS as t_update
FROM INP
""").show(truncate=False)
+---+--------------------------+------+--------------------------+
|id |t_create |t_hour|t_update |
+---+--------------------------+------+--------------------------+
|abc|2022-07-01 12:23:21.343998|5 |2022-07-01 17:23:21.343998|
+---+--------------------------+------+--------------------------+
Pyspark API
s = StringIO("""
id,t_create,t_hour
abc,2022-07-01 12:23:21.343998,5
"""
)
df = pd.read_csv(s,delimiter=',')
sparkDF = sql.createDataFrame(df)\
.withColumn('t_create'
,F.to_timestamp(F.col('t_create')
,'yyyy-MM-dd HH:mm:ss.SSSSSS'
)
).withColumn('t_update'
,F.expr('t_create + MAKE_INTERVAL(0,0,0,0,t_hour,0,0) HOURS')
).show(truncate=False)
+---+--------------------------+------+--------------------------+
|id |t_create |t_hour|t_update |
+---+--------------------------+------+--------------------------+
|abc|2022-07-01 12:23:21.343998|5 |2022-07-01 17:23:21.343998|
+---+--------------------------+------+--------------------------+

A simple way would be to cast the timestamp to bigint (or decimal if dealing with fraction of second) and add the number of seconds to it. Here's an example where I've created columns for every calculation for detailed understanding - you can merge all the calculations into a single column.
spark.sparkContext.parallelize([("2022-07-01 12:23:21.343998",)]).toDF(['ts_str']). \
withColumn('ts', func.col('ts_str').cast('timestamp')). \
withColumn('hours_to_add', func.lit(5)). \
withColumn('ts_as_decimal', func.col('ts').cast('decimal(20, 10)')). \
withColumn('seconds_to_add_as_decimal',
func.col('hours_to_add').cast('decimal(20, 10)') * 3600
). \
withColumn('new_ts_as_decimal',
func.col('ts_as_decimal') + func.col('seconds_to_add_as_decimal')
). \
withColumn('new_ts', func.col('new_ts_as_decimal').cast('timestamp')). \
show(truncate=False)
# +--------------------------+--------------------------+------------+---------------------+-------------------------+---------------------+--------------------------+
# |ts_str |ts |hours_to_add|ts_as_decimal |seconds_to_add_as_decimal|new_ts_as_decimal |new_ts |
# +--------------------------+--------------------------+------------+---------------------+-------------------------+---------------------+--------------------------+
# |2022-07-01 12:23:21.343998|2022-07-01 12:23:21.343998|5 |1656678201.3439980000|18000.0000000000 |1656696201.3439980000|2022-07-01 17:23:21.343998|
# +--------------------------+--------------------------+------------+---------------------+-------------------------+---------------------+--------------------------+

Related

PySpark - convert RDD to pair key value RDD

I created rdd from CSV
lines = sc.textFile(data)
now I need to convert lines to key value rdd
where value where value will be string (after splitting) and key will be number of column of csv
for example CSV
Col 1
Col2
73
230666
55
149610
I want to get rdd.take(1):
[(1,73), (2, 230666)]
I create rdd of lists
lines_of_list = lines_data.map(lambda line : line.split(','))
I create function that get list and return list of tuples (key, value)
def list_of_tuple (l):
list_tup = []
for i in range(len(l[0])):
list_tup.append((l[0][i],i))
return(list_tup)
But I can’t get the correct result when I try to map this function on RDD
You can use the PySpark's create_map function to do so, like so:
from pyspark.sql.functions import create_map, col, lit
df = spark.createDataFrame([(73, 230666), (55, 149610)], "Col1: int, Col2: int")
mapped_df = df.select(create_map(lit(1), col("Col1")).alias("mappedCol1"), create_map(lit(2), col("Col2")).alias("mappedCol2"))
mapped_df.show()
+----------+-------------+
|mappedCol1| mappedCol2|
+----------+-------------+
| {1 -> 73}|{2 -> 230666}|
| {1 -> 55}|{2 -> 149610}|
+----------+-------------+
If you still want use RDD API, then its a property of DataFrame, so you can use it like so:
mapped_df.rdd.take(1)
Out[32]: [Row(mappedCol1={1: 73}, mappedCol2={2: 230666})]
I fixed the problem in this way:
def list_of_tuple (line_rdd):
l = line_rdd.split(',')
list_tup = []
for i in range(len(l)):
list_tup.append((l[i],i))
return(list_tup)
pairs_rdd = lines_data.map(lambda line: list_of_tuple(line))

Update a column in PySpark while doing multiple inner joins?

I have a SQL query which I am trying to convert into PySpark. In SQL query, we are joining three tables and updating a column where there's a match. The SQL query looks like this:
UPDATE [DEPARTMENT_DATA]
INNER JOIN ([COLLEGE_DATA]
INNER JOIN [STUDENT_TABLE]
ON COLLEGE_DATA.UNIQUEID = STUDENT_TABLE.PROFESSIONALID)
ON DEPARTMENT_DATA.PUBLICID = COLLEGE_DATA.COLLEGEID
SET STUDENT_TABLE.PRIVACY = "PRIVATE"
The logic I have tried:
df_STUDENT_TABLE = (
df_STUDENT_TABLE.alias('a')
.join(
df_COLLEGE_DATA('b'),
on=F.col('a.PROFESSIONALID') == F.col('b.UNIQUEID'),
how='left',
)
.join(
df_DEPARTMENT_DATA.alias('c'),
on=F.col('b.COLLEGEID') == F.col('c.PUBLICID'),
how='left',
)
.select(
*[F.col(f'a.{c}') for c in df_STUDENT_TABLE.columns],
F.when(
F.col('b.UNIQUEID').isNotNull() & F.col('c.PUBLICID').isNotNull()
F.lit('PRIVATE')
).alias('PRIVACY')
)
)
This code is adding a new column "PRIVACY", but giving null values after running.
I have taken some sample data and when I apply the join using conditions, the following is the result I get (requirement is that the following record's privacy needs to be set to PRIVATE)
%sql
select student.*,college.*,department.* from department INNER JOIN college INNER JOIN student
ON college.unique_id = student.professional_id and department.public_id = college.college_id
When I used your code (same logic), I got the same output i.e., an additional column being added to the dataframe with required values and the actual privacy column has nulls.
from pyspark.sql.functions import col,when,lit
df_s = df_s.alias('a').join(df_c.alias('b'), col('a.professional_id') == col('b.unique_id'),'left').join(df_d.alias('c'), col('b.college_id') == col('c.public_id'),'left').select(*[col(f'a.{c}') for c in df_s.columns],when(col('b.unique_id').isNotNull() & col('c.public_id').isNotNull(), 'PRIVATE').otherwise(col('a.privacy')).alias('req_value'))
df_s.show()
Since, from the above, req_value is the column with required values and these values need to be reflected in privacy, you can use the following code directly.
final = df_s.withColumn('privacy',col('req_value')).select([column for column in df_s.columns if column!='req_value'])
final.show()
UPDATE:
You can also use the following code where I have updated the column using withColumn instead of select.
df_s = df_s.alias('a').join(df_c.alias('b'), col('a.professional_id') == col('b.unique_id'),'left').join(df_d.alias('c'), col('b.college_id') == col('c.public_id'),'left').withColumn('privacy',when(col('b.unique_id').isNotNull() & col('c.public_id').isNotNull(), 'PRIVATE').otherwise(col('privacy'))).select(*df_s.columns)
#or you can use this as well, without using alias.
#df_s = df_s.join(df_c, df_s['professional_id'] == df_c['unique_id'],'left').join(df_d, df_c['college_id'] == df_d['public_id'],'left').withColumn('privacy',when(df_c['unique_id'].isNotNull() & df_d['public_id'].isNotNull(), 'PRIVATE').otherwise(df_s['privacy'])).select(*df_s.columns)
df_s.show()
After the joins, you can use nvl2. It can check if the join with the last dataframe (df_dept) was successful, if yes, then you can return "PRIVATE", otherwise the value from df_stud.PRIVACY.
Inputs:
from pyspark.sql import functions as F
df_stud = spark.createDataFrame([(1, 'x'), (2, 'STAY')], ['PROFESSIONALID', 'PRIVACY'])
df_college = spark.createDataFrame([(1, 1)], ['COLLEGEID', 'UNIQUEID'])
df_dept = spark.createDataFrame([(1,)], ['PUBLICID'])
df_stud.show()
# +--------------+-------+
# |PROFESSIONALID|PRIVACY|
# +--------------+-------+
# | 1| x|
# | 2| STAY|
# +--------------+-------+
Script:
df = (df_stud.alias('s')
.join(df_college.alias('c'), F.col('s.PROFESSIONALID') == F.col('c.UNIQUEID'), 'left')
.join(df_dept.alias('d'), F.col('c.COLLEGEID') == F.col('d.PUBLICID'), 'left')
.select(
*[f's.`{c}`' for c in df_stud.columns if c != 'PRIVACY'],
F.expr("nvl2(d.PUBLICID, 'PRIVATE', s.PRIVACY) PRIVACY")
)
)
df.show()
# +--------------+-------+
# |PROFESSIONALID|PRIVACY|
# +--------------+-------+
# | 1|PRIVATE|
# | 2| STAY|
# +--------------+-------+

Glue/Spark: Filter a large dynamic frame with thousands of conditions

I am trying to filter a timeseries glue dynamic frame with millions of rows having data:
id val ts
a 1.3 2022-05-03T14:18:00.000Z
a 9.2 2022-05-03T12:18:00.000Z
c 8.2 2022-05-03T13:48:00.000Z
I have another pandas dataframe with thousands of rows:
id start_ts end_ts
a 2022-05-03T14:00:00.000Z 2022-05-03T14:18:00.000Z
a 2022-05-03T11:38:00.000Z 2022-05-03T12:18:00.000Z
c 2022-05-03T13:15:00.000Z 2022-05-03T13:48:00.000Z
I want to filter all the rows in the time series dynamic frame having condition they have the same id and the ts is between start_ts and end_ts.
My current approach is too slow to solve the problem:
I am first iterating over the pandas_df and storing multiple filtered glue dynamic frames into an array
dfs=[]
for index, row in pandas_df.iterrows():
df = Filter.apply(ts_dynamicframe, f=lambda x: ((row['start_ts'] <= x['ts'] <= row['end_ts']) and x['id'] == index))
dfs.append(df)
and then unioning all the dynamicframes together.
df = dfs[0]
dfs.pop(0)
for _df in dfs:
df = df.union(_df)
the materialization takes too long and never finishes..
print("Count: ", df.count())
What could be more efficient approaches to solving this problem with spark/glue?
Use a range join
Data
df=spark.createDataFrame([('a' , 1.3 ,'2022-05-03T14:18:00.000Z'),
('a' , 9.2, '2021-05-03T12:18:00.000Z'),
('c' , 8.2, '2022-05-03T13:48:00.000Z')],
('id' , 'val', 'ts' ))
df1=spark.createDataFrame([('a' , '2022-05-03T14:00:00.000Z' , '2022-05-03T14:18:00.000Z'),
('a' , '2022-05-03T11:38:00.000Z' , '2022-05-03T12:18:00.000Z'),
('c' , '2022-05-03T13:15:00.000Z' , '2022-05-03T13:48:00.000Z')],
('id' , 'start_ts' , 'end_ts' ))
#Convert to timestamp if not yet converted
df= df.withColumn('ts', to_timestamp('ts'))
df1= df1.withColumn('start_ts', to_timestamp('start_ts')).withColumn('end_ts', to_timestamp('end_ts'))
Solution
#convert to SQL table
df1.createOrReplaceTempView('df1')
df.createOrReplaceTempView('df')
#Use range between
spark.sql("SELECT * FROM df,df1 WHERE df.id= df1.id AND df.ts BETWEEN df1.start_ts and df1.end_ts").show()
outcome
+---+---+-------------------+---+-------------------+-------------------+
| id|val| ts| id| start_ts| end_ts|
+---+---+-------------------+---+-------------------+-------------------+
| a|1.3|2022-05-03 14:18:00| a|2022-05-03 14:00:00|2022-05-03 14:18:00|
| c|8.2|2022-05-03 13:48:00| c|2022-05-03 13:15:00|2022-05-03 13:48:00|
+---+---+-------------------+---+-------------------+-------------------+

How to extract specific time interval on working days with sql in apache spark?

I loaded csv file in sql table databricks, which is using apache spark. I need to extract sql table column which has content:
01.01.2018,15:25
01.01.2018,00:10
01.01.2018,13:20
...
...
on data which represent only working days and time between 8.30 and 9.30 a.m. How should I do that? Should I first extract column on two columns? I found how I can do some parts with data which I enter into databricks, but this data are part of sql table.
Also some commands from classical sql does not work on apache spark, it means databricks.
This is query for reading data:
# File location and type
file_location = "/FileStore/tables/NEZ_OPENDATA_2018_20190125-1.csv"
file_type = "csv"
# CSV options
infer_schema = "false"
first_row_is_header = "false"
delimiter = ","
# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
.option("inferSchema", infer_schema) \
.option("header", first_row_is_header) \
.option("sep", delimiter) \
.load(file_location)
display(df)
# Create a view or table
temp_table_name = "NEZ_OPENDATA_2018_20190125"
df.createOrReplaceTempView(temp_table_name)
%sql
/* Query the created temp table in a SQL cell */
select * from `NEZ_OPENDATA_2018_20190125`
permanent_table_name = "NEZ_OPENDATA_2018_20190125"
df.write.format("parquet").saveAsTable(permanent_table_name)
Reading as a text file is probably more appropriate as the timestamp consists of both the date and time. Then you can filter the day of week and time using relevant Pyspark functions. Note that day of week is 1 for Sunday, 2 for Monday, ... etc.
import pyspark.sql.functions as F
file_location = "/FileStore/tables/NEZ_OPENDATA_2018_20190125-1.csv"
df = spark.read.text(file_location).toDF('timestamp')
result = df.select(
F.to_timestamp('timestamp', 'dd.MM.yyyy,HH:mm').alias('timestamp')
).filter(
F.dayofweek('timestamp').isin([2,3,4,5,6]) & (
( (F.hour('timestamp') == 8) & (F.minute('timestamp').between(30,59)) ) |
( (F.hour('timestamp') == 9) & (F.minute('timestamp').between(0,30)) )
)
)
If you want to show the output, you can do result.show() or display(result).

How to use a function over an RDD and get new column (Pyspark)?

I'm looking for a way to apply a function to an RDD using PySpark and put the result in a new column. With DataFrames, it looks easy :
Given :
rdd = sc.parallelize([(u'1751940903', u'2014-06-19', '2016-10-19'), (u'_guid_VubEgxvPPSIb7W5caP-lXg==', u'2014-09-10', '2016-10-19')])
My code can look like this :
df= rdd.toDF(['gigya', 'inscription','d_date'])
df.show()
+--------------------+-------------------------+----------+
| gigya| inscription| d_date|
+--------------------+-------------------------+----------+
| 1751940903| 2014-06-19|2016-10-19|
|_guid_VubEgxvPPSI...| 2014-09-10|2016-10-19|
+--------------------+-------------------------+----------+
Then :
from pyspark.sql.functions import split, udf, col
get_period_day = udf(lambda item : datetime.strptime(item, "%Y-%m-%d").timetuple().tm_yday)
df.select('d_date', 'gigya', 'inscription', get_period_day(col('d_date')).alias('period_day')).show()
+----------+--------------------+-------------------------+----------+
| d_date| gigya|inscription_service_6Play|period_day|
+----------+--------------------+-------------------------+----------+
|2016-10-19| 1751940903| 2014-06-19| 293|
|2016-10-19|_guid_VubEgxvPPSI...| 2014-09-10| 293|
+----------+--------------------+-------------------------+----------+
Is there a way to do the same thing without the need to convert my RDD to a DataFrame ? Something with map for exemple..
This code can just give me a part from the expected results :
rdd.map(lambda x: datetime.strptime(x[1], '%Y-%m-%d').timetuple().tm_yday).cache().collect()
Help ?
Try:
rdd.map(lambda x:
x + (datetime.strptime(x[1], '%Y-%m-%d').timetuple().tm_yday, ))
or:
def g(x):
return x + (datetime.strptime(x[1], '%Y-%m-%d').timetuple().tm_yday, )
rdd.map(g)

Resources