Change the day of the date to a particular day - apache-spark

I basically have a requirement that needs a column that as the PeriodEndDate in. The period always ends on the 23rd of the month.
I need to take a date from a column in this case it is the last day of the month each day, and set the "day" of that date to be "23".
I have tried doing the following:
.withColumn("periodEndDate", change_day(jsonDF2.periodDate, sf.lit(23)))
cannot import name 'change_day' from 'pyspark.sql.functions'

You can use make_date
from pyspark.sql import functions as F
df = spark.createDataFrame([('2022-05-31',)], ['periodDate'])
df = df.withColumn('periodEndDate', F.expr("make_date(year(periodDate), month(periodDate), 23)"))
df.show()
# +----------+-------------+
# |periodDate|periodEndDate|
# +----------+-------------+
# |2022-05-31| 2022-05-23|
# +----------+-------------+

As far as I know, there is no function change_day however, you can make one using UDF. Pass a date and replace day.
Example:
from datetime import datetime
from pyspark.sql import SparkSession
from pyspark.sql.types import TimestampType
from pyspark.sql import functions as F
def change_day(date, day):
return date.replace(day=day)
change_day = F.udf(change_day, TimestampType())
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([{"date": datetime(2022, 1, 31)}])
df = df.withColumn("23day", change_day(F.col("date"), F.lit(23)))
df.show(20, False)
Result:
+-------------------+-------------------+
|date |23day |
+-------------------+-------------------+
|2022-01-31 00:00:00|2022-01-23 00:00:00|
+-------------------+-------------------+

Related

How to use chaining in pyspark?

I have a dataframe called Incitoand in Supplier Inv Nocolumn of that data frame consists of comma separated values. I need to recreate the data frame by appropriately repeating those comma separated values using pyspark.I am using following python code for that.Can I convert this into pyspark?Is it possible via pyspark?
from itertools import chain
def chainer(s):
return list(chain.from_iterable(s.str.split(',')))
incito['Supplier Inv No'] = incito['Supplier Inv No'].astype(str)
# calculate lengths of splits
lens = incito['Supplier Inv No'].str.split(',').map(len)
# create new dataframe, repeating or chaining as appropriate
dfnew = pd.DataFrame({'Supplier Inv No': chainer(incito['Supplier Inv No']),
'Forwarder': np.repeat(incito['Forwarder'], lens),
'Mode': np.repeat(incito['Mode'], lens),
'File No': np.repeat(incito['File No'], lens),
'ETD': np.repeat(incito['ETD'], lens),
'Flight No': np.repeat(incito['Flight No'], lens),
'Shipped Country': np.repeat(incito['Shipped Country'], lens),
'Port': np.repeat(incito['Port'], lens),
'Delivered_Country': np.repeat(incito['Delivered_Country'], lens),
'AirWeight': np.repeat(incito['AirWeight'], lens),
'FREIGHT CHARGE': np.repeat(incito['FREIGHT CHARGE'], lens)})
This is what I tried in pyspark.But I am not getting the expected outcome.
from pyspark.context import SparkContext, SparkConf
from pyspark.sql.session import SparkSession
from pyspark.sql import functions as F
import pandas as pd
conf = SparkConf().setAppName("appName").setMaster("local")
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
ddf = spark.createDataFrame(dfnew)
exploded = ddf.withColumn('d', F.explode("Supplier Inv No"))
exploded.show()
Something like this, using repeat?
from pyspark.sql import functions as F
df = (spark
.sparkContext
.parallelize([
('ABCD',),
('EFGH',),
])
.toDF(['col_a'])
)
(df
.withColumn('col_b', F.repeat(F.col('col_a'), 2))
.withColumn('col_c', F.repeat(F.lit('X'), 10))
.show()
)
# +-----+--------+----------+
# |col_a| col_b| col_c|
# +-----+--------+----------+
# | ABCD|ABCDABCD|XXXXXXXXXX|
# | EFGH|EFGHEFGH|XXXXXXXXXX|
# +-----+--------+----------+

Changing string to timestamp in Pyspark

I'm trying to convert a string column to Timestamp column which is in the format:
c1
c2
2019-12-10 10:07:54.000
2019-12-13 10:07:54.000
2020-06-08 15:14:49.000
2020-06-18 10:07:54.000
from pyspark.sql.functions import col, udf, to_timestamp
joined_df.select(to_timestamp(joined_df.c1, '%Y-%m-%d %H:%M:%S.%SSSS').alias('dt')).collect()
joined_df.select(to_timestamp(joined_df.c2, '%Y-%m-%d %H:%M:%S.%SSSS').alias('dt')).collect()
When the dates are changed, I want a new column Date difference by subtracting c2-c1
In python I'm doing it:
df['c1'] = df['c1'].fillna('0000-01-01').apply(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S.%f'))
df['c2'] = df['c2'].fillna('0000-01-01').apply(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S.%f'))
df['days'] = (df['c2'] - df['c1']).apply(lambda x: x.days)
Can anyone help how to convert to pyspark?
If you want to get the date difference, you can use datediff:
import pyspark.sql.functions as F
df = df.withColumn('c1', F.col('c1').cast('timestamp')).withColumn('c2', F.col('c2').cast('timestamp'))
result = df.withColumn('days', F.datediff(F.col('c2'), F.col('c1')))
result.show(truncate=False)
+-----------------------+-----------------------+----+
|c1 |c2 |days|
+-----------------------+-----------------------+----+
|2019-12-10 10:07:54.000|2019-12-13 10:07:54.000|3 |
|2020-06-08 15:14:49.000|2020-06-18 10:07:54.000|10 |
+-----------------------+-----------------------+----+

How to extract time from timestamp in pyspark?

I have a requirement to extract time from timestamp(this is a column in dataframe) using pyspark.
lets say this is the timestamp 2019-01-03T18:21:39 , I want to extract only time "18:21:39" such that it always appears in this manner "01:01:01"
df = spark.createDataFrame(["2020-06-17T00:44:30","2020-06-17T06:06:56","2020-06-17T15:04:34"],StringType()).toDF('datetime')
df=df.select(df['datetime'].cast(TimestampType()))
I tried like below but did not get the expected result
df1=df.withColumn('time',concat(hour(df['datetime']),lit(":"),minute(df['datetime']),lit(":"),second(df['datetime'])))
display(df1)
+-------------------+-------+
| datetime| time|
+-------------------+-------+
|2020-06-17 00:44:30|0:44:30|
|2020-06-17 06:06:56| 6:6:56|
|2020-06-17 15:04:34|15:4:34|
+-------------------+-------+
my results are like this 6:6:56 but i want them to be 06:06:56
Use the date_format function.
from pyspark.sql.types import StringType
df = spark \
.createDataFrame(["2020-06-17T00:44:30","2020-06-17T06:06:56","2020-06-17T15:04:34"], StringType()) \
.toDF('datetime')
from pyspark.sql.functions import date_format
q = df.withColumn('time', date_format('datetime', 'HH:mm:ss'))
>>> q.show()
+-------------------+--------+
| datetime| time|
+-------------------+--------+
|2020-06-17T00:44:30|00:44:30|
|2020-06-17T06:06:56|06:06:56|
|2020-06-17T15:04:34|15:04:34|
+-------------------+--------+

Spark order by second field to perform timeseries function

I have a csv with a timeseries:
timestamp, measure-name, value, type, quality
1503377580,x.x-2.A,0.5281250,Float,GOOD
1503377340,x.x-1.B,0.0000000,Float,GOOD
1503377400,x.x-1.B,0.0000000,Float,GOOD
The measure-name should be my partition key and I would like to calculate a moving average with pyspark, here my code (for instance) to calculate the max
def mysplit(line):
ll = line.split(",")
return (ll[1],float(ll[2]))
text_file.map(lambda line: mysplit(line)).reduceByKey(lambda a, b: max(a , b)).foreach(print)
However, for the average I would like to respect the timestamp ordering.
How to order by a second column?
You need to use a window function on pyspark dataframes:
First you should transform your rdd to a dataframe:
from pyspark.sql import HiveContext
hc = HiveContext(sc)
df = hc.createDataFrame(text_file.map(lambda l: l.split(','), ['timestamp', 'measure-name', 'value', 'type', 'quality'])
Or load it directly as a dataframe:
local:
import pandas as pd
df = hc.createDataFrame(pd.read_csv(path_to_csv, sep=",", header=0))
from hdfs:
df = hc.read.format("com.databricks.spark.csv").option("delimiter", ",").load(path_to_csv)
Then use a window function:
from pyspark.sql import Window
import pyspark.sql.functions as psf
w = Window.orderBy('timestamp')
df.withColumn('value_rol_mean', psf.mean('value').over(w))
+----------+------------+--------+-----+-------+-------------------+
| timestamp|measure_name| value| type|quality| value_rol_mean|
+----------+------------+--------+-----+-------+-------------------+
|1503377340| x.x-1.B| 0.0|Float| GOOD| 0.0|
|1503377400| x.x-1.B| 0.0|Float| GOOD| 0.0|
|1503377580| x.x-2.A|0.528125|Float| GOOD|0.17604166666666665|
+----------+------------+--------+-----+-------+-------------------+
in .orderByyou can order by as many columns as you want

Calculate time between two dates in pyspark

Hoping this is fairly elementary. I have a Spark dataframe containing a Date column, I want to add a new column with number of days since that date. Google fu is failing me.
Here's what I've tried:
from pyspark.sql.types import *
import datetime
today = datetime.date.today()
schema = StructType([StructField("foo", DateType(), True)])
l = [(datetime.date(2016,12,1),)]
df = sqlContext.createDataFrame(l, schema)
df = df.withColumn('daysBetween',today - df.foo)
df.show()
it fails with error:
u"cannot resolve '(17212 - foo)' due to data type mismatch: '(17212 -
foo)' requires (numeric or calendarinterval) type, not date;"
I've tried fiddling around but gotten nowhere. I can't think that this is too hard. Can anyone help?
OK, figured it out
from pyspark.sql.types import *
import pyspark.sql.functions as funcs
import datetime
today = datetime.date(2017,2,15)
schema = StructType([StructField("foo", DateType(), True)])
l = [(datetime.date(2017,2,14),)]
df = sqlContext.createDataFrame(l, schema)
df = df.withColumn('daysBetween',funcs.datediff(funcs.lit(today), df.foo))
df.collect()
returns [Row(foo=datetime.date(2017, 2, 14), daysBetween=1)]
You can simply do the following:
import pyspark.sql.functions as F
df = df.withColumn('daysSince', F.datediff(F.current_date(), df.foo))

Resources