Change the day of the date to a particular day

Change the day of the date to a particular day - apache-spark

I basically have a requirement that needs a column that as the PeriodEndDate in. The period always ends on the 23rd of the month.
I need to take a date from a column in this case it is the last day of the month each day, and set the "day" of that date to be "23".
I have tried doing the following:
.withColumn("periodEndDate", change_day(jsonDF2.periodDate, sf.lit(23)))
cannot import name 'change_day' from 'pyspark.sql.functions'

You can use make_date
from pyspark.sql import functions as F
df = spark.createDataFrame([('2022-05-31',)], ['periodDate'])
df = df.withColumn('periodEndDate', F.expr("make_date(year(periodDate), month(periodDate), 23)"))
df.show()
# +----------+-------------+
# |periodDate|periodEndDate|
# +----------+-------------+
# |2022-05-31| 2022-05-23|
# +----------+-------------+

As far as I know, there is no function change_day however, you can make one using UDF. Pass a date and replace day.
Example:
from datetime import datetime
from pyspark.sql import SparkSession
from pyspark.sql.types import TimestampType
from pyspark.sql import functions as F
def change_day(date, day):
return date.replace(day=day)
change_day = F.udf(change_day, TimestampType())
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([{"date": datetime(2022, 1, 31)}])
df = df.withColumn("23day", change_day(F.col("date"), F.lit(23)))
df.show(20, False)
Result:
+-------------------+-------------------+
|date |23day |
+-------------------+-------------------+
|2022-01-31 00:00:00|2022-01-23 00:00:00|
+-------------------+-------------------+

Related

How to use chaining in pyspark?

I have a dataframe called Incitoand in Supplier Inv Nocolumn of that data frame consists of comma separated values. I need to recreate the data frame by appropriately repeating those comma separated values using pyspark.I am using following python code for that.Can I convert this into pyspark?Is it possible via pyspark?
from itertools import chain
def chainer(s):
return list(chain.from_iterable(s.str.split(',')))
incito['Supplier Inv No'] = incito['Supplier Inv No'].astype(str)
# calculate lengths of splits
lens = incito['Supplier Inv No'].str.split(',').map(len)
# create new dataframe, repeating or chaining as appropriate
dfnew = pd.DataFrame({'Supplier Inv No': chainer(incito['Supplier Inv No']),
'Forwarder': np.repeat(incito['Forwarder'], lens),
'Mode': np.repeat(incito['Mode'], lens),
'File No': np.repeat(incito['File No'], lens),
'ETD': np.repeat(incito['ETD'], lens),
'Flight No': np.repeat(incito['Flight No'], lens),
'Shipped Country': np.repeat(incito['Shipped Country'], lens),
'Port': np.repeat(incito['Port'], lens),
'Delivered_Country': np.repeat(incito['Delivered_Country'], lens),
'AirWeight': np.repeat(incito['AirWeight'], lens),
'FREIGHT CHARGE': np.repeat(incito['FREIGHT CHARGE'], lens)})
This is what I tried in pyspark.But I am not getting the expected outcome.
from pyspark.context import SparkContext, SparkConf
from pyspark.sql.session import SparkSession
from pyspark.sql import functions as F
import pandas as pd
conf = SparkConf().setAppName("appName").setMaster("local")
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
ddf = spark.createDataFrame(dfnew)
exploded = ddf.withColumn('d', F.explode("Supplier Inv No"))
exploded.show()

Something like this, using repeat?
from pyspark.sql import functions as F
df = (spark
.sparkContext
.parallelize([
('ABCD',),
('EFGH',),
])
.toDF(['col_a'])
)
(df
.withColumn('col_b', F.repeat(F.col('col_a'), 2))
.withColumn('col_c', F.repeat(F.lit('X'), 10))
.show()
)
# +-----+--------+----------+
# |col_a| col_b| col_c|
# +-----+--------+----------+
# | ABCD|ABCDABCD|XXXXXXXXXX|
# | EFGH|EFGHEFGH|XXXXXXXXXX|
# +-----+--------+----------+

Changing string to timestamp in Pyspark

I'm trying to convert a string column to Timestamp column which is in the format:
c1
c2
2019-12-10 10:07:54.000
2019-12-13 10:07:54.000
2020-06-08 15:14:49.000
2020-06-18 10:07:54.000
from pyspark.sql.functions import col, udf, to_timestamp
joined_df.select(to_timestamp(joined_df.c1, '%Y-%m-%d %H:%M:%S.%SSSS').alias('dt')).collect()
joined_df.select(to_timestamp(joined_df.c2, '%Y-%m-%d %H:%M:%S.%SSSS').alias('dt')).collect()
When the dates are changed, I want a new column Date difference by subtracting c2-c1
In python I'm doing it:
df['c1'] = df['c1'].fillna('0000-01-01').apply(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S.%f'))
df['c2'] = df['c2'].fillna('0000-01-01').apply(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S.%f'))
df['days'] = (df['c2'] - df['c1']).apply(lambda x: x.days)
Can anyone help how to convert to pyspark?

If you want to get the date difference, you can use datediff:
import pyspark.sql.functions as F
df = df.withColumn('c1', F.col('c1').cast('timestamp')).withColumn('c2', F.col('c2').cast('timestamp'))
result = df.withColumn('days', F.datediff(F.col('c2'), F.col('c1')))
result.show(truncate=False)
+-----------------------+-----------------------+----+
|c1 |c2 |days|
+-----------------------+-----------------------+----+
|2019-12-10 10:07:54.000|2019-12-13 10:07:54.000|3 |
|2020-06-08 15:14:49.000|2020-06-18 10:07:54.000|10 |
+-----------------------+-----------------------+----+

How to extract time from timestamp in pyspark?

I have a requirement to extract time from timestamp(this is a column in dataframe) using pyspark.
lets say this is the timestamp 2019-01-03T18:21:39 , I want to extract only time "18:21:39" such that it always appears in this manner "01:01:01"
df = spark.createDataFrame(["2020-06-17T00:44:30","2020-06-17T06:06:56","2020-06-17T15:04:34"],StringType()).toDF('datetime')
df=df.select(df['datetime'].cast(TimestampType()))
I tried like below but did not get the expected result
df1=df.withColumn('time',concat(hour(df['datetime']),lit(":"),minute(df['datetime']),lit(":"),second(df['datetime'])))
display(df1)
+-------------------+-------+
| datetime| time|
+-------------------+-------+
|2020-06-17 00:44:30|0:44:30|
|2020-06-17 06:06:56| 6:6:56|
|2020-06-17 15:04:34|15:4:34|
+-------------------+-------+
my results are like this 6:6:56 but i want them to be 06:06:56

Use the date_format function.
from pyspark.sql.types import StringType
df = spark \
.createDataFrame(["2020-06-17T00:44:30","2020-06-17T06:06:56","2020-06-17T15:04:34"], StringType()) \
.toDF('datetime')
from pyspark.sql.functions import date_format
q = df.withColumn('time', date_format('datetime', 'HH:mm:ss'))
>>> q.show()
+-------------------+--------+
| datetime| time|
+-------------------+--------+
|2020-06-17T00:44:30|00:44:30|
|2020-06-17T06:06:56|06:06:56|
|2020-06-17T15:04:34|15:04:34|
+-------------------+--------+

Spark order by second field to perform timeseries function

I have a csv with a timeseries:
timestamp, measure-name, value, type, quality
1503377580,x.x-2.A,0.5281250,Float,GOOD
1503377340,x.x-1.B,0.0000000,Float,GOOD
1503377400,x.x-1.B,0.0000000,Float,GOOD
The measure-name should be my partition key and I would like to calculate a moving average with pyspark, here my code (for instance) to calculate the max
def mysplit(line):
ll = line.split(",")
return (ll[1],float(ll[2]))
text_file.map(lambda line: mysplit(line)).reduceByKey(lambda a, b: max(a , b)).foreach(print)
However, for the average I would like to respect the timestamp ordering.
How to order by a second column?

You need to use a window function on pyspark dataframes:
First you should transform your rdd to a dataframe:
from pyspark.sql import HiveContext
hc = HiveContext(sc)
df = hc.createDataFrame(text_file.map(lambda l: l.split(','), ['timestamp', 'measure-name', 'value', 'type', 'quality'])
Or load it directly as a dataframe:
local:
import pandas as pd
df = hc.createDataFrame(pd.read_csv(path_to_csv, sep=",", header=0))
from hdfs:
df = hc.read.format("com.databricks.spark.csv").option("delimiter", ",").load(path_to_csv)
Then use a window function:
from pyspark.sql import Window
import pyspark.sql.functions as psf
w = Window.orderBy('timestamp')
df.withColumn('value_rol_mean', psf.mean('value').over(w))
+----------+------------+--------+-----+-------+-------------------+
| timestamp|measure_name| value| type|quality| value_rol_mean|
+----------+------------+--------+-----+-------+-------------------+
|1503377340| x.x-1.B| 0.0|Float| GOOD| 0.0|
|1503377400| x.x-1.B| 0.0|Float| GOOD| 0.0|
|1503377580| x.x-2.A|0.528125|Float| GOOD|0.17604166666666665|
+----------+------------+--------+-----+-------+-------------------+
in .orderByyou can order by as many columns as you want

Calculate time between two dates in pyspark

Hoping this is fairly elementary. I have a Spark dataframe containing a Date column, I want to add a new column with number of days since that date. Google fu is failing me.
Here's what I've tried:
from pyspark.sql.types import *
import datetime
today = datetime.date.today()
schema = StructType([StructField("foo", DateType(), True)])
l = [(datetime.date(2016,12,1),)]
df = sqlContext.createDataFrame(l, schema)
df = df.withColumn('daysBetween',today - df.foo)
df.show()
it fails with error:
u"cannot resolve '(17212 - foo)' due to data type mismatch: '(17212 -
foo)' requires (numeric or calendarinterval) type, not date;"
I've tried fiddling around but gotten nowhere. I can't think that this is too hard. Can anyone help?

OK, figured it out
from pyspark.sql.types import *
import pyspark.sql.functions as funcs
import datetime
today = datetime.date(2017,2,15)
schema = StructType([StructField("foo", DateType(), True)])
l = [(datetime.date(2017,2,14),)]
df = sqlContext.createDataFrame(l, schema)
df = df.withColumn('daysBetween',funcs.datediff(funcs.lit(today), df.foo))
df.collect()
returns [Row(foo=datetime.date(2017, 2, 14), daysBetween=1)]

You can simply do the following:
import pyspark.sql.functions as F
df = df.withColumn('daysSince', F.datediff(F.current_date(), df.foo))

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Change the day of the date to a particular day - apache-spark

Related

How to use chaining in pyspark?

Changing string to timestamp in Pyspark

How to extract time from timestamp in pyspark?

Spark order by second field to perform timeseries function

Calculate time between two dates in pyspark

Categories

Resources