Converting yyyymmdd to MM-dd-yyyy format in pyspark - apache-spark

I have a large data frame df containing a column for date in the format yyyymmdd, how can I convert it into MM-dd-yyyy in pySpark.

from datetime import datetime
from pyspark.sql.functions import col,udf
from pyspark.sql.types import DateType
rdd = sc.parallelize(['20161231', '20140102', '20151201', '20161124'])
df1 = sqlContext.createDataFrame(rdd, ['old_col'])
# UDF to convert string to date
func = udf (lambda x: datetime.strptime(x, '%Y%m%d'), DateType())
df = df1.withColumn('new_col', date_format(func(col('old_col')), 'MM-dd-yyy'))
df.show()

This is also working:
from datetime import datetime
from pyspark.sql.functions import col,udf,unix_timestamp
from pyspark.sql.types import DateType
func = udf(lambda x: datetime.strptime(str(x), '%m%d%y'), DateType())
df2 = df.withColumn('date', func(col('InvcDate')))

Related

Create dataframe with timestamp field

On Databricks, the following code snippet
%python
from pyspark.sql.types import StructType, StructField, TimestampType
from pyspark.sql import functions as F
data = [F.current_timestamp()]
schema = StructType([StructField("current_timestamp", TimestampType(), True)])
df = spark.createDataFrame(data, schema)
display(df)
displays a table with value "null". I would expect to see the current timestamp there. Why is this not the case?
createDataFrame does not accept PySpark expressions.
You could pass python's datetime.datetime.now():
import datetime
df = spark.createDataFrame([(datetime.datetime.now(),)], ['ts'])
Defining schema beforehand:
from pyspark.sql.types import *
import datetime
data = [(datetime.datetime.now(),)]
schema = StructType([StructField("current_timestamp", TimestampType(), True)])
df = spark.createDataFrame(data, schema)
OR add timestamp column afterwards:
from pyspark.sql import functions as F
df = spark.range(3)
df1 = df.select(
F.current_timestamp().alias('ts')
)
df2 = df.withColumn('ts', F.current_timestamp())

how to append a dataframe to existing partitioned table to specific partition

I have a existing table like below
create_table=""" create table tbl1 (tran int,count int) partitioned by (year string) """
spark.sql(create_table)
insert_query="insert into tbl1 partition(year='2022') values (101,500)"
spark.sql(insert_query)
and i create dataframe like below
from pyspark.sql.functions import *
from datetime import datetime
rows=[
(1,501),
(2,502),
(3,503)
]
from pyspark.sql.types import *
myschema =StructType([
StructField("id",LongType(),True),\
StructField("count",LongType(),True)
])
df=spark.createDataFrame(rows,myschema)
Now I want to append this dataframe to above table and append values to existing partition 2022.
How can i do that
When you create the dataframe, you could include the year as well, then partitionBy and write into the table:
from pyspark.sql.types import StructType, StructField
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Test').getOrCreate()
rows=[
(1,501,'2022'),
(2,502,'2022'),
(3,503,'2022')
]
myschema =StructType([
StructField("id",LongType(),True),\
StructField("count",LongType(),True),\
StructField("year",StringType(),True)
])
df=spark.createDataFrame(rows,myschema)
df.write.mode('append').partitionBy('year').saveAsTable('tbl1')

Transform a column in a sparksql dataframe using python

Hi I have a spark sql dataframe with a whole bunch of columns. One of the columns ("date") is a date field. I want to apply the following transformation to every row in that column.
This is what would I do if it were a pandas dataframe. I cant seem to figure out the spark equivalent
df["date"] = df["date"].map(lambda x: x.isoformat() + "Z")
The column has values of the form
2020-12-07 01:01:48
I want the values to be of the form:
2020-12-07T01:01:48Z
Try something like that:
from pyspark.sql.types import StructType, StructField, DateType, StringType, IntegerType
from pyspark.sql.functions import col
from pyspark.sql import functions as F
schema = StructType([
StructField("date",StringType(),True),
StructField("age", StringType(),True)])
df = spark.createDataFrame([(None,22),(None,25)],schema=schema)
Z = F.lit("Z").cast(StringType())
datetime = F.current_date().cast(StringType())
datetimeZ = F.concat(datetime,Z)
df = df.withColumn("date", datetimeZ)
df.show(5)
+-----------+---+
| date|age|
+-----------+---+
|2021-06-15Z| 22|
|2021-06-15Z| 25|
+-----------+---+

How to convert working with date time in Pandas?

I have datetime field like 2017-01-15T02:41:38.466Z and would like to convert it to %Y-%m-%d format. How can this be achieved in pandas or python?
I tried this
frame['datetime_ordered'] = pd.datetime(frame['datetime_ordered'], format='%Y-%m-%d')
but getting the error
cannot convert the series to <class 'int'>
The following code worked
d_parser= lambda x: pd.datetime.strptime(x,'%Y-%m-%dT%H:%M:%S.%fZ')
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0,parse_dates['datetime_ordered'],date_parser=d_parser)
li.append(df)
frame =pd.concat(li, axis=0, ignore_index=True)
import datetime
from datetime import datetime
date_str="2017-01-15T02:41:38.466Z"
a_date=pd.to_datetime(date_str)
print("date time value", a_date)
#datetime to string with format
print(a_date.strftime('%Y-%m-%d'))

Formatting duration as "hh:mm:ss" and write to pandas dataframe and to save it as CSV file

I imported data from a CSV file to pandas dataframe.
Then, created a duration column by taking difference of 2 datetime columns and which is as follows:
df['Drive Time'] = df['Delivered Time'] - df['Pickup Time']
Now, I want to save it back to the CSV file but I want the 'Drive Time' column to be displayed as "hh:mm:ss" format while I open using Excel. And the code I used as below:
import pandas as pd
import numpy as np
df = pd.read_csv("1554897620.csv", parse_dates = ['Pickup Time', 'Delivered Time'])
df['Drive Time'] = df['Delivered Time'] - df['Pickup Time']
df.to_csv(index=False)
df.to_csv('test.csv', index=False)
In Conclusion, I want to save Drive Time column in the format "hh:mm:ss" while exporting to CSV
If you know that Delivered Time is never greater than 24 hours, you can use this trick:
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=['Delivered Time', 'Pickup Time'])
df['Delivered Time'] = pd.date_range('2019-01-01 13:00', '2019-01-02 13:00', periods=12)
df['Pickup Time'] = pd.date_range('2019-01-01 12:00', '2019-01-02 12:00', periods=12)
df['Drive Time'] = df['Delivered Time'] - df['Pickup Time']
# Trick: transform timedelta to datetime object to enable strftime:
df['Drive Time'] = pd.to_datetime(df['Drive Time']).dt.strftime("%H:%M:%S")
df.to_csv('test.csv')
By transforming the timedelta to a datetime data type, you can use its strftime method.

Resources