I am new to PySpark and I am trying to create a function that can be used across when inputted a column from String type to a timestampType.
This the input column string looks like: 23/04/2021 12:00:00 AM
I want this to be turned in to timestampType so I can get latest date using pyspark.
Below is the function I so far created:
def datetype_change(self, key, col):
self.log.info("datetype_change...".format(self.app_name.upper()))
self.df[key] = self.df[key].withColumn("column_name", F.unix_timestamp(F.col("column_name"), 'yyyy-MM-dd HH:mm:ss').cast(TimestampType()))
When I run it I'm getting an error:
NameError: name 'TimestampType' is not defined
How do I change this function so it can take the intended output?
Found my answer:
def datetype_change(self,key,col):
self.log.info("-datetype_change...".format(self.app_name.upper()))
self.df[key] = self.df[key].withColumn(col, F.unix_timestamp(self.df[key][col], 'dd/MM/yyyy hh:mm:ss aa').cast(TimestampType()))
Related
I have a date value in a column of string type that takes this format:
06-MAY-16 09.17.15
I want to convert it to this format:
20160506
I have tried using DATE_FORMAT(TO_DATE(<column>), 'yyyyMMdd') but a NULL value is returned.
Does anyone have any ideas about how to go about doing this in pyspark or spark SQL?
Thanks
I've got it! This is the code I used which seems to have worked:
FROM_UNIXTIME(UNIX_TIMESTAMP(<column>, 'dd-MMM-yy HH.mm.ss'), 'yyyyMMdd')
Hope this helps others!
Your original attempt is close to the solution. You just needed to add the format in the TO_DATE() function. This will work as well:
DATE_FORMAT(TO_DATE(<col>, 'dd-MMM-yy HH.mm.ss'), 'yyyyMMdd')
And for pyspark:
import pyspark.sql.functions as F
df = df.withColumn('<col>', F.date_format(F.to_date(F.col('<col>'), 'dd-MMM-yy HH.mm.ss'), 'yyyyMMdd'))
Convert your string to a date before you try to 'reformat' it.
Convert pyspark string to date format -- to_timestamp(df.t, 'dd-MMM-YY HH.mm.ss').alias('my_date')
Pyspark date yyyy-mmm-dd conversion -- date_format(col("my_date"), "yyyyMMdd")
I am taking a dataframe and inserting it into a Postgresql table.
One column in the dataframe is a datetime64 dtype. The column type in PostgreSQL is 'timestamp without time zone.' To prepare the dataframe to insert, I am using to_records:
listdf = df.to_records(index=False).tolist()
When I run the to_records, it gives an error at the psycopg2's cur.executemany() that I am trying to insert Biginit into a Timestamp without timezone.
So I tried to add a dict of column_dtypes to the to_records. But that doesn't work. The below gives the error: "ValueError: Cannot convert from specific units to generic units in NumPy datetimes or timedeltas"
DictofDTypes = dict.fromkeys(SQLdfColHAedings,'float')
DictofDTypes['Date_Time'] = 'datetime64'
listdf = df.to_records(index=False,column_dtypes=DictofDTypes).tolist()
I have also tried type of str, int, and float. None worked in the above three lines.
How do I convert the column properly to be able to insert the column into a timestamp sql column?
I removed defining the dtypes from to_records.
And before to_recordes, I converted the datetime to str with:
df['Date_Time'] = df['Date_Time'].apply(lambda x: x.strftime('%Y-%m-%d %H:%M:%S'))
The sql insert command then worked.
I would like to convert a column in dataframe to a string
it looks like this :
company department id family name start_date end_date
abc sales 38221925 Levy nali 16/05/2017 01/01/2018
I want to convert the id from int to string
I tried
data['id']=data['id'].to_string()
and
data['id']=data['id'].astype(str)
got dtype('O')
I expect to receive string
This is intended behaviour. This is how pandas stores strings.
From the docs
Pandas uses the object dtype for storing strings.
For a simple test, you can make a dummy dataframe and check it's dtype too.
import pandas as pd
df = pd.DataFrame(["abc", "ab"])
df[0].dtype
#Output:
dtype('O')
You can do that by using apply() function in this way:
data['id'] = data['id'].apply(lambda x: str(x))
This will convert all the values of id column to string.
You can ensure the type of the values like this:
type(data['id'][0]) (It is checking the first value of 'id' column)
This will give the output str.
And data['id'].dtype will give dtype('O') that is object.
You can also use data.info() to check all the information about that DataFrame.
str(12)
>>'12'
Can easily convert to a String
I have a dataset with a datecreated column. this column is typically in the format 'dd/MM/yy' but sometimes it has garbage text. I want to ultimately convert the column to a DATE and have the garbage text as a NULL value.
I have been trying to use resolveChoice, but it is resulting in all null values.
data_res = date_dyf.resolveChoice(specs =
[('datescanned','cast:timestamp')])
Sample data
3,1/1/18,text7
93,this is a test,text8
9,this is a test,text9
82,12/12/17,text10
Try converting a DynamicFrame into Spark's DataFrame and parse date using to_date function:
from pyspark.sql.functions import to_date
df = date_dyf.toDF
parsedDateDf = df.withColumn("datescanned", to_date(df["datescanned"], "dd/MM/yy"))
dyf = DynamicFrame.fromDF(parsedDateDf, glueContext, "convertedDyf")
If a string doesn't match the format a null value will be set
I am in the process of converting multiple string columns to date time columns, but I am running into the following issues:
Example column 1:
1/11/2018 9:00:00 AM
Code:
df = df.withColumn(df.column_name, to_timestamp(df.column_name, "MM/dd/yyyy hh:mm:ss aa"))
This works okay
Example column 2:
2019-01-10T00:00:00-05:00
Code:
df = df.withColumn(df.column_name, to_date(df.column_name, "yyyy-MM-dd'T'HH:mm:ss'-05:00'"))
This works okay
Example column 3:
20190112
Code:
df = df.withColumn(df.column_name, to_date(df.column_name, "yyyyMMdd"))
This does not work. I get this error:
AnalysisException: "cannot resolve 'unix_timestamp(t.`date`,
'yyyyMMdd')' due to data type mismatch: argument 1 requires (string or
date or timestamp) type, however, 't.`date`' is of int type.
I feel like it should be straightforward, but I am missing something.
The error is pretty self explanatory, you need your column yo be a String.
Are you sure your column is already a String? It seems not. You can cast it to String first with column.cast
import org.apache.spark.sql.types._
df = df.withColumn(df.column_name, to_date(df.column_name.cast(StringType), "yyyyMMdd")