PySpark string to timestamp conversion - string

How can I convert timestamp as string to timestamp in "yyyy-mm-ddThh:mm:ss.sssZ" format using PySpark?
Input timestamp (string), df:
| col_string |
| :-------------------- |
| 5/15/2022 2:11:06 AM |
Desired output (timestamp), df:
| col_timestamp |
| :---------------------- |
| 2022-05-15T2:11:06.000Z |

to_timestamp can be used providing the optional format parameter.
from pyspark.sql import functions as F
df = spark.createDataFrame([("5/15/2022 2:11:06 AM",)], ["col_string"])
df = df.select(F.to_timestamp("col_string", "M/dd/yyyy h:mm:ss a").alias("col_ts"))
df.show()
# +-------------------+
# | col_ts|
# +-------------------+
# |2022-05-15 02:11:06|
# +-------------------+
df.printSchema()
# root
# |-- col_ts: timestamp (nullable = true)

Related

Copying column name as dictionary key in all values of column in Pyspark dataframe

I have pyspark df, distributed across the cluster as follows:
Name ID
A 1
B 2
C 3
I want to modify 'ID' column to make all values as python dictionaries with column name as key & value as existing values in column as follows:
Name TRACEID
A {ID:1}
B {ID:2}
C {ID:3}
How do I achieve this using pyspark code ? I need an efficient solution since it's a big volume distributed df across the cluster.
Thanks in advance.
You can first construct a struct from the ID column, and then use the to_json function to convert it to the desired format.
df = df.select('Name', F.to_json(F.struct(F.col('ID'))).alias('TRACEID'))
You can use the create_map function
from pyspark.sql.functions import col, lit, create_map
sparkDF.withColumn("ID_dict", create_map(lit("id"),col("ID"))).show()
# +----+---+---------+
# |Name| ID| ID_dict|
# +----+---+---------+
# | A| 1|{id -> 1}|
# | B| 2|{id -> 2}|
# | C| 3|{id -> 3}|
# +----+---+---------+
Rename/drop columns:
df = sparkDF.withColumn("ID_dict",create_map("id",col("ID"))).drop(col("ID")).withColumnRenamed("ID_dict", "ID")
df.show()
# +----+---------+
# |Name| ID|
# +----+---------+
# | A|{id -> 1}|
# | B|{id -> 2}|
# | C|{id -> 3}|
# +----+---------+
df.printSchema()
# root
# |-- Name: string (nullable = true)
# |-- ID: map (nullable = false)
# | |-- key: string
# | |-- value: long (valueContainsNull = true)
You get a column with map datatype that's well suited for representing a dictionary.

column not present in pyspark dataframe?

I have pyspark dataframe df having IP as column_name like below :
summary `0.0.0.0` 8.8.8.8 1.0.0.0 1.1.1.1
count 14 14 14 14
min 123 231 423 54
max 2344 241 555 100
When I am doing df.columns it is giving me a below column list but in list special character of 1st column back quote is missing.
[0.0.0.0, 8.8.8.8 ,1.0.0.0,1.1.1.1]
And when I am performing any operation using this list it gives me an error column 0.0.0.0 not present in dataframe.
Also, I tried to change column_name by using the below code but is not changing because it is not in the list.
import re
df = df.select([F.col(col).alias(re.sub("[`]+","",i)) for col in df.columns])
How to resolve this issue?
Schema of the df is like below after performing df.printSchema()
root
|-- summary: string (nullable = true)
|-- 0.0.0.0: string (nullable = true)
|-- 8.8.8.8: string (nullable = true)
|-- 1.0.0.0: string (nullable = true)
|-- 1.1.1.1: string (nullable = true)
With numbers as the first character of the column name, you always can force adding backticks when query from it
df.select('summary', '`0.0.0.0`').show()
# +-------+-------+
# |summary|0.0.0.0|
# +-------+-------+
# | count| 14|
# | min| 123|
# | max| 2344|
# +-------+-------+
df.select(['summary'] + [f'`{col}`' for col in df.columns if col != 'summary']).show()
# +-------+-------+-------+-------+-------+
# |summary|0.0.0.0|8.8.8.8|1.0.0.0|1.1.1.1|
# +-------+-------+-------+-------+-------+
# | count| 14| 14| 14| 14|
# | min| 123| 231| 423| 54|
# | max| 2344| 241| 555| 100|
# +-------+-------+-------+-------+-------+

How to create dataframe in pyspark with two columns, one string and one array?

I have a list of strings say
id = ['a','b','c']
A list of numpy arrays
value = [array([1,2]),array([2,3]),array([3,4])]
I want to create a pyspark dataframe like
| id | value |
| -------- | -------- |
| a | [1,2] |
| b | [2,3] |
| c | [3,4] |
How can I do it?
The other answer would not work for Numpy arrays. To create a dataframe from numpy arrays, you need to convert it to a Python list of integers first.
from numpy import array
id = ['a','b','c']
value = [array([1,2]),array([2,3]),array([3,4])]
df = spark.createDataFrame(
[(i, list(map(int, j))) for (i, j) in zip(id, value)],
['id', 'value']
)
df.show()
+---+------+
| id| value|
+---+------+
| a|[1, 2]|
| b|[2, 3]|
| c|[3, 4]|
+---+------+
df.printSchema()
root
|-- id: string (nullable = true)
|-- value: array (nullable = true)
| |-- element: long (containsNull = true)
You can do something like this:
id = ['a','b','c']
value = [[1,2],[2,3],[3,4]]
l = list(zip(id, value))
df = spark.createDataFrame(l)

How to cast string to timestamp with nanoseconds in pyspark

I am working with data with timestamps that contain nanoseconds and am trying to convert the string to timestamp format.
Here is what the 'Time' column looks like:
+---------------+
| Time |
+---------------+
|091940731349000|
|092955002327000|
|092955004088000|
+---------------+
I would like to cast it to:
+------------------+
| Timestamp |
+------------------+
|09:19:40.731349000|
|09:29:55.002327000|
|09:29:55.004088000|
+------------------+
From what I have found online, I don't need to use a udf to do this and there should be a native function that I can use.
I have tried cast and to_timestamp but got 'null' values:
df_new = df.withColumn('Timestamp', df.Time.cast("timestamp"))
df_new.select('Timestamp').show()
+---------+
|Timestamp|
+---------+
| null|
| null|
+---------+
There are two problems in your code:
Input is not a valid timestamp representation.
Spark doesn't provide type that can represent time without date component
The closest you can get to the required output is to convert input to JDBC compliant java.sql.Timestamp format:
from pyspark.sql.functions import col, regexp_replace
df = spark.createDataFrame(
["091940731349000", "092955002327000", "092955004088000"],
"string"
).toDF("time")
df.select(regexp_replace(
col("time"),
"^(\\d{2})(\\d{2})(\\d{2})(\\d{9}).*",
"1970-01-01 $1:$2:$3.$4"
).cast("timestamp").alias("time")).show(truncate = False)
# +--------------------------+
# |time |
# +--------------------------+
# |1970-01-01 09:19:40.731349|
# |1970-01-01 09:29:55.002327|
# |1970-01-01 09:29:55.004088|
# +--------------------------+
If you want just a string skip cast and limit output to:
df.select(regexp_replace(
col("time"),
"^(\\d{2})(\\d{2})(\\d{2})(\\d{9}).*",
"$1:$2:$3.$4"
).alias("time")).show(truncate = False)
# +------------------+
# |time |
# +------------------+
# |09:19:40.731349000|
# |09:29:55.002327000|
# |09:29:55.004088000|
# +------------------+

Convert string in Spark dataframe to date. Month and date are incorrect [duplicate]

Any Idea why I am getting the result below?
scala> val b = to_timestamp($"DATETIME", "ddMMMYYYY:HH:mm:ss")
b: org.apache.spark.sql.Column = to_timestamp(`DATETIME`, 'ddMMMYYYY:HH:mm:ss')
scala> sourceRawData.withColumn("ts", b).show(6,false)
+------------------+-------------------+-----------+--------+----------------+---------+-------------------+
|DATETIME |LOAD_DATETIME |SOURCE_BANK|EMP_NAME|HEADER_ROW_COUNT|EMP_HOURS|ts |
+------------------+-------------------+-----------+--------+----------------+---------+-------------------+
|01JAN2017:01:02:03|01JAN2017:01:02:03 | RBS | Naveen |100 |15.23 |2017-01-01 01:02:03|
|15MAR2017:01:02:03|15MAR2017:01:02:03 | RBS | Naveen |100 |115.78 |2017-01-01 01:02:03|
|02APR2015:23:24:25|02APR2015:23:24:25 | RBS |Arun |200 |2.09 |2014-12-28 23:24:25|
|28MAY2010:12:13:14| 28MAY2010:12:13:14|RBS |Arun |100 |30.98 |2009-12-27 12:13:14|
|04JUN2018:10:11:12|04JUN2018:10:11:12 |XZX | Arun |400 |12.0 |2017-12-31 10:11:12|
+------------------+-------------------+-----------+--------+----------------+---------+-------------------+
I am trying to convert DATETIME (which is in ddMMMYY:HH:mm:ss format) to Timestamp (which is shown in the last column above) but it doesn't seem to be converting to correct value.
I referred the below post but no help:
Better way to convert a string field into timestamp in Spark
Anyone can help me ?
Use y (year) not Y (week year):
spark.sql("SELECT to_timestamp('04JUN2018:10:11:12', 'ddMMMyyyy:HH:mm:ss')").show
// +--------------------------------------------------------+
// |to_timestamp('04JUN2018:10:11:12', 'ddMMMyyyy:HH:mm:ss')|
// +--------------------------------------------------------+
// | 2018-06-04 10:11:12|
// +--------------------------------------------------------+
Another example:
scala> sql("select to_timestamp('12/08/2020 1:24:21 AM', 'MM/dd/yyyy H:mm:ss a')").show
+-------------------------------------------------------------+
|to_timestamp('12/08/2020 1:24:21 AM', 'MM/dd/yyyy H:mm:ss a')|
+-------------------------------------------------------------+
| 2020-12-08 01:24:21|
+-------------------------------------------------------------+
Try this UDF:
val changeDtFmt = udf{(cFormat: String,
rFormat: String,
date: String) => {
val formatterOld = new SimpleDateFormat(cFormat)
val formatterNew = new SimpleDateFormat(rFormat)
formatterNew.format(formatterOld.parse(date))
}}
sourceRawData.
withColumn("ts",
changeDtFmt(lit("ddMMMyyyy:HH:mm:ss"), lit("yyyy-MM-dd HH:mm:ss"), $"DATETIME")).
show(6,false)
try below code
I have created a sample dataframe "df" for the table
+---+-------------------+
| id| date|
+---+-------------------+
| 1| 01JAN2017:01:02:03|
| 2| 15MAR2017:01:02:03|
| 3|02APR2015:23:24:25 |
+---+-------------------+
val t_s= unix_timestamp($"date","ddMMMyyyy:HH:mm:ss").cast("timestamp")
df.withColumn("ts",t_s).show()
+---+-------------------+--------------------+
| id| date| ts|
+---+-------------------+--------------------+
| 1| 01JAN2017:01:02:03|2017-01-01 01:02:...|
| 2| 15MAR2017:01:02:03|2017-03-15 01:02:...|
| 3|02APR2015:23:24:25 |2015-04-02 23:24:...|
+---+-------------------+--------------------+
Thanks

Resources