Spark Dataframe: Convert bigint to timestamp - apache-spark

I have a Dataframe with a bigint column. How to convert a bigint column to timestamp in scala spark

You can use from_unixtime/to_timestamp function in spark to convert Bigint column to timestamp.
Example:
spark.sql("select timestamp(from_unixtime(1563853753,'yyyy-MM-dd HH:mm:ss')) as ts").show(false)
+-------------------+
|ts |
+-------------------+
|2019-07-22 22:49:13|
+-------------------+
(or)
spark.sql("select to_timestamp(1563853753) as ts").show(false)
+-------------------+
|ts |
+-------------------+
|2019-07-22 22:49:13|
+-------------------+
Schema:
spark.sql("select to_timestamp(1563853753) as ts").printSchema
root
|-- ts: timestamp (nullable = false)
Refer this link for more details regards to converting different formats of timestamps in spark.

Related

PySpark string to timestamp conversion

How can I convert timestamp as string to timestamp in "yyyy-mm-ddThh:mm:ss.sssZ" format using PySpark?
Input timestamp (string), df:
| col_string |
| :-------------------- |
| 5/15/2022 2:11:06 AM |
Desired output (timestamp), df:
| col_timestamp |
| :---------------------- |
| 2022-05-15T2:11:06.000Z |
to_timestamp can be used providing the optional format parameter.
from pyspark.sql import functions as F
df = spark.createDataFrame([("5/15/2022 2:11:06 AM",)], ["col_string"])
df = df.select(F.to_timestamp("col_string", "M/dd/yyyy h:mm:ss a").alias("col_ts"))
df.show()
# +-------------------+
# | col_ts|
# +-------------------+
# |2022-05-15 02:11:06|
# +-------------------+
df.printSchema()
# root
# |-- col_ts: timestamp (nullable = true)

pyspark: read partitioned parquet "my_file.parquet/col1=NOW" string value replaced by <current_time> on read()

With pyspark 3.1.1 on wsl Debian 10
When reading parquet file partitioned with a column containing the string NOW, the string is replaced by the current time at the moment of the read() funct is executed. I suppose that NOW string interpreted as now()
# step to reproduce
df = spark.createDataFrame(data=[("NOW",1), ("TEST", 2)], schema = ["col1", "id"])
df.write.partitionBy("col1").parquet("test/test.parquet")
>>> /home/test/test.parquet/col1=NOW
df_loaded = spark.read.option(
"basePath",
"test/test.parquet",
).parquet("test/test.parquet/col1=*")
df_loaded.show(truncate=False)
>>>
+---+--------------------------+
|id |col1 |
+---+--------------------------+
|2 |TEST |
|1 |2021-04-18 14:36:46.532273|
+---+--------------------------+
Is that a bug or a normal function of pyspark?
if the latter, is there a sparkContext option to avoid that behaviour?
I suspect that's an expected feature... but I'm not sure where it was documented. Anyway, if you want to keep the column as a string column, you can provide a schema while reading the parquet file:
df = spark.read.schema("id long, col1 string").parquet("test/test.parquet")
df.show()
+---+----+
| id|col1|
+---+----+
| 1| NOW|
| 2|TEST|
+---+----+

How to write a streaming DataFrame out to Kafka with all rows as JSON array?

I am looking for a solutions for the writing the spark streaming data to kafka.
I am using following method to write data to kafka
df.selectExpr("to_json(struct(*)) AS value").writeStream.format("kafka")
But my issue is while writing to kafka the data showing as following
{"country":"US","plan":postpaid,"value":300}
{"country":"CAN","plan":0.0,"value":30}
my expected output is
[
{"country":"US","plan":postpaid,"value":300}
{"country":"CAN","plan":0.0,"value":30}
]
I want enclose the rows inside the array. How can achieve the same in spark streaming ? can someone advice
I assume the schema of the streaming DataFrame (df) is as follows:
root
|-- country: string (nullable = true)
|-- plan: string (nullable = true)
|-- value: string (nullable = true)
I also assume that you want to write (produce) all rows in the streaming DataFrame (df) out to a Kafka topic as a single record in which the rows are in the form of an array of JSONs.
If so, you should groupBy the rows and collect_list to group all rows into one that you could write out to Kafka.
// df is a batch DataFrame so I could show for demo purposes
scala> df.show
+-------+--------+-----+
|country| plan|value|
+-------+--------+-----+
| US|postpaid| 300|
| CAN| 0.0| 30|
+-------+--------+-----+
val jsons = df.selectExpr("to_json(struct(*)) AS value")
scala> jsons.show(truncate = false)
+------------------------------------------------+
|value |
+------------------------------------------------+
|{"country":"US","plan":"postpaid","value":"300"}|
|{"country":"CAN","plan":"0.0","value":"30"} |
+------------------------------------------------+
val grouped = jsons.groupBy().agg(collect_list("value") as "value")
scala> grouped.show(truncate = false)
+-----------------------------------------------------------------------------------------------+
|value |
+-----------------------------------------------------------------------------------------------+
|[{"country":"US","plan":"postpaid","value":"300"}, {"country":"CAN","plan":"0.0","value":"30"}]|
+-----------------------------------------------------------------------------------------------+
I'd do all the above in DataStreamWriter.foreachBatch to get ahold of a DataFrame to work on.
I'm really not sure if that is achievable, but I'll post my suggestion anyway here; so what you can do is transform your Dataframe afterwards:
//Input
inputDF.show(false)
+---+-------+
|int|string |
+---+-------+
|1 |string1|
|2 |string2|
+---+-------+
//convert that to json
inputDF.toJSON.show(false)
+----------------------------+
|value |
+----------------------------+
|{"int":1,"string":"string1"}|
|{"int":2,"string":"string2"}|
+----------------------------+
//then use collect and mkString
println(inputDF.toJSON.collect().mkString("[", "," , "]"))
[{"int":1,"string":"string1"},{"int":2,"string":"string2"}]

How do i change string to HH:mm:ss only in spark

I am getting the time as string like 134455 and I need to convert into 13:44:55 using spark sql how can we get this in right format
You can try the regexp_replace function.
scala> val df = Seq((134455 )).toDF("ts_str")
df: org.apache.spark.sql.DataFrame = [ts_str: int]
scala> df.show(false)
+------+
|ts_str|
+------+
|134455|
+------+
scala> df.withColumn("ts",regexp_replace('ts_str,"""(\d\d)""","$1:")).show(false)
+------+---------+
|ts_str|ts |
+------+---------+
|134455|13:44:55:|
+------+---------+
scala> df.withColumn("ts",trim(regexp_replace('ts_str,"""(\d\d)""","$1:"),":")).show(false)
+------+--------+
|ts_str|ts |
+------+--------+
|134455|13:44:55|
+------+--------+
scala>
val df = Seq("133456").toDF
+------+
| value|
+------+
|133456|
+------+
df.withColumn("value", unix_timestamp('value, "HHmmss"))
.withColumn("value", from_unixtime('value, "HH:mm:ss"))
.show
+--------+
| value|
+--------+
|13:34:56|
+--------+
Note that a unix timestamp is stored as the number of seconds since 00:00:00, 1 January 1970. If you try to convert a time with millisecond accuracy to a timestamp, you will lose the millisecond part of the time. For times including milliseconds, you will need to use a different approach.

Spark sql count changes on changing case

I have a table that has data distribution like :
sqlContext.sql( """ SELECT
count(to_Date(PERIOD_DT)), to_date(PERIOD_DT)
from dbname.tablename group by to_date(PERIOD_DT) """).show
+-------+----------+
| _c0| _c1|
+-------+----------+
|1067177|2016-09-30|
|1042566|2017-07-07|
|1034333|2017-07-31|
+-------+----------+
However, when I run a query like the following :
sqlContext.sql(""" SELECT COUNT(*)
from dbname.tablename
where PERIOD_DT = '2017-07-07' """).show
Surprisingly, it returns :
+-------+
| _c0|
+-------+
|3144076|
+-------+
But if I changed PERIOD_DT to lowercase, i.e., period_dt , it returns the correct result
sqlContext.sql("""
SELECT COUNT(*)
from dbname.table
where period_dt='2017-07-07' """).show
+-------+
| _c0|
+-------+
|1042566|
+-------+
period_dt is the column on which the table is partitioned and it's type is char(10)
The table data is stored as Parquet :
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
What might be causing this inconsistency?
It is a case sensitive issue . Because of limitations of hive meta store schema, table is always lowercase. Parquet should resolve the issue

Resources