Pyspark - Timezone issue when writing the dataframe to kafka - apache-spark

In my transformation I cast the dates as the following
f.date_format(
f.to_utc_timestamp(f.col("DATE"), "Europe/Paris"), "yyyy-MM-dd HH:mm:ss"
).alias("receipt_date"),
when writing my resulted dataframe to json
data_frame.write.mode("overwrite").option("ignoreNullFields", "false").format("json").save(path)
I got a correct output
"receipt_date":"2022-07-07 00:00:00"
But when writing the dataframe to kafka I got the following output
"receipt_date":"2022-07-06 22:00:00"
data_frame.selectExpr(
"CAST(id AS STRING) AS key", "to_json(struct(metadata,payload)) AS value"
).write.format("kafka")
Note: I run the job on AWS Glue and I alreday added this config
--conf spark.sql.session.timeZone=Europe/Paris
Any suggestion please

Related

Using PySpark to read in datalake table and can't parse timestamp column in Synapse Analytics

I can read in the datalake table and print schema but if I try and display data I get the following error. I am working within Synapse Analytics using a PySpark Notebook and Apache Spark Pool.
See error message:
You may get a different result due to the upgrading of Spark 3.0: Fail to parse '10/27/2022 1:14:31 PM' in the new parser.
You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
I don't want to use the LEGACY version.
I've tried converting using the following code
df = df.withColumn("SinkCreatedOn",to_date(col("SinkCreatedOn"),"M/dd/yyyy h:m:s"))
df = df.withColumn("SinkModifiedOn",to_date(col("SinkModifiedOn"),"M/dd/yyyy h:m:s"))
I've also tried converting the suspect columns to StringType() or DateType() but no luck.
Any help appreciated
Thank you
Try the script with below date format
df = df1.withColumn("SinkCreatedOn",to_date(col("SinkCreatedOn"),"MM/dd/yyyy h:mm:s a"))
I repro'd the same with sample input. Below is the approach.
Code:
df1=spark.createDataFrame(
data = [ ("1","Arpit","10/27/2022 1:14:31 PM"),("2","Anand","10/28/2022 1:14:31 PM"),("3","Mike","10/29/2022 1:14:31 PM")],
schema=["id","Name","SinkCreatedOn"])
df1.printSchema()
from pyspark.sql.functions import *
df_output = df1.withColumn("SinkCreatedOn",to_date(col("SinkCreatedOn"),"MM/dd/yyyy h:mm:s a"))
df1.show()
df_output.show()
df1
df_output

Spark SQL: Parse date string from dd/mm/yyyy to yyyy/mm/dd

I want to use spark SQL or pyspark to reformat a date field from 'dd/mm/yyyy' to 'yyyy/mm/dd'. The field type is string:
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType
spark = SparkSession.builder.master("local[1]")\
.appName("date.com")\
.getOrCreate()
my_df = spark.createDataFrame(["13/04/2020", "16/04/2020", "19/04/2020"], StringType()).toDF("date")
expected_df = spark.createDataFrame(["2020/04/12", "2020/04/16", "2020/04/19"], StringType()).toDF("date")
I have tried the following spark sql command, but this returns the date as literally 'yyyy/MM/dd' rather than '2020/04/12'.
select date_format(date, 'dd/MM/yyyy'), 'yyyy/MM/dd' as reformatted_date
FROM my_df
I have also looked at the following documentation but didn't see anything that fits my scenario: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
If it's not possible in spark sql then pyspark would work.
Any ideas?
You need to convert to date type using to_date first:
select date_format(to_date(date, 'dd/MM/yyyy'), 'yyyy/MM/dd') as reformatted_date
from my_df
df1.select( to_date(date_format(to_date(lit("12/12/2020"), "dd/MM/yyyy"), "yyyy-MM-dd") ).as("campo")).show()

how to get the sequence number of a kinesis record when consuming using pyspark and spark streaming

we are using pyspark and spark streaming to consume records from a kinesis stream
the code looks something like this :
streams = [
KinesisUtils.createStream(
ssc,
app_name,
stream_name,
endpoint_url,
region_name,
InitialPositionInStream.TRIM_HORIZON,
conf["stream"]["checkpoint_interval"],
decoder=gzip.decompress,
)
for _ in range(number_of_streams)
]
ssc.union(*streams).pprint()
the output has a data column and some metadata columns that were added to the payload.
the metadata column is empty.
the question is if we should get metadata columns by default : such as sequence number and partition key.
and if not is there a way to get them using pyspark?
using spark 2.4.4 emr 5.27 and spark-streaming-kinesis-asl_2.11-2.4.4.jar
thanks.

Date type is saved as long type when pyspark write data to elasticsearch

nice to meet you.
Nowadays i use Elasticsearch for Apache Hadoop to join elasticsearch index.
(https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html)
However, i have a problem when pyspark writes data with date type field to elasticsearch.
Original Field:
created: timestamp (nullable = true)
However, when i save the data to elasticsearch like below:
result.write.format("org.elasticsearch.spark.sql")\
.option("es.nodes","server")\
.option("es.mapping.date.rich", "true")\
.option("timestampFormat", "YYYY-MM-DD'T'hh:mm:ss.sss")\
.option("es.mapping.id","id")\
.mode("append")\
.option("es.resource", "index").save()
Fields with date type converted to long type with Unixtimestamp.
However, i want to save the data as date type( like ISO 8601 Format)
How can I save the type as it is?
Please help me
The code i used.
# Import PySpark modules
from pyspark import SparkContext, SparkConf, SQLContext
# Spark Config
conf = SparkConf().setAppName("es_app")
conf.set("es.scroll.size", "1000")
sc = SparkContext(conf=conf)
# sqlContext
sqlContext = SQLContext(sc)
# Load data from elasticsearch
df = sqlContext.read.format("org.elasticsearch.spark.sql") \
.option("es.nodes","server")\
.option("es.nodes.discovery", "true")\
.option("es.mapping.date.rich", 'false')\
.load("index")
# Make view
df.registerTempTable("test")
all_data = sqlContext.sql("SELECT * from test")
result.write.format("org.elasticsearch.spark.sql")\
.option("es.nodes","server")\
.option("es.mapping.date.rich", "true")\
.option("timestampFormat", "YYYY-MM-DD'T'hh:mm:ss.sss")\
.option("es.mapping.id","id")\
.mode("append")\
.option("es.resource", "index").save()
How can i fixed the problem?
please define mapping for your date field and use Date field of Elasticsearch which supports multiple date formats. also date fields in Elasticsearch is internally stored as long.
Ref :- date datatype in elasticsearch
Define date field in mapping with various formats
Also please read this note about how date fields are internally stored and displayed
Internally, dates are converted to UTC (if the time-zone is specified)
and stored as a long number representing milliseconds-since-the-epoch.
Dates will always be rendered as strings, even if they were initially
supplied as a long in the JSON document.
Example
{
"mappings": {
"properties": {
"date": {
"type": "date",
"format": "yyyy-MM-dd"
}
}
}
}

Spark Structured streaming replacing values of a column

I have the following dataframe
val tDataJsonDF = kafkaStreamingDFParquet
.filter($"value".contains("tUse"))
.filter($"value".isNotNull)
.selectExpr("cast (value as string) as tdatajson", "cast (topic as string) as env")
.select(from_json($"tdatajson", schema = ParquetSchema.tSchema).as("data"), $"env".as("env"))
.select("data.*", "env")
.select($"date", <--YYYY/MM/dd
$"time",
$"event",
$"serviceGroupId",
$"userId",
$"env")
This streaming dataframe has a column date which has the format - YYYY/MM/dd.
Due to this when I use this column as a partitioning column in my parquet write Spark creates the partition as date=2018%04%12.
Is there way I can modify the column value on the fly in the above code so that the date value is YYYY-MM-dd or YYYYMMd.
Parquet write query:
val tunerQuery = tunerDataJsonDF
.writeStream
.format("parquet")
.option("path",pathtodata )
.option("checkpointLocation", pathtochkpt)
.partitionBy("date","env","serviceGroupId")
.start()
I assume you're using Spark 2.2+
tDataJsonDF.withColumn("formatted_date",date_format(to_date(col("date"), "YYYY/MM/dd"), "yyyy-MM-dd"))

Resources