Add leading zero to PySpark time components - apache-spark

I have this function which writes data partitioned by date and time
df = df.withColumn("year", F.year(col(date_column))) \
.withColumn("month", F.month(col(date_column))) \
.withColumn("day", F.dayofmonth(col(date_column))) \
.withColumn("hour", F.hour(col(date_column)))
df.write.partitionBy("year","month","day","hour").mode("append").format("csv").save(destination)
The output gets written to month=9 how can I make it be like month=09 same goes for hours, e.g. hour=04.

You could try
.withColumn("month", F.date_format(col(date_column), "MM")))
and
.withColumn("hour", F.date_format(col(date_column), "HH"))

Related

pyspark dates are offset

I have a large (5B rows) of timeseries data, as separate segment parquet files in a directory. If I read each parquet file separately using pandas read_parquet(engine="fastparquet") then I can see the correct data. For instance price on day 2022-08-01 is 150.98:
Date
Price
2022-08-01
150.98
However, if I read the same data via pyspark, I get the incorrect data for that date. It seems to be offset by one day
Date
Price
2022-08-01
153.37
The value 153.37 is in fact the price on 2022-08-02.
My code is as follows:
sc = SparkSession \
.builder \
.appName("test") \
.master('local[*]') \
.config("spark.sql.shuffle.partitions", "200") \
.config("spark.driver.maxResultSize","5g") \
.config("spark.executor.memory","40g") \
.config("spark.driver.memory","10g") \
.config("spark.rdd.compress", "true") \
.config("spark.sql.execution.arrow.pyspark.enabled", "true") \
.getOrCreate()
df = sc.read\
.option("primitivesAsString","true")\
.option("allowNumericLeadingZeros","true")\
.option("timestampFormat", "yyyy-MM-dd")\
.parquet(f'{data_rroot}/*.parquet')
The strange thing is that the dates at this stage of the ingestion are yyyy-MM-dd hh:mm:ss format, even though I've set the timestampFormat option to be yyyy-MM-dd (loading the same data via pandas read_parquet behaves correctly). pyspark reads dates using java's SimpleDateFormat class. To fix this problem I then do:
df = df.withColumn('Date', F.to_date(df["Date"],'yyy-MM-dd'))
I have tried setting the option .config ("spark.sql.parquet.datetimeRebaseModeInRead", "CORRECTED") but that hasnt worked either.
I am trumped, and do not understand what is going on. Even chatGPT cant help :)
figured it out. The problem was timezone. Problem goes away with this addition
.config("spark.sql.session.timeZone", "UTC")
thank you for your time.

Can i exclude the column used for partitioning when writing to parquet?

i need to create parquet files, reading from jdbc. The table is quite big and all columns are varchars. So i created a new column with a random int to make partitioning.
so my read jdbc looks something like this:
data_df = sparkSession.read.format('jdbc') \
.option('url', 'jdbc:netezza://host:port/db') \
.option('dbtable', """(SELECT * FROM schema.table) A""") \
.option('user', 'user') \
.option('password', 'password') \
.option('partitionColumn','random_number') \
.option('lowerBound','1') \
.option('upperBound','200') \
.option('numPartitions','200') \
.load()
and my write to parquet looks something like this:
data_df.write.mode("overwrite").parquet("parquetfile.parquet").partitionBy('random_number')
The generated parquet also contains the 'random_number' column, but i only made that column for partitioning, is there a way to exclude that column to the writing of the parquet files?
Thanks for any help, i'm new to spark :)
I'm expecting to exclude the random_number column, but lack the knowledge if this is possible if i need the column for partitioning
So do you want to repartition in memory using a column but not writing it, you can just use .repartition(col("random_number")) before writing droping the column then write your data:
data_df = sparkSession.read.format('jdbc') \
.option('url', 'jdbc:netezza://host:port/db') \
.option('dbtable', """(SELECT * FROM schema.table) A""") \
.option('user', 'user') \
.option('password', 'password') \
.option('partitionColumn','random_number') \
.option('lowerBound','1') \
.option('upperBound','200') \
.option('numPartitions','200') \
.load()
.repartition(col("random_number")).drop("random_number")
then:
data_df.write.mode("overwrite").parquet("parquetfile.parquet")

Handling Duplicates in Databricks autoloader

I am new to this Databricks Autoloader, we have a requirement where we need to process the data from AWS s3 to delta table via Databricks autoloader. I was testing this autoloader so I came across duplicate issue that is if i upload a file with name say emp_09282021.csv having same data as emp_09272021.csv then it is not detecting any duplicate it is simply inserting them so if I had 5 rows in emp_09272021.csv file now it will become 10 rows as I upload emp_09282021.csv file.
below is the code that i tried:
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.option("header",True) \
.schema("id string,name string, age string,city string") \
.load("s3://some-s3-path/source/") \
.writeStream.format("delta") \
.option("mergeSchema", "true") \
.option("checkpointLocation", "s3://some-s3-path/tgt_checkpoint_0928/") \
.start("s3://some-s3-path/spark_stream_processing/target/")
any guidance please to handle this?
It's not the task of the autoloader to detect duplicates, it provides you the possibility to ingest data, but you need to handle duplicates yourself. There are several approaches to that:
Use built-in dropDuplicates function. It's recommended to use it with watermarking to avoid creating a huge state, but you need to have some column that will be used as event time, and it should be part of dropDuplicate list (see docs for more details):
streamingDf \
.withWatermark("eventTime", "10 seconds") \
.dropDuplicates("col1", "eventTime")
Use Delta's merge capability - you just need to insert data that isn't in the Delta table, but you need to use foreachBatch for that. Something like this (please note that table should already exist, or you need to add a handling of non-existent table):
from delta.tables import *
def drop_duplicates(df, epoch):
table = DeltaTable.forPath(spark,
"s3://some-s3-path/spark_stream_processing/target/")
dname = "destination"
uname = "updates"
dup_columns = ["col1", "col2"]
merge_condition = " AND ".join([f"{dname}.{col} = {uname}.{col}"
for col in dup_columns])
table.alias(dname).merge(df.alias(uname), merge_condition)\
.whenNotMatchedInsertAll().execute()
# ....
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.option("header",True) \
.schema("id string,name string, age string,city string") \
.load("s3://some-s3-path/source/") \
.writeStream.foreachBatch(drop_duplicates)\
.option("checkpointLocation", "s3://some-s3-path/tgt_checkpoint_0928/") \
.start()
In this code you need to change the dup_columns variable to specify columns that are used to detect duplicates.

How to include partitioned column in pyspark dataframe read method

I am writing Avro file-based from a parquet file. I have read the file as below:
Reading data
dfParquet = spark.read.format("parquet").option("mode", "FAILFAST")
.load("/Users/rashmik/flight-time.parquet")
Writing data
I have written the file in Avro format as below:
dfParquetRePartitioned.write \
.format("avro") \
.mode("overwrite") \
.option("path", "datasink/avro") \
.partitionBy("OP_CARRIER") \
.option("maxRecordsPerFile", 100000) \
.save()
As expected, I got data partitioned by OP_CARRIER.
Reading Avro partitioned data from a specific partition
In another job, I need to read data from the output of the above job, i.e. from datasink/avro directory. I am using the below code to read from datasink/avro
dfAvro = spark.read.format("avro") \
.option("mode","FAILFAST") \
.load("datasink/avro/OP_CARRIER=AA")
It reads data successfully, but as expected OP_CARRIER column is not available in dfAvro dataframe as it is a partition column of the first job. Now my requirement is to include OP_CARRIER field also in 2nd dataframe i.e. in dfAvro. Could somebody help me with this?
I am referring documentation from the spark document, but I am not able to locate the relevant information. Any pointer will be very helpful.
You replicate the same column value with a different alias.
dfParquetRePartitioned.withColumn("OP_CARRIER_1", lit(df.OP_CARRIER)) \
.write \
.format("avro") \
.mode("overwrite") \
.option("path", "datasink/avro") \
.partitionBy("OP_CARRIER") \
.option("maxRecordsPerFile", 100000) \
.save()
This would give you what you wanted. But with a different alias.
Or you can also do it during reading. If location is dynamic then you can easily append the column.
path = "datasink/avro/OP_CARRIER=AA"
newcol = path.split("/")[-1].split("=")
dfAvro = spark.read.format("avro") \
.option("mode","FAILFAST") \
.load(path).withColumn(newcol[0], lit(newcol[1]))
If the value is static its way more easy to add it during the data read.

Spark Structural Streaming Output Mode problem

I am currently looking for an applicable solution to solve the following problem using Spark Structured Streaming API. I have searched through a lot of blog posts and Stackoverflow. Unfortunately, I still can't find a solution to this. Hence raising this ticket to call for expert help.
Use Case
Let said I have a Kafka Topic (user_creation_log) that has all the real-time user_creation_event. For those users who didn't do any transaction within 10 secs, 20 secs, and 30 secs then we will assign them a certain voucher. ( time windows is shortened for testing purpose)
Flag and sending the timeout row (more than 10 sec, more than 20 sec , more than 30 secs) to Kafka is the most problematic part!!! Too much rules, or perhaps i should break it 10sec,20sec and 30 secs into different script
My Tracking Table
I am able to track user no_action_sec by no_action_10sec,no_action_20sec,no_action_30sec flag(shown in code below). The no_action_sec is derived from (current_time - creation_time) which will be calculated in every microbatch.
Complete Output Mode
outputMode("complete") writes all the rows of a Result Table (and corresponds to a traditional batch structured query).
Update Output Mode
outputMode("update") writes only the rows that were updated (every time there are updates).
In this case Update Output Mode seems very suitable because it will write an updated row to output. However, whenever the flag10, flag20, flag30 columns have been updated, the row didn't write to the desired location.
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *
spark = SparkSession \
.builder \
.appName("Notification") \
.getOrCreate()
lines = spark \
.readStream \
.format("socket") \
.option("host", "localhost") \
.option("port", 9999) \
.load()
split_col=split(lines.value, ' ')
df = lines.withColumn('user_id', split_col.getItem(0))
df = df.withColumn('create_date_time', split_col.getItem(1)) \
.groupBy("user_id","create_date_time").count()
df = df.withColumn("create_date_time",col("create_date_time").cast(LongType())) \
.withColumn("no_action_sec", current_timestamp().cast(LongType()) -col("create_date_time").cast(LongType()) ) \
.withColumn("no_action_10sec", when(col("no_action_sec") >= 10 ,True)) \
.withColumn("no_action_20sec", when(col("no_action_sec") >= 20 ,True)) \
.withColumn("no_action_30sec", when(col("no_action_sec") >= 30 ,True)) \
query = df \
.writeStream \
.outputMode("update") \
.format("console") \
.start()
query.awaitTermination()
Current Output
UserId = 0 is disappear in Batch 2. It's supposed to show up because no_action_30sec will changes from null to True.
Expected output
User Id should be write to output 3 times once it triggers the flag logic 10 sec, 20 sec and 30 sec
Can anyone shed light on this problem? Like what can i do to let rows write into output when no_action_10sec,no_action_20sec,no_action_30sec is flag to True?
Debug
OutputMode = Complete will output too much redundant data
Mock Data Generator
for i in {0..10000}; do echo "${i} $(date +%s)"; sleep 1; done | nc -lk 9999
Assume that the row has been showing up in console mode (.format("console") ) will send to Kafka for chaining action

Resources