Aggregation 30days data in Spark Structured Streaming

Aggregation 30days data in Spark Structured Streaming - apache-spark

I am using Spark Structured Streaming to calculate the monthly amount of a user. I am using the below code:
df = spark
.readStream
.format('kafka')
.option("startingOffsets", "earliest")
.option("failOnDataLoss", "false")
.load()
df1 = df.groupby('client_id', 'id', window(col('date'), "30 days"))
.agg(sum(col('amount')).alias('amount'), count(col('id')).alias('count'))
df1.selectExpr("CAST(client_id AS STRING) AS key", "to_json(struct(*)) AS value")
.writeStream
.format('kafka')
........
I observed that the output is not correct. For example:
input
{"client_id":"1", "id":"1", "date":"2022-08-01", "amount": 10.0}
{"client_id":"1", "id":"1", "date":"2022-08-15", "amount": 10.0}
{"client_id":"1", "id":"1", "date":"2022-08-25", "amount": 10.0}
{"client_id":"1", "id":"1", "date":"2022-08-26", "amount": 10.0}
{"client_id":"1", "id":"1", "date":"2022-08-27", "amount": 10.0}
{"client_id":"1", "id":"1", "date":"2022-08-28", "amount": 10.0}
{"client_id":"1", "id":"1", "date":"2022-08-29", "amount": 10.0}
output
{"client_id":"1","id":"1","amount":10.0,"count":1}
{"client_id":"1","id":"1","amount":20.0,"count":2}
{"client_id":"1","id":"1","amount":30.0,"count":3}
{"client_id":"1","id":"1","amount":40.0,"count":4}
{"client_id":"1","id":"1","amount":50.0,"count":5}
{"client_id":"1","id":"1","amount":10.0,"count":1}
{"client_id":"1","id":"1","amount":20.0,"count":2}
The first input record was on "2022-08-01" with an amount of 10. So it should sum the amount for the next 30 days. So the final sum should be 70, but it is 50 and then 20. It is calculating the sum for the next 27 days. You can see that the "count" and "amount" is updated on "2022-08-28". It is not aggregating records for 30 days.

According to the documentation, there's this 4th argument...
startTime : str, optional
The offset with respect to 1970-01-01 00:00:00 UTC with which to start window intervals. For example, in order to have hourly tumbling windows that start 15 minutes past the hour, e.g. 12:15-13:15, 13:15-14:15… provide startTime as 15 minutes.
This means, that windows are fixed durations since 1970-01-01 00:00:00 UTC, unless you specify some other startTime which must be an offset interval from this point in time.
If you use window function without the startTime specified and windowDuration="30 days", you will get intervals divided into 30-day periods starting from 1970-01-01 00:00:00 UTC. In your case, I don't really understand why the line between two 30-day windows is on 2022-08-28, because for me it's on 2022-08-26:
{2022-07-27 00:00:00, 2022-08-26 00:00:00}
{2022-08-26 00:00:00, 2022-09-25 00:00:00}
Even though there is a difference in exact dates, but the logic is the same: 30-day windows are created fixed from some specific point in time.
If I specify startTime="1 day", I get 30-day windows shifted:
{2022-07-28 00:00:00, 2022-08-27 00:00:00}
{2022-08-27 00:00:00, 2022-09-26 00:00:00}
Only by carefully choosing the startTime we can get time windows start at the date we want, but it will always be fixed until the next change of startTime.
To get what you intend, you could probably use slideDuration parameter of the window function and filtering based on aggregation result:
df1 = (df
.groupby('client_id', 'id', F.window('date', "30 days", "1 day"))
.agg(F.sum('amount').alias('amount'), F.count('id').alias('count'))
.filter((F.date_add(F.sort_array(F.collect_set('date'), False)[0], 1) == F.col("window.end")))
).selectExpr("CAST(client_id AS STRING) AS key", "to_json(struct(*)) AS value")

Related

HIVE where date filter by x days back? string format

so our DBA's setup our hive table with the date column as the partition column, but as a "string" YYYYMMDD format.
How can I WHERE filter this "date" column for something like last 30 days?

Please use date_format to format systemdate - 30 days into YYYYMMDD and then compare with your partition column. Please note to use partition column as is so hive can choose correct partitions.
When you want to pick previous 30th days data -
select *
from mytable
where partition_col = date_format( current_date() - interval '30' days, 'yyyyMMdd')
If you want all data since last 30 days -
select *
from mytable
wherecast(partition_col as INT) >= cast(date_format( current_date() - interval '30' days, 'yyyyMMdd') as INT)
casting shouldnt impact partition benefits but you need to check the performance before using it. Please get back in such scenario.

Databricks Delta table load take long time to load 1 recod

Whenever databricks notebook is running, I am trying to insert 1 records into a delta table but this is taking around 70 seconds. I am passing start_time as a variable.
val batchDf= Seq((1000, 40, start_time, null, null, status)).toDF("Key", "RunId", "Start_Time", "End_Time", "Duration", "In-progress")
batchDf.write.format("delta").mode("append").saveAsTable("t_audit")
Any idea why loading 1 record into a delta table is taking this long? I would expect this to finish in less than 5 secs.

Databricks is horribly slow in comparison to anything that I have used in the past 30 years, but in your case it could be related to auto optimize

Convert date string with AM PM to 24 Hour timestamp in Impala

I am trying to convert a date string with AM/PM to timestamp in impala to check data conversion.
My date string is as below:
10/07/2017 02:04:01.575000000 PM
I tried to convert this in Impala through below query:
select from_unixtime(unix_timestamp((Y_date), "MM/dd/yyyy HH:mm:ss.SSSSSSSSS 'ZZ'"), "yyyy-MM-dd HH:mm:ss.SSSSSS 'ZZ'") from table
The result I get is
2017-10-07 02:04:01.000000 .
I only lose the AM/PM however the hour part "02" is not getting converted to timestamp value "14". I need to get below as result:
2017-10-07 14:04:01.000000 .
I use Impala as my interface for querying Hadoop.
Any inputs would be helpful.
Thanks,
Vishu

Haven't found a built in function for this. Gotta do an inefficient double query, adding 12 hours for PM:
SELECT cast(unix_timestamp(Y_date , "dd/MM/yyyy HH:mm:ss") + 43200 as timestamp) as
FROM table_a
where instr(`datetime`,'PM') > 0
union
SELECT cast(unix_timestamp(Y_date , "dd/MM/yyyy HH:mm:ss") as timestamp) as action_time
FROM table_a
where instr(`datetime`,'AM') > 0

Partition of Timestamp column in Dataframes Pyspark

I have a DataFrame in PSspark in the below format
Date Id Name Hours Dno Dname
12/11/2013 1 sam 8 102 It
12/10/2013 2 Ram 7 102 It
11/10/2013 3 Jack 8 103 Accounts
12/11/2013 4 Jim 9 101 Marketing
I want to do partition based on dno and save as table in Hive using Parquet format.
df.write.saveAsTable(
'default.testing', mode='overwrite', partitionBy='Dno', format='parquet')
The query worked fine and created table in Hive with Parquet input.
Now I want to do partitioned based on the year and month of the date column. The timestamp is Unix timestamp
how can we achieve that in PySpark. I have done it in hive but unable to do it PySpark

Spark >= 3.1
Instead of cast use timestamp_seconds
from pyspark.sql.functions import timestamp_seconds
year(timestamp_seconds(col("timestamp")))
Spark < 3.1
Just extract fields you want to use and provide a list of columns as an argument to the partitionBy of the writer. If timestamp is UNIX timestamps expressed in seconds:
df = sc.parallelize([
(1484810378, 1, "sam", 8, 102, "It"),
(1484815300, 2, "ram", 7, 103, "Accounts")
]).toDF(["timestamp", "id", "name", "hours", "dno", "dname"])
add columns:
from pyspark.sql.functions import year, month, col
df_with_year_and_month = (df
.withColumn("year", year(col("timestamp").cast("timestamp")))
.withColumn("month", month(col("timestamp").cast("timestamp"))))
and write:
(df_with_year_and_month
.write
.partitionBy("year", "month")
.mode("overwrite")
.format("parquet")
.saveAsTable("default.testing"))

Cassandra Timestampe

I want ask about timestampe formate in the insert command
in the following sample when i insert any number like "12" or "15" in the "message_sent _at"
I found that all the values of the timestamp fields is the same value : 1970-01-01 02:00 EGYPT standard time .
sample:
CREATE TABLE chat (
id1 int,
id2 int,
message_sent_at timestamp,
message text,
primary key ((id1, id2), message_sent_at)
)

The units of timestamp type are milliseconds since the epoch (1/1/1970 00:00:00 UTC). Entering 12 means 12 ms after midnight so will be rounded to the time you print (in your timezone) when displayed in that format.
You can create timestamps from dates here: http://www.epochconverter.com/.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Aggregation 30days data in Spark Structured Streaming - apache-spark

Related

HIVE where date filter by x days back? string format

Databricks Delta table load take long time to load 1 recod

Convert date string with AM PM to 24 Hour timestamp in Impala

Partition of Timestamp column in Dataframes Pyspark

Cassandra Timestampe

Categories

Resources