Spark Structured Streaming ignore old records - apache-spark

I am new to spark and help me to arrive in solutions for this problem. I am receiving the input file it has information about an event occurred and the file itself has the timestamp value. Event Id is the primary column for this input. Refer below the sample input (the actual file has many other columns).
Event_Id | Event_Timestamp
1 | 2018-10-11 12:23:01
2 | 2018-10-11 13:25:01
1 | 2018-10-11 14:23:01
3 | 2018-10-11 20:12:01
When we get the above input we need to get the latest record based on event id, timestamp and the expected output would be
Event_Id | Event_Timestamp
2 | 2018-10-11 13:25:01
1 | 2018-10-11 14:23:01
3 | 2018-10-11 20:12:01
Hereafter whenever I receive the event information which has timestamp value less than the above value I need to ignore, for example, consider the second input
Event_Id | Event_Timestamp
2 | 2018-10-11 10:25:01
1 | 2018-10-11 08:23:01
3 | 2018-10-11 21:12:01
Now I need to ignore event_id 1 and 2 since it has the old timestamp that the state what we have right now. Only the event 3 would be passed and the expected output here is
3 | 2018-10-11 21:12:01
Assume we have n number of unique(10 billion) event id how it would be stored in spark memory, is there something needs to be taken care.
Thanks in advance

We can take max timestamp and use persist() method with disk_only or disk_only2 storage levels... In that case, we can achieve this I think...
Since it's an streaming data, we can try with memory_only or memory_only2 storage levels too...
Please try and update..

Related

Cassandra is unexpectedly prepending zeros in timestamp millisecond

My code is reading data from Kafka and writing it to Cassandra using Spark. But in some cases it is appending the zero in front of millisecond.
For Example-
Kafka Data: 2022-10-11T08:46:12.220Z
Cassandra Data: 2022-10-11 14:16:12.022000+0000
Another Example Where we are expecting 2022-07-31 23:28:46.960000+0000 but in Cassandra it is present as 2022-07-31 23:28:46.096000+0000
How a zero is getting prepended in millisecond and how resolve it? It is only happening in some cases, most of the timestamp are coming correctly.
Note- The difference in hour and minute is due to change in timezone.
I suspect you're looking at different records because the timestamps you posted are not the same as what you think they should be.
I happen to have a table that contains timestamps and I inserted the same timestamps you posted above and I can confirm that Cassandra is not adding the leading zeros.
Here is my table schema:
CREATE TABLE community.tstamp_table (
id int,
tstamp timestamp,
name text,
PRIMARY KEY (id, tstamp)
)
And here are the table's contents with the timestamps you posted:
id | tstamp | name
----+---------------------------------+-------
1 | 2022-10-11 14:16:12.022000+0000 | alice
1 | 2022-10-11 14:16:12.220000+0000 | alice
2 | 2022-07-31 23:28:46.096000+0000 | bob
2 | 2022-07-31 23:28:46.960000+0000 | bob
The CQL timestamp data type is encoded as the number of milliseconds since Unix epoch (Jan 1, 1970 00:00 GMT). Knowing this, we can display the timestamp as an integer value using some native CQL functions:
system.blobasbigint(system.timestampasblob(tstamp)) | tstamp
-----------------------------------------------------+---------------------------------
1665497772022 | 2022-10-11 14:16:12.022000+0000
1665497772220 | 2022-10-11 14:16:12.220000+0000
1659310126096 | 2022-07-31 23:28:46.096000+0000
1659310126960 | 2022-07-31 23:28:46.960000+0000
You should be able to see from the sample data above that the encoded value of the timestamps are correct. For example, the first row is encoded correctly with 022 ms vs 220 ms and the third row is encoded as 096 ms vs 960 ms. Cheers!

Get the last 100 rows from cassandra table

I have a table in cassandra now i cannot select the last 200 rows in the table.
The clustering order by clause was supposed to enforce sorting on disk.
CREATE TABLE t1(id int ,
event text,
receivetime timestamp ,
PRIMARY KEY (event, id)
) WITH CLUSTERING ORDER BY (id DESC)
;
The output is unsorted by id:
event | id | receivetime
---------+----+---------------------------------
event1 | 1 | 2021-07-12 08:11:57.702000+0000
event7 | 7 | 2021-05-22 05:30:00.000000+0000
event5 | 5 | 2021-05-25 05:30:00.000000+0000
event9 | 9 | 2021-05-22 05:30:00.000000+0000
event2 | 2 | 2021-05-21 05:30:00.000000+0000
event10 | 10 | 2021-05-23 05:30:00.000000+0000
event4 | 4 | 2021-05-24 05:30:00.000000+0000
event6 | 6 | 2021-05-27 05:30:00.000000+0000
event3 | 3 | 2021-05-22 05:30:00.000000+0000
event8 | 8 | 2021-05-21 05:30:00.000000+0000
How do I overcome this problem?
Thanks
The same question was asked on https://community.datastax.com/questions/11983/ so I'm re-posting my answer here.
The rows within a partition are sorted based on the order of the clustering column, not the partition key.
In your case, the table's primary key is defined as:
PRIMARY KEY (event, id)
This means that each partition key can have one or more rows, with each row identified by the id column. Since there is only one row in each partition, the sorting order is not evident. But if you had multiple rows in each partition, you'd be able to see that they would be sorted. For example:
event | id | receivetime
---------+----+---------------------------------
event1 | 7 | 2021-05-22 05:30:00.000000+0000
event1 | 5 | 2021-05-25 05:30:00.000000+0000
event1 | 1 | 2021-07-12 08:11:57.702000+0000
In the example above, the partition event1 has 3 rows sorted by the ID column in reverse order.
In addition, running unbounded queries (no WHERE clause filter) is an anti-pattern in Cassandra because it requires a full table scan. If you consider a cluster which has 500 nodes, an unbounded query has to request all the partitions (records) from all 500 nodes to return the result. It will not perform well and does not scale. Cheers!
The ordering for a clustering order, is the order within a single partition key value, e.g. all of the rows for event1 would be in order for event1. It is not a global ordering.
From your results we can see you are selecting multiple partitions - which is why you are not seeing an order you expect.

Calculate time difference between two datetimes within group

I have a data frame that shows me everyday call records and what happened to that call. I need to group the call records by caller ID, date, and description of the call, count the number of times that caller called in a day, and the time difference between them.
My current dataset is
Call record ID | DiscriptionCallEvent | CallDateTime |
14306187 | CallTerminated | 2021-02-08 17:48:04
14306189 | AbandonedQueue | 2021-02-08 18:18:44
14306189 | AbandonedQueue | 2021-02-08 18:19:20
14306189 | AbandonedQueue | 2021-02-08 18:20:00
14306189 | Caller termintaed | 2021-02-08 19:20:00
The second caller here called 2 times back to back. What I am trying to do is capture the information in this way.
Call record ID | DiscriptionCallEvent | CallDate | #Times | TimeDiff
14306189 AbandonedQueue 2021-02-08 3 2 minutes (say)
14306189 Caller terminated 2021-02-08 1 1 hour
I was able to group by record ID and description and date and count the number of times the call occurred. But I am not able to find the time difference within the group. I tried many StackOverflow solution but did not work.
Currently, I am using this query:
CallsGrouped=df.groupby(["CallRecordID","CallDate","DiscriptionCallEvent "],as_index=False)['Call Record ID'].count()
How do I modify the query such that I can find the time difference as well?

Using multiple parent IDs for cutoff times in deep feature synthesis

My data looks like: People <-- Events <--Activities. The parent is People, of which the only variable is the person_id. Events and Activities both have a time index, along with event_id and activity_id, both which have a few features.
Members of the 'People' entity visit places at all different times. I am trying to generate deep features for people. If people is something like [1,2,3], how do I pass cut off times that create deep features for something like (Person,cutofftime): [1,January2], [1, January3]
If I have only 3 People, it seems like I can't pass a cutoff_time dataframe that has 10 rows (for example, person 1 with 10 possible cutoff times). Trying this gives me the error "Duplicated rows in cutoff time dataframe", despite dropping duplicates from my cutoff_times dataframe.
Must I include time index in the People Entity? This would leave my parent entity with multiple people in the index, although they would have different time index. My instinct is that the people entity should not include any datetime column. I would like to give cut off times to the DFS function.
My cutoff_times df.head looks like this, and has multiple instances of some people_id:
+-------------------------------------------+
| person_id time label |
+-------------------------------------------+
| 0 f_GZSVLYU 2019-12-06 0.0 |
| 1 f_ATBJEQS 2019-12-06 1.0 |
| 2 f_GLFYVAY 2019-12-06 0.5 |
| 3 f_DIHPTPA 2019-12-06 0.5 |
| 4 f_GZSVLYU 2019-12-02 1.0 |
+-------------------------------------------+
The Parent People Entity is like this:
+-------------------+
| person_id |
+-------------------+
| 0 f_GZSVLYU |
| 1 f_ATBJEQS |
| 2 f_GLFYVAY |
| 3 f_DIHPTPA |
| 4 f_DVOYHRQ |
+-------------------+
How can I make featuretools understand what I'm trying to do?
'Duplicated rows in cutoff time dataframe.' I have explored my cutoff_times df and there are no duplicate rows. Person_id, times, and labels all have multiple occurrences each but no 2 rows are the same. Could these duplicates the error is referring to be somewhere else in the EntitySet?
The answer is one row of the cutoff_df had the same ID and time but with different labels. That's a problem.

Spark: count events based on two columns

I have a table with events which are grouped by a uid. All rows have the columns uid, visit_num and event_num.
visit_num is an arbitrary counter that occasionally increases. event_num is the counter of interactions within the visit.
I want to merge these two counters into a single interaction counter that keeps increasing by 1 for each event and continues to increase when then next visit has started.
As I only look at the relative distance between events, it's fine if I don't start the counter at 1.
|uid |visit_num|event_num|interaction_num|
| 1 | 1 | 1 | 1 |
| 1 | 1 | 2 | 2 |
| 1 | 2 | 1 | 3 |
| 1 | 2 | 2 | 4 |
| 2 | 1 | 1 | 500 |
| 2 | 2 | 1 | 501 |
| 2 | 2 | 2 | 502 |
I can achieve this by repartitioning the data and using the monotonically_increasing_id like this:
df.repartition("uid")\
.sort("visit_num", "event_num")\
.withColumn("iid", fn.monotonically_increasing_id())
However the documentation states:
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.
As the id seems to be monotonically increasing by partition this seems fine. However:
I am close to reaching the 1 billion partition/uid threshold.
I don't want to rely on the current implementation not changing.
Is there a way I can start each uid with 1 as the first interaction num?
Edit
After testing this some more, I notice that some of the users don't seem to have consecutive iid values using the approach described above.
Edit 2: Windowing
Unfortunately there are some (rare) cases where more thanone row has the samevisit_numandevent_num`. I've tried using the windowing function as below, but due to this assigning the same rank to two identical columns, this is not really an option.
iid_window = Window.partitionBy("uid").orderBy("visit_num", "event_num")
df_sample_iid=df_sample.withColumn("iid", fn.rank().over(iid_window))
The best solution is the Windowing function with rank, as suggested by Jacek Laskowski.
iid_window = Window.partitionBy("uid").orderBy("visit_num", "event_num")
df_sample_iid=df_sample.withColumn("iid", fn.rank().over(iid_window))
In my specific case some more data cleaning was required but generally, this should work.

Resources