Get the last 100 rows from cassandra table - cassandra

I have a table in cassandra now i cannot select the last 200 rows in the table.
The clustering order by clause was supposed to enforce sorting on disk.
CREATE TABLE t1(id int ,
event text,
receivetime timestamp ,
PRIMARY KEY (event, id)
) WITH CLUSTERING ORDER BY (id DESC)
;
The output is unsorted by id:
event | id | receivetime
---------+----+---------------------------------
event1 | 1 | 2021-07-12 08:11:57.702000+0000
event7 | 7 | 2021-05-22 05:30:00.000000+0000
event5 | 5 | 2021-05-25 05:30:00.000000+0000
event9 | 9 | 2021-05-22 05:30:00.000000+0000
event2 | 2 | 2021-05-21 05:30:00.000000+0000
event10 | 10 | 2021-05-23 05:30:00.000000+0000
event4 | 4 | 2021-05-24 05:30:00.000000+0000
event6 | 6 | 2021-05-27 05:30:00.000000+0000
event3 | 3 | 2021-05-22 05:30:00.000000+0000
event8 | 8 | 2021-05-21 05:30:00.000000+0000
How do I overcome this problem?
Thanks

The same question was asked on https://community.datastax.com/questions/11983/ so I'm re-posting my answer here.
The rows within a partition are sorted based on the order of the clustering column, not the partition key.
In your case, the table's primary key is defined as:
PRIMARY KEY (event, id)
This means that each partition key can have one or more rows, with each row identified by the id column. Since there is only one row in each partition, the sorting order is not evident. But if you had multiple rows in each partition, you'd be able to see that they would be sorted. For example:
event | id | receivetime
---------+----+---------------------------------
event1 | 7 | 2021-05-22 05:30:00.000000+0000
event1 | 5 | 2021-05-25 05:30:00.000000+0000
event1 | 1 | 2021-07-12 08:11:57.702000+0000
In the example above, the partition event1 has 3 rows sorted by the ID column in reverse order.
In addition, running unbounded queries (no WHERE clause filter) is an anti-pattern in Cassandra because it requires a full table scan. If you consider a cluster which has 500 nodes, an unbounded query has to request all the partitions (records) from all 500 nodes to return the result. It will not perform well and does not scale. Cheers!

The ordering for a clustering order, is the order within a single partition key value, e.g. all of the rows for event1 would be in order for event1. It is not a global ordering.
From your results we can see you are selecting multiple partitions - which is why you are not seeing an order you expect.

Related

Using multiple parent IDs for cutoff times in deep feature synthesis

My data looks like: People <-- Events <--Activities. The parent is People, of which the only variable is the person_id. Events and Activities both have a time index, along with event_id and activity_id, both which have a few features.
Members of the 'People' entity visit places at all different times. I am trying to generate deep features for people. If people is something like [1,2,3], how do I pass cut off times that create deep features for something like (Person,cutofftime): [1,January2], [1, January3]
If I have only 3 People, it seems like I can't pass a cutoff_time dataframe that has 10 rows (for example, person 1 with 10 possible cutoff times). Trying this gives me the error "Duplicated rows in cutoff time dataframe", despite dropping duplicates from my cutoff_times dataframe.
Must I include time index in the People Entity? This would leave my parent entity with multiple people in the index, although they would have different time index. My instinct is that the people entity should not include any datetime column. I would like to give cut off times to the DFS function.
My cutoff_times df.head looks like this, and has multiple instances of some people_id:
+-------------------------------------------+
| person_id time label |
+-------------------------------------------+
| 0 f_GZSVLYU 2019-12-06 0.0 |
| 1 f_ATBJEQS 2019-12-06 1.0 |
| 2 f_GLFYVAY 2019-12-06 0.5 |
| 3 f_DIHPTPA 2019-12-06 0.5 |
| 4 f_GZSVLYU 2019-12-02 1.0 |
+-------------------------------------------+
The Parent People Entity is like this:
+-------------------+
| person_id |
+-------------------+
| 0 f_GZSVLYU |
| 1 f_ATBJEQS |
| 2 f_GLFYVAY |
| 3 f_DIHPTPA |
| 4 f_DVOYHRQ |
+-------------------+
How can I make featuretools understand what I'm trying to do?
'Duplicated rows in cutoff time dataframe.' I have explored my cutoff_times df and there are no duplicate rows. Person_id, times, and labels all have multiple occurrences each but no 2 rows are the same. Could these duplicates the error is referring to be somewhere else in the EntitySet?
The answer is one row of the cutoff_df had the same ID and time but with different labels. That's a problem.

Spark Structured Streaming ignore old records

I am new to spark and help me to arrive in solutions for this problem. I am receiving the input file it has information about an event occurred and the file itself has the timestamp value. Event Id is the primary column for this input. Refer below the sample input (the actual file has many other columns).
Event_Id | Event_Timestamp
1 | 2018-10-11 12:23:01
2 | 2018-10-11 13:25:01
1 | 2018-10-11 14:23:01
3 | 2018-10-11 20:12:01
When we get the above input we need to get the latest record based on event id, timestamp and the expected output would be
Event_Id | Event_Timestamp
2 | 2018-10-11 13:25:01
1 | 2018-10-11 14:23:01
3 | 2018-10-11 20:12:01
Hereafter whenever I receive the event information which has timestamp value less than the above value I need to ignore, for example, consider the second input
Event_Id | Event_Timestamp
2 | 2018-10-11 10:25:01
1 | 2018-10-11 08:23:01
3 | 2018-10-11 21:12:01
Now I need to ignore event_id 1 and 2 since it has the old timestamp that the state what we have right now. Only the event 3 would be passed and the expected output here is
3 | 2018-10-11 21:12:01
Assume we have n number of unique(10 billion) event id how it would be stored in spark memory, is there something needs to be taken care.
Thanks in advance
We can take max timestamp and use persist() method with disk_only or disk_only2 storage levels... In that case, we can achieve this I think...
Since it's an streaming data, we can try with memory_only or memory_only2 storage levels too...
Please try and update..

Spark: count events based on two columns

I have a table with events which are grouped by a uid. All rows have the columns uid, visit_num and event_num.
visit_num is an arbitrary counter that occasionally increases. event_num is the counter of interactions within the visit.
I want to merge these two counters into a single interaction counter that keeps increasing by 1 for each event and continues to increase when then next visit has started.
As I only look at the relative distance between events, it's fine if I don't start the counter at 1.
|uid |visit_num|event_num|interaction_num|
| 1 | 1 | 1 | 1 |
| 1 | 1 | 2 | 2 |
| 1 | 2 | 1 | 3 |
| 1 | 2 | 2 | 4 |
| 2 | 1 | 1 | 500 |
| 2 | 2 | 1 | 501 |
| 2 | 2 | 2 | 502 |
I can achieve this by repartitioning the data and using the monotonically_increasing_id like this:
df.repartition("uid")\
.sort("visit_num", "event_num")\
.withColumn("iid", fn.monotonically_increasing_id())
However the documentation states:
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.
As the id seems to be monotonically increasing by partition this seems fine. However:
I am close to reaching the 1 billion partition/uid threshold.
I don't want to rely on the current implementation not changing.
Is there a way I can start each uid with 1 as the first interaction num?
Edit
After testing this some more, I notice that some of the users don't seem to have consecutive iid values using the approach described above.
Edit 2: Windowing
Unfortunately there are some (rare) cases where more thanone row has the samevisit_numandevent_num`. I've tried using the windowing function as below, but due to this assigning the same rank to two identical columns, this is not really an option.
iid_window = Window.partitionBy("uid").orderBy("visit_num", "event_num")
df_sample_iid=df_sample.withColumn("iid", fn.rank().over(iid_window))
The best solution is the Windowing function with rank, as suggested by Jacek Laskowski.
iid_window = Window.partitionBy("uid").orderBy("visit_num", "event_num")
df_sample_iid=df_sample.withColumn("iid", fn.rank().over(iid_window))
In my specific case some more data cleaning was required but generally, this should work.

cassandra composite index and compact storages

I am new in cassandra, have not run it yet, but my business logic requires to create such table.
CREATE TABLE Index(
user_id uuid,
keyword text,
score text,
fID int,
PRIMARY KEY (user_id, keyword, score); )
WITH CLUSTERING ORDER BY (score DESC) and COMPACT STORAGE;
Is it possible or not? I have only one column(fID) which is not part of my composite index, so i hope I will be able to apply compact_storage setting. Pay attention thet I ordered by third column of my composite index, not second. I need to compact the storage as well, so the keywords will not be repeated for each fID.
A few things initially about your CREATE TABLE statement:
It will error on the semicolon (;) after your PRIMARY KEY definition.
You will need to pick a new name, as Index is a reserved word.
Pay attention thet I ordered by third column of my composite index, not second.
You cannot skip a clustering key when you specify CLUSTERING ORDER.
However, I do see an option here. Depending on your query requirements, you could simply re-order keyword and score in your PRIMARY KEY definition, and then it would work:
CREATE TABLE giveMeABetterName(
user_id uuid,
keyword text,
score text,
fID int,
PRIMARY KEY (user_id, score, keyword)
) WITH CLUSTERING ORDER BY (score DESC) and COMPACT STORAGE;
That way, you could query by user_id and your rows (keywords?) for that user would be ordered by score:
SELECT * FROM giveMeABetterName WHERE `user_id`=1b325b66-8ae5-4a2e-a33d-ee9b5ad464b4;
If that won't work for your business logic, then you might have to retouch your data model. But it is not possible to skip a clustering key when specifying CLUSTERING ORDER.
Edit
But re-ordering of columns does not work for me. Can I do something like this WITH CLUSTERING ORDER BY (keyword asc, score desc)
Let's look at some options here. I created a table with your original PRIMARY KEY, but with this CLUSTERING ORDER. That will technically work, but look at how it treats my sample data (video game keywords):
aploetz#cqlsh:stackoverflow> SELECT * FROM givemeabettername WHERE user_id=dbeddd12-40c9-4f84-8c41-162dfb93a69f;
user_id | keyword | score | fid
--------------------------------------+------------------+-------+-----
dbeddd12-40c9-4f84-8c41-162dfb93a69f | Assassin's creed | 87 | 0
dbeddd12-40c9-4f84-8c41-162dfb93a69f | Battlefield 4 | 9 | 0
dbeddd12-40c9-4f84-8c41-162dfb93a69f | Uncharted 2 | 91 | 0
(3 rows)
On the other hand, if I alter the PRIMARY KEY to cluster on score first (and adjust CLUSTERING ORDER accordingly), the same query returns this:
user_id | score | keyword | fid
--------------------------------------+-------+------------------+-----
dbeddd12-40c9-4f84-8c41-162dfb93a69f | 91 | Uncharted 2 | 0
dbeddd12-40c9-4f84-8c41-162dfb93a69f | 87 | Assassin's creed | 0
dbeddd12-40c9-4f84-8c41-162dfb93a69f | 9 | Battlefield 4 | 0
Note that you'll want to change the data type of score from TEXT to a numeric (int/bigint) to avoid ASCII-betical sorting, like this:
user_id | score | keyword | fid
--------------------------------------+-------+------------------+-----
dbeddd12-40c9-4f84-8c41-162dfb93a69f | 91 | Uncharted 2 | 0
dbeddd12-40c9-4f84-8c41-162dfb93a69f | 9 | Battlefield 4 | 0
dbeddd12-40c9-4f84-8c41-162dfb93a69f | 87 | Assassin's creed | 0
Something that might help you, is to read through this DataStax doc on Compound Keys and Clustering.

Is there a way to make clustering order by data type and not string in Cassandra?

I created a table in CQL3 in the cqlsh using the following CQL:
CREATE TABLE test (
locationid int,
pulseid int,
name text, PRIMARY KEY(locationid, pulseid)
) WITH CLUSTERING ORDER BY (locationid ASC, pulseid DESC);
Note that locationid is an integer.
However, after I inserted data, and ran a select, I noticed that locationid's ascending sort seems to be based upon string, and not integer.
cqlsh:citypulse> select * from test;
locationid | pulseid | name
------------+---------+------
0 | 3 | test
0 | 2 | test
0 | 1 | test
0 | 0 | test
10 | 3 | test
5 | 3 | test
Note the 0 10 5. Is there a way to make it sort via its actual data type?
Thanks,
Allison
In Cassandra, the first part of the primary key is the 'partition key'. That key is used to distribute data around the cluster. It does this in a random fashion to achieve an even distribution. This means that you can not order by the first part of your primary key.
What version of Cassandra are you on? In the most recent version of 1.2 (1.2.2), the create statement you have used an example is invalid.

Resources