Paging Resultsets in Cassandra with compound primary keys - Missing out on rows - cassandra

So, my original problem was using the token() function to page through a large data set in Cassandra 1.2.9, as explained and answered here: Paging large resultsets in Cassandra with CQL3 with varchar keys
The accepted answer got the select working with tokens and chunk size, but another problem manifested itself.
My table looks like this in cqlsh:
key | column1 | value
---------------+-----------------------+-------
85.166.4.140 | county_finnmark | 4
85.166.4.140 | county_id_20020 | 4
85.166.4.140 | municipality_alta | 2
85.166.4.140 | municipality_id_20441 | 2
93.89.124.241 | county_hedmark | 24
93.89.124.241 | county_id_20005 | 24
The primary key is a composite of key and column1. In CLI, the same data looks like this:
get ip['85.166.4.140'];
=> (counter=county_finnmark, value=4)
=> (counter=county_id_20020, value=4)
=> (counter=municipality_alta, value=2)
=> (counter=municipality_id_20441, value=2)
Returned 4 results.
The problem
When using cql with a limit of i.e. 100, the returned results may stop in the middle of a record, like this:
key | column1 | value
---------------+-----------------------+-------
85.166.4.140 | county_finnmark | 4
85.166.4.140 | county_id_20020 | 4
leaving these to "rows" (columns) out:
85.166.4.140 | municipality_alta | 2
85.166.4.140 | municipality_id_20441 | 2
Now, when I use the token() function for the next page like, these two rows are skipped:
select * from ip where token(key) > token('85.166.4.140') limit 10;
Result:
key | column1 | value
---------------+------------------------+-------
93.89.124.241 | county_hedmark | 24
93.89.124.241 | county_id_20005 | 24
95.169.53.204 | county_id_20006 | 2
95.169.53.204 | county_oppland | 2
So, no trace of the last two results from the previous IP address.
Question
How can I use token() for paging without skipping over cql rows? Something like:
select * from ip where token(key) > token(key:column1) limit 10;

Ok, so I used the info in this post to work out a solution:
http://www.datastax.com/dev/blog/cql3-table-support-in-hadoop-pig-and-hive
(section "CQL3 pagination").
First, I execute this cql:
select * from ip limit 5000;
From the last row in the resultset, I get the key (i.e. '85.166.4.140') and the value from column1 (i.e. 'county_id_20020').
Then I create a prepared statement evaluating to
select * from ip where token(key) = token('85.166.4.140') and column1 > 'county_id_20020' ALLOW FILTERING;
(I'm guessing it would work also without using the token() function, as the check is now for equal:)
select * from ip where key = '85.166.4.140' and column1 > 'county_id_20020' ALLOW FILTERING;
The resultset now contains the remaining X rows (columns) for this IP. The method then returns all the rows, and the next call to the method includes the last used key ('85.166.4.140'). With this key, I can execute the following select:
select * from ip where token(key) > token('85.166.4.140') limit 5000;
which gives me the next 5000 rows from (and including) the first IP after '85.166.4.140'.
Now, no columns are lost in the paging.
UPDATE
Cassandra 2.0 introduced automatic paging, handled by the client.
More info here: http://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0
(note that setFetchSize is optional and not necessary for paging to work)

Related

COGNOS Report: COUNT IF

I am not sure how to go about creating a custom field to count instances given a condition.
I have a field, ID, that exists in two formats:
A#####
B#####
I would like to create two columns (one for A and one for B) and count instances by month. Something like COUNTIF ID STARTS WITH A for the first column resulting in something like below. Right now I can only create a table with the total count.
+-------+------+------+
| Month | ID A | ID B |
+-------+------+------+
| Jan | 100 | 10 |
+-------+------+------+
| Feb | 130 | 13 |
+-------+------+------+
| Mar | 90 | 12 |
+-------+------+------+
Define ID A as...
CASE
WHEN ID LIKE 'A%' THEN 1
ELSE 0
END
...and set the Default aggregation property to Total.
Do the same for ID B.
Apologies if I misunderstood the requirement, but you maybe able to spin the list into crosstab using the section off the toolbar, your measure value would be count(ID).
Try this
Query 1 to count A , filtering by substring(ID,1,1) = 'A'
Query 2 to count B , filtering by substring(ID,1,1) = 'B'
Join Query 1 and Query 2 by Year/Month
List by Month with Count A and Count B

Spark: count events based on two columns

I have a table with events which are grouped by a uid. All rows have the columns uid, visit_num and event_num.
visit_num is an arbitrary counter that occasionally increases. event_num is the counter of interactions within the visit.
I want to merge these two counters into a single interaction counter that keeps increasing by 1 for each event and continues to increase when then next visit has started.
As I only look at the relative distance between events, it's fine if I don't start the counter at 1.
|uid |visit_num|event_num|interaction_num|
| 1 | 1 | 1 | 1 |
| 1 | 1 | 2 | 2 |
| 1 | 2 | 1 | 3 |
| 1 | 2 | 2 | 4 |
| 2 | 1 | 1 | 500 |
| 2 | 2 | 1 | 501 |
| 2 | 2 | 2 | 502 |
I can achieve this by repartitioning the data and using the monotonically_increasing_id like this:
df.repartition("uid")\
.sort("visit_num", "event_num")\
.withColumn("iid", fn.monotonically_increasing_id())
However the documentation states:
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.
As the id seems to be monotonically increasing by partition this seems fine. However:
I am close to reaching the 1 billion partition/uid threshold.
I don't want to rely on the current implementation not changing.
Is there a way I can start each uid with 1 as the first interaction num?
Edit
After testing this some more, I notice that some of the users don't seem to have consecutive iid values using the approach described above.
Edit 2: Windowing
Unfortunately there are some (rare) cases where more thanone row has the samevisit_numandevent_num`. I've tried using the windowing function as below, but due to this assigning the same rank to two identical columns, this is not really an option.
iid_window = Window.partitionBy("uid").orderBy("visit_num", "event_num")
df_sample_iid=df_sample.withColumn("iid", fn.rank().over(iid_window))
The best solution is the Windowing function with rank, as suggested by Jacek Laskowski.
iid_window = Window.partitionBy("uid").orderBy("visit_num", "event_num")
df_sample_iid=df_sample.withColumn("iid", fn.rank().over(iid_window))
In my specific case some more data cleaning was required but generally, this should work.

Cassandra how to add values in a single row on every hit

In this table application will feed us with the below data and it will be incremental as and when we will receive updates on the status . So initially table will look like the below as shown:-
+---------------+---------------+---------------+---------------+
| ID | Total count | Failed count | Success count |
+---------------+---------------+---------------+---------------+
| 1 | 30 | 10 | 20 |
+---------------+---------------+---------------+---------------+
Now let’s assume total 30 messages are pushed now out of which 10 Failed and 20 Success as shown above.Now again application is run and values changed . Now total 20 new records came in out of which all are success. This should be updated in the same row .
+---------------+---------------+---------------+---------------+
| ID | Total count | Failed count | Success count |
+---------------+---------------+---------------+---------------+
| 1 | 50 | 10 | 40 |
+---------------+---------------+---------------+---------------+
Is it feasible in Cassandra DB using Counter data type?
Of course you can use counter tables in your case.
Let's assume table structure like :
CREATE KEYSPACE Test WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 3 };
CREATE TABLE data (
id int,
data string,
PRIMARY KEY (id)
);
CREATE TABLE counters (
id int,
total_count counter,
failed_count counter,
success_coutn counter,
PRIMARY KEY (id)
);
You can increment counters by running queries like :
UPDATE counters
SET total_count = total_count + 1,
success_count = success_count + 1
WHERE id= 1;
Hope this can help you.

Cassandra Delete by Secondary Index or By Allowing Filtering

I’m trying to delete by a secondary index or column key in a table. I'm not concerned with performance as this will be an unusual query. Not sure if it’s possible? E.g.:
CREATE TABLE user_range (
id int,
name text,
end int,
start int,
PRIMARY KEY (id, name)
)
cqlsh> select * from dat.user_range where id=774516966;
id | name | end | start
-----------+-----------+-----+-------
774516966 | 0 - 499 | 499 | 0
774516966 | 500 - 999 | 999 | 500
I can:
cqlsh> select * from dat.user_range where name='1000 - 1999' allow filtering;
id | name | end | start
-------------+-------------+------+-------
-285617516 | 1000 - 1999 | 1999 | 1000
-175835205 | 1000 - 1999 | 1999 | 1000
-1314399347 | 1000 - 1999 | 1999 | 1000
-1618174196 | 1000 - 1999 | 1999 | 1000
Blah blah…
But I can’t delete:
cqlsh> delete from dat.user_range where name='1000 - 1999' allow filtering;
Bad Request: line 1:52 missing EOF at 'allow'
cqlsh> delete from dat.user_range where name='1000 - 1999';
Bad Request: Missing mandatory PRIMARY KEY part id
Even if I create an index:
cqlsh> create index on dat.user_range (start);
cqlsh> delete from dat.user_range where start=1000;
Bad Request: Non PRIMARY KEY start found in where clause
Is it possible to delete without first knowing the primary key?
No, deleting by using a secondary index is not supported: CASSANDRA-5527
When you have your secondary index you can select all rows from that index. When you have your rows you know the primary key and can then delete the rows.
I came here looking for a solution to delete rows from cassandra column family.
I ended up doing an INSERT and set a TTL (time to live) so that I don't have to worry about deleting it.
Putting it out there, might help someone.

Is there a way to make clustering order by data type and not string in Cassandra?

I created a table in CQL3 in the cqlsh using the following CQL:
CREATE TABLE test (
locationid int,
pulseid int,
name text, PRIMARY KEY(locationid, pulseid)
) WITH CLUSTERING ORDER BY (locationid ASC, pulseid DESC);
Note that locationid is an integer.
However, after I inserted data, and ran a select, I noticed that locationid's ascending sort seems to be based upon string, and not integer.
cqlsh:citypulse> select * from test;
locationid | pulseid | name
------------+---------+------
0 | 3 | test
0 | 2 | test
0 | 1 | test
0 | 0 | test
10 | 3 | test
5 | 3 | test
Note the 0 10 5. Is there a way to make it sort via its actual data type?
Thanks,
Allison
In Cassandra, the first part of the primary key is the 'partition key'. That key is used to distribute data around the cluster. It does this in a random fashion to achieve an even distribution. This means that you can not order by the first part of your primary key.
What version of Cassandra are you on? In the most recent version of 1.2 (1.2.2), the create statement you have used an example is invalid.

Resources