Querying 2 cells from each partition in cassandra using cql - cassandra

I am new to cassandra, I am using cassandra datastax driver to access my keyspace. I have a legacy table which is created by using cassandra thrift client. I am in need of retrieving two column values from each partion in one query. It is like multigetslice Query in hector api. How can I do this using cql and DataStax Java driver?
--edit--
My column family is a legacy table, which looks like the following in cqlsh.
CREATE TABLE messages (
key blob,
column1 text,
value blob,
PRIMARY KEY ((key), column1)
).
I need to select two values for each key. In this table i used to store messages of each user. userid as rowkey, messageid as columnname and message as value. I need to show two latest messages from each user.

Try using an IN filter condition.

I think you should execute one request per partition (execute concurrently if getting more than one parition). Assuming you want the top two in the natural order of 'column1':
SELECT column1, value FROM messages WHERE key=<blob> LIMIT 2;

Related

Is cassandra really a NoSQL Database?

I'm new to cassandara and NoSql Database. as per my understanding when you say it is NoSQL it means it should accept all data when you insert values(it is schema free). i.e. I have created a table in cassandra, it contains 5 fields. First insert query I inserted only 5 values, it is success. Next I tried 6 values, it throws error saying there is Unmatched column names/values (6th field). If cassandra is NoSQL then that 6th field should be inserted into Table.
I did this Google. Few people suggested saying user alter Query to change schema. If that is the case, I can use alter schema in SQL also. Then why I need to go to NoSQL?
Is my understanding is correct?
Unmatched column names/values
com.datastax.driver.core.exceptions.InvalidQueryException: Unmatched column names/values
Yes, Cassandra is a NoSQL database. A NoSQL database can be broadly defined as a database which stores and maintains data in non relational way(No SQL), can store web scale data easily, can scale out and is generally distributed. Cassandra ticks all the boxes for to be called as a NoSQL database.
Coming back to your question regarding requirement of a schema. Cassandra used to provide (still provide as deprecated feature) to add columns on the go using thrift API. Thrift API is going to be completely removed in Cassandra 4.0. Cassandra now supports schema based CQL.
You can still design your table to add columns dynamically using CQL, like
CREATE TABLE keyspace.table_name (
partition_key text,
column_name text,
column_value text,
PRIMARY KEY ((partition_key),column_name));
Now you can group all rows consisting of column_name and column_value with partition_key and rows sorted by column_name.

Filter on the partition and the clustering key with an additional criteria

I want to filter on a table that has a partition and a clustering key with another criteria on a regular column. I got the following warning.
InvalidQueryException: Cannot execute this query as it might involve
data filtering and thus may have unpredictable performance. If you
want to execute this query despite the performance unpredictability,
use ALLOW FILTERING
I understand the problem if the partition and the clustering key are not used. In my case, is it a relevant error or can I ignore it?
Here is an example of the table and query.
CREATE TABLE mytable(
name text,
id uuid,
deleted boolean
PRIMARY KEY((name),id)
)
SELECT id FROM mytable WHERE name='myname' AND id='myid' AND deleted=false;
In Cassandra you can't filter data with non-primary key column unless you create index in it.
Cassandra 3.0 or up it is allowed to filter data with non primary key but in unpredictable performance
Cassandra 3.0 or up, If you provide all the primary key (as your given query) then you can use the query with ALLOW FILTERING, ignoring the warning
Otherwise filter from the client side or remove the field deleted and create another table :
Instead of updating the field to deleted true move your data to another table let's say mytable_deleted
CREATE TABLE mytable_deleted (
name text,
id uuid
PRIMARY KEY (name, id)
);
Now if you only have the non deleted data on mytable and deleted data on mytable_deleted table
or
Create index on it :
The column deleted is a low cardinality column. So remember
A query on an indexed column in a large cluster typically requires collating responses from multiple data partitions. The query response slows down as more machines are added to the cluster. You can avoid a performance hit when looking for a row in a large partition by narrowing the search.
Read More : When not to use an index

Get first row for each partition key in Cassandra

I am considering Cassandra as an intermediate storage during my ETL job to perform data deduplication.
Let's imagine I have a stream of events, each of them have some business entity id, timestamp and some value. I need to get only latest value in terms of in-event timestamp for each business key, but events may come unordered.
My idea was to create staging table with business id as a partition key and timestamp as a clustering key:
CREATE TABLE sample_keyspace.table1_copy1 (
id uuid,
time timestamp,
value text,
PRIMARY KEY (id, time)
) WITH CLUSTERING ORDER BY ( time DESC )
Now if I insert some data in this table I can get latest value for some given partition key:
select * from table1 where id = 96b29b4b-b60b-4be9-9fa3-efa903511f2d limit 1;
But that would require to issue such query for every business key I'm interested in.
Is there some effective way I could do it in CQL?
I know we have an ability to list all available partition keys (by select distinct id from table1). So if I look into storage model of Cassandra, getting first row for each partition key should not be too hard.
Is that supported?
If you're using a version after 3.6, there is an option on your query named PER PARTITION LIMIT (CASSANDRA-7017) which you can set to 1. This won't auto complete in cqlsh until 3.10 with CASSANDRA-12803.
SELECT * FROM table1 PER PARTITION LIMIT 1;
In a word: no.
The partitioning key is why Cassandra can work essentially any amount of data: It decides where to put/look for data using the hash of the partitioning key. That is why CQL SELECTs always need to do an equality filter on the entire partitioning key. In order to find the first time for each id, Cassandra would have to ask all nodes for any partition of the data, then perform a complex operation on each of them. Relational databases allow this, Cassandra does not. All it allows are full table scans (SELECT * from table1), or partition scans (SELECT DISTINCT id FROM table1), but those cannot* be linked to any complex operation.
*) I am omitting ALLOW FILTERING here, since it does not help in this context.

Data loss in cassandra because of frequent delete and insert of same column in a row

I have a column family posts which is used to store post detail of my facebook account. I am using cassandra 2.0.9 and datastax java driver 3.0.
CREATE TABLE posts (
key blob,
column1 text,
value blob,
PRIMARY KEY ((key), column1)
) WITH COMPACT STORAGE;
where rowkey is my userid, columnkey is postid, value is post json. Whenever i refresh my application in browser, it'll fetch data from facebook and remove and add data for existing postids. Some times i miss some posts from cassandra. May frequent delete and insert in same column of a row causes data loss? How can i manage this?
It's not really dataloss, if you're updating the same column at a very high frequency (like thousands updates/sec) you may have unpredictable result.
Why ? Because Cassandra is using insert timestamp to determine at read time which value is the right one by comparing the timestamp of the same column from different replicas.
Currently, the resolution of the timestamp is the order of milliseconds so if you update rate is very high, for example 2 update on the same column for the same millisecond, the bigger post JSON will win.
By bigger, I mean by using postJson1.compareTo(postJson2). The ordering is determined by the type of your column and in your case it's a String so Cassandra breaks tie by comparing the post JSON data lexicographically.
To avoid this, you can provide the write timestamp at client side by generating yourself an unique timmeuuid().
There are many internatives to generate such TimeUUID, for example by using the Java driver class com.datastax.driver.core.utils.UUIDs.timeBased()

SparkSQL restrict queries by Cassandra partition key ranges

Imagine that my primary key is a timestamp.
I would like to restrict the query by timestamp ranges.
I don't seem to manage to make it work, even if I used token(). Also I can't create a secondary index on the partition key.
How should this be done?
Cassandra doesn't allow for range queries on partition key.
One way of dealing with this problem is changing your schema so that your timestamp value would be a clustering column. For this to work, you need to introduce a sentinel column as partition key. See this question for more detailed answers: Range Queries in Cassandra (CQL 3.0)
Another way is just to let Spark do the filtering. Range queries on primary key should work in Spark SQL. They would simply not be pushed down to Cassandra and Spark would fetch all data and filter them on the Spark side.

Resources