Cassandra Hector Client: Is a RangeSlicesQuery on Composite Row Keys possible when using Random Partitioning? - cassandra

Is there any way to range query rows with a composite row key when using random partitioning?
Im workling with column families created via CQL v3 like this:
CREATE TABLE products ( rowkey CompositeType(UTF8Type,UTF8Type,UTF8Type,UTF8Type)
PRIMARY KEY, prod_id varchar, class_id varchar, date varchar);
The data in the table looks like this:
RowKey: 6:3:2:19
=> (column=class_id, value=254, timestamp=1346800102625002)
=> (column=date, value=2034, timestamp=1346800102625000)
=> (column=prod_id, value=1922, timestamp=1346800102625001)
-------------------
RowKey: 0:14:1:16
=> (column=class_id, value=144, timestamp=1346797896819002)
=> (column=date, value=234, timestamp=1346797896819000)
=> (column=prod_id, value=4322, timestamp=1346797896819001)
-------------------
I’m trying to find a way to range query over these composite row keys analog to how we slice query over composite columns. Following approach sometimes actually succeeds in returning something useful depending on the start and stop key I choose.
Composite startKey = new Composite();
startKey.addComponent(0, "3", Composite.ComponentEquality.EQUAL);
startKey.addComponent(1, "3", Composite.ComponentEquality.EQUAL);
startKey.addComponent(2, "3", Composite.ComponentEquality.EQUAL);
startKey.addComponent(3, "3", Composite.ComponentEquality.EQUAL);
Composite stopKey = new Composite();
stopKey.addComponent(0, "6", Composite.ComponentEquality.EQUAL);
stopKey.addComponent(1, "6", Composite.ComponentEquality.EQUAL);
stopKey.addComponent(2, "6", Composite.ComponentEquality.EQUAL);
stopKey.addComponent(3, "6" , Composite.ComponentEquality.GREATER_THAN_EQUAL);
RangeSlicesQuery<Composite, String, String> rangeSlicesQuery =
HFactory.createRangeSlicesQuery(keyspace, CompositeSerializer.get(), StringSerializer.get(), StringSerializer.get());
rangeSlicesQuery.setColumnFamily(columnFamilyName);
rangeSlicesQuery.setKeys(startKey,stopKey);
rangeSlicesQuery.setRange("", "", false, 3);
Most of the time the database returns this:
InvalidRequestException(why:start key's md5 sorts after end key's md5.
this is not allowed; you probably should not specify end key at all,
under RandomPartitioner)
Does somebody have an idea if something like this can be achieved WITHOUT using the order preserving partitioner? Do I have to build a custom row key index for this use case?
Thanks a lot!
Additional information:
What I’m trying to do is storing sales transaction data in a table which uses both composite row keys to encode date/time/place and composite columns to store information about the sold items:
The set of items per transaction varies in size and includes information about size, color and quantity of every item:
{ ... items :
[ { item_id : 43523 , size : 050 , color : 123 , qty : 1 } ,
{ item_id : 64233 , size : 048 , color : 834 , qty : 1 } ,
{ item_id : 23984 , size : 000 , color : 341 , qty : 3 } ,
… ] }
There’s also information about where and when the transaction happened including a unique transaction id:
{ trx_id : 23324827346, store_id : 8934 , date : 20110303 , time : 0947 , …
My initial approach was putting every item in a separate row and let the application group items back together by transaction id. That’s working fine. But now I’m trying to leverage the structuring capabilities of composite columns to persist the nested item data within a representation (per item) like this:
item_id:’size’ = <value> ; item_id:’color’ = <value> ; item_id:’qty’ = <value> ; …
43523:size = 050 ; 43523:color = 123 ; 43523:qty = 1 ; …
The rest of the data would be encoded in a composite row key like this:
date : time : store_id : trx_id
20110303 : 0947 : 001 : 23324827346
I need to be able to run queries like: All items which were sold between the dates 20110301 and 20110310 between times 1200 and 1400 in stores 25 - 50. What I achieved so far with composite columns was using one wide row per store and putting all the rest of the data into 3 different composite columns per item:
date:time:<type>:prod_id:transaction_id = <value> ; …
20110303:0947:size:43523:23324827346 = 050 ;
20110303:0947:color:43523:23324827346 = 123 ;
20110303:0947:qty:43523:23324827346 = 1 ;
It’s working, but it doesn’t really look highly efficient.
Is there any other alternative?

You're creating one row per partition, so it should be clear that RandomPartitioner will not give you ordered range queries.
You can do ordered ranges within a partition, which is very common, e.g. http://rubyscale.com/blog/2011/03/06/basic-time-series-with-cassandra/ and http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra

Related

Save array of objects in cassandra

How can I save array of objects in cassandra?
I'm using a nodeJS application and using cassandra-driver to connect to Cassandra DB. I wanted to save records like below in my db:
{
"id" : "5f1811029c82a61da4a44c05",
"logs" : [
{
"conversationId" : "e9b55229-f20c-4453-9c18-a1f4442eb667",
"source" : "source1",
"destination" : "destination1",
"url" : "https://asdasdas.com",
"data" : "data1"
},
{
"conversationId" : "e9b55229-f20c-4453-9c18-a1f4442eb667",
"source" : "source2",
"destination" : "destination2",
"url" : "https://afdvfbwadvsffd.com",
"data" : "data2"
}
],
"conversationId" : "e9b55229-f20c-4453-9c18-a1f4442eb667"
}
In the above record, I can use type "text" to save values of the columns "id" and "conversationId". But not sure how can I define the schema and save data for the field "logs".
With Cassandra, you'll want to store the data in the same way that you want to query it. As you mentioned querying by conversatonid, that's going to influence how the PRIMARY KEY definition should look. Given this, conversationid, should make a good partition key. As for the clustering columns, I had to make some guesses as to cardinality. So, sourceid looked like it could be used to uniquely identify a log entry within a conversation, so I went with that next.
I thought about using id as the final clustering column, but it looks like all entries with the same conversationid would also have the same id. It might be a good idea to give each entry its own unique identifier, to help ensure uniqueness:
{
"uniqueid": "e53723ca-2ab5-441f-b360-c60eacc2c854",
"conversationId" : "e9b55229-f20c-4453-9c18-a1f4442eb667",
"source" : "source1",
"destination" : "destination1",
"url" : "https://asdasdas.com",
"data" : "data1"
},
This makes the final table definition look like this:
CREATE TABLE conversationlogs (
id TEXT,
conversationid TEXT,
uniqueid UUID,
source TEXT,
destination TEXT,
url TEXT,
data TEXT,
PRIMARY KEY (conversationid,sourceid,uniqueid));
You have a few options depending on how you want to query this data.
The first is to stringify the json in logs field and save that to the database and then convert it back to JSON after querying the data.
The second option is similar to the first, but instead of stringifying the array, you store the data as a list in the database.
The third option is to define a new table for the logs with a primary key of the conversation and clustering keys for each element of the logs. This will allow you to lookup either by the full key or query by just the primary key and retrieve all the rows that match those criteria.
CREATE TABLE conversationlogs (
conversationid uuid,
logid timeuuid,
...
PRIMARY KEY ((conversationid), logid));

SELECT rows with primary key of multiple columns

How do I select all relevant records according to the provided list of pairs?
table:
CREATE TABLE "users_groups" (
"user_id" INTEGER NOT NULL,
"group_id" BIGINT NOT NULL,
PRIMARY KEY (user_id, group_id),
"permissions" VARCHAR(255)
);
For example, if I have the following JavaScript array of pairs that I should get from DB
[
{user_id: 1, group_id: 19},
{user_id: 1, group_id: 11},
{user_id: 5, group_id: 19}
]
Here we see that the same user_id can be in multiple groups.
I can pass with for-loop over every array element and create the following query:
SELECT * FROM users_groups
WHERE (user_id = 1 AND group_id = 19)
OR (user_id = 1 AND group_id = 11)
OR (user_id = 5 AND group_id = 19);
But is this the best solution? Let say if the array is very long. As I know query length may get ~1GB.
what is the best and quick solution to do this?
Bill Karwin's answer will work for Postgres just as well.
However, I have made the experience that joining against a VALUES clause is very often faster than a large IN list (with hundreds if not thousands of elements):
select ug.*
from user_groups ug
join (
values (1,19), (1,11), (5,19), ...
) as l(uid, guid) on l.uid = ug.user_id and l.guid = ug.group_id;
This assumes that there are no duplicates in the values provided, otherwise the JOIN would result in duplicated rows, which the IN solution would not do.
You tagged both mysql and postgresql, so I don't know which SQL database you're really using.
MySQL at least supports tuple comparisons:
SELECT * FROM users_groups WHERE (user_id, group_id) IN ((1,19), (1,11), (5,19), ...)
This kind of predicate can be optimized in MySQL 5.7 and later. See https://dev.mysql.com/doc/refman/5.7/en/range-optimization.html#row-constructor-range-optimization
I don't know whether PostgreSQL supports this type of predicate, or if it optimizes it.

Cassandra create table won't keep column order

I am creating a column family in Cassandra and I expect the column order to match the one I am specifying in the create clause.
This
CREATE TABLE cf.mycf (
timestamp timestamp,
id text,
score int,
type text,
publisher_id text,
embed_url text,
PRIMARY KEY (timestamp, id, score)
) WITH bloom_filter_fp_chance = 0.01
AND comment = ''
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE'
AND caching = {
'keys' : 'ALL',
'rows_per_partition' : 'NONE'
}
AND compression = {
'chunk_length_kb' : 64,
'crc_check_chance' : 1.0,
'sstable_compression' : 'LZ4Compressor'
}
AND compaction = {
'base_time_seconds' : 60,
'class' : 'DateTieredCompactionStrategy',
'enabled' : true,
'max_sstable_age_days' : 365,
'max_threshold' : 32,
'min_threshold' : 4,
'timestamp_resolution' : 'MICROSECONDS',
'tombstone_compaction_interval' : 86400,
'tombstone_threshold' : 0.2,
'unchecked_tombstone_compaction' : false
};
Should create a table like :
timestamp ,id ,score , type, id ,embed_url
Instead I am getting this:
timestamp timestamp,
id text,
score int,
embed_url text,
publisher_id text,
type text,
I've created quite a few tables in the same way and this never happened so any help would be appreciated.
I put the id and score as keys to show that these keep their respective position. while the actual scheme I am looking for is only the timestamp to be the primary key.
Looks like there is no such thing as fields order in cassandra.
The others columns are displayed in alphabetical order by Cassandra.
http://docs.datastax.com/en/cql/3.1/cql/ddl/ddl_compound_keys_c.html
You should make a clear distinction on how you want the data to be presented and how it is effectively presented to you. Moreover, you should not rely on the ordinal position of the fields but only on their names.
In order to be efficient, and against your will (you specified an order to the columns when you modeled your schema), Cassandra needs to store the columns in a particular order, and for simplicity this reflects on how it (the CQL interface or the driver) will give back your data.
I suggest you to have a deep insight on how Cassandra stores data (column names included!) in Understanding How CQL3 Maps to Cassandra’s Internal Data Structure.
By the way, if you absolutely need to keep your order at application level (and are too lazy to specify all the fields in the SELECT instead of using SELECT *), you need to create an abstraction interface on your own, something like creating an ordered "field names" array (your order):
String myorder[] = { "timestamp", "id", "score", "type", "publisher_id", "embed_url"};
and then use this as a map in loops using ordinal values.
Keep in mind that the rendering of the CQL string in DESCRIBE in cqlsh is just a function call in the python driver iterating over the metadata. It has nothing to do with how C* stores or sends its results.
If it matters you can set the order. When you Insert you can define the order explicitly
INSERT INTO keyspace_name.table_name
( identifier, column_name, whatever, order)
VALUES ( value, value ... )
When you do a select you can define the order explicitly.
SELECT identifier, whatever, order, column_name FROM keyspace_name.table_name

MongoDB Data Structure

I'm a bit of a noob with MongoDB, so would appreciate some help with figuring out the best solution/format/structure in storing some data.
Basically, the data that will be stored will be updated every second with a name, value and timestamp for a certain meter reading.
For example, one possibility is water level and temperature in a tank. The tank will have a name and then the level and temperature will be read and stored every second. Overall, there will be 100's of items (i.e. tanks), each with millions of timestamped values.
From what I've learnt so far (and please correct me if I'm wrong), there are a few options as how to structure the data:
A slightly RDMS approach:
This would consist of two collections, Items and Values
Items : {
_id : "id",
name : "name"
}
Values : {
_id : "id",
item_id : "item_id",
name : "name", // temp or level etc
value : "value",
timestamp : "timestamp"
}
The more document db denormalized method:
This method involves one collection of items each with an array of timestamped values
Items : {
_id : "id",
name : "name"
values : [{
name : "name", // temp or level etc
value : "value",
timestamp : "timestamp"
}]
}
A collection for each item
Save all the values in a collection named after that item.
ItemName : {
_id : "id",
name : "name", // temp or level etc
value : "value",
timestamp : "timestamp"
}
The majority of read queries will be to retrieve the timestamped values for a specified time period of an item (i.e. tank) and display in a graph. And for this, the first option makes more sense to me as I don't want to retrieve the millions of values when querying for a specific item.
Is it even possible to query for values between specific timestamps for option 2?
I will also need to query for a list of items, so maybe a combination of the first and third option with a collection for all the items and then a number of collections to store the values for each of those items?
Any feedback on this is greatly appreciated.
Don't use timestamp if you are not modifying the ObjectId.
As ObjectId itself has time stamp in it.
So you will be saving a lot of memory by it.
MongoDB Id Documentation
In case if you dont require the previous data then you can use update query in MongoDB to update the fields every second instead of storing.
If you want to store the updated data each time then instead of updating store it in flat structure.
{ "_id" : ObjectId("XXXXXX"),
"name" : "ItemName",
"value" : "ValueOfItem"
"created_at" : "timestamp"
}
Edit 1: Added timestamp as per the comments

Cassandra CQL3 Composite keys return duplicate values

I am new to CQL & composite keys (I previously used CLI)
I am looking to implement my old super-column-family with composite keys instead.
In short, my look-up model is:
blocks[file_id][position][block_id]=size
I have the folowing CQL table with composite keys:
CREATE TABLE blocks (
file_id text,
start_position bigint,
block_id text,
size bigint,
PRIMARY KEY (file_id, start_position,block_id)
);
I insert these sample values:
/*Example insertions*/
INSERT INTO blocks (file_id, start_position, block_id,size) VALUES ('test_schema_file', 0, 'testblock1', 500);
INSERT INTO blocks (file_id, start_position, block_id,size) VALUES ('test_schema_file', 500, '2testblock2', 501);
I query using this Astyanax code:
OperationResult result = m_keyspace.prepareQuery(m_BlocksTable).getKey(file).execute();
ColumnList<BlockKey> columns = (ColumnList<BlockKey>) result.getResult();
for (Column<BlockKey> column : columns) {
System.out.println(StaticUtils.fieldsToString(column.getName()));
try{
long value=column.getLongValue();
System.out.println(value);
}catch(Exception e){
System.out.println("Can't get size");
}
}
When I iterate over the result, I get 2 results for each column. One that contains a "size", and one where a "size" column doesn't exist.
recorder.data.models.BlockKey Object {
m_StartPosition: 0
m_BlockId: testblock1
m_Extra: null
}
Can't get size
recorder.data.models.BlockKey Object {
m_StartPosition: 0
m_BlockId: testblock1
m_Extra: size
}
500
recorder.data.models.BlockKey Object {
m_StartPosition: 500
m_BlockId: 2testblock2
m_Extra: null
}
Can't get size
recorder.data.models.BlockKey Object {
m_StartPosition: 500
m_BlockId: 2testblock2
m_Extra: size
}
501
So I have two questions:
Theoretically I do not need a size column, it should be a value of the composite key: blocks[file_id][position][block_id]=size instead of blocks[file_id][position][block_id]['size'] = size. . How do I correctly insert this data in CQL3 without creating the redundant size column?
Why am I getting the extra column without 'size', if I never inserted such a row?
The 'duplicates' are because, with CQL, there are extra thrift columns inserted to store extra metadata. With your example, from cassandra-cli you can see what's going on:
[default#ks1] list blocks;
------------------- RowKey: test_schema_file
=> (column=0:testblock1:, value=, timestamp=1373966136246000)
=> (column=0:testblock1:size, value=00000000000001f4, timestamp=1373966136246000)
=> (column=500:2testblock2:, value=, timestamp=1373966136756000)
=> (column=500:2testblock2:size, value=00000000000001f5, timestamp=1373966136756000)
If you insert data with CQL, you should query with CQL too. You can do this with Astyanax by using m_keyspace.prepareCqlStatement().withCql("SELECT * FROM blocks").execute();.

Resources