Cassandra CQL3 Composite keys return duplicate values - cassandra

I am new to CQL & composite keys (I previously used CLI)
I am looking to implement my old super-column-family with composite keys instead.
In short, my look-up model is:
blocks[file_id][position][block_id]=size
I have the folowing CQL table with composite keys:
CREATE TABLE blocks (
file_id text,
start_position bigint,
block_id text,
size bigint,
PRIMARY KEY (file_id, start_position,block_id)
);
I insert these sample values:
/*Example insertions*/
INSERT INTO blocks (file_id, start_position, block_id,size) VALUES ('test_schema_file', 0, 'testblock1', 500);
INSERT INTO blocks (file_id, start_position, block_id,size) VALUES ('test_schema_file', 500, '2testblock2', 501);
I query using this Astyanax code:
OperationResult result = m_keyspace.prepareQuery(m_BlocksTable).getKey(file).execute();
ColumnList<BlockKey> columns = (ColumnList<BlockKey>) result.getResult();
for (Column<BlockKey> column : columns) {
System.out.println(StaticUtils.fieldsToString(column.getName()));
try{
long value=column.getLongValue();
System.out.println(value);
}catch(Exception e){
System.out.println("Can't get size");
}
}
When I iterate over the result, I get 2 results for each column. One that contains a "size", and one where a "size" column doesn't exist.
recorder.data.models.BlockKey Object {
m_StartPosition: 0
m_BlockId: testblock1
m_Extra: null
}
Can't get size
recorder.data.models.BlockKey Object {
m_StartPosition: 0
m_BlockId: testblock1
m_Extra: size
}
500
recorder.data.models.BlockKey Object {
m_StartPosition: 500
m_BlockId: 2testblock2
m_Extra: null
}
Can't get size
recorder.data.models.BlockKey Object {
m_StartPosition: 500
m_BlockId: 2testblock2
m_Extra: size
}
501
So I have two questions:
Theoretically I do not need a size column, it should be a value of the composite key: blocks[file_id][position][block_id]=size instead of blocks[file_id][position][block_id]['size'] = size. . How do I correctly insert this data in CQL3 without creating the redundant size column?
Why am I getting the extra column without 'size', if I never inserted such a row?

The 'duplicates' are because, with CQL, there are extra thrift columns inserted to store extra metadata. With your example, from cassandra-cli you can see what's going on:
[default#ks1] list blocks;
------------------- RowKey: test_schema_file
=> (column=0:testblock1:, value=, timestamp=1373966136246000)
=> (column=0:testblock1:size, value=00000000000001f4, timestamp=1373966136246000)
=> (column=500:2testblock2:, value=, timestamp=1373966136756000)
=> (column=500:2testblock2:size, value=00000000000001f5, timestamp=1373966136756000)
If you insert data with CQL, you should query with CQL too. You can do this with Astyanax by using m_keyspace.prepareCqlStatement().withCql("SELECT * FROM blocks").execute();.

Related

Reading guarantees for full table scan while updating the table?

Given schema:
CREATE TABLE keyspace.table (
key text,
ckey text,
value text
PRIMARY KEY (key, ckey)
)
...and Spark pseudocode:
val sc: SparkContext = ...
val connector: CassandraConnector = ...
sc.cassandraTable("keyspace", "table")
.mapPartitions { partition =>
connector.withSessionDo { session =>
partition.foreach { row =>
val key = row.getString("key")
val ckey = Random.nextString(42)
val value = row.getString("value")
session.execute(s"INSERT INTO keyspace.table (key, ckey, value)" +
" VALUES ($key, $ckey, $value)")
}
}
}
Is it possible for a code like this to read an inserted value within a single application (Spark job) run? More generalized version of my question would be whether a token range scan CQL query can read newly inserted values while iterating over rows.
Yes, it is possible exactly as Alex wrote
but I don't think it's possible with above code
So per data model the table is ordered by ckey in ascending order
The funny part however is the page size and how many pages are prefetched and since this is by default 1000 (spark.cassandra.input.fetch.sizeInRows), then the only problem could occur, if you wouldn't use 42, but something bigger and/or the executor didn't page yet
Also I think you use unnecessary nesting, so the code to achieve what you want might be simplified (after all cassandraTable will give you a data frame).
(I hope I understand that you want to read per partition (note a partition in your case is all rows under one primary key - "key") and for every row (distinguished by ckey) in this partition generate new one (with new ckey that will just duplicate value with new ckey) - use case for such code is a mystery for me, but I hope it has some sense:-))

Cassandra: If a field inside an UDT is set to null, does this create a tombstone in Cassandra?

Please look at the following example:
Insert
INSERT INTO my_keyspace.my_table (id, name, my_info) VALUES (
3464546,
'Sumit',
{ birthday : '1990-01-01', height : '6.2 feet', weight : '74 kg' }
);
Second Insert
INSERT INTO my_keyspace.my_table (id, name, my_info) VALUES (
3464546,
'Sumit',
{ birthday : '1990-01-01', height : '6.2 feet', weight : null }
);
Consider "id" as the Primary Key.
In the second insert "weight" attribute inside "my_info" UDT is set as null. Does this create a tombstone? How null inside an UDT is stored in the Cassandra database?
Yes Setting a column to NULL is the same as writing a tombstone in some cases.

Alter a Column with migration file?

Using orchad 1.6 in the migration file I have just altered a table and added a column. I need this column to be NotNull, but it doesnt allow you to alter a table enter a NotNull type, so i've used Nullable and entered data into the existing columns.
I then want to edit this column and change it to a Nullable, but am unsure how....
public int UpdateFrom37()
{
SchemaBuilder.AlterTable("ManufacturedProductOrders", table => table
.AddColumn<DateTime>("DateOrdered", c => c.Nullable())
);
return 38;
}
public int UpdateFrom38()
{
SchemaBuilder.AlterTable("ManufacturedProductOrders", table => table
.AlterColumn("DateOrdered", c => c.WithType(dbType.???????????
);
}
I guess you want to change from NULL to NOT NULL, right? The code above clearly states that you already have a nullable column.
AlterColumn command does not currently allow changing column 'nullability'.
Your best option is to issue a manual ALTER TABLE command through SchemaBuilder.ExecuteSql() or directly in the database. You can read about it eg. here.

Cassandra Hector Client: Is a RangeSlicesQuery on Composite Row Keys possible when using Random Partitioning?

Is there any way to range query rows with a composite row key when using random partitioning?
Im workling with column families created via CQL v3 like this:
CREATE TABLE products ( rowkey CompositeType(UTF8Type,UTF8Type,UTF8Type,UTF8Type)
PRIMARY KEY, prod_id varchar, class_id varchar, date varchar);
The data in the table looks like this:
RowKey: 6:3:2:19
=> (column=class_id, value=254, timestamp=1346800102625002)
=> (column=date, value=2034, timestamp=1346800102625000)
=> (column=prod_id, value=1922, timestamp=1346800102625001)
-------------------
RowKey: 0:14:1:16
=> (column=class_id, value=144, timestamp=1346797896819002)
=> (column=date, value=234, timestamp=1346797896819000)
=> (column=prod_id, value=4322, timestamp=1346797896819001)
-------------------
I’m trying to find a way to range query over these composite row keys analog to how we slice query over composite columns. Following approach sometimes actually succeeds in returning something useful depending on the start and stop key I choose.
Composite startKey = new Composite();
startKey.addComponent(0, "3", Composite.ComponentEquality.EQUAL);
startKey.addComponent(1, "3", Composite.ComponentEquality.EQUAL);
startKey.addComponent(2, "3", Composite.ComponentEquality.EQUAL);
startKey.addComponent(3, "3", Composite.ComponentEquality.EQUAL);
Composite stopKey = new Composite();
stopKey.addComponent(0, "6", Composite.ComponentEquality.EQUAL);
stopKey.addComponent(1, "6", Composite.ComponentEquality.EQUAL);
stopKey.addComponent(2, "6", Composite.ComponentEquality.EQUAL);
stopKey.addComponent(3, "6" , Composite.ComponentEquality.GREATER_THAN_EQUAL);
RangeSlicesQuery<Composite, String, String> rangeSlicesQuery =
HFactory.createRangeSlicesQuery(keyspace, CompositeSerializer.get(), StringSerializer.get(), StringSerializer.get());
rangeSlicesQuery.setColumnFamily(columnFamilyName);
rangeSlicesQuery.setKeys(startKey,stopKey);
rangeSlicesQuery.setRange("", "", false, 3);
Most of the time the database returns this:
InvalidRequestException(why:start key's md5 sorts after end key's md5.
this is not allowed; you probably should not specify end key at all,
under RandomPartitioner)
Does somebody have an idea if something like this can be achieved WITHOUT using the order preserving partitioner? Do I have to build a custom row key index for this use case?
Thanks a lot!
Additional information:
What I’m trying to do is storing sales transaction data in a table which uses both composite row keys to encode date/time/place and composite columns to store information about the sold items:
The set of items per transaction varies in size and includes information about size, color and quantity of every item:
{ ... items :
[ { item_id : 43523 , size : 050 , color : 123 , qty : 1 } ,
{ item_id : 64233 , size : 048 , color : 834 , qty : 1 } ,
{ item_id : 23984 , size : 000 , color : 341 , qty : 3 } ,
… ] }
There’s also information about where and when the transaction happened including a unique transaction id:
{ trx_id : 23324827346, store_id : 8934 , date : 20110303 , time : 0947 , …
My initial approach was putting every item in a separate row and let the application group items back together by transaction id. That’s working fine. But now I’m trying to leverage the structuring capabilities of composite columns to persist the nested item data within a representation (per item) like this:
item_id:’size’ = <value> ; item_id:’color’ = <value> ; item_id:’qty’ = <value> ; …
43523:size = 050 ; 43523:color = 123 ; 43523:qty = 1 ; …
The rest of the data would be encoded in a composite row key like this:
date : time : store_id : trx_id
20110303 : 0947 : 001 : 23324827346
I need to be able to run queries like: All items which were sold between the dates 20110301 and 20110310 between times 1200 and 1400 in stores 25 - 50. What I achieved so far with composite columns was using one wide row per store and putting all the rest of the data into 3 different composite columns per item:
date:time:<type>:prod_id:transaction_id = <value> ; …
20110303:0947:size:43523:23324827346 = 050 ;
20110303:0947:color:43523:23324827346 = 123 ;
20110303:0947:qty:43523:23324827346 = 1 ;
It’s working, but it doesn’t really look highly efficient.
Is there any other alternative?
You're creating one row per partition, so it should be clear that RandomPartitioner will not give you ordered range queries.
You can do ordered ranges within a partition, which is very common, e.g. http://rubyscale.com/blog/2011/03/06/basic-time-series-with-cassandra/ and http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra

Can't create table with composite key using Astyanax client

How do I create table with composite keys using astyanax client. For now I've created it with cqlsh -3, and this is how it looks like in cli:
[default#KS] describe my_cf;
ColumnFamily: my_cf
Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type
Default column value validator: org.apache.cassandra.db.marshal.UTF8Type
Columns sorted by: org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.TimeUUIDType,org.apache.cassandra.db.marshal.UTF8Type)
GC grace seconds: 864000
Compaction min/max thresholds: 4/32
Read repair chance: 0.1
DC Local Read repair chance: 0.0
Replicate on write: true
Caching: KEYS_ONLY
Bloom Filter FP chance: default
Built indexes: []
Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
Compression Options:
sstable_compression: org.apache.cassandra.io.compress.SnappyCompressor
This is how I would expect it to be in cqlsh:
CREATE TABLE my_cf (
... key text,
... timeid timeuuid,
... flag boolean,
... data text,
... PRIMARY KEY (key, timeid));
I got it working with composite key stored as a blob which is a problem.
my code
public class MyKey {
#Component(ordinal=0)
private String key;
#Component(ordinal=1)
private UUID timeid;
//...
}
CF
public static ColumnFamily<MyKey, String> MY_CF = ColumnFamily
.newColumnFamily("my_cf",
new AnnotatedCompositeSerializer<MyKey>(MyKey.class),
StringSerializer.get());
KS
ksDef = cluster.makeKeyspaceDefinition();
ksDef.setName(keyspaceName)
.setStrategyOptions(keyspaceOptions)
.setStrategyClass("SimpleStrategy")
.addColumnFamily(
cluster.makeColumnFamilyDefinition()
.setName(MY_CF.getName())
.setComparatorType("UTF8Type")
.setDefaultValidationClass("UTF8Type")
// blob if no key validation class specified
// and something looking as a string if I use this: .setKeyValidationClass("CompositeType(UTF8Type, TimeUUIDType)")
// anyway there's a single column per composite key
.addColumnDefinition(
cluster.makeColumnDefinition()
.setName("flag")
.setValidationClass(
"BooleanType"))
.addColumnDefinition(
cluster.makeColumnDefinition()
.setName("data")
.setValidationClass(
"UTF8Type")));
cluster.addKeyspace(ksDef);
mutation
MutationBatch m = ks.prepareMutationBatch();
for (char keyName = 'A'; keyName <= 'C'; keyName++) {
MyKey myKey = new MyKey("THEKEY_" + keyName, TimeUUIDUtils.getUniqueTimeUUIDinMillis());
ColumnListMutation<String> cfm = m.withRow(MY_CF, myKey);
cfm.putColumn("flag", true, null);
cfm.putColumn("data", "DATA_" + keyName, null);
}
m.execute();
cqlsh:KS>describe columnfamily my_cf;
CREATE TABLE my_cf (
KEY blob PRIMARY KEY,
flag boolean,
data text
) WITH ...
cqlsh:KS>select * from my_cf;
key | flag | data
----------------------------------------------------------+--------+---------
00064953494e5f420000109f4513d0e3ac11e19c400022191ad62b00 | True | DATA_B
cqlsh:KS> select * from my_cf where key = 'THEKEY_B' order by timeid desc;
Bad Request: Order by on unknown column timeid
doesnt' it look right in cassandra-cli below? why it doesn't work in cqlsh?
cassandra-cli] list my_cf;
RowKey: THEKEY_B:09f29941-e3c2-11e1-a7ef-0022191ad62b
=> (column=active, value=true, timestamp=1344695832788000)
=> (column=data, value=DATA_B, timestamp=1344695832788000)
What am I doing wrong?
(astyanax 1.0.6, cassandra 1.1.2)
cqlsh>[cqlsh 2.2.0 | Cassandra 1.1.2 | CQL spec 3.0.0 | Thrift protocol 19.32.0]
From what I've been able to figure out, composite primary keys
represent a major divergence in the protocol and interface to
cassandra and the protocol you use controls the features you have
access to.
For instance, astyanax and hector are primarily thrift protocol
clients, while CQL, more than just a language, is (or will be?) a binary protocol.
The two protocols are not equivalent and CQL3 with composite primary
keys makes things very different.
The thing to understand about "TABLES" with composite primary keys is that they
essentially translate into wide rows with composite column names. The
first part of the primary key is the row key and the remaining parts
are used as a prefix along with the TABLE-column name as the column name in
the wide row.
In your instance, the row key is "key" and the column prefix is
"timeid", so the flag field of what you are inserting is actually a
column named :flag and data is :data and so
on.
In order for this to work, the CQL protocol interface to cassandra is
converting "TABLES" into wide rows and transparently handling all of
that column naming.
The thrift interface doesn't take care of this stuff and and when you
do a mutation, it just writes columns like it is used to, without the
virtual addressing.
So, in fact, the results do not look right in your cassandra-cli. If you do an insert from cqlsh -3, here is what it should look like from the cassandra-cli point of view (with a simple text date):
[default#testapp] list my_cf;
RowKey: mykey
=> (column=20120827:data, value=some data, timestamp=1346090889361000)
=> (column=20120827:flag, value=, timestamp=1346090889361001)
CQL3 and tables look really attractive, but there are some trade-offs to be made and there doesn't seem to be solid java client support yet.

Resources