How do I create table with composite keys using astyanax client. For now I've created it with cqlsh -3, and this is how it looks like in cli:
[default#KS] describe my_cf;
ColumnFamily: my_cf
Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type
Default column value validator: org.apache.cassandra.db.marshal.UTF8Type
Columns sorted by: org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.TimeUUIDType,org.apache.cassandra.db.marshal.UTF8Type)
GC grace seconds: 864000
Compaction min/max thresholds: 4/32
Read repair chance: 0.1
DC Local Read repair chance: 0.0
Replicate on write: true
Caching: KEYS_ONLY
Bloom Filter FP chance: default
Built indexes: []
Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
Compression Options:
sstable_compression: org.apache.cassandra.io.compress.SnappyCompressor
This is how I would expect it to be in cqlsh:
CREATE TABLE my_cf (
... key text,
... timeid timeuuid,
... flag boolean,
... data text,
... PRIMARY KEY (key, timeid));
I got it working with composite key stored as a blob which is a problem.
my code
public class MyKey {
#Component(ordinal=0)
private String key;
#Component(ordinal=1)
private UUID timeid;
//...
}
CF
public static ColumnFamily<MyKey, String> MY_CF = ColumnFamily
.newColumnFamily("my_cf",
new AnnotatedCompositeSerializer<MyKey>(MyKey.class),
StringSerializer.get());
KS
ksDef = cluster.makeKeyspaceDefinition();
ksDef.setName(keyspaceName)
.setStrategyOptions(keyspaceOptions)
.setStrategyClass("SimpleStrategy")
.addColumnFamily(
cluster.makeColumnFamilyDefinition()
.setName(MY_CF.getName())
.setComparatorType("UTF8Type")
.setDefaultValidationClass("UTF8Type")
// blob if no key validation class specified
// and something looking as a string if I use this: .setKeyValidationClass("CompositeType(UTF8Type, TimeUUIDType)")
// anyway there's a single column per composite key
.addColumnDefinition(
cluster.makeColumnDefinition()
.setName("flag")
.setValidationClass(
"BooleanType"))
.addColumnDefinition(
cluster.makeColumnDefinition()
.setName("data")
.setValidationClass(
"UTF8Type")));
cluster.addKeyspace(ksDef);
mutation
MutationBatch m = ks.prepareMutationBatch();
for (char keyName = 'A'; keyName <= 'C'; keyName++) {
MyKey myKey = new MyKey("THEKEY_" + keyName, TimeUUIDUtils.getUniqueTimeUUIDinMillis());
ColumnListMutation<String> cfm = m.withRow(MY_CF, myKey);
cfm.putColumn("flag", true, null);
cfm.putColumn("data", "DATA_" + keyName, null);
}
m.execute();
cqlsh:KS>describe columnfamily my_cf;
CREATE TABLE my_cf (
KEY blob PRIMARY KEY,
flag boolean,
data text
) WITH ...
cqlsh:KS>select * from my_cf;
key | flag | data
----------------------------------------------------------+--------+---------
00064953494e5f420000109f4513d0e3ac11e19c400022191ad62b00 | True | DATA_B
cqlsh:KS> select * from my_cf where key = 'THEKEY_B' order by timeid desc;
Bad Request: Order by on unknown column timeid
doesnt' it look right in cassandra-cli below? why it doesn't work in cqlsh?
cassandra-cli] list my_cf;
RowKey: THEKEY_B:09f29941-e3c2-11e1-a7ef-0022191ad62b
=> (column=active, value=true, timestamp=1344695832788000)
=> (column=data, value=DATA_B, timestamp=1344695832788000)
What am I doing wrong?
(astyanax 1.0.6, cassandra 1.1.2)
cqlsh>[cqlsh 2.2.0 | Cassandra 1.1.2 | CQL spec 3.0.0 | Thrift protocol 19.32.0]
From what I've been able to figure out, composite primary keys
represent a major divergence in the protocol and interface to
cassandra and the protocol you use controls the features you have
access to.
For instance, astyanax and hector are primarily thrift protocol
clients, while CQL, more than just a language, is (or will be?) a binary protocol.
The two protocols are not equivalent and CQL3 with composite primary
keys makes things very different.
The thing to understand about "TABLES" with composite primary keys is that they
essentially translate into wide rows with composite column names. The
first part of the primary key is the row key and the remaining parts
are used as a prefix along with the TABLE-column name as the column name in
the wide row.
In your instance, the row key is "key" and the column prefix is
"timeid", so the flag field of what you are inserting is actually a
column named :flag and data is :data and so
on.
In order for this to work, the CQL protocol interface to cassandra is
converting "TABLES" into wide rows and transparently handling all of
that column naming.
The thrift interface doesn't take care of this stuff and and when you
do a mutation, it just writes columns like it is used to, without the
virtual addressing.
So, in fact, the results do not look right in your cassandra-cli. If you do an insert from cqlsh -3, here is what it should look like from the cassandra-cli point of view (with a simple text date):
[default#testapp] list my_cf;
RowKey: mykey
=> (column=20120827:data, value=some data, timestamp=1346090889361000)
=> (column=20120827:flag, value=, timestamp=1346090889361001)
CQL3 and tables look really attractive, but there are some trade-offs to be made and there doesn't seem to be solid java client support yet.
Related
Is there there any way to query on a SET type(or MAP/LIST) to find does it contain a value or not?
Something like this:
CREATE TABLE test.table_name(
id text,
ckk SET<INT>,
PRIMARY KEY((id))
);
Select * FROM table_name WHERE id = 1 AND ckk CONTAINS 4;
Is there any way to reach this query with YCQL api?
And can we use a SET type in SECONDRY INDEX?
Is there any way to reach this query with YCQL api?
YCQL does not support the CONTAINS keyword yet (feel free to open an issue for this on the YugabyteDB GitHub).
One workaround can be to use MAP<INT, BOOLEAN> instead of SET<INT> and the [] operator.
For instance:
CREATE TABLE test.table_name(
id text,
ckk MAP<int, boolean>,
PRIMARY KEY((id))
);
SELECT * FROM table_name WHERE id = 'foo' AND ckk[4] = true;
And can we use a SET type in SECONDRY INDEX?
Generally, collection types cannot be part of the primary key, or an index key.
However, "frozen" collections (i.e. collections serialized into a single value internally) can actually be part of either primary key or index key.
For instance:
CREATE TABLE table2(
id TEXT,
ckk FROZEN<SET<INT>>,
PRIMARY KEY((id))
) WITH transactions = {'enabled' : true};
CREATE INDEX table2_idx on table2(ckk);
Another option is to use with compound primary key and defining ckk as clustering key:
cqlsh> CREATE TABLE ybdemo.tt(id TEXT, ckk INT, PRIMARY KEY ((id), ckk)) WITH CLUSTERING ORDER BY (ckk DESC);
cqlsh> SELECT * FROM ybdemo.tt WHERE id='foo' AND ckk=4;
I am dealing with a puzzling behaviour when doing SELECTs on Cassandra 2.2.3. I have 4 nodes in the ring, and I create the following keyspace, table and index.
CREATE KEYSPACE IF NOT EXISTS my_keyspace
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
CREATE TABLE my_keyspace.my_table (
id text,
some_text text,
code text,
some_set set<int>,
a_float float,
name text,
type int,
a_double double,
another_set set<int>,
another_float float,
yet_another_set set<text>,
PRIMARY KEY (id, some_text, code)
) WITH read_repair_chance = 0.0
AND dclocal_read_repair_chance = 0.1
AND gc_grace_seconds = 864000
AND bloom_filter_fp_chance = 0.01
AND caching = { 'keys' : 'ALL', 'rows_per_partition' : 'NONE' }
AND comment = ''
AND compaction = { 'class' : 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy' }
AND compression = { 'sstable_compression' : 'org.apache.cassandra.io.compress.LZ4Compressor' }
AND default_time_to_live = 0
AND speculative_retry = '99.0PERCENTILE'
AND min_index_interval = 128
AND max_index_interval = 2048;
CREATE INDEX idx_my_table_code ON my_keyspace.my_table (code);
Then I insert some rows on the table. Some of them have empty sets. I perform this query through the default CQL client and get the row I am expecting:
SELECT * FROM my_table WHERE code = 'test';
Then I run some tests which are outside my control. I don't know what they do but I expect they read and possibly insert/update/delete some rows. I'm sure they don't delete or change any of the settings in the index, table or keyspace.
After the tests, I log in again through the default CQL client and run the following queries.
SELECT * FROM my_table WHERE code = 'test';
SELECT * FROM my_table;
SELECT * FROM my_table WHERE id = 'my_id' AND some_text = 'whatever' AND code = 'test';
The first one doesn't return anything.
The second one returns all the rows, including the one with code = 'test'.
The third one returns the expected row that the first query couldn't retrieve.
The only difference that I can see between this row and others is that it is one of the rows which contains some empty sets, as explained earlier. If I query for another of the rows that also contain some empty sets, I get the same behavior.
I would say the problem is related to the secondary index. Somehow, the operations performed during the tests leave the index in an state where it cannot see certain rows.
I'm obviously missing something. Do you have any ideas about what could cause this behavior?
Thanks in advance.
UPDATE:
I worked around the issue, but now I found the same problem somewhere else. Since the issue first happened, I found out more about the operations performed before the error: updates on specific columns that set a TTL for said columns. After some investigation I found some Jira issues which could be related to this problem:
https://issues.apache.org/jira/browse/CASSANDRA-6782
https://issues.apache.org/jira/browse/CASSANDRA-8206
However, those issues seem to have been solved on 2.0 and 2.1, and I'm using 2.2. I think these changes are included in 2.2, but I could be mistaken.
The main problem is the the type of query you are running on Cassandra.
The Cassadra data model is query driven, tables are recomputed to serve the query.
Tables are created by using well defined Primary Key (Partition Key & clustring key). Cassandra is not good for full table scan type of queries.
Now coming to your queries.
SELECT * FROM my_table WHERE code = 'test';
Here the column used is clustring column and it the equality search column it should be part of Partition Key. Clustring key will be present in different partitions so if Read consistency level is one it may give empty result.
SELECT * FROM my_table;
Cassandra is not good for this kind of table scan query. Here it will search all the table and get all the rows (poor querying).
SELECT * FROM my_table
WHERE id = 'my_id' AND some_text = 'whatever' AND code = 'test';
Here you mentioned everything so the correct results were returned.
I opened a Jira issue and the problem was fixed on 2.1.18 and 2.2.10:
https://issues.apache.org/jira/browse/CASSANDRA-13412
I speak just from what I read in the Jira issue. I didn't test the above scenario again after the fix was implemented because by then I had moved to the 3.0 version.
In the end though I ended up removing almost every use of secondary indices in my application, as I learned that they led to bad performance.
The reason is that in most cases they will result in fan-out queries that will contact every node of the cluster, with the corresponding costs.
There are still some cases where they can perform well, e.g. when you query by partition key at the same time, as no other nodes will be involved.
But for anything else, my advice is: consider if you can remove your secondary indices and do lookups in auxiliary tables instead. You'll have the burden of maintaining the tables in sync, but performance should be better.
I am creating a column family in Cassandra and I expect the column order to match the one I am specifying in the create clause.
This
CREATE TABLE cf.mycf (
timestamp timestamp,
id text,
score int,
type text,
publisher_id text,
embed_url text,
PRIMARY KEY (timestamp, id, score)
) WITH bloom_filter_fp_chance = 0.01
AND comment = ''
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE'
AND caching = {
'keys' : 'ALL',
'rows_per_partition' : 'NONE'
}
AND compression = {
'chunk_length_kb' : 64,
'crc_check_chance' : 1.0,
'sstable_compression' : 'LZ4Compressor'
}
AND compaction = {
'base_time_seconds' : 60,
'class' : 'DateTieredCompactionStrategy',
'enabled' : true,
'max_sstable_age_days' : 365,
'max_threshold' : 32,
'min_threshold' : 4,
'timestamp_resolution' : 'MICROSECONDS',
'tombstone_compaction_interval' : 86400,
'tombstone_threshold' : 0.2,
'unchecked_tombstone_compaction' : false
};
Should create a table like :
timestamp ,id ,score , type, id ,embed_url
Instead I am getting this:
timestamp timestamp,
id text,
score int,
embed_url text,
publisher_id text,
type text,
I've created quite a few tables in the same way and this never happened so any help would be appreciated.
I put the id and score as keys to show that these keep their respective position. while the actual scheme I am looking for is only the timestamp to be the primary key.
Looks like there is no such thing as fields order in cassandra.
The others columns are displayed in alphabetical order by Cassandra.
http://docs.datastax.com/en/cql/3.1/cql/ddl/ddl_compound_keys_c.html
You should make a clear distinction on how you want the data to be presented and how it is effectively presented to you. Moreover, you should not rely on the ordinal position of the fields but only on their names.
In order to be efficient, and against your will (you specified an order to the columns when you modeled your schema), Cassandra needs to store the columns in a particular order, and for simplicity this reflects on how it (the CQL interface or the driver) will give back your data.
I suggest you to have a deep insight on how Cassandra stores data (column names included!) in Understanding How CQL3 Maps to Cassandra’s Internal Data Structure.
By the way, if you absolutely need to keep your order at application level (and are too lazy to specify all the fields in the SELECT instead of using SELECT *), you need to create an abstraction interface on your own, something like creating an ordered "field names" array (your order):
String myorder[] = { "timestamp", "id", "score", "type", "publisher_id", "embed_url"};
and then use this as a map in loops using ordinal values.
Keep in mind that the rendering of the CQL string in DESCRIBE in cqlsh is just a function call in the python driver iterating over the metadata. It has nothing to do with how C* stores or sends its results.
If it matters you can set the order. When you Insert you can define the order explicitly
INSERT INTO keyspace_name.table_name
( identifier, column_name, whatever, order)
VALUES ( value, value ... )
When you do a select you can define the order explicitly.
SELECT identifier, whatever, order, column_name FROM keyspace_name.table_name
I am trying to optimize my spark job by avoiding shuffling as much as possible.
I am using cassandraTable to create the RDD.
The column family's column names are dynamic, thus it is defined as follows:
CREATE TABLE "Profile" (
key text,
column1 text,
value blob,
PRIMARY KEY (key, column1)
) WITH COMPACT STORAGE AND
bloom_filter_fp_chance=0.010000 AND
caching='ALL' AND
...
This definition results in CassandraRow RDD elements in the following format:
CassandraRow <key, column1, value>
key - the RowKey
column1 - the value of column1 is the name of the dynamic column
value - the value of the dynamic column
So if I have RK='profile1', with columns name='George' and age='34', the resulting RDD will be:
CassandraRow<key=profile1, column1=name, value=George>
CassandraRow<key=profile1, column1=age, value=34>
Then I need to group elements that share the same key together to get a PairRdd:
PairRdd<String, Iterable<CassandraRow>>
Important to say, that all the elements I need to group are in the same Cassandra node (share the same row key), so I expect the connector to keep the locality of the data.
The problem is that using groupBy or groupByKey causes shuffling. I rather group them locally, because all the data is on the same node:
JavaPairRDD<String, Iterable<CassandraRow>> rdd = javaFunctions(context)
.cassandraTable(ks, "Profile")
.groupBy(new Function<ColumnFamilyModel, String>() {
#Override
public String call(ColumnFamilyModel arg0) throws Exception {
return arg0.getKey();
}
})
My questions are:
Does using keyBy on the RDD will cause shuffling, or will it keep the data locally?
Is there a way to group the elements by key without shuffling? I read about mapPartitions, but didn't quite understand the usage of it.
Thanks,
Shai
I think you are looking for spanByKey, a cassandra-connector specific operation that takes advantage of the ordering provided by cassandra to allow grouping of elements without incurring in a shuffle stage.
In your case, it should look like:
sc.cassandraTable("keyspace", "Profile")
.keyBy(row => (row.getString("key")))
.spanByKey
Read more in the docs:
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/3_selection.md#grouping-rows-by-partition-key
I have the following column family in Cassandra for storing time series data in a small number of very "wide" rows:
CREATE TABLE data_bucket (
day_of_year int,
minute_of_day int,
event_id int,
data ascii,
PRIMARY KEY (data_of_year, minute_of_day, event_id)
)
On the CQL shell, I am able to run a query such as this:
select * from data_bucket where day_of_year = 266 and minute_of_day = 244
and event_id in (4, 7, 11, 1990, 3433)
Essentially, I fix the value of the first component of the composite column name (minute_of_day) and want to select a non-contiguous set of columns based on the distinct values of the second component (event_id). Since the "IN" relation is interpreted as an equality relation, this works fine.
Now my question is, how would I accomplish the same type of composite column slicing programmatically and without CQL. So far I have tried the Python client pycassa and the Java client Astyanax, but without any success.
Any thoughts would be welcome.
EDIT:
I'm adding the describe output of the column family as seen through cassandra-cli. Since I am looking for a Thrift-based solution, maybe this will help.
ColumnFamily: data_bucket
Key Validation Class: org.apache.cassandra.db.marshal.Int32Type
Default column value validator: org.apache.cassandra.db.marshal.AsciiType
Cells sorted by: org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.Int32Type,org.apache.cassandra.db.marshal.Int32Type)
GC grace seconds: 864000
Compaction min/max thresholds: 4/32
Read repair chance: 0.1
DC Local Read repair chance: 0.0
Populate IO Cache on flush: false
Replicate on write: true
Caching: KEYS_ONLY
Bloom Filter FP chance: default
Built indexes: []
Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
Compression Options:
sstable_compression: org.apache.cassandra.io.compress.SnappyCompressor
There is no "IN"-type query in the Thrift API. You could perform a series of get queries for each composite column value (day_of_year, minute_of_day, event_id).
If your event_ids were sequential (and your question says they are not) you could perform a single get_slice query, passing in the range (e.g., day_of_year, minute_of_day, and range of event_ids). You could grab bunches of them in this way and filter the response programatically yourself (e.g., grab all data on the date with event ids between 4-3433). More data transfer, more processing on the client side so not a great option unless you really are looking for a range.
So, if you want to use "IN" with Cassandra you will need to switch to a CQL-based solution. If you are considering using CQL in python another option is cassandra-dbapi2. This worked for me:
import cql
# Replace settings as appropriate
host = 'localhost'
port = 9160
keyspace = 'keyspace_name'
# Connect
connection = cql.connect(host, port, keyspace, cql_version='3.0.1')
cursor = connection.cursor()
print "connected!"
# Execute CQL
cursor.execute("select * from data_bucket where day_of_year = 266 and minute_of_day = 244 and event_id in (4, 7, 11, 1990, 3433)")
for row in cursor:
print str(row) # Do something with your data
# Shut the connection
cursor.close()
connection.close()
(Tested with Cassandra 2.0.1.)