Cassandra create table won't keep column order - cassandra

I am creating a column family in Cassandra and I expect the column order to match the one I am specifying in the create clause.
This
CREATE TABLE cf.mycf (
timestamp timestamp,
id text,
score int,
type text,
publisher_id text,
embed_url text,
PRIMARY KEY (timestamp, id, score)
) WITH bloom_filter_fp_chance = 0.01
AND comment = ''
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE'
AND caching = {
'keys' : 'ALL',
'rows_per_partition' : 'NONE'
}
AND compression = {
'chunk_length_kb' : 64,
'crc_check_chance' : 1.0,
'sstable_compression' : 'LZ4Compressor'
}
AND compaction = {
'base_time_seconds' : 60,
'class' : 'DateTieredCompactionStrategy',
'enabled' : true,
'max_sstable_age_days' : 365,
'max_threshold' : 32,
'min_threshold' : 4,
'timestamp_resolution' : 'MICROSECONDS',
'tombstone_compaction_interval' : 86400,
'tombstone_threshold' : 0.2,
'unchecked_tombstone_compaction' : false
};
Should create a table like :
timestamp ,id ,score , type, id ,embed_url
Instead I am getting this:
timestamp timestamp,
id text,
score int,
embed_url text,
publisher_id text,
type text,
I've created quite a few tables in the same way and this never happened so any help would be appreciated.
I put the id and score as keys to show that these keep their respective position. while the actual scheme I am looking for is only the timestamp to be the primary key.

Looks like there is no such thing as fields order in cassandra.
The others columns are displayed in alphabetical order by Cassandra.
http://docs.datastax.com/en/cql/3.1/cql/ddl/ddl_compound_keys_c.html

You should make a clear distinction on how you want the data to be presented and how it is effectively presented to you. Moreover, you should not rely on the ordinal position of the fields but only on their names.
In order to be efficient, and against your will (you specified an order to the columns when you modeled your schema), Cassandra needs to store the columns in a particular order, and for simplicity this reflects on how it (the CQL interface or the driver) will give back your data.
I suggest you to have a deep insight on how Cassandra stores data (column names included!) in Understanding How CQL3 Maps to Cassandra’s Internal Data Structure.
By the way, if you absolutely need to keep your order at application level (and are too lazy to specify all the fields in the SELECT instead of using SELECT *), you need to create an abstraction interface on your own, something like creating an ordered "field names" array (your order):
String myorder[] = { "timestamp", "id", "score", "type", "publisher_id", "embed_url"};
and then use this as a map in loops using ordinal values.

Keep in mind that the rendering of the CQL string in DESCRIBE in cqlsh is just a function call in the python driver iterating over the metadata. It has nothing to do with how C* stores or sends its results.
If it matters you can set the order. When you Insert you can define the order explicitly
INSERT INTO keyspace_name.table_name
( identifier, column_name, whatever, order)
VALUES ( value, value ... )
When you do a select you can define the order explicitly.
SELECT identifier, whatever, order, column_name FROM keyspace_name.table_name

Related

Does using all fields as a partitioning keys in a table a drawback in cassandra?

my aim is to get the msgAddDate based on below query :
select max(msgAddDate)
from sampletable
where reportid = 1 and objectType = 'loan' and msgProcessed = 1;
Design 1 :
here the reportid, objectType and msgProcessed may not be unique. To add the uniqueness I have added msgAddDate and msgProcessedDate (an additional unique value).
I use this design because I don't perform range query.
Create table sampletable ( reportid INT,
objectType TEXT,
msgAddDate TIMESTAMP,
msgProcessed INT,
msgProcessedDate TIMESTAMP,
PRIMARY KEY ((reportid ,msgProcessed,objectType,msgAddDate,msgProcessedDate));
Design 2 :
create table sampletable (
reportid INT,
objectType TEXT,
msgAddDate TIMESTAMP,
msgProcessed INT,
msgProcessedDate TIMESTAMP,
PRIMARY KEY ((reportid ,msgProcessed,objectType),msgAddDate, msgProcessedDate))
);
Please advice which one to use and what will be the pros and cons between two based on performance.
Design 2 is the one you want.
In Design 1, the whole primary key is the partition key. Which means you need to provide all the attributes (which are: reportid, msgProcessed, objectType, msgAddDate, msgProcessedDate) to be able to query your data with a SELECT statement (which wouldn't be useful as you would not retrieve any additional attributes than the one you already provided in the WHERE statemenent)
In Design 2, your partition key is reportid ,msgProcessed,objectType which are the three attributes you want to query by. Great. msgAddDate is the first clustering column, which will be automatically sorted for you. So you don't even need to run a max since it is sorted. All you need to do is use LIMIT 1:
SELECT msgAddDate FROM sampletable WHERE reportid = 1 and objectType = 'loan' and msgProcessed = 1 LIMIT 1;
Of course, make sure to define a DESC sorted order on msgAddDate (I think by default it is ascending...)
Hope it helps!

Cassandra SELECT on secondary index doesn't return row

I am dealing with a puzzling behaviour when doing SELECTs on Cassandra 2.2.3. I have 4 nodes in the ring, and I create the following keyspace, table and index.
CREATE KEYSPACE IF NOT EXISTS my_keyspace
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
CREATE TABLE my_keyspace.my_table (
id text,
some_text text,
code text,
some_set set<int>,
a_float float,
name text,
type int,
a_double double,
another_set set<int>,
another_float float,
yet_another_set set<text>,
PRIMARY KEY (id, some_text, code)
) WITH read_repair_chance = 0.0
AND dclocal_read_repair_chance = 0.1
AND gc_grace_seconds = 864000
AND bloom_filter_fp_chance = 0.01
AND caching = { 'keys' : 'ALL', 'rows_per_partition' : 'NONE' }
AND comment = ''
AND compaction = { 'class' : 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy' }
AND compression = { 'sstable_compression' : 'org.apache.cassandra.io.compress.LZ4Compressor' }
AND default_time_to_live = 0
AND speculative_retry = '99.0PERCENTILE'
AND min_index_interval = 128
AND max_index_interval = 2048;
CREATE INDEX idx_my_table_code ON my_keyspace.my_table (code);
Then I insert some rows on the table. Some of them have empty sets. I perform this query through the default CQL client and get the row I am expecting:
SELECT * FROM my_table WHERE code = 'test';
Then I run some tests which are outside my control. I don't know what they do but I expect they read and possibly insert/update/delete some rows. I'm sure they don't delete or change any of the settings in the index, table or keyspace.
After the tests, I log in again through the default CQL client and run the following queries.
SELECT * FROM my_table WHERE code = 'test';
SELECT * FROM my_table;
SELECT * FROM my_table WHERE id = 'my_id' AND some_text = 'whatever' AND code = 'test';
The first one doesn't return anything.
The second one returns all the rows, including the one with code = 'test'.
The third one returns the expected row that the first query couldn't retrieve.
The only difference that I can see between this row and others is that it is one of the rows which contains some empty sets, as explained earlier. If I query for another of the rows that also contain some empty sets, I get the same behavior.
I would say the problem is related to the secondary index. Somehow, the operations performed during the tests leave the index in an state where it cannot see certain rows.
I'm obviously missing something. Do you have any ideas about what could cause this behavior?
Thanks in advance.
UPDATE:
I worked around the issue, but now I found the same problem somewhere else. Since the issue first happened, I found out more about the operations performed before the error: updates on specific columns that set a TTL for said columns. After some investigation I found some Jira issues which could be related to this problem:
https://issues.apache.org/jira/browse/CASSANDRA-6782
https://issues.apache.org/jira/browse/CASSANDRA-8206
However, those issues seem to have been solved on 2.0 and 2.1, and I'm using 2.2. I think these changes are included in 2.2, but I could be mistaken.
The main problem is the the type of query you are running on Cassandra.
The Cassadra data model is query driven, tables are recomputed to serve the query.
Tables are created by using well defined Primary Key (Partition Key & clustring key). Cassandra is not good for full table scan type of queries.
Now coming to your queries.
SELECT * FROM my_table WHERE code = 'test';
Here the column used is clustring column and it the equality search column it should be part of Partition Key. Clustring key will be present in different partitions so if Read consistency level is one it may give empty result.
SELECT * FROM my_table;
Cassandra is not good for this kind of table scan query. Here it will search all the table and get all the rows (poor querying).
SELECT * FROM my_table
WHERE id = 'my_id' AND some_text = 'whatever' AND code = 'test';
Here you mentioned everything so the correct results were returned.
I opened a Jira issue and the problem was fixed on 2.1.18 and 2.2.10:
https://issues.apache.org/jira/browse/CASSANDRA-13412
I speak just from what I read in the Jira issue. I didn't test the above scenario again after the fix was implemented because by then I had moved to the 3.0 version.
In the end though I ended up removing almost every use of secondary indices in my application, as I learned that they led to bad performance.
The reason is that in most cases they will result in fan-out queries that will contact every node of the cluster, with the corresponding costs.
There are still some cases where they can perform well, e.g. when you query by partition key at the same time, as no other nodes will be involved.
But for anything else, my advice is: consider if you can remove your secondary indices and do lookups in auxiliary tables instead. You'll have the burden of maintaining the tables in sync, but performance should be better.

Cassandra timeout during read query at consistency ONE

I have a problem with the cassandra db and hope somebody can help me. I have a table “log”. In the log table, I have inserted about 10000 rows. Everything works fine. I can do a
select * from
select count(*) from
As soon I insert 100000 rows with TTL 50, I receive a error with
select count(*) from
Version: cassandra 2.1.8, 2 nodes
Cassandra timeout during read query at consistency ONE (1 responses
were required but only 0 replica responded)
Has someone a idea what I am doing wrong?
CREATE TABLE test.log (
day text,
date timestamp,
ip text,
iid int,
request text,
src text,
tid int,
txt text,
PRIMARY KEY (day, date, ip)
) WITH read_repair_chance = 0.0
AND dclocal_read_repair_chance = 0.1
AND gc_grace_seconds = 864000
AND bloom_filter_fp_chance = 0.01
AND caching = { 'keys' : 'ALL', 'rows_per_partition' : 'NONE' }
AND comment = ''
AND compaction = { 'class' : 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy' }
AND compression = { 'sstable_compression' : 'org.apache.cassandra.io.compress.LZ4Compressor' }
AND default_time_to_live = 0
AND speculative_retry = '99.0PERCENTILE'
AND min_index_interval = 128
AND max_index_interval = 2048;
That error message indicates a problem with the READ operation. Most likely it is a READ timeout. You may need to update your Cassandra.yaml with a larger read timeout time as described in this SO answer.
Example for 200 seconds:
read_request_timeout_in_ms: 200000
If updating that does not work you may need to tweak the JVM settings for Cassandra. See DataStax's "Tuning Java Ops" for more information
count() is a very costly operation, imagine Cassandra need to scan all the row from all the node just to give you the count. In small amount of rows if works, but on bigger data, you should use another approaches to avoid timeout.
First of all, we have to retrieve row by row to count amount and forgot about count(*)
We should make a several(dozens, hundreds?) queries with filtering by partition and clustering key and summ amount of rows retrieved by each query.
Here is good explanation what is clustering and partition keys In your case day - is partition key, composite key consists from two columns: date and ip.
It most likely impossible to do it with cqlsh commandline client, so you should write a script by yourself. Official drivers for popular programming languages: http://docs.datastax.com/en/developer/driver-matrix/doc/common/driverMatrix.html
Example of one of such queries:
select day, date, ip, iid, request, src, tid, txt from test.log where day='Saturday' and date='2017-08-12 00:00:00' and ip='127.0 0.1'
Remarks:
If you need just to calculate count and nothing more, probably has a sense to google for tool like https://github.com/brianmhess/cassandra-count
If Cassandra refuses to run your query without ALLOW FILTERING that mean query is not efficient https://stackoverflow.com/a/38350839/2900229

Composite columns and "IN" relation in Cassandra

I have the following column family in Cassandra for storing time series data in a small number of very "wide" rows:
CREATE TABLE data_bucket (
day_of_year int,
minute_of_day int,
event_id int,
data ascii,
PRIMARY KEY (data_of_year, minute_of_day, event_id)
)
On the CQL shell, I am able to run a query such as this:
select * from data_bucket where day_of_year = 266 and minute_of_day = 244
and event_id in (4, 7, 11, 1990, 3433)
Essentially, I fix the value of the first component of the composite column name (minute_of_day) and want to select a non-contiguous set of columns based on the distinct values of the second component (event_id). Since the "IN" relation is interpreted as an equality relation, this works fine.
Now my question is, how would I accomplish the same type of composite column slicing programmatically and without CQL. So far I have tried the Python client pycassa and the Java client Astyanax, but without any success.
Any thoughts would be welcome.
EDIT:
I'm adding the describe output of the column family as seen through cassandra-cli. Since I am looking for a Thrift-based solution, maybe this will help.
ColumnFamily: data_bucket
Key Validation Class: org.apache.cassandra.db.marshal.Int32Type
Default column value validator: org.apache.cassandra.db.marshal.AsciiType
Cells sorted by: org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.Int32Type,org.apache.cassandra.db.marshal.Int32Type)
GC grace seconds: 864000
Compaction min/max thresholds: 4/32
Read repair chance: 0.1
DC Local Read repair chance: 0.0
Populate IO Cache on flush: false
Replicate on write: true
Caching: KEYS_ONLY
Bloom Filter FP chance: default
Built indexes: []
Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
Compression Options:
sstable_compression: org.apache.cassandra.io.compress.SnappyCompressor
There is no "IN"-type query in the Thrift API. You could perform a series of get queries for each composite column value (day_of_year, minute_of_day, event_id).
If your event_ids were sequential (and your question says they are not) you could perform a single get_slice query, passing in the range (e.g., day_of_year, minute_of_day, and range of event_ids). You could grab bunches of them in this way and filter the response programatically yourself (e.g., grab all data on the date with event ids between 4-3433). More data transfer, more processing on the client side so not a great option unless you really are looking for a range.
So, if you want to use "IN" with Cassandra you will need to switch to a CQL-based solution. If you are considering using CQL in python another option is cassandra-dbapi2. This worked for me:
import cql
# Replace settings as appropriate
host = 'localhost'
port = 9160
keyspace = 'keyspace_name'
# Connect
connection = cql.connect(host, port, keyspace, cql_version='3.0.1')
cursor = connection.cursor()
print "connected!"
# Execute CQL
cursor.execute("select * from data_bucket where day_of_year = 266 and minute_of_day = 244 and event_id in (4, 7, 11, 1990, 3433)")
for row in cursor:
print str(row) # Do something with your data
# Shut the connection
cursor.close()
connection.close()
(Tested with Cassandra 2.0.1.)

Can't create table with composite key using Astyanax client

How do I create table with composite keys using astyanax client. For now I've created it with cqlsh -3, and this is how it looks like in cli:
[default#KS] describe my_cf;
ColumnFamily: my_cf
Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type
Default column value validator: org.apache.cassandra.db.marshal.UTF8Type
Columns sorted by: org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.TimeUUIDType,org.apache.cassandra.db.marshal.UTF8Type)
GC grace seconds: 864000
Compaction min/max thresholds: 4/32
Read repair chance: 0.1
DC Local Read repair chance: 0.0
Replicate on write: true
Caching: KEYS_ONLY
Bloom Filter FP chance: default
Built indexes: []
Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
Compression Options:
sstable_compression: org.apache.cassandra.io.compress.SnappyCompressor
This is how I would expect it to be in cqlsh:
CREATE TABLE my_cf (
... key text,
... timeid timeuuid,
... flag boolean,
... data text,
... PRIMARY KEY (key, timeid));
I got it working with composite key stored as a blob which is a problem.
my code
public class MyKey {
#Component(ordinal=0)
private String key;
#Component(ordinal=1)
private UUID timeid;
//...
}
CF
public static ColumnFamily<MyKey, String> MY_CF = ColumnFamily
.newColumnFamily("my_cf",
new AnnotatedCompositeSerializer<MyKey>(MyKey.class),
StringSerializer.get());
KS
ksDef = cluster.makeKeyspaceDefinition();
ksDef.setName(keyspaceName)
.setStrategyOptions(keyspaceOptions)
.setStrategyClass("SimpleStrategy")
.addColumnFamily(
cluster.makeColumnFamilyDefinition()
.setName(MY_CF.getName())
.setComparatorType("UTF8Type")
.setDefaultValidationClass("UTF8Type")
// blob if no key validation class specified
// and something looking as a string if I use this: .setKeyValidationClass("CompositeType(UTF8Type, TimeUUIDType)")
// anyway there's a single column per composite key
.addColumnDefinition(
cluster.makeColumnDefinition()
.setName("flag")
.setValidationClass(
"BooleanType"))
.addColumnDefinition(
cluster.makeColumnDefinition()
.setName("data")
.setValidationClass(
"UTF8Type")));
cluster.addKeyspace(ksDef);
mutation
MutationBatch m = ks.prepareMutationBatch();
for (char keyName = 'A'; keyName <= 'C'; keyName++) {
MyKey myKey = new MyKey("THEKEY_" + keyName, TimeUUIDUtils.getUniqueTimeUUIDinMillis());
ColumnListMutation<String> cfm = m.withRow(MY_CF, myKey);
cfm.putColumn("flag", true, null);
cfm.putColumn("data", "DATA_" + keyName, null);
}
m.execute();
cqlsh:KS>describe columnfamily my_cf;
CREATE TABLE my_cf (
KEY blob PRIMARY KEY,
flag boolean,
data text
) WITH ...
cqlsh:KS>select * from my_cf;
key | flag | data
----------------------------------------------------------+--------+---------
00064953494e5f420000109f4513d0e3ac11e19c400022191ad62b00 | True | DATA_B
cqlsh:KS> select * from my_cf where key = 'THEKEY_B' order by timeid desc;
Bad Request: Order by on unknown column timeid
doesnt' it look right in cassandra-cli below? why it doesn't work in cqlsh?
cassandra-cli] list my_cf;
RowKey: THEKEY_B:09f29941-e3c2-11e1-a7ef-0022191ad62b
=> (column=active, value=true, timestamp=1344695832788000)
=> (column=data, value=DATA_B, timestamp=1344695832788000)
What am I doing wrong?
(astyanax 1.0.6, cassandra 1.1.2)
cqlsh>[cqlsh 2.2.0 | Cassandra 1.1.2 | CQL spec 3.0.0 | Thrift protocol 19.32.0]
From what I've been able to figure out, composite primary keys
represent a major divergence in the protocol and interface to
cassandra and the protocol you use controls the features you have
access to.
For instance, astyanax and hector are primarily thrift protocol
clients, while CQL, more than just a language, is (or will be?) a binary protocol.
The two protocols are not equivalent and CQL3 with composite primary
keys makes things very different.
The thing to understand about "TABLES" with composite primary keys is that they
essentially translate into wide rows with composite column names. The
first part of the primary key is the row key and the remaining parts
are used as a prefix along with the TABLE-column name as the column name in
the wide row.
In your instance, the row key is "key" and the column prefix is
"timeid", so the flag field of what you are inserting is actually a
column named :flag and data is :data and so
on.
In order for this to work, the CQL protocol interface to cassandra is
converting "TABLES" into wide rows and transparently handling all of
that column naming.
The thrift interface doesn't take care of this stuff and and when you
do a mutation, it just writes columns like it is used to, without the
virtual addressing.
So, in fact, the results do not look right in your cassandra-cli. If you do an insert from cqlsh -3, here is what it should look like from the cassandra-cli point of view (with a simple text date):
[default#testapp] list my_cf;
RowKey: mykey
=> (column=20120827:data, value=some data, timestamp=1346090889361000)
=> (column=20120827:flag, value=, timestamp=1346090889361001)
CQL3 and tables look really attractive, but there are some trade-offs to be made and there doesn't seem to be solid java client support yet.

Resources