Debugging Cassandra in CQLSH - ArrayIndexOutOfBoundsException - cassandra

I am running some Cassandra queries. When I run
select * from logtable;
I get this error:
<ErrorMessage code=0000 [Server error] message="java.lang.ArrayIndexOutOfBoundsException">
However, if I run with limits, some rows are OK:
select * from logtable limit 100;
This works. I keep increasing the limit and eventually, I get the error again. It's obvious that our software adds some corrupted data. My question is this: is there a way to find out what is happening using cqlsh? I can't study the code, because it's extremely messed up, it's a nightmare. I couldn't find anything useful in system.log.
This is the table:
cqlsh:mykeyspace> desc table logtable
CREATE TABLE mykeyspace.logtable (
key text,
key2 text,
column1 text,
column2 text,
column3 text,
column4 text,
value blob,
PRIMARY KEY ((key, key2), column1, column2, column3, column4)
) WITH COMPACT STORAGE
AND CLUSTERING ORDER BY (column1 ASC, column2 ASC, column3 ASC, column4 ASC)
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = 'NONE';
Thanks.
Regards,
Serban

Related

Cassandra - Altering of types is not allowed

I have Cassandra cluster ( 2 nodes ) and i trying to alter value column type to Map.
After executing ALTER TABLE "keyspace"."table" ALTER value TYPE Map; in cqlsh i got an error that modifying not allowed. (Table is empty)
CREATE TABLE "keyspace"."table" (
key text,
column1 bigint,
column2 bigint,
value text,
PRIMARY KEY (key, column1, column2)
) WITH COMPACT STORAGE
AND CLUSTERING ORDER BY (column1 ASC, column2 ASC)
AND bloom_filter_fp_chance = 0.1
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.SnappyCompressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.05
AND default_time_to_live = 0
AND gc_grace_seconds = 5
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
[cqlsh 5.0.1 | Cassandra 3.11.0 | CQL spec 3.4.4 | Native protocol v4]
Is it possible to alter table for this table structure? What can cause this issue?
Thanks
That kind of alteration is unfortunately not supported in Cassandra. For a reference regarding CQL data types and supported conversions, please see the Datastax documentation.
I don't know if this will work, but you could try deleting the column and then create it again?
Unfortunately, dropping and re-adding the column only works in certain cases where the types involved are similar in nature.
Bigger changes, for example changing from an int to a boolean will produce an error along the lines of:
code=2200 [Invalid query] message="Cannot re-add previously dropped column 'MYCOLUMN' of type boolean, incompatible with previous type int"
If the table in question is part of a new development effort and has not been released into production, you might be better off dropping the table and re-adding it the new type.

Insert query replaces rows having same data field in Cassandra clustering column

I'm learning Cassandra, started off with v3.8. My sample keyspace/table looks like this
CREATE TABLE digital.usage (
provider decimal,
deviceid text,
date text,
hours varint,
app text,
flat text,
usage decimal,
PRIMARY KEY ((provider, deviceid), date, hours)
) WITH CLUSTERING ORDER BY (date ASC, hours ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
Using a composite PRIMARY KEY with partition key as provider and deviceId, so that the uniqueness and distribution is done across the cluster nodes. Then the clustering keys are date and hours.
I have few observatons:
1) For a PRIMARY KEY((provider, deviceid), date, hours), while inserting multiple entries for hours field, only latest is logged and the previous are disappeared.
2) For a PRIMARY KEY((provider, deviceid), date), while inserting multiple entries for same date field, only latest is logged and the previous are disappeared.
Though i'm happy with above(point-1) behaviour, want to know whats happening in the background. Do I have to understand more about the clustering order keys?
PRIMARY KEY is meant to be unique.
Most of RDBMS throws error if you insert duplicate value in PRIMARY KEY.
Cassandra does not do Read before Write. It creates a new version of record with latest timestamp. When you insert data with same values for columns in primary key, new data will be created with latest timestamp and while querying (SELECT) record with only latest timestamp is returned back.
Example:
PRIMARY KEY((provider, deviceid), date, hours)
Insert into digital.usage(provider, deviceid, date, hours,app,flat) values(1.0,'a','2017-07-27',1,"test","test")
---- This will create a new record with let's say timestamp as 1
Insert into digital.usage(provider, deviceid, date, hours,app,flat) values(1.0,'a','2017-07-27',1,"test1","test1")
---- This will create a new record with let's say timestamp as 2
SELECT app,flat FROM digital.usage WHERE provider=1.0 AND deviceid='a' AND date='2017-07-27' AND hours=1
Will give
------------
| app | flat |
|-----|------|
|test1|test1 |
------------

issue with frequent truncates in Cassandra and 24 hour ttl create large tombstones

We have the below table with ttl 24 hours or 1 day. We have 4 cassandra 3.0 node cluster and there will be a spark processing on this table. Once processed, it will truncate all the data in the tables and new batch of data would be inserted. This will be a continuous process.
Problem I am seeing is , we are getting more tombstones because data is truncated frequently everyday after spark finishes processing.
If I set gc_grace_seconds to default , there will be more tombstones. If I reduce gc_grace_seconds to 1 day will it be an issue ? even if I run repair on that table every day will that be enough.
How should I approach this problem, I know frequent deletes is an antipattern in Cassandra, is there any other way to solve this issue?
TABLE b.stag (
xxxid bigint PRIMARY KEY,
xxxx smallint,
xx smallint,
xxr int,
xxx text,
xxx smallint,
exxxxx smallint,
xxxxxx tinyint,
xxxx text,
xxxx int,
xxxx text,
xxxxx text,
xxxxx timestamp
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCom pactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandr a.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 86400
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
thank you
A truncate of a table should not invoke tombstones. So when you're saying "truncating" I assume you mean deleting. You can as you have already mentioned drop the gc_grace_seconds value, however this is means you have a smaller window for repairs to run to reconcile any data, make sure each node has the right tombstone for a given key etc or old data could reappear. Its a trade off.
However to be fair if you are clearing out the table each time, why not use the TRUNCATE command, this way you'll flush the table with no tombstones.

Cassandra only one row is inserted when expected 100k

I tried to CQL Python driver to insert 100k rows,
# no_of_rows = 100k
for row in range(no_of_rows):
session.execute("INSERT INTO test_table (key1, key2, key3) VALUES ('test', 'test', 'test'"))
but only one row is inserted into test_table (using Cassandra CQL Shell and select * from test_table), how to fix the issue?
UPDATE
If I tried
for row in range(no_of_rows):
session.execute("INSERT INTO test_table (key1, key2, key3) VALUES ('test' + str(row), 'test', 'test'"))
no rows were inserted, here key1 is the primary key.
describe test_table,
CREATE TABLE test_keyspace.test_table (
key1 text PRIMARY KEY,
key2 text,
key3 text
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
Cassandra primary keys are unique. 100000 in-place writes to the same key(s) leaves you with 1 row.
Which means if your primary key structure is PRIMARY KEY(key1,key2,key3) and you INSERT 'test','test','test' 100000 times...
...it'll write 'test','test','test' to the same partition 100000 times.
To get your Python code to work, I made some adjustments, such as creating a separate variable for the key (key1) and using a prepared statement:
pStatement = session.prepare("""
INSERT INTO test_table (key1, key2, key3) VALUES (?, ?, ?);
""")
no_of_rows=100000
for row in range(no_of_rows):
key='test' + str(row)
session.execute(pStatement,[key,'test','test'])
using Cassandra CQL Shell and select * from test_table
I feel compelled to mention, that both multi-key (querying for more than one partition key at a time) and unbound queries (SELECTs without a WHERE clause) are definite anti-patterns in Cassandra. They may appear work fine in a dev/test environment. But when you get to a production-scale cluster with dozens of nodes, these types of queries will introduce a lot of network time into the equation, as they will have to scan each node to compile the query results.
Your new code has a bug in string concatenation. It should be:
for row in range(no_of_rows):
session.execute("INSERT INTO test_table (key1, key2, key3) VALUES ('test" + str(row) + "', 'test', 'test')")

Cassandra only returns some rows of a column family

I have the following column family in Cassandra:
CREATE TABLE tese.airplane_max_delay (
airplane_manufacturer text,
airplane_model text,
year int,
tailnumber text,
airplane_engine text,
airtime int,
arrdelay int,
cod uuid,
depdelay int,
description_carrier text,
distance int,
month int,
uniquecarrier text,
PRIMARY KEY ((airplane_manufacturer, airplane_model), year, tailnumber)) WITH CLUSTERING ORDER BY (year DESC, tailnumber DESC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
I import to this column family a csv file with 10 million records by COPY command.
when I try to query the data only 6629 lines are returned. The query should return 10 million lines.
Why does this happen? How can I change this?

Resources