What are the ways to select distinct count in cassandra? - cassandra

I need to select distinct count in table in cassandra.
As I understand direct distinct count is not supported in cassandra not even nested queries like rdbms.
select count(*) from (select distinct key_part_one from stackoverflow_composite) as count;
SyntaxException: line 1:21 no viable alternative at input '(' (select count(*) from [(]...)
What are the ways to get it. whether I can get directly from cassandra or any addon tools/languages need to be used?
below is my create table statement.
CREATE TABLE nishant_ana.ais_profile_table (
profile_key text,
profile_id text,
last_update_day date,
last_transaction_timestamp timestamp,
last_update_insertion_timestamp timeuuid,
profile_data blob,
PRIMARY KEY ((profile_key, profile_id), last_update_day)
) WITH CLUSTERING ORDER BY (last_update_day DESC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
I have just started using cassandra.

From Cassandra you can only do the select distinct partition_key from ....
If you need something like this, you can use Spark + Spark Cassandra Connector - it will work, but don't expect really real-time answers, as it needs to read necessary data from all nodes, and then calculate answer.

Related

How to make cqlsh to display text correctly?

When I try to read from a Cassandra table I get what looks like binary output:
cqlsh 10.243.128.4 --debug -e "select enduser from test.cert limit 2;"
enduser
---------------------------------------
*7UDdnLg\x1135J"\x15%(
\x10\x1c\x1aHa\x7fO\x19)1#3b\x17\x
I am not sure why this is happening. The other fields are displayed correctly.
Table def:
CREATE TABLE test.cert (
enduser text,
cert_id int
PRIMARY KEY (enduser, cert_id)
) WITH CLUSTERING ORDER BY (cert_id ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 1024
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
I have tried UTF8 encoding with the command line but it did not help.
The CQL text data type is a UTF-8 encoded string so it just displays the data in the cell.
It looks like you've stored some data with hex-encoding in it. If you post examples of what the data should be versus how it is displayed, it will provide clues as to what's going on. Cheers!

Reduce gc_grace_seconds to 0 for TTL'ed data in Cassandra

Does it make sense to reduce gc_grace_seconds to 0 (or some other really low #) if table only contains TTL'ed data (with no manual deletes)? Table has a default_time_to_live set of 30 days. Also, as mentioned here
In a single-node cluster, this property can safely be set to zero. You
can also reduce this value for tables whose data is not explicitly
deleted — for example, tables containing only data with TTL set,
More details of the schema.
CREATE TABLE Foo (
user_uuid uuid,
ts bigint,
... //skipped a few columns
PRIMARY KEY (user_uuid, ts, event_uuid)
) WITH CLUSTERING ORDER BY (ts DESC, event_uuid ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 'compaction_window_size': '24', 'compaction_window_unit': 'HOURS', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 2592000
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
You need to be careful, as with gc_grace_seconds you effectively disable hints collection, so if the node is down even for 5 minutes, you'll need to do a repair. In Cassandra 3.0, hints obey value gc_grace_seconds and if it's shorter then max_hints_window, then the hints will be collected for that time period only... But you can reduce this value to several hours if necessary, as it was hinted in linked documentation.
See this very good blog post on that topic.

cassandra read high iowait

data read per second
I have a three node cassandra cluster。when I have multi thread querying from the cluster ,the io load is very high.The cluster holds about 80GB data per node.I use time window compact strategy and time-window is ten hour.One sstable is about 1GB.Can some body help me with it.Thank you.one sstable infomation
data is at a speed of 10000 per second .the cluster holds about 10 billion records .
below is the schema information
CREATE TABLE point_warehouse.point_period (
point_name text,
year text,
time timestamp,
period int,
time_end timestamp,
value text,
PRIMARY KEY ((point_name, year), time)
) WITH CLUSTERING ORDER BY (time DESC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', 'compaction_window_size': '10', 'compaction_window_unit': 'HOURS', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 2592000
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
and the query is SELECT * from POINT_PERIOD where POINT_NAME=? AND YEAR='2017' AND TIME >'2017-05-23 12:53:24 order by time asc LIMIT 1 ALLOW FILTERING'
when execute this query concurrently the io load became extremely high like 200MB/s. thank you .

Connection timed out while dropping keyspace

I'm having trouble deleting a keyspace.
The keyspace in question has 4 tables similar to this one:
CREATE KEYSPACE demo WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = false;
CREATE TABLE demo.t1 (
t11 int,
t12 timeuuid,
t13 int,
t14 text,
t15 text,
t16 boolean,
t17 boolean,
t18 int,
t19 timeuuid,
t110 text,
PRIMARY KEY (t11, t12)
) WITH CLUSTERING ORDER BY (t13 DESC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
CREATE INDEX t1_idx ON demo.t1 (t14);
CREATE INDEX t1_deleted_idx ON demo.t2 (t15);
When I want to delete the keyspace using the command:
Session session = cluster.connect();
PreparedStatement prepared = session.prepare("drop keyspace if exists " + schemaName);
BoundStatement bound = prepared.bind();
session.execute(bound);
Then the query gets timed out (or takes over 10 seconds to execute), even when the tables are empty:
com.datastax.driver.core.exceptions.OperationTimedOutException: [/192.168.0.1:9042] Timed out waiting for server response
at com.datastax.driver.core.exceptions.OperationTimedOutException.copy(OperationTimedOutException.java:44)
at com.datastax.driver.core.exceptions.OperationTimedOutException.copy(OperationTimedOutException.java:26)
at com.datastax.driver.core.DriverThrowables.propagateCause(DriverThrowables.java:37)
at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:245)
at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:64)
at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:39)
I tried this on multiple machines and the result was the same. I'm using Cassandra 3.9. A similar thing happens using cqlsh. I know that I can increase the read timeout in the cassandra.yaml file, but how can I make this dropping faster? Another thing is that if I do two consecutive requests, the first one gets timed out and the second one goes through fast.
Try to run it with increased timeout:
cqlsh --request-timeout=3600 (in seconds, default 10 deconds)
There's should be also same setting on driver level. Review timeout session in this link:
http://docs.datastax.com/en/developer/java-driver/3.1/manual/socket_options/
Increasing timeout just hides the issue away and is usually a bad idea. Have a look at this answer: https://stackoverflow.com/a/16618888/7413631

What is the difference between Varchar and text type in Cassandra CQL

What is the difference between Varchar and text data type in Cassandra
CQL.
https://docs.datastax.com/en/cql/3.0/cql/cql_reference/cql_data_types_c.html
When I try to create a table with field data type as varchar it is creating as text data type
CREATE TABLE test ( empID int, first_name varchar, last_name
varchar, PRIMARY KEY (empID) );
DESC test table gives me the below result.
CREATE TABLE test (
empid int PRIMARY KEY,
first_name text,
last_name text
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
[cqlsh 5.0.1 | Cassandra 3.0.7.1158 | DSE 5.0.0 | CQL spec 3.4.0 |
Native protocol v4]
CQL data types doc, both text and varchar are UTF-8 strings.
So both are one and the same.

Resources