from the Cassandra table how to get records based on date from timestamp column.Table details are
CREATE TABLE hlragent_logs.hlragent_logs_2021 (
msisdn text,
date_time timestamp,
cmd_no text,
agent_id text,
cmd_executed text,
dummy text,
id bigint,
imsi text,
mml_cmd text,
module text,
node text,
node_id text,
node_ip text,
p text,
pno text,
serial text,
vhlr_name text,
PRIMARY KEY (msisdn, date_time, cmd_no)
) WITH CLUSTERING ORDER BY (date_time ASC, cmd_no ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
CREATE INDEX indx_agentlogs_2021 ON hlragent_logs.hlragent_logs_2021 (imsi)
select * from hlragent_logs_2021 where todate(date_time)="2021-08-10" allow filtering;
SyntaxException: line 1:45 no viable alternative at input '(' (select * from hlragent_logs_2021 where todate
You can't use the user-defined functions in the WHERE clause (there is a Jira ticket for it, but I don't remember anyone working on it).
It isn't necessary to use the native CQL TODATE() function to work on timestamp columns. It's possible to directly work on the column with:
SELECT * FROM ... WHERE ... AND date_time < '2021-08-10';
But you cannot use the equality operator (=) because the CQL timestamp data type is encoded as the number of milliseconds since Unix epoch (Jan 1, 1970 00:00 GMT) so you need to be precise when you're working with timestamps.
Depending on where you're running the query, the filter could be translated in the local timezone. Let me illustrate with this example table:
CREATE TABLE community.tstamptbl (
id int,
tstamp timestamp,
PRIMARY KEY (id, tstamp)
)
These 2 statements may appear similar but translate to 2 different entries:
INSERT INTO tstamptbl (id, tstamp) VALUES (5, '2021-08-09');
INSERT INTO tstamptbl (id, tstamp) VALUES (5, '2021-08-09 +0000');
The first statement creates an entry with a timestamp in my local timezone (Melbourne, Australia) while the second statement creates an entry with a timestamp in UTC (+0000):
cqlsh:community> SELECT * FROM tstamptbl WHERE id = 5;
id | tstamp
----+---------------------------------
5 | 2021-08-08 14:00:00.000000+0000
5 | 2021-08-09 00:00:00.000000+0000
Similarly, you need to be precise when reading the data. You need to specify the timezone to remove ambiguity. Here are some examples:
SELECT * FROM tstamptbl WHERE id = 5 AND tstamp < '2021-08-09 +0000';
SELECT * FROM tstamptbl WHERE id = 5 AND tstamp < '2021-08-10 12:00+0000';
SELECT * FROM tstamptbl WHERE id = 5 AND tstamp < '2021-08-08 12:34:56+0000';
Again since timestamps are encoded in milliseconds (instead of days), there is a whole range of possible values for a given date. If I wanted to retrieve all rows for the date 2021-08-09, I need to filter based on a range as in this example:
SELECT * FROM tstamptbl
WHERE id = 5
AND tstamp >= '2021-08-09 +0000'
AND tstamp < '2021-08-10 +0000';
Cheers!
Related
I have a table susbcriber, which will contain millions of data.
Table schema is as below in cassandra -
CREATE TABLE susbcriber (
id int PRIMARY KEY,
age_identifier text,
alternate_mobile_identifier text,
android_identifier text,
batch_id text,
circle text,
city_identifier text,
country text,
country_identifier text,
created_at text,
deleted_at text,
email_identifier text,
gender_identifier text,
ios_identifier text,
list_master_id int,
list_subscriber_id text,
mobile_identifier text,
operator text,
partition_id text,
raw_data map<text, text>,
region_identifier text,
unique_identifier text,
updated_at text,
web_push_identifier text
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 0
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
I have to make filter query mostly on 'raw_data map<text, text>,' this column contains JSON values and keys, How I can model the data so that select and update have to be fast in performance ?
I am trying to achieve some bulk update operations.
Any suggestion is highly appreciated.
Yeah you can.
Map is used to store dynamic data in table
You can have the index based upon Keys ,Entries or values of a map
There are three options I mentioned below.
If your use case is to search the keys of dynamic data then use first...
if you want to search on the values of known key in map then use second...
if you don't know the keys just want to search the values in the map then use third.
Create index idx_first on <keyspaceName.tableName> (Keys (<mapColumn>))
Create index idx_second on <keyspaceName.tableName> (Entries (<mapColumn>))
Create index idx_third on <keyspaceName.tableName> (Values (<mapColumn>))
If data is already in the map you dont really need to keep the values in their own columns as well, and if its just a key to a map its easier on cassandra to represent it as a clustering key instead of a collection like:
CREATE TABLE susbcriber_data (
id int,
key text,
value text,
PRIMARY KEY((id), key))
Then you can query by any id and key. If you are looking where a specific key has a value than
CREATE TABLE susbcriber_data_by_value (
id int,
shard int,
key text,
value text,
PRIMARY KEY((key, shard), value, id))
Then when you insert you set shard to be id % 12 or some value such that your partitions do not get to large (need to some guessing based on expected load). Then to see all the values where key = value you need to query all 12 of those shards (async call to each and merge). Although if your cardinality for the key/value pairs is low enough the shard might be unnecessary. Then you will have a list of the ids which you can lookup. If you want to avoid the lookup you can add an additional key and value to that table but your data may explode quite a bit depending on the number of keys you have in your map and keeping everything updated will be painful.
An option that I would not recommend but is available is to index the map ie:
CREATE INDEX raw_data_idx ON susbcriber ( ENTRIES (raw_data) );
SELECT * FROM susbcriber WHERE raw_data['ios_identifier'] = 'id';
Keeping in mind the issues with secondary indexes.
I'm learning Cassandra, started off with v3.8. My sample keyspace/table looks like this
CREATE TABLE digital.usage (
provider decimal,
deviceid text,
date text,
hours varint,
app text,
flat text,
usage decimal,
PRIMARY KEY ((provider, deviceid), date, hours)
) WITH CLUSTERING ORDER BY (date ASC, hours ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
Using a composite PRIMARY KEY with partition key as provider and deviceId, so that the uniqueness and distribution is done across the cluster nodes. Then the clustering keys are date and hours.
I have few observatons:
1) For a PRIMARY KEY((provider, deviceid), date, hours), while inserting multiple entries for hours field, only latest is logged and the previous are disappeared.
2) For a PRIMARY KEY((provider, deviceid), date), while inserting multiple entries for same date field, only latest is logged and the previous are disappeared.
Though i'm happy with above(point-1) behaviour, want to know whats happening in the background. Do I have to understand more about the clustering order keys?
PRIMARY KEY is meant to be unique.
Most of RDBMS throws error if you insert duplicate value in PRIMARY KEY.
Cassandra does not do Read before Write. It creates a new version of record with latest timestamp. When you insert data with same values for columns in primary key, new data will be created with latest timestamp and while querying (SELECT) record with only latest timestamp is returned back.
Example:
PRIMARY KEY((provider, deviceid), date, hours)
Insert into digital.usage(provider, deviceid, date, hours,app,flat) values(1.0,'a','2017-07-27',1,"test","test")
---- This will create a new record with let's say timestamp as 1
Insert into digital.usage(provider, deviceid, date, hours,app,flat) values(1.0,'a','2017-07-27',1,"test1","test1")
---- This will create a new record with let's say timestamp as 2
SELECT app,flat FROM digital.usage WHERE provider=1.0 AND deviceid='a' AND date='2017-07-27' AND hours=1
Will give
------------
| app | flat |
|-----|------|
|test1|test1 |
------------
I tried to CQL Python driver to insert 100k rows,
# no_of_rows = 100k
for row in range(no_of_rows):
session.execute("INSERT INTO test_table (key1, key2, key3) VALUES ('test', 'test', 'test'"))
but only one row is inserted into test_table (using Cassandra CQL Shell and select * from test_table), how to fix the issue?
UPDATE
If I tried
for row in range(no_of_rows):
session.execute("INSERT INTO test_table (key1, key2, key3) VALUES ('test' + str(row), 'test', 'test'"))
no rows were inserted, here key1 is the primary key.
describe test_table,
CREATE TABLE test_keyspace.test_table (
key1 text PRIMARY KEY,
key2 text,
key3 text
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
Cassandra primary keys are unique. 100000 in-place writes to the same key(s) leaves you with 1 row.
Which means if your primary key structure is PRIMARY KEY(key1,key2,key3) and you INSERT 'test','test','test' 100000 times...
...it'll write 'test','test','test' to the same partition 100000 times.
To get your Python code to work, I made some adjustments, such as creating a separate variable for the key (key1) and using a prepared statement:
pStatement = session.prepare("""
INSERT INTO test_table (key1, key2, key3) VALUES (?, ?, ?);
""")
no_of_rows=100000
for row in range(no_of_rows):
key='test' + str(row)
session.execute(pStatement,[key,'test','test'])
using Cassandra CQL Shell and select * from test_table
I feel compelled to mention, that both multi-key (querying for more than one partition key at a time) and unbound queries (SELECTs without a WHERE clause) are definite anti-patterns in Cassandra. They may appear work fine in a dev/test environment. But when you get to a production-scale cluster with dozens of nodes, these types of queries will introduce a lot of network time into the equation, as they will have to scan each node to compile the query results.
Your new code has a bug in string concatenation. It should be:
for row in range(no_of_rows):
session.execute("INSERT INTO test_table (key1, key2, key3) VALUES ('test" + str(row) + "', 'test', 'test')")
I am running some Cassandra queries. When I run
select * from logtable;
I get this error:
<ErrorMessage code=0000 [Server error] message="java.lang.ArrayIndexOutOfBoundsException">
However, if I run with limits, some rows are OK:
select * from logtable limit 100;
This works. I keep increasing the limit and eventually, I get the error again. It's obvious that our software adds some corrupted data. My question is this: is there a way to find out what is happening using cqlsh? I can't study the code, because it's extremely messed up, it's a nightmare. I couldn't find anything useful in system.log.
This is the table:
cqlsh:mykeyspace> desc table logtable
CREATE TABLE mykeyspace.logtable (
key text,
key2 text,
column1 text,
column2 text,
column3 text,
column4 text,
value blob,
PRIMARY KEY ((key, key2), column1, column2, column3, column4)
) WITH COMPACT STORAGE
AND CLUSTERING ORDER BY (column1 ASC, column2 ASC, column3 ASC, column4 ASC)
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = 'NONE';
Thanks.
Regards,
Serban
I have the following column family in Cassandra:
CREATE TABLE tese.airplane_max_delay (
airplane_manufacturer text,
airplane_model text,
year int,
tailnumber text,
airplane_engine text,
airtime int,
arrdelay int,
cod uuid,
depdelay int,
description_carrier text,
distance int,
month int,
uniquecarrier text,
PRIMARY KEY ((airplane_manufacturer, airplane_model), year, tailnumber)) WITH CLUSTERING ORDER BY (year DESC, tailnumber DESC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
I import to this column family a csv file with 10 million records by COPY command.
when I try to query the data only 6629 lines are returned. The query should return 10 million lines.
Why does this happen? How can I change this?