I've got 3 nodes in 2 datacenters. On each node I'm using the same cqlshrc file having just the following lines:
[connection]
request_timeout = 3600
Replication strategy is as follows :
class:NetworkTopologyStrategy, 'dc1' :2, 'dc2':1
I've inserted more than 100 000 rows in the database. However when I execute
select count(*) from table, I get "operation timed out" in 2 of the 3 nodes. i.e. the query is successful in one node.
Why is the query unsuccessful in 2 out of 3 nodes despite each having same cqlshrc file?
OS: RHEL 6
Cassandra:3.0.14
In cassandra count(*) is a very costly operation, need to scan all the row from all the node just to give you the count and can generate timeout exception.
So Instead of using count(*) maintain a counter table for example :
CREATE TABLE page_view_counts (
counter_value counter,
url_name varchar,
page_name varchar,
PRIMARY KEY (url_name, page_name)
);
Whenever a new row inserted into base table increment the count plus one
UPDATE page_view_counts
SET counter_value = counter_value + 1
WHERE url_name = 'stackoverflow.com' AND page_name = 'questions';
Now to get the count just try like below
SELECT * FROM page_view_counts
WHERE url_name = 'stackoverflow.com' AND page_name = 'questions';
Related
I have two Athena tables 1 and 2. Table 1 is partitioned, table 2 is not. When I create table 3 from the result of joining 1 and 2 on a mutual field, the partition in table 1 isn't propagated.
I know it's possible to do CTAS queries with partitions, but that requires the partition to be an existing column.
Is there a way to keep the partition in table 1 when creating table 3, something like this:
CREATE TABLE table_3
WITH (
format='PARQUET',
partitioned_by='existing_partition_in_table_1'
) AS
SELECT table_1.field
FROM table_1
JOIN table_2
ON table_1.field = table_2.field
Figured it out five minutes later.. I just need to select the partition from table 1 as well, then the CTA statement can access the partition
CREATE TABLE table_3
WITH (
format='PARQUET',
partitioned_by='partition_name'
) AS
SELECT table_1.field, table_1.partition_name
FROM table_1
JOIN table_2
ON table_1.field = table_2.field
*facepalm
Using Java, can I scan a Cassandra table and just update the TTL of a row? I don't want to change any data. I just want to scan Cassandra table and set TTL of a few rows.
Also, using java, can I set TTL which is absolute. for example (2016-11-22 00:00:00). so I don't want to specify the TTL in seconds, but specify the absolute value in time.
Cassandra doesn't allow to set the TTL value for a row, it allows to set TTLs for columns values only.
In the case you're wondering why you are experiencing rows expiration, this is because if all the values of all the columns of a record are TTLed then the row disappears when you try to SELECT it.
However, this is only true if you perform an INSERT with the USING TTL. If you INSERT without TTL and then do an UPDATE with TTL you'll still see the row, but with null values. Here's a few examples and some gotchas:
Example with a TTLed INSERT only:
CREATE TABLE test (
k text PRIMARY KEY,
v int,
);
INSERT INTO test (k,v) VALUES ('test', 1) USING TTL 10;
... 10 seconds after...
SELECT * FROM test ;
k | v
---------------+---------------
Example with a TTLed INSERT and a TTLed UPDATE:
INSERT INTO test (k,v) VALUES ('test', 1) USING TTL 10;
UPDATE test USING TTL 10 SET v=0 WHERE k='test';
... 10 seconds after...
SELECT * FROM test;
k | v
---------------+---------------
Example with a non-TTLed INSERT with a TTLed UPDATE
INSERT INTO test (k,v) VALUES ('test', 1);
UPDATE test USING TTL 10 SET v=0 WHERE k='test';
... 10 seconds after...
SELECT * FROM test;
k | v
---------------+---------------
test | null
Now you can see that the only way to solve you problem is to rewrite all the values of all the columns of your row with a new TTL.
In addition, there's no way to specify an explicit expiration date, but you can get a simple TTL value in seconds with simple math (as other suggested).
Have a look at the official documentation about data expiration. And don't forget to have a look at the DELETE section for updating TTLs.
HTH.
You can't only update TTL of a row. You have to update or re-insert all the column.
You can select all the regular column and the primary keys column then update the regular columns with primary keys or re-insert using TTL in second
In Java you can calculate TTL in second from a date using below method.
public static long ttlFromDate(Date ttlDate) throws Exception {
long ttl = (ttlDate.getTime() - System.currentTimeMillis()) / 1000;
if (ttl < 1) {
throw new Exception("Invalid ttl date");
}
return ttl;
}
Alternatively, you can set a TTL value on the entire table while creating it.
CREATE TABLE test (
k text PRIMARY KEY,
v int,
) WITH default_time_to_live = 63113904;
Above example will create a table whose rows will disappear after 2 years.
select count (*) from my_table gives me OperationTimedOut: errors={}, last_host=127.0.0.1
I have already tried to change the values in request_timeout_in_ms in cassandra.yaml and request_timeout in cqlshrc.sample. (Both are in C:\Programs\DataStax-DDC\apache-cassandra\conf) But without success.
How can I increse timeout?
select count (*) is not doing what you think. It is actually expensive as it counts the rows one by one. You can track number of records using a separate column family with a counter, which you will need to increment for every insert you do into your table. For example
CREATE TABLE IF NOT EXISTS my_table_counter (
mykey text,
count counter,
PRIMARY KEY (mykey)
);
Then for every insert into your table, do counter update:
INSERT into my_table (mykey, mydata) VALUES (?, ?);
UPDATE my_table_counter SET count = count + 1 WHERE mykey = ?;
To get the count:
SELECT count FROM my_table_counter WHERE mykey = ?
Note that counters are not idempotent, so in a rare event of a failure your data might be under or over-counted. Also the code above assumes that you only insert with a new key.
If you need a precise counting, Cassandra may be not a good fit for that. Also if you are not inserting with unique keys you may need to consider using light weight transaction with insert (IF NOT EXISTS) and update a counter only if transaction was applied.
When I am trying to execute the below query, I am always getting QueryTimeOutException,
Exception is,
com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout during read query at consistency QUORUM (2 responses were required but only 0 replica responded)
Query is,
SELECT * FROM my_test.my_table WHERE key_1 = 101 ORDER BY key_2 ASC LIMIT 25;
I am using cassandra version 2.1.0 with 3 nodes, Single DC with replication of 3, cassandra.yaml has all default values and I am having following keyspace and table as schema,
CREATE KEYSPACE my_test
WITH REPLICATION = {
'class' : 'SimpleStrategy',
'replication_factor' : 3
};
CREATE TABLE my_test.my_table (
key_1 bigint,
key_2 bigint,
key_3 text,
key_4 text,
key_5 text,
key_6 text,
key_7 text,
key_8 text,
key_9 text,
key_10 text,
key_11 timestamp,
PRIMARY KEY (key_1, key_2)
);
Currently the table has around 39000 records but initially it has 50000 records, 11000 records has been deleted for some business logic.
One of the solution to avoid such exception is to increase query read time out, But my schema and query are more direct why should I increase my read time out?
Since In my query I have given the partition key (key_1) so it should reach the destination exactly, after that I had specified the start range of parition key,
So it should retrieve with a maximum time of 2seconds, but is not so. But the below query is working fine and retrieved the results less than 1 seconds (Difference is, ASC is not working and DESC is working)
SELECT * FROM my_test.my_table WHERE key_1 = 101 ORDER BY key_2 DESC LIMIT 25;
Again as per schema the cluster key default order is ASC, So retrieving the data in ASC should be faster than DESC order as per cassandra documentation.
But it is reverse in my case.
Again some clues, The following are the queries that has been tried through CQLSH.
The following query is working and retrieved the results less than 1 seconds
SELECT * FROM my_test.my_table WHERE key_1 = 101 AND key_2 > 1 AND key_2 < 132645 LIMIT 1;
But, the following query is not working and throws time out exception,
SELECT * FROM my_test.my_table WHERE key_1 = 101 AND key_2 > 1 AND key_2 < 132646 LIMIT 1;
But, the following queries are working, and retrieved results less than 1 seconds
SELECT * FROM my_test.my_table WHERE key_1 = 101 AND key_2 = 132644;
SELECT * FROM my_test.my_table WHERE key_1 = 101 AND key_2 = 132645;
SELECT * FROM my_test.my_table WHERE key_1 = 101 AND key_2 = 132646;
SELECT * FROM my_test.my_table WHERE key_1 = 101 AND key_2 = 132647;
Strange behaviour any help would be appreciated.
For each key_1 there will be around 1000000 key_2.
And this is what happens when you take the 2 billion cells per partition limit, and try to use all of it. I know I've answered plenty of posts here before by acknowledging that there is a hard limit of 2 billion cells per partition, your (very) wide row will become ungainly and probably time-out long before that. This is what I believe you are seeing.
The solution here, is a technique called "bucketing." Basically, you have to find an additional key to partition your data by. Too many CQL rows are being written to the same data partition, and bucketing would help bring the ratio of partition to clustering keys back to a sane level.
The logical way to go about bucketing, is with a time element. I see your last key is a timestamp. I don't know how many rows each key_1 gets in a day, but let's say that you only get a few thousand every month. In that case, I would create an additional partition key of month_bucket:
CREATE TABLE my_test.my_table (
key_1 bigint,
key_2 bigint,
...
key_11 timestamp,
month_bucket text,
PRIMARY KEY ((key_1,month_bucket) key_2)
);
That would allow you to support a query like this:
SELECT * FROM my_test.my_table
WHERE key_1 = 101 AND month_bucket = '201603'
AND key_2 > 1 AND key_2 < 132646 LIMIT 1;
Again, bucketing on month is just an example. But basically, you need to find an additional column to partition your data on.
Issue got resolved after restarting all the 3 cassandra servers. I don't know what the hell makes trouble.. Since it is in production server couldn't able to get exact Root Cause.
CQL Execution [returns instantly, assuming uses clustering key index]:
cqlsh:stats> select count(*) from events where month='2015-04' and day = '2015-04-02';
count
-------
5447
Presto Execution [takes around 8secs]:
presto:default> select count(*) as c from cassandra.stats.events where month = '2015-04' and day = timestamp '2015-04-02';
c
------
5447
(1 row)
Query 20150228_171912_00102_cxzfb, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0:08 [147K rows, 144KB] [17.6K rows/s, 17.2KB/s]
Why should presto get to process 147K rows when cassandra itself responds with just 5447 rows for the same query [I tried select * too]?
Why presto is not able to use the clustering key optimization?
I tried all possible values like timestamp, date, different formats of dates. Not able to see any effect on number of rows being fetched.
CF Reference:
CREATE TABLE events (
month text,
day timestamp,
test_data text,
some_random_column text,
event_time timestamp,
PRIMARY KEY (month, day, event_time)
) WITH comment='Test Data'
AND read_repair_chance = 1.0;
Added event_timestamp too as a constraint in response to Dain's answer
presto:default> select count(*) from cassandra.stats.events where month = '2015-04' and day = timestamp '2015-04-02 00:00:00+0000' and event_time = timestamp '2015-04-02 00:00:34+0000';
_col0
-------
1
(1 row)
Query 20150301_071417_00009_cxzfb, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0:07 [147K rows, 144KB] [21.3K rows/s, 20.8KB/s]
The Presto engine will pushdown simple WHERE clauses like this to a connector (you can see this in the Hive connector), so the question is, why does the Cassandra connector not take advantage of this. To see why, we'll have to look at the code.
The pushdown system first interacts with connectors in the ConnectorSplitManager.getPartitions(ConnectorTableHandle, TupleDomain) method, so looking at the CassandraSplitManager, I see it is delegating the logic to getPartitionKeysSet. This method looks for a range constraint (e.g., x=33 or x BETWEEN 1 AND 10) for every column in the primary key, so in your case, you would need to add a constraint on event_time.
I don't know why the code insists on having a constraint on every column in the primary key, but I'd guess that it is a bug. It should be easy to tweak this code to remove that constraint.