I have all keyspaces and tables copied from another cassandara data folder ,How can I restore it in my cassandara node.
I dont have snapshots which are normally required to restore.
You might be able to do this with the Cassandra Bulk Loader.
Assuming a packaged install (with default data and bin locations), try this from one of your nodes:
$ sstableloader -d hostname1,hostname2 /var/lib/cassandra/data/yourKeyspaceName/tableName/
Check out the documentation on the Bulk Loader for more details.
You can do this but:
You need to know the schema for all the tables you are restoring. If you don't know this, use sstable2json (example below, but this can be tricky and requires understanding how sstable2json formats things)
You will have to start a new node, create the keyspace and it's tables using the schema derived from 1 and then use the BulkLoader as described in the docs by Aaron (BryceAtNetwork23).
Example of retreiving a schema (an offline process) using sstable2json, this example assumes your keyspace name is test and the table is named example1:
sstable2json /var/lib/cassandra/data/test/example1-55639910d46a11e4b4335dbb0aaeeb24/test-example1-ka-1-Data.db
// output:
WARN 10:25:34 JNA link failure, one or more native method will be unavailable.
[
{"key": "7d700500-d46b-11e4-b433-5dbb0aaeeb24", <-- key = bytes of what is in the PRIMARY KEY()
"cells": [["coolguy:","",1427451885901681], <-- cql3 row marker (empty cell that tells us table was created using cql3)
["coolguy:age","29",1427451885901681], <-- age
["coolguy:email:_","coolguy:email:!",1427451885901680,"t",1427451885], <-- collection cell marker
["coolguy:email:6367406d61696c2e6e6574","",1427451885901681], <-- first entry in collection
["coolguy:email:636f6f6c677579383540676d61696c2e636f6d","",1427451885901681], <-- second entry in collection
["coolguy:password","xQajKe2fa?af",1427451885901681]]}, <-- another text field for password
{"key": "52641f40-d46b-11e4-b433-5dbb0aaeeb24",
"cells": [["lyubent:","",1427451813663728],
["lyubent:age","109",1427451813663728],
["lyubent:email:_","lyubent:email:!",1427451813663727,"t",1427451813],
["lyubent:email:66616b65406162762e6267","",1427451813663728],
["lyubent:email:66616b6540676d61696c2e636f6d","",1427451813663728],
["lyubent:password","password",1427451813663728]]}
]
The above equates to:
CREATE TABLE test.example1 (
id timeuuid,
username text,
age int,
email set<text>,
password text,
PRIMARY KEY (id, username)
) WITH CLUSTERING ORDER BY (username ASC)
// the below are settings that you have no way of knowing,
// unless you are hardcore enough to start digging through
// system tables with the debug tool, but this is beyond
// the scope of the question.
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
You can see clearly that username and password get lost in the translation as they are the key, but you can tell that there is a compound key based on the fact that all cells have a section with : pre-appended, in the above two entries the examples are coolguy: and lyubent:. Going on this you know that they key is formed of PRIMARY KEY(something ?, username text). If you're lucky your primary key will be simple and debugging the schema from it will be straight forward, if not post it here and we'll see how far we can get.
Related
I have the following table definition:
CREATE TABLE snap_websites.backend (
key blob,
column1 blob,
value blob,
PRIMARY KEY (key, column1)
) WITH COMPACT STORAGE
AND CLUSTERING ORDER BY (column1 ASC)
AND bloom_filter_fp_chance = 0.100000001
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = 'backend'
AND compaction = {'class': 'org.apache.cassandra.db.compaction.DateTieredCompactionStrategy', 'max_threshold': '10', 'min_threshold': '4', 'tombstone_threshold': '0.02'}
AND compression = {'enabled': 'false'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 3600
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 3600000
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
Looking at the compaction setup, it seems that it should get compacted once in a while... However, after about 2 years the table was really slow on a SELECT and I could see 12,281 files in the corresponding data folder! I only checked on node, I would imagine that all the nodes had similar piles of files.
Why does that happen? Could it be because I never give Cassandra a break and therefore it never really is given a time to run the compaction? (i.e. I pretty much always have some process running against that table, but I did not expect things to get this bad! Wow!)
The command line worked well:
nodetools compact snapwebsites backend
and the number of files went all the way down to 9 (after all, I have just 2 lines of data in that table at the moment!)
What I really need to know is: what is preventing Cassandra from running the compaction process?
I don't remember much about DTCS, but if you can, I'd consider using TWCS to replace it. It works well for time series data (TDCS was mentioned to be going away in the near future).
I am dealing with a puzzling behaviour when doing SELECTs on Cassandra 2.2.3. I have 4 nodes in the ring, and I create the following keyspace, table and index.
CREATE KEYSPACE IF NOT EXISTS my_keyspace
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
CREATE TABLE my_keyspace.my_table (
id text,
some_text text,
code text,
some_set set<int>,
a_float float,
name text,
type int,
a_double double,
another_set set<int>,
another_float float,
yet_another_set set<text>,
PRIMARY KEY (id, some_text, code)
) WITH read_repair_chance = 0.0
AND dclocal_read_repair_chance = 0.1
AND gc_grace_seconds = 864000
AND bloom_filter_fp_chance = 0.01
AND caching = { 'keys' : 'ALL', 'rows_per_partition' : 'NONE' }
AND comment = ''
AND compaction = { 'class' : 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy' }
AND compression = { 'sstable_compression' : 'org.apache.cassandra.io.compress.LZ4Compressor' }
AND default_time_to_live = 0
AND speculative_retry = '99.0PERCENTILE'
AND min_index_interval = 128
AND max_index_interval = 2048;
CREATE INDEX idx_my_table_code ON my_keyspace.my_table (code);
Then I insert some rows on the table. Some of them have empty sets. I perform this query through the default CQL client and get the row I am expecting:
SELECT * FROM my_table WHERE code = 'test';
Then I run some tests which are outside my control. I don't know what they do but I expect they read and possibly insert/update/delete some rows. I'm sure they don't delete or change any of the settings in the index, table or keyspace.
After the tests, I log in again through the default CQL client and run the following queries.
SELECT * FROM my_table WHERE code = 'test';
SELECT * FROM my_table;
SELECT * FROM my_table WHERE id = 'my_id' AND some_text = 'whatever' AND code = 'test';
The first one doesn't return anything.
The second one returns all the rows, including the one with code = 'test'.
The third one returns the expected row that the first query couldn't retrieve.
The only difference that I can see between this row and others is that it is one of the rows which contains some empty sets, as explained earlier. If I query for another of the rows that also contain some empty sets, I get the same behavior.
I would say the problem is related to the secondary index. Somehow, the operations performed during the tests leave the index in an state where it cannot see certain rows.
I'm obviously missing something. Do you have any ideas about what could cause this behavior?
Thanks in advance.
UPDATE:
I worked around the issue, but now I found the same problem somewhere else. Since the issue first happened, I found out more about the operations performed before the error: updates on specific columns that set a TTL for said columns. After some investigation I found some Jira issues which could be related to this problem:
https://issues.apache.org/jira/browse/CASSANDRA-6782
https://issues.apache.org/jira/browse/CASSANDRA-8206
However, those issues seem to have been solved on 2.0 and 2.1, and I'm using 2.2. I think these changes are included in 2.2, but I could be mistaken.
The main problem is the the type of query you are running on Cassandra.
The Cassadra data model is query driven, tables are recomputed to serve the query.
Tables are created by using well defined Primary Key (Partition Key & clustring key). Cassandra is not good for full table scan type of queries.
Now coming to your queries.
SELECT * FROM my_table WHERE code = 'test';
Here the column used is clustring column and it the equality search column it should be part of Partition Key. Clustring key will be present in different partitions so if Read consistency level is one it may give empty result.
SELECT * FROM my_table;
Cassandra is not good for this kind of table scan query. Here it will search all the table and get all the rows (poor querying).
SELECT * FROM my_table
WHERE id = 'my_id' AND some_text = 'whatever' AND code = 'test';
Here you mentioned everything so the correct results were returned.
I opened a Jira issue and the problem was fixed on 2.1.18 and 2.2.10:
https://issues.apache.org/jira/browse/CASSANDRA-13412
I speak just from what I read in the Jira issue. I didn't test the above scenario again after the fix was implemented because by then I had moved to the 3.0 version.
In the end though I ended up removing almost every use of secondary indices in my application, as I learned that they led to bad performance.
The reason is that in most cases they will result in fan-out queries that will contact every node of the cluster, with the corresponding costs.
There are still some cases where they can perform well, e.g. when you query by partition key at the same time, as no other nodes will be involved.
But for anything else, my advice is: consider if you can remove your secondary indices and do lookups in auxiliary tables instead. You'll have the burden of maintaining the tables in sync, but performance should be better.
In my spark job I am reading data from cassandra using java cassandra util. My query reads like-
JavaRDD<CassandraRow> cassandraRDD = functions.cassandraTable("keyspace","column_family").
select("timeline_id","shopper_id","product_id").where("action=?", "Viewed")
My row key level is set on action column. When I am running my spark job its causing the over utilisation of cpu but when I remove the filter on the action column its working fine.
Please find below the create table script for the column family-
CREATE TABLE keyspace.column_family (
action text,
timeline_id timeuuid,
shopper_id text,
product_id text,
publisher_id text,
referer text,
remote_ip text,
seed_product text,
strategy text,
user_agent text,
PRIMARY KEY (action, timeline_id, shopper_id)
) WITH CLUSTERING ORDER BY (timeline_id DESC, shopper_id ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
What I am suspecting is as action_item is the row key, all data is getting served from single node (hot spot) and thats why that nodes CPU might be shooting up. Also while reading there is only a single partition of RDD getting created in the spark job. Any help will be appreciated.
Ok you're having a data model issue here. action = partition key so all similar actions are stored in a single partition = (one node + replicas).
How many distinct actions do you have in total ? Your intuition about having hotspot is justified.
You probably need a different partition key OR need to add an extra column to the partition key to let Cassandra distributes the data evenly on the cluster.
Read this blog post : http://www.planetcassandra.org/blog/the-most-important-thing-to-know-in-cassandra-data-modeling-the-primary-key/
I have a problem with the cassandra db and hope somebody can help me. I have a table “log”. In the log table, I have inserted about 10000 rows. Everything works fine. I can do a
select * from
select count(*) from
As soon I insert 100000 rows with TTL 50, I receive a error with
select count(*) from
Version: cassandra 2.1.8, 2 nodes
Cassandra timeout during read query at consistency ONE (1 responses
were required but only 0 replica responded)
Has someone a idea what I am doing wrong?
CREATE TABLE test.log (
day text,
date timestamp,
ip text,
iid int,
request text,
src text,
tid int,
txt text,
PRIMARY KEY (day, date, ip)
) WITH read_repair_chance = 0.0
AND dclocal_read_repair_chance = 0.1
AND gc_grace_seconds = 864000
AND bloom_filter_fp_chance = 0.01
AND caching = { 'keys' : 'ALL', 'rows_per_partition' : 'NONE' }
AND comment = ''
AND compaction = { 'class' : 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy' }
AND compression = { 'sstable_compression' : 'org.apache.cassandra.io.compress.LZ4Compressor' }
AND default_time_to_live = 0
AND speculative_retry = '99.0PERCENTILE'
AND min_index_interval = 128
AND max_index_interval = 2048;
That error message indicates a problem with the READ operation. Most likely it is a READ timeout. You may need to update your Cassandra.yaml with a larger read timeout time as described in this SO answer.
Example for 200 seconds:
read_request_timeout_in_ms: 200000
If updating that does not work you may need to tweak the JVM settings for Cassandra. See DataStax's "Tuning Java Ops" for more information
count() is a very costly operation, imagine Cassandra need to scan all the row from all the node just to give you the count. In small amount of rows if works, but on bigger data, you should use another approaches to avoid timeout.
First of all, we have to retrieve row by row to count amount and forgot about count(*)
We should make a several(dozens, hundreds?) queries with filtering by partition and clustering key and summ amount of rows retrieved by each query.
Here is good explanation what is clustering and partition keys In your case day - is partition key, composite key consists from two columns: date and ip.
It most likely impossible to do it with cqlsh commandline client, so you should write a script by yourself. Official drivers for popular programming languages: http://docs.datastax.com/en/developer/driver-matrix/doc/common/driverMatrix.html
Example of one of such queries:
select day, date, ip, iid, request, src, tid, txt from test.log where day='Saturday' and date='2017-08-12 00:00:00' and ip='127.0 0.1'
Remarks:
If you need just to calculate count and nothing more, probably has a sense to google for tool like https://github.com/brianmhess/cassandra-count
If Cassandra refuses to run your query without ALLOW FILTERING that mean query is not efficient https://stackoverflow.com/a/38350839/2900229
I am receiving a OperationTimedOut error while running an alter table command in cqlsh. How is that possible? Since this is just a table metadata update, shouldn't this operation run almost instantaneously?
Specifically, this is an excerpt from my cqlsh session
cqlsh:metric> alter table metric with gc_grace_seconds = 86400;
OperationTimedOut: errors={}, last_host=sandbox73vm230
The metric table currently has a gc_grace_seconds of 864000. I am seeing this behavior in a 2-node cluster and in a 6-node 2-datacenter cluster. My nodes seem to be communicating fine in general (e.g. I can insert in one and read from the other). Here is the full table definition (a cyanite 0.1.3 schema with DateTieredCompactionStrategy, clustering and caching changes):
CREATE TABLE metric.metric (
tenant text,
period int,
rollup int,
path text,
time bigint,
data list<double>,
PRIMARY KEY ((tenant, period, rollup, path), time)
) WITH CLUSTERING ORDER BY (time ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'timestamp_resolution': 'SECONDS', 'class': 'org.apache.cassandra.db.compaction.DateTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.0
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = 'NONE';
I realize at this point the question is pretty old, and you may have either figured out the answer or otherwise moved on, but wanted to post this in case others stumbled upon it.
The default cqlsh request timeout is 10 seconds. You can adjust this by starting up cqlsh with the --request-timeout option set to some value that allows your ALTER TABLE to run to completion, e.g.:
cqlsh --request-timeout=1000000