Cassandra create duplicate table with different primary key - cassandra

I'm new to Apache Cassandra and have the following issue:
I have a table with PRIMARY KEY (userid, countrycode, carid). As described in many tutorials this table can be queried by using following filter criteria:
userid = x
userid = x and countrycode = y
userid = x and countrycode = y and carid = z
This is fine for most cases, but now I need to query the table by filtering only on
userid = x and carid = z
Here, the documentation sais that is the best solution to create another table with a modified primary key, in this case PRIMARY KEY (userid, carid, countrycode).
The question here is, how to copy the data from the "original" table to the new one with different index?
On small tables
On huge tables
And another important question concerning the duplication of a huge table: What about the storage needed to save both tables instead of only one?

You can use COPY command to export from one table and import into other table.
From your example - I created 2 tables. user_country and user_car with respective primary keys.
CREATE KEYSPACE user WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 2 } ;
CREATE TABLE user.user_country ( user_id text, country_code text, car_id text, PRIMARY KEY (user_id, country_code, car_id));
CREATE TABLE user.user_car ( user_id text, country_code text, car_id text, PRIMARY KEY (user_id, car_id, country_code));
Let's insert some dummy data into one table.
cqlsh> INSERT INTO user.user_country (user_id, country_code, car_id) VALUES ('1', 'IN', 'CAR1');
cqlsh> INSERT INTO user.user_country (user_id, country_code, car_id) VALUES ('2', 'IN', 'CAR2');
cqlsh> INSERT INTO user.user_country (user_id, country_code, car_id) VALUES ('3', 'IN', 'CAR3');
cqlsh> select * from user.user_country ;
user_id | country_code | car_id
---------+--------------+--------
3 | IN | CAR3
2 | IN | CAR2
1 | IN | CAR1
(3 rows)
Now we will export the data into a CSV. Observe the sequence of columns mentioned.
cqlsh> COPY user.user_country (user_id,car_id, country_code) TO 'export.csv';
Using 1 child processes
Starting copy of user.user_country with columns [user_id, car_id, country_code].
Processed: 3 rows; Rate: 4 rows/s; Avg. rate: 4 rows/s
3 rows exported to 1 files in 0.824 seconds.
export.csv can now be directly inserted into other table.
cqlsh> COPY user.user_car(user_id,car_id, country_code) FROM 'export.csv';
Using 1 child processes
Starting copy of user.user_car with columns [user_id, car_id, country_code].
Processed: 3 rows; Rate: 6 rows/s; Avg. rate: 8 rows/s
3 rows imported from 1 files in 0.359 seconds (0 skipped).
cqlsh>
cqlsh>
cqlsh> select * from user.user_car ;
user_id | car_id | country_code
---------+--------+--------------
3 | CAR3 | IN
2 | CAR2 | IN
1 | CAR1 | IN
(3 rows)
cqlsh>
About your other question - yes the data will be duplicated, but that's how cassandra is used.

Related

Cassandra clustering key uniqueness

In the book Cassandra the definitive guide it is said that the combination of partition key and clustering key guarantees a unique record in the data base... i understand that the partition key is the one that guarantees unique of record - the node where the record is stored. And the clustering key is for the sorting of the records. Can someone help me understand this?
thank and sorry for the question...
Single partition key (without clustering key) is primary key which has to be unique.
A partition key + clustering key has to be unique but it doesn't mean that either partition key or a clustering key has to be unique alone.
You can insert
(a,b) (first record)
(a,c) (same partition key with the first record)
(d,b) (same clustering key with the first record)
When you insert (a,b) again then it will update the non primary key values for existing primary key.
In the following example userid is partition key and date is clustering key.
cqlsh:play> CREATE TABLE example (userid int, date int, name text, PRIMARY KEY (userid, date));
cqlsh:play> INSERT INTO example (userid, date, name) VALUES (1, 20200530, 'a');
cqlsh:play> INSERT INTO example (userid, date, name) VALUES (1, 20200531, 'a');
cqlsh:play> INSERT INTO example (userid, date, name) VALUES (2, 20200531, 'a');
cqlsh:play> SELECT * FROM example;
userid | date | name
--------+----------+------
1 | 20200530 | a
1 | 20200531 | a
2 | 20200531 | a
(3 rows)
cqlsh:play> INSERT INTO example (userid, date, name) VALUES (2, 20200531, 'b');
cqlsh:play> SELECT * FROM example;
userid | date | name
--------+----------+------
1 | 20200530 | a
1 | 20200531 | a
2 | 20200531 | b
(3 rows)
cqlsh:play>

Cassandra where clause as a tuple

Table12
CustomerId CampaignID
1 1
1 2
2 3
1 3
4 2
4 4
5 5
val CustomerToCampaign = ((1,1),(1,2),(2,3),(1,3),(4,2),(4,4),(5,5))
Is it possible to write a query like
select CustomerId, CampaignID from Table12 where (CustomerId, CampaignID) in (CustomerToCampaign_1, CustomerToCampaign_2)
???
So the input is a tuple but the columns are not tuple but rather individual columns.
Sure, it's possible. But only on the clustering keys. That means I need to use something else as a partition key or "bucket." For this example, I'll assume that marketing campaigns are time sensitive and that we'll get a good distribution and easy of querying by using "month" as the bucket (partition).
CREATE TABLE stackoverflow.customertocampaign (
campaign_month int,
customer_id int,
campaign_id int,
customer_name text,
PRIMARY KEY (campaign_month, customer_id, campaign_id)
);
Now, I can INSERT the data described in your CustomerToCampaign variable. Then, this query works:
aploetz#cqlsh:stackoverflow> SELECT campaign_month, customer_id, campaign_id
FROM customertocampaign WHERE campaign_month=202004
AND (customer_id,campaign_id) = (1,2);
campaign_month | customer_id | campaign_id
----------------+-------------+-------------
202004 | 1 | 2
(1 rows)

Timestamp with auto increment in Cassandra

Want to write System.currentMiliseconds in the cassandta table for each column by cassandra. For example
writeToCassandra(name, email)
in cassandra table:
--------------------------------
name | email| currentMiliseconds
Can cassandra prepare currentMiliseconds column automatically like auto increment ?
BR!
Cassandra has some sort of columnar database taste inside. So if you read docs how the columns are stored inside SSTable, you'll notice that each column has a personal write timestamp appended (used for conflict resolution, like last-write-wins strategy). You can query for that timestamp using writetime() function:
cqlsh:so> create table ticks ( id text primary key, value int);
cqlsh:so> insert into ticks (id, value) values ('foo', 1);
cqlsh:so> insert into ticks (id, value) values ('bar', 2);
cqlsh:so> insert into ticks (id, value) values ('baz', 3);
cqlsh:so> select id, value from ticks;
id | value
-----+-------
bar | 2
foo | 1
baz | 3
(3 rows)
cqlsh:so> select id, writetime(value) from ticks;
id | writetime(value)
-----+------------------
bar | 1448282940862913
foo | 1448282937031542
baz | 1448282945591607
(3 rows)
As you requested, I've not explicitly inserted write timestamp to DB, but able to query it. Note you cannot use writetime() function for PK.
You can try with: dateof(now())
e.g.
INSERT INTO YOUR_TABLE (NAME, EMAIL, DATE)
VALUES ('NAME', 'EMAIL', dateof(now()));

Select a specific record in Cassandra using cql

This is the schema I use:
CREATE TABLE playerInfo (
key text,
column1 bigint,
column2 bigint,
column3 bigint,
column4 bigint,
column5 text,
value bigint,
PRIMARY KEY (key, column1, column2, column3, column4, column5)
)
WITH COMPACT STORAGE AND
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'SnappyCompressor'};
Note I use a composite key. And there is a record like this:
key | column1 | column2 | column3 | column4 | column5 | value
----------+------------+---------+----------+---------+--------------------------------------------------+-------
Kitty | 1411 | 3 | 713 | 4 | American | 1
In cqlsh, how to select it? I try to use:
cqlsh:game> SELECT * FROM playerInfo WHERE KEY = 'Kitty' AND column5 = 'American';
but the output is:
Bad Request: PRIMARY KEY part column5 cannot be restricted (preceding part column4 is either not restricted or by a non-EQ relation)
Then how could I select such cell?
You have choosen the primary key as PRIMARY KEY (key, column1, column2, column3, column4, column5) so if you are going to give where clause on column5 then you should also need to specify the where clause of key, column1, column2, column3, column4. for eg,
SELECT * FROM playerInfo WHERE KEY = 'Kitty' AND column1 = 1411 AND column2 = 3 AND column3 = 713 AND column4 = 4 AND column5 = 'American';
If you are going to give where clause on column2 then you should also need to specify the where clause of key, column1. for eg,
SELECT * FROM playerInfo WHERE KEY = 'Kitty' AND column1 = 1411 AND column2 = 3;
If you want to specify where clause on a particular column of primary key, then where clause of previous column also need to be given. So you need to choose the cassandra data modelling in a tricky way to have a good read and write performance and to satisfy your business needs too. But however if business logic satisfies you, then cassandra performance will not satisfies you. If cassandra performance satisfies you, then your business logic will not satisfies you. That is the beauty of cassandra. Sure cassandra needs more to improve.
There is a way to select rows based on columns that are not a part of the primary key by creating secondary index.
Let me explain this with an example.
In this schema:
CREATE TABLE playerInfo (
player_id int,
name varchar,
country varchar,
age int,
performance int,
PRIMARY KEY ((player_id, name), country)
);
the first part of the primary key i.e player_id and name is the partition key. The hash value of this will determine which node in the cassandra cluster this row will be written to.
Hence we need to specify both these values in the where clause to fetch a record. For example
SELECT * FROM playerinfo WHERE player_id = 1000 and name = 'Mark B';
player_id | name | country | age | performance
-----------+--------+---------+-----+-------------
1000 | Mark B | USA | 26 | 8
If the second part of your primary key contains more than 2 columns you would have to specify values for all the columns on the left hand side of they key including that column.
In this example
PRIMARY KEY ((key, column1), column2, column3, column4, column5)
For filtering based on column3 you would have to specify values for "key, column1, column2 and column3".
For filtering based on column5 you need to sepcify values for "key, column1, column2, column3, column4, and column5".
But if your application demands using filtering on a particular columns which are not a part of the partition key you could create secondary indices on those columns.
To create an index on a column use the following command
CREATE INDEX player_age on playerinfo (age) ;
Now you can filter columns based on age.
SELECT * FROM playerinfo where age = 26;
player_id | name | country | age | performance
-----------+---------+---------+-----+-------------
2000 | Sarah L | UK | 26 | 24
1000 | Mark B | USA | 26 | 8
Be very careful about using index in Cassandra. Use this only if a table has few records or more precisely few distinct values in those columns.
Also you can drop an index using
DROP INDEX player_age ;
Refer http://wiki.apache.org/cassandra/SecondaryIndexes and http://www.datastax.com/docs/1.1/ddl/indexes for more details

CQL generates two columns per value?

I am wondering why is Cassandra creating two columns when I add a cell with CQL?
This is my schema:
DROP KEYSPACE IF EXISTS tsdb;
CREATE KEYSPACE tsdb WITH replication =
{
'class': 'SimpleStrategy',
'replication_factor' : 3
};
USE tsdb;
CREATE TABLE datapoints (
tsid int,
key text,
value blob,
PRIMARY KEY (tsid, key)
);
INSERT INTO datapoints (tsid, key, value)
VALUES (
1,
'foo',
0x012345
);
INSERT INTO datapoints (tsid, key, value)
VALUES (
2,
'foo',
0x500000
);
Querying it in CQLSH looks good:
cqlsh:tsdb> SELECT * FROM datapoints;
tsid | key | value
------+-----+----------
1 | foo | 0x012345
2 | foo | 0x500000
(2 rows)
but when I list the rows via cassandra-cli I get two columns per row:
[default#tsdb] list datapoints;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: 1
=> (name=foo:, value=, timestamp=1405353603216000)
=> (name=foo:value, value=012345, timestamp=1405353603216000)
-------------------
RowKey: 2
=> (name=foo:, value=, timestamp=1405353603220000)
=> (name=foo:value, value=500000, timestamp=1405353603220000)
2 Rows Returned.
Elapsed time: 6.9 msec(s).
I was expecting to get something like:
-------------------
RowKey: 1
=> (name=foo:value, value=012345, timestamp=1405353603216000)
-------------------
RowKey: 2
=> (name=foo:value, value=500000, timestamp=1405353603220000)
2 Rows Returned.
Why does CQL create columns with the name "foo:" and an empty value? What are these good for?
Thank you!
Best,
Malte
Understanding How CQL3 Maps to Cassandra’s Internal Data Structure
Thanks to John Berryman for the depth explanation of CQL mapping under the hood.

Resources