This is the schema I use:
CREATE TABLE playerInfo (
key text,
column1 bigint,
column2 bigint,
column3 bigint,
column4 bigint,
column5 text,
value bigint,
PRIMARY KEY (key, column1, column2, column3, column4, column5)
)
WITH COMPACT STORAGE AND
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'SnappyCompressor'};
Note I use a composite key. And there is a record like this:
key | column1 | column2 | column3 | column4 | column5 | value
----------+------------+---------+----------+---------+--------------------------------------------------+-------
Kitty | 1411 | 3 | 713 | 4 | American | 1
In cqlsh, how to select it? I try to use:
cqlsh:game> SELECT * FROM playerInfo WHERE KEY = 'Kitty' AND column5 = 'American';
but the output is:
Bad Request: PRIMARY KEY part column5 cannot be restricted (preceding part column4 is either not restricted or by a non-EQ relation)
Then how could I select such cell?
You have choosen the primary key as PRIMARY KEY (key, column1, column2, column3, column4, column5) so if you are going to give where clause on column5 then you should also need to specify the where clause of key, column1, column2, column3, column4. for eg,
SELECT * FROM playerInfo WHERE KEY = 'Kitty' AND column1 = 1411 AND column2 = 3 AND column3 = 713 AND column4 = 4 AND column5 = 'American';
If you are going to give where clause on column2 then you should also need to specify the where clause of key, column1. for eg,
SELECT * FROM playerInfo WHERE KEY = 'Kitty' AND column1 = 1411 AND column2 = 3;
If you want to specify where clause on a particular column of primary key, then where clause of previous column also need to be given. So you need to choose the cassandra data modelling in a tricky way to have a good read and write performance and to satisfy your business needs too. But however if business logic satisfies you, then cassandra performance will not satisfies you. If cassandra performance satisfies you, then your business logic will not satisfies you. That is the beauty of cassandra. Sure cassandra needs more to improve.
There is a way to select rows based on columns that are not a part of the primary key by creating secondary index.
Let me explain this with an example.
In this schema:
CREATE TABLE playerInfo (
player_id int,
name varchar,
country varchar,
age int,
performance int,
PRIMARY KEY ((player_id, name), country)
);
the first part of the primary key i.e player_id and name is the partition key. The hash value of this will determine which node in the cassandra cluster this row will be written to.
Hence we need to specify both these values in the where clause to fetch a record. For example
SELECT * FROM playerinfo WHERE player_id = 1000 and name = 'Mark B';
player_id | name | country | age | performance
-----------+--------+---------+-----+-------------
1000 | Mark B | USA | 26 | 8
If the second part of your primary key contains more than 2 columns you would have to specify values for all the columns on the left hand side of they key including that column.
In this example
PRIMARY KEY ((key, column1), column2, column3, column4, column5)
For filtering based on column3 you would have to specify values for "key, column1, column2 and column3".
For filtering based on column5 you need to sepcify values for "key, column1, column2, column3, column4, and column5".
But if your application demands using filtering on a particular columns which are not a part of the partition key you could create secondary indices on those columns.
To create an index on a column use the following command
CREATE INDEX player_age on playerinfo (age) ;
Now you can filter columns based on age.
SELECT * FROM playerinfo where age = 26;
player_id | name | country | age | performance
-----------+---------+---------+-----+-------------
2000 | Sarah L | UK | 26 | 24
1000 | Mark B | USA | 26 | 8
Be very careful about using index in Cassandra. Use this only if a table has few records or more precisely few distinct values in those columns.
Also you can drop an index using
DROP INDEX player_age ;
Refer http://wiki.apache.org/cassandra/SecondaryIndexes and http://www.datastax.com/docs/1.1/ddl/indexes for more details
Related
I'm new to Apache Cassandra and have the following issue:
I have a table with PRIMARY KEY (userid, countrycode, carid). As described in many tutorials this table can be queried by using following filter criteria:
userid = x
userid = x and countrycode = y
userid = x and countrycode = y and carid = z
This is fine for most cases, but now I need to query the table by filtering only on
userid = x and carid = z
Here, the documentation sais that is the best solution to create another table with a modified primary key, in this case PRIMARY KEY (userid, carid, countrycode).
The question here is, how to copy the data from the "original" table to the new one with different index?
On small tables
On huge tables
And another important question concerning the duplication of a huge table: What about the storage needed to save both tables instead of only one?
You can use COPY command to export from one table and import into other table.
From your example - I created 2 tables. user_country and user_car with respective primary keys.
CREATE KEYSPACE user WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 2 } ;
CREATE TABLE user.user_country ( user_id text, country_code text, car_id text, PRIMARY KEY (user_id, country_code, car_id));
CREATE TABLE user.user_car ( user_id text, country_code text, car_id text, PRIMARY KEY (user_id, car_id, country_code));
Let's insert some dummy data into one table.
cqlsh> INSERT INTO user.user_country (user_id, country_code, car_id) VALUES ('1', 'IN', 'CAR1');
cqlsh> INSERT INTO user.user_country (user_id, country_code, car_id) VALUES ('2', 'IN', 'CAR2');
cqlsh> INSERT INTO user.user_country (user_id, country_code, car_id) VALUES ('3', 'IN', 'CAR3');
cqlsh> select * from user.user_country ;
user_id | country_code | car_id
---------+--------------+--------
3 | IN | CAR3
2 | IN | CAR2
1 | IN | CAR1
(3 rows)
Now we will export the data into a CSV. Observe the sequence of columns mentioned.
cqlsh> COPY user.user_country (user_id,car_id, country_code) TO 'export.csv';
Using 1 child processes
Starting copy of user.user_country with columns [user_id, car_id, country_code].
Processed: 3 rows; Rate: 4 rows/s; Avg. rate: 4 rows/s
3 rows exported to 1 files in 0.824 seconds.
export.csv can now be directly inserted into other table.
cqlsh> COPY user.user_car(user_id,car_id, country_code) FROM 'export.csv';
Using 1 child processes
Starting copy of user.user_car with columns [user_id, car_id, country_code].
Processed: 3 rows; Rate: 6 rows/s; Avg. rate: 8 rows/s
3 rows imported from 1 files in 0.359 seconds (0 skipped).
cqlsh>
cqlsh>
cqlsh> select * from user.user_car ;
user_id | car_id | country_code
---------+--------+--------------
3 | CAR3 | IN
2 | CAR2 | IN
1 | CAR1 | IN
(3 rows)
cqlsh>
About your other question - yes the data will be duplicated, but that's how cassandra is used.
I have a Cassandra table where each column can contain a value or a NULL. But if it contains a NULL, I know that all the next values in that column are also NULL.
Something like this:
+------------+---------+---------+---------+
| date | column1 | column2 | column3 |
+------------+---------+---------+---------+
| 2017-01-01 | 1 | 'a' | NULL |
| 2017-01-02 | 2 | 'b' | NULL |
| 2017-01-03 | 3 | NULL | NULL |
| 2017-01-04 | 4 | NULL | NULL |
| 2017-01-05 | NULL | NULL | NULL |
+------------+---------+---------+---------+
I need a query that, for a given column, returns the date of the last column with a non-null value. In this case:
For column1, '2017-01-04'
For column2, '2017-01-02'
For column3, no result returned.
In SQL it would be something like this:
SELECT date
FROM my_table
WHERE column1 IS NOT NULL
ORDER BY date DESC LIMIT 1
Is it possible in any way, or should I break the table into one table for each column to avoid the NULL situation at all?
tldr; Create a new table that tracks this separately.
This would only be possible if 'column 1' was part of the primary key, with secondary indexes or with a materialized view.
You don't want your primary key to have nulls. As an aside make sure you're writing 'UNSET' inplace of null to the rest of your table. This should be handled by the driver but some drivers are not terribly mature. Writing nulls is effectively a delete operation and will cause tombstones.
Secondary indexes come with performance problems as potentially they hit the entire cluster and don't scale very well beyond a certain point.
Materialized views are being deprecated, so probably avoid those.
You are likely better served by creating a separate table that tracks this exact functionality. This would mean multiple writes and multiple reads but would avoid large table scans and secondary indexes.
I'm going to assume your partition isn't by date and that you've got wide rows because it makes this simpler but this is what that would look like.
CREATE TABLE my_table (
partition bigint,
date text,
column1 bigint,
column2 text,
column3 text,
PRIMARY KEY(partition, date);
CREATE TABLE offset_tracker(
partition bigint,
date text,
PRIMARY KEY(partition);
Here you would do a select date FROM offset_tracker WHERE partition=x to get your 'largest date with values'.
So I have a CF whose Schema looks something like this :
CREATE TABLE "emp" (
id text,
column1 text,
column2 text,
PRIMARY KEY (id, column1, column2)
)
I have an entry which looks like this and I want to delete it :
20aff8144049 | name | someValue
So i tried this command :
Delete column2 from emp where id='20aff8144049';
It failed with below error:
no viable alternative at input '20aff8144049' (...column2 from emp where id=["20aff8144049]...)
Can someone help with where I'm going wrong? Thanks!
You can't delete or set null to primary key column
You have to delete the entire row.
You only can delete an entry using a valid value for your primary key. You defined your primary key to include (id, column1, column2) which means that you have to put all the corresponding values in your where clause.
However, I assume you wanted to be able to delete by id only. Therefore, I'd suggest you re-define your column family like this:
CREATE TABLE "emp" (
id text,
column1 text,
column2 text,
PRIMARY KEY ((id), column1, column2)
)
where id is your partition key and column1 and column2 are your clustering columns.
Given a table with a composite partition key like:
CREATE TABLE testtable (
column1 text,
column2 text,
column3 text,
column4 text,
column5 text,
column6 text,
PRIMARY KEY ((column1, column2, column3, column4))
)
How could I get all the unique values of only one partition column? Now, I know this doesn't work:
SELECT DISTINCT column2 from testtable;
However, I can do
SELECT DISTINCT column1, column2, column3, column4 from testtable;
So, is there a way (within CQL because the result of that query might be quite large) to query that query result like you would do in SQL? Something like this:
SELECT DISTINCT column2 FROM (SELECT DISTINCT column1, column2, column3, column4 from testtable);
Which doesn't work. Or do I really have to use Python (or other alternatives) for this?
Simply put, there is no way to achieve this in CQL. The entirety of the partitioning key defines which Cassandra node is responsible for handling the data, and queries on it. Therefore, it always has to be given in its entirety.
How can I count number of columns in different rows of a column family?
I am a Cassandra newbie. I do not know a starting point. The only option I have is to make the application fetch data for each row at a time. It does not sound right to me. I am using Hector to connect to Cassandra.
this is how you will get total column count in particular rowkey
sliceQuery.setColumnFamily("**your column family**");
sliceQuery.setKey("**your row key**");
sliceQuery.setRange(null, null, false, Integer.MAX_VALUE);
QueryResult<ColumnSlice<String, String>> result = sliceQuery.execute();
ColumnSlice<String, String> cs = result.get();
long noOfColumnInRowKey=result.get().getColumns().size();
Assume that you have wide row (Lets create it using CLI)
create column family cf3
with column_type = 'Standard' and
comparator = 'TimeUUIDType' and
key_validation_class = 'UTF8Type' and
default_validation_class = 'UTF8Type';
This is what I see in CQL3:
cqlsh:ks> desc table cf3;
CREATE TABLE cf3 (
key text,
column1 timeuuid,
value text,
PRIMARY KEY (key, column1)
) WITH COMPACT STORAGE AND
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'SnappyCompressor'};
I inserted some values from CQL3, which makes you feel like good ol' MySQL
cqlsh:ks> insert into cf3 (key, column1, value) values ('user1', now(), 'time5');
cqlsh:ks> select * from cf3;
key | column1 | value
-------+--------------------------------------+-------
user1 | f0c687b0-d114-11e2-8002-2f4261da0d90 | time1
user1 | fb9fa130-d114-11e2-8002-2f4261da0d90 | time2
user1 | 09512f10-d115-11e2-8002-2f4261da0d90 | time3
user1 | 0f5c93e0-d115-11e2-8002-2f4261da0d90 | time4
user1 | 21155220-d115-11e2-8002-2f4261da0d90 | time5
But it's your wide-row (as seen from CLI)
[default#ks] list cf3;
Using default limit of 100
Using default column limit of 100
-------------------
RowKey: user1
=> (column=f0c687b0-d114-11e2-8002-2f4261da0d90, value=time1, timestamp=1370789864363000)
=> (column=fb9fa130-d114-11e2-8002-2f4261da0d90, value=time2, timestamp=1370789882563000)
=> (column=09512f10-d115-11e2-8002-2f4261da0d90, value=time3, timestamp=1370789905537000)
=> (column=0f5c93e0-d115-11e2-8002-2f4261da0d90, value=time4, timestamp=1370789915678000)
=> (column=21155220-d115-11e2-8002-2f4261da0d90, value=time5, timestamp=1370789945410000)
1 Row Returned.
Elapsed time: 105 msec(s).
Now, you wanted to count number of columns starting from a given time onwards. Right? Here is CQL3 for that.
cqlsh:ks> select count(*) from cf3 where key = 'user1' and column1 >= 09512f10-d115-11e2-8002-2f4261da0d90 ;
count
-------
3
Now, I am somewhat doubtful what goes beneath. But, my intuition says that actually all the columns gets fetched at coordinator node and counted in memory. Which is probably somewhat similar to what you were planning to manually on client machine.
Also, I am unaware if cassandra-cli provides such functionality, but you mentioned you are using Hector. So, you can leverage get_count or CountQuery like mentioned here except have null as range finish and large count value. Like this:
CountQuery<String, String> cq = HFactory.createCountQuery(keyspace, StringSerializer.get(), TimeUUIDSerializer.get());
cq.setColumnFamily(cf).setKey("user1");
cq.setRange(timestamp, null, Integer.MAX_VALUE);
QueryResult<Integer> r = cq.execute();
(uncompiled code above)
HTH
Old answer:
See Hector documentation:
CQL:
CqlQuery<String,String,Long> cqlQuery = new CqlQuery<String,String,Long>(keyspace, se, se, le);
cqlQuery.setQuery("SELECT COUNT(*) FROM StandardLong1 WHERE KEY = 'cqlQueryTest_key1'");
QueryResult<CqlRows<String,String,Long>> result = cqlQuery.execute();
assertEquals(2, result.get().getAsCount());
You may just miss the WHERE condition and use LIMIT to get your purpose solved.