We've recently decided to migrate an application to Cassandra (from Oracle) because it may help with performance, and as I have a decent Oracle background, I gotta admit I struggle with the Cassandra "way of thinking".
Basically i'm having a table with ~15 fields, among those dates. One of these dates is used for "ordering", so I need to be able to do "order by" on it. At the same time though, this field can be nullable.
Now i've figured putting that field as a primary key lets me actually do the order-by part, but I can't assign the null value to it anymore...
Any ideas ?
You are correct in that you cannot query by NULL values in Cassandra. There's a really good reason for that; which is that NULL values don't really exist. That row simply does not contain a value for the "NULL" column. So the CQL interface abstracts that with the "NULL" output, because that's easier to explain to people.
Cassandra also does not allow NULLs (or an absence of a column value) in its key fields. So the best you can do in this case, is to come up with a timestamp constant that you (and your application) recognize to be NULL without breaking anything. So consider this example table structure:
aploetz#cqlsh:stackoverflow> CREATE TABLE eventsByMonth (
monthBucket text,
eventTime timestamp,
event text,
PRIMARY KEY (monthBucket,eventTime))
WITH CLUSTERING ORDER BY (eventTime DESC);
Next I'll insert some values to test with:
aploetz#cqlsh:stackoverflow> INSERT INTO eventsByMonth (monthBucket,eventTime,event)
VALUES ('201509','2015-09-19 00:00:00','Talk Like A Pirate Day');
aploetz#cqlsh:stackoverflow> INSERT INTO eventsByMonth (monthBucket,eventTime,event)
VALUES ('201509','2015-09-25 00:00:00','Hobbit Day');
aploetz#cqlsh:stackoverflow> INSERT INTO eventsByMonth (monthBucket,eventTime,event)
VALUES ('201509','2015-09-19 21:00:00','dentist appt');
aploetz#cqlsh:stackoverflow> INSERT INTO eventsByMonth (monthBucket,eventTime,event)
VALUES ('201503','2015-03-14 00:00:00','Pi Day');
Let's say that I have two events that I want to keep track of, but I don't know the eventTimes, so instead of INSERTing a NULL, I'll just specify a zero. For the sake of the example, I'll put one in September 2015 and the other in October 2015:
aploetz#cqlsh:stackoverflow> INSERT INTO eventsByMonth (monthBucket,eventTime,event)
VALUES ('201510',0,'Some random day I want to keep track of');
aploetz#cqlsh:stackoverflow> INSERT INTO eventsByMonth (monthBucket,eventTime,event)
VALUES ('201509',0,'Some other random day I want to keep track of');
Now when I query for September of 2015, I'll get the following output:
aploetz#cqlsh:stackoverflow> SELECT * FROM eventsbymonth WHERe monthbucket = '201509';
monthbucket | eventtime | event
-------------+--------------------------+-----------------------------------------------
201509 | 2015-09-25 00:00:00-0500 | Hobbit Day
201509 | 2015-09-19 21:00:00-0500 | dentist appt
201509 | 2015-09-19 00:00:00-0500 | Talk Like A Pirate Day
201509 | 1969-12-31 18:00:00-0600 | Some other random day I want to keep track of
(4 rows)
Notes:
This is probably something you want to avoid doing, if possible.
INSERT/UPDATE (Upsert) with a "NULL" value is the same as a DELETE operation, and creates tombstone(s).
Upserting a zero (0) as a TIMESTAMP defaults to 1970-01-01 00:00:00 UTC. My current timezone offset is -0600, which is why the value of 1969-12-31 18:00:00 appears.
I don't need to specify an ORDER BY clause in my query, because the defined clustering order is what I want. It is a good idea to configure this as per your query requirements, because all ORDER BY can really do is enforce ASCending or DESCending. You cannot specify a column in your ORDER BY that differs from your table's defined clustering order.
An advantage of using a zero TIMESTAMP, is that all rows containing that key are ordered at the bottom of the result set (DESCending order), so you'll always know where to look for them.
Not sure what your partitioning key is, but I used monthBucket for mine. FYI- "bucketing" is a Cassandra modeling technique used when working with time series data, to evenly distribute data in your cluster.
Related
I am currently trying to model some time series data in base of Cassandra.
For example i have a table bigint_table, which was created by following query
**
CREATE TABLE bigint_table (name_id int,tuuid timeuuid, timestamp
timestamp, value text, PRIMARY KEY ((name_id),tuuid, timestamp)) WITH
CLUSTERING ORDER BY (tuuid asc, timestamp asc)
**
tuuid column was added because without it I had problems and I lost some data while inserting them in DB. name_id represents the channel's ID data comes from.tuuid column was added because without it I had problems and I lost some data while inserting them in DB. In one table there are lots of data with the same ID, but they are unique by timestamp and tuuid (values also can be the same sometimes).
I consistently execute 2 different queries to get values and timestamps
select value from bigint_table where name_id=6 and timestamp>'
2017-11-01 8:26:47.970+0000' and timestamp<'2017-11-30
8:26:52.048+0000' order by tuuid asc, timestamp asc allow filtering
2.
select timestamp from bigint_table where name_id=6 and timestamp>'
2017-11-01 8:26:47.970+0000' and timestamp<'2017-11-30
8:26:52.048+0000' order by tuuid asc, timestamp asc allow filtering
In this post author says one need to resist the urge to just add ALLOW FILTERING to itand one should think about data, model and what one is trying to do.
I thought a lot about using ALLOW FILTERING function or not, and I figured out that I have no choice in my case and I need to use it. But those words in post I mentioned above are keeping me in doubt. I would like to know your advise and what do you thnik about my problem. Is there another way to model my data tables, queries of which do not require ALLOW FILTERING? I would be very very thank you for advice.
The reason you need allow filtering is because you have the clustering column (tuuid, timestamp)in the wrong order. In this case the data stored first by tuuid and then by timestamp.But you're actually choosing data by timestamp and then ordering by tuuid so Cassandra can't use the indexes that you have specified. The order when you define the primary key matters.
I'm trying to get data from a date range on Cassandra, the table is like this:
CREATE TABLE test6 (
time timeuuid,
id text,
checked boolean,
email text,
name text,
PRIMARY KEY ((time), id)
)
But when I select a data range I get nothing:
SELECT * FROM teste WHERE time IN ( minTimeuuid('2013-01-01 00:05+0000'), now() );
(0 rows)
How can I get a date range from a Cassandra Query?
The IN condition is used to specify multiple keys for a SELECT query. To run a date range query for your table, (you're close) but you'll want to use greater-than and less-than.
Of course, you can't run a greater-than/less-than query on a partition key, so you'll need to flip your keys for this to work. This also means that you'll need to specify your id in the WHERE clause, as well:
CREATE TABLE teste6 (
time timeuuid,
id text,
checked boolean,
email text,
name text,
PRIMARY KEY ((id), time)
)
INSERT INTO teste6 (time,id,checked,email,name)
VALUES (now(),'B26354',true,'rdeckard#lapd.gov','Rick Deckard');
SELECT * FROM teste6
WHERE id='B26354'
AND time >= minTimeuuid('2013-01-01 00:05+0000')
AND time <= now();
id | time | checked | email | name
--------+--------------------------------------+---------+-------------------+--------------
B26354 | bf0711f0-b87a-11e4-9dbe-21b264d4c94d | True | rdeckard#lapd.gov | Rick Deckard
(1 rows)
Now while this will technically work, partitioning your data by id might not work for your application. So you may need to put some more thought behind your data model and come up with a better partition key.
Edit:
Remember with Cassandra, the idea is to get a handle on what kind of queries you need to be able to fulfill. Then build your data model around that. Your original table structure might work well for a relational database, but in Cassandra that type of model actually makes it difficult to query your data in the way that you're asking.
Take a look at the modifications that I have made to your table (basically, I just reversed your partition and clustering keys). If you still need help, Patrick McFadin (DataStax's Chief Evangelist) wrote a really good article called Getting Started with Time Series Data Modeling. He has three examples that are similar to yours. In fact his first one is very similar to what I have suggested for you here.
I have a cassandra Column Family, or CQL table with the following schema:
CREATE TABLE user_actions (
company_id varchar,
employee_id varchar,
inserted_at timeuuid,
action_type varchar,
PRIMARY KEY ((company_id, employee_id), inserted_at)
) WITH CLUSTERING ORDER BY (inserted_at DESC);
Basically a composite partition key that is made up of a company ID and an employee ID, and a clustering column, representing the insertion time, that is used to order the columns in reverse chronological order (newest actions are at the beginning of the row).
Here's what an insert looks like:
INSERT INTO user_actions (company_id, employee_id, inserted_at, action_type)
VALUES ('acme', 'xyz', now(), 'started_project')
USING TTL 1209600; // two weeks
Nothing special here, except the TTL which is set to expire in two weeks.
The read path is also quite simple - we always want the latest 100 actions, so it looks like this:
SELECT action_type FROM user_actions
WHERE company_id = 'acme' and employee_id = 'xyz'
LIMIT 100;
The issue: I would expect that since we order in reverse chronological order, and the TTL is always the same amount of seconds on insertion - that such a query should not scan through any tombstones - all "dead" columns are at the tail of the row, not the head. But in practice we see many warnings in the log in the following format:
WARN [ReadStage:60452] 2014-09-08 09:48:51,259 SliceQueryFilter.java (line 225) Read 40 live and 1164 tombstoned cells in profiles.user_actions (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=1410169639669000, localDeletion=1410169639}
and on rare occasions the tombstone number is large enough to abort the query completely.
Since I see this type of schema design being advocated quite often, I wonder if I'm doing something wrong here?
Your SELECT statement is not giving an explicit sort order and is hence defaulting to ASC (even though your clustering order is DESC).
So if you change your query to:
SELECT action_type FROM user_actions
WHERE company_id = 'acme' and employee_id = 'xyz'
ORDER BY inserted_at DESC
LIMIT 100;
you should be fine
Perhaps data is reappearing because a node fails and gc_grace_seconds expired already, the node comes back into the cluster, and Cassandra can't replay/repair updates because the tombstone disappeared after gc_grace_seconds: http://www.datastax.com/documentation/cassandra/2.1/cassandra/dml/dml_about_deletes_c.html
The 2.1 incremental repair sounds like it might be right for you: http://www.datastax.com/documentation/cassandra/2.1/cassandra/operations/ops_repair_nodes_c.html
I have below table in CQL-
create table test (
employee_id text,
employee_name text,
value text,
last_modified_date timeuuid,
primary key (employee_id)
);
I inserted couple of records in the above table like this which I will be inserting in our actual use case scenario-
insert into test (employee_id, employee_name, value, last_modified_date) values ('1', 'e27', 'some_value', now());
insert into test (employee_id, employee_name, value, last_modified_date) values ('2', 'e27', 'some_new_value', now());
insert into test (employee_id, employee_name, value, last_modified_date) values ('3', 'e27', 'some_again_value', now());
insert into test (employee_id, employee_name, value, last_modified_date) values ('4', 'e28', 'some_values', now());
insert into test (employee_id, employee_name, value, last_modified_date) values ('5', 'e28', 'some_new_values', now());
Now I was doing select query for - give me all the employee_id for employee_name e27.
select employee_id from test where employee_name = 'e27';
And this is the error I am getting -
Bad Request: No indexed columns present in by-columns clause with Equal operator
Perhaps you meant to use CQL 2? Try using the -2 option when starting cqlsh.
Is there anything wrong I am doing here?
My use cases are in general -
Give me everything for any of the employee_name?
Give me everything for what has changed in last 5 minutes?
Give me the latest employee_id and value for any of the employee_name?
Give me all the employee_id for any of the employee_name?
I am running Cassandra 1.2.11
The general rule is simple: "you can only query by columns that are part of the key". As an explanation all other queries would require a complete scan of the tables which might mean a lot of data sifting.
There are things that can modify this rule:
use secondary indexes for columns with low cardinality (more details here)
define multi-column keys (e.g. PRIMARY KEY (col1, col2); which would allow queries like col1 = value1 and col1 = value1 and col2 COND)
use ALLOW FILTERING in queries. This will result in a warning as Cassandra will have to sift through a lot of data and there will be no performance guarantees. For more details see details of ALLOW FILTERING in CQL and this SO thread
Cassandra take a little getting used to :) Some of us have been spoiled by some of the extra stuff RDBMS does for you that you do not get for free from noSql.
If you think back on a regular RDBMS table, if you SELECT on a column that has no index, the DB must do a full-table scan to find all the matches you seek. This is a no-no in Cassandra, and it will complain if you try to do this. Imagine if you found 10^32 matches to this query? It is not a reasonable ask.
In your table, you have coded *PRIMARY KEY(employee_id);* this is the row's primary and unique identifying key. You can now SELECT * from TEST where employee_id='123'; this is perfectly reasonable and Cassandra will happily return the result.
However, your SELECT from TEST WHERE employee_name = 'e27'; tells Cassandra to go and read EVERY record until it finds a match on 'e27'. With no index to rely on, it politely asks you to 'forget it'.
If you want to filter on a column, make sure you have an index on that column so that Cassandra can performs the filtering you need.
Prior to CQL3 one could insert arbitrary columns such as columns that are named by a date:
cqlsh:test>CREATE TABLE seen_ships (day text PRIMARY KEY)
WITH comparator=timestamp AND default_validation=text;
cqlsh:test>INSERT INTO seen_ships (day, '2013-02-02 00:08:22')
VALUES ('Tuesday', 'Sunrise');
Per this post It seems that things are different in CQL3. Is it still somehow possible to insert arbitrary columns? Here's my failed attempt:
cqlsh:test>CREATE TABLE seen_ships (
day text,
time_seen timestamp,
shipname text,
PRIMARY KEY (day, time_seen)
);
cqlsh:test>INSERT INTO seen_ships (day, 'foo') VALUES ('Tuesday', 'bar');
Here I get Bad Request: line 1:29 no viable alternative at input 'foo'
So I try a slightly different table because maybe this is a limitation of compound keys:
cqlsh:test>CREATE TABLE seen_ships ( day text PRIMARY KEY );
cqlsh:test>INSERT INTO seen_ships (day, 'foo') VALUES ('Tuesday', 'bar');
Again with the Bad Request: line 1:29 no viable alternative at input 'foo'
What am I missing here?
There's a good blog post over on the Datastax blog about this: http://www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows
The answer is that yes, CQL3 supports dynamic colums, just not the way it worked in earlier versions of CQL. I don't really understand your example, you mix datestamps with strings in a way I don't see how it worked in CQL2 either. If I understand you correctly you want to make a timeline of ship sightings, where the partition key (row key) is the day and each sighting is a time/name pair. Here's a suggestion:
CREATE TABLE ship_sightings (
day TEXT,
time TIMESTAMP,
ship TEXT,
PRIMARY KEY (day, time)
)
And you insert entries with
INSERT INTO ship_sightings (day, time, ship) VALUES ('Tuesday', NOW(), 'Titanic')
however, you should probably use a TIMEUUID instead of TIMESTAMP (and the primary key could be a DATE), since otherwise you might add two sightings with the same timestamp and only one will survive.
This was an example of wide rows, but then there's the issue of dynamic columns, which isn't exactly the same thing. Here's an example of that in CQL3:
CREATE TABLE ship_sightings_with_properties (
day TEXT,
time TIMEUUID,
ship TEXT,
property TEXT,
value TEXT,
PRIMARY KEY (day, time, ship, property)
)
which you can insert into like this:
INSERT INTO ship_sightings_with_properties (day, time, ship, property, value)
VALUES ('Sunday', NOW(), 'Titanic', 'Color', 'Black')
# you need to repeat the INSERT INTO for each statement, multiple VALUES isn't
# supported, but I've not included them here to make this example shorter
VALUES ('Sunday', NOW(), 'Titanic', 'Captain', 'Edward John Smith')
VALUES ('Sunday', NOW(), 'Titanic', 'Status', 'Steaming on')
VALUES ('Monday', NOW(), 'Carapathia', 'Status', 'Saving the passengers off the Titanic')
The downside with this kind of dynamic columns is that the property names will be stored multiple times (so if you have a thousand sightings in a row and each has a property called "Captain", that string is saved a thousand times). On-disk compression takes away most of that overhead, and most of the time it's nothing to worry about.
Finally a note about collections in CQL3. They're a useful feature, but they are not a way to implement wide rows or dynamic columns. First of all they have a limit of 65536 items, but Cassandra can't enforce this limit, so if you add too many elements you might not be able to read them back later. Collections are mostly for small multi-values fields -- the canonical example is an address book where each row is an entry and where entries only have a single name, but multiple phone numbers, email addresses, etc.
It is not truly dynamic column, but most times you can get away with collections. Using Map column you might store some dynamic data