Cassandra: get last with a non-null value in a column - cassandra

I have a Cassandra table where each column can contain a value or a NULL. But if it contains a NULL, I know that all the next values in that column are also NULL.
Something like this:
+------------+---------+---------+---------+
| date | column1 | column2 | column3 |
+------------+---------+---------+---------+
| 2017-01-01 | 1 | 'a' | NULL |
| 2017-01-02 | 2 | 'b' | NULL |
| 2017-01-03 | 3 | NULL | NULL |
| 2017-01-04 | 4 | NULL | NULL |
| 2017-01-05 | NULL | NULL | NULL |
+------------+---------+---------+---------+
I need a query that, for a given column, returns the date of the last column with a non-null value. In this case:
For column1, '2017-01-04'
For column2, '2017-01-02'
For column3, no result returned.
In SQL it would be something like this:
SELECT date
FROM my_table
WHERE column1 IS NOT NULL
ORDER BY date DESC LIMIT 1
Is it possible in any way, or should I break the table into one table for each column to avoid the NULL situation at all?

tldr; Create a new table that tracks this separately.
This would only be possible if 'column 1' was part of the primary key, with secondary indexes or with a materialized view.
You don't want your primary key to have nulls. As an aside make sure you're writing 'UNSET' inplace of null to the rest of your table. This should be handled by the driver but some drivers are not terribly mature. Writing nulls is effectively a delete operation and will cause tombstones.
Secondary indexes come with performance problems as potentially they hit the entire cluster and don't scale very well beyond a certain point.
Materialized views are being deprecated, so probably avoid those.
You are likely better served by creating a separate table that tracks this exact functionality. This would mean multiple writes and multiple reads but would avoid large table scans and secondary indexes.
I'm going to assume your partition isn't by date and that you've got wide rows because it makes this simpler but this is what that would look like.
CREATE TABLE my_table (
partition bigint,
date text,
column1 bigint,
column2 text,
column3 text,
PRIMARY KEY(partition, date);
CREATE TABLE offset_tracker(
partition bigint,
date text,
PRIMARY KEY(partition);
Here you would do a select date FROM offset_tracker WHERE partition=x to get your 'largest date with values'.

Related

Cassandra find records where list is empty [duplicate]

How do I query in cassandra for != null columns.
Select * from tableA where id != null;
Select * from tableA where name != null;
Then I wanted to store these values and insert these into different table.
I don't think this is possible with Cassandra. First of all, Cassandra CQL doesn't support the use of NOT or not equal to operators in the WHERE clause. Secondly, your WHERE clause can only contain primary key columns, and primary key columns will not allow null values to be inserted. I wasn't sure about secondary indexes though, so I ran this quick test:
create table nullTest (id text PRIMARY KEY, name text);
INSERT INTO nullTest (id,name) VALUES ('1','bob');
INSERT INTO nullTest (id,name) VALUES ('2',null);
I now have a table and two rows (one with null data):
SELECT * FROM nullTest;
id | name
----+------
2 | null
1 | bob
(2 rows)
I then try to create a secondary index on name, which I know contains null values.
CREATE INDEX nullTestIdx ON nullTest(name);
It lets me do it. Now, I'll run a query on that index.
SELECT * FROM nullTest WHERE name=null;
Bad Request: Unsupported null value for indexed column name
And again, this is done under the premise that you can't query for not null, if you can't even query for column values that may actually be null.
So, I'm thinking this can't be done. Also, if null values are a possibility in your primary key, then you may want to re-evaluate your data model. Again, I know the OP's question is about querying where data is not null. But as I mentioned before, Cassandra CQL doesn't have a NOT or != operator, so that's going to be a problem right there.
Another option, is to insert an empty string instead of a null. You would then be able to query on an empty string. But that still doesn't get you past the fundamental design flaw of having a null in a primary key field. Perhaps if you had a composite primary key, and only part of it (the clustering columns) had the possibility of being empty (certainly not part of the partitioning key). But you'd still be stuck with the problem of not being able to query for rows that are "not empty" (instead of not null).
NOTE: Inserting null values was done here for demonstration purposes only. It is something you should do your best to avoid, as inserting a null column value WILL create a tombstone. Likewise, inserting lots of null values will create lots of tombstones.
1) select * from test;
name | id | address
------------------+----+------------------
bangalore | 3 | ramyam_lab
bangalore | 4 | bangalore_ramyam
bangalore | 5 | jasgdjgkj
prasad | 11 | null
prasad | 12 | null
india | 6 | karnata
india | 7 | karnata
ramyam-bangalore | 3 | jasgdjgkj
ramyam-bangalore | 5 | jasgdjgkj
2)cassandra does't support null values selection.It is showing null for our understanding.
3) For handling null values use another strings like "not-available","null",then we can select data

Usage of cqlsh is similar with mysql, what's the difference?

cqlsh create table:
CREATE TABLE emp(
emp_id int PRIMARY KEY,
emp_name text,
emp_city text,
emp_sal varint,
emp_phone varint
);
insert data
INSERT INTO emp (emp_id, emp_name, emp_city,
emp_phone, emp_sal) VALUES(1,'ram', 'Hyderabad', 9848022338, 50000);
select data
SELECT * FROM emp;
emp_id | emp_city | emp_name | emp_phone | emp_sal
--------+-----------+----------+------------+---------
1 | Hyderabad | ram | 9848022338 | 50000
2 | Hyderabad | robin | 9848022339 | 40000
3 | Chennai | rahman | 9848022330 | 45000
looks just same as mysql, where is column family, column?
A column family is a container for an ordered collection of rows. Each row, in turn, is an ordered collection of columns.
A column is the basic data structure of Cassandra with three values, namely key or column name, value, and a time stamp.
so table emp is a column family?
INSERT INTO emp (emp_id, emp_name, emp_city, emp_phone, emp_sal) VALUES(1,'ram', 'Hyderabad', 9848022338, 50000); is a row which contains columns?
column here is something like emp_id=>1 or emp_name=>ram ??
In Cassandra, although the column families are defined, the columns are not. You can freely add any column to any column family at any time.
what does this mean?
I can have something like this?
emp_id | emp_city | emp_name | emp_phone | emp_sal
--------+-----------+----------+------------+---------
1 | Hyderabad | ram | 9848022338 | 50000
2 | Hyderabad | robin | 9848022339 | 40000 | asdfasd | asdfasdf
3 | Chennai | rahman | 9848022330 | 45000
A super column is a special column, therefore, it is also a key-value pair. But a super column stores a map of sub-columns.
Where is super column, how to create it?
Column family is an old name, now it's called just table.
About super column, also an old term, you have "Map" data type for example, or user defined data types for more complex structures.
About freely adding columns - in the old days, Cassandra was working with unstructured data paradigm, so you didn't had to define columns before you insert them, for now it isn't possible, since Cassandra team moved to be "structured" only (as many in the DB's industry came to conclusion that unstructured data makes more problems than effort).
Anyway, Cassandra's data representation on storage level is very different from MySQL, and indeed saves only data for the columns that aren't empty. It may look same row when you are running select from cqlsh, but it is stored and queried in very different way.
The name column family is an old term for what's now simply called a table, such as "emp" in your example. Each table contains one or many columns, such as "emp_id", "emp_name".
When saying something like being able to freely add columns any time, this would mean that you're always able to omit values for columns (will be null) or add columns using the ALTER TABLE statement.

Delete Rows using timestamp column cassandra

I want to delete data between timestamp from my table.
CREATE TABLE propatterns_test.test (
clientId text,
meterId text,
meterreading text,
date timestamp,
PRIMARY KEY (meterId, date) );
My delete query is:
DELETE FROM test WHERE meterid = 'M5' AND date > '2016-12-27 10:00:00+0000';
Which returned this error :
InvalidRequest: Error from server: code=2200 [Invalid query]
message="Invalid operator < for PRIMARY KEY part date"
After that I tried to delete a specific row :
DELETE FROM test WHERE meterid = 'M5' AND date = '2016-12-27 09:42:30+0000';
Actually the table contains the same record, but it was not deleted.
This is what my data looks like:
meterid | date | clientid | meterreading
---------+--------------------------+----------+--------------
M5 | 2016-12-27 09:42:30+0000 | RDS | 35417.8
M5 | 2016-12-27 09:42:44+0000 | RDS | 35417.8
M5 | 2016-12-27 09:47:20+0000 | RDS | 35417.8
M5 | 2016-12-27 09:47:33+0000 | RDS | 35417.8
Nothing is deleting from table. So how can I delete data between timestamp dates which is part of the primary key?
I see a couple of things happening here. First of all, like iconnj mentioned, range deletes are not possible in versions prior to Cassandra 3.0.
Secondly, your single-row delete attempt is failing (I believe) due to the fact that you are not accounting for the milliseconds present on the timestamp. You can see this if you nest your date column inside the timestsampasblob and blobasbigint functions:
aploetz#cqlsh:stackoverflow> SELECT meterid,date,blobAsBigint(timestampAsBlob(date))
FROM propatterns WHERE meterid='M5';
meterid | date | system.blobasbigint(system.timestampasblob(date))
---------+--------------------------+---------------------------------------------------
M5 | 2016-12-27 09:42:30+0000 | 1482831750000
M5 | 2016-12-30 17:31:53+0000 | 1483119113231
M5 | 2016-12-30 17:32:08+0000 | 1483119128812
(3 rows)
Note the zeros on the end of the 2016-12-27 09:42:30+0000 row, that I explicitly INSERTed from your example. Note that the two rows I INSERTed using the dateof(now()) nested functions actually has the milliseconds as the last three digits on the timestamps.
Watch what happens when I take those three digits and add them as milliseconds when I delete one of the rows:
aploetz#cqlsh:stackoverflow> DELETE FROM propatterns WHERE meterid='M5'
AND date='2016-12-30 17:32:08.812+0000';
aploetz#cqlsh:stackoverflow> SELECT meterid,date,blobAsBigint(timestampAsBlob(date))
FROM propatterns WHERE meterid='M5';
meterid | date | system.blobasbigint(system.timestampasblob(date))
---------+--------------------------+---------------------------------------------------
M5 | 2016-12-27 09:42:30+0000 | 1482831750000
M5 | 2016-12-30 17:31:53+0000 | 1483119113231
(2 rows)
In summary:
You cannot perform range deletes prior to Cassandra 3.0.
You cannot delete individual rows keyed by timestamps without specifying milliseconds, if milliseconds are indeed present.
Delete with range clause is possible in C* 3.0 onwards. Looking at the error you got I think you are on a pre 3.0 version in which case you won't be able to do this via CQL
In Cassandra 3 you can use the "...from Y using timestamp XXX where ..." command:
create table mytime (
location_id text,
tour_id text,
mytime timestamp,
PRIMARY KEY (location_id, tour_id));
INSERT INTO mytime (location_id, tour_id, mytime) values ('location1', '1', toTimeStamp(now()));
INSERT INTO mytime (location_id, tour_id, mytime) values ('location1', '2', toTimeStamp(now()));
Be aware: the value you need to use for the timestamp is nanoseconds not miliseconds:
select location_id, mytime, blobAsBigint(mytime), WRITETIME(mytime) from mytime;
location_id |mytime |system.blobasbigint(mytime) |writetime(mytime) |
------------|------------------------|----------------------------|------------------|
location1 |2018-11-28-09.53.52.110 |1543395232110 |1543395232109517 |
location1 |2018-11-28-09.53.52.742 |1543395232742 |1543395232740055 |
So now you can do
delete from mytime using timestamp 1543395232109517 where location_id = 'location1';
Which correctly deletes the entry <= 1543395232109517:
select location_id, mytime, blobAsBigint(mytime), WRITETIME(mytime) from mytime;
location_id |mytime |system.blobasbigint(mytime) |writetime(mytime) |
------------|------------------------|----------------------------|------------------|
location1 |2018-11-28-09.53.52.742 |1543395232742 |1543395232740055 |

How to delete a record in Cassandra?

I have a table like this:
CREATE TABLE mytable (
user_id int,
device_id ascii,
record_time timestamp,
timestamp timeuuid,
info_1 text,
info_2 int,
PRIMARY KEY (user_id, device_id, record_time, timestamp)
);
When I ask Cassandra to delete a record (an entry in the columnfamily) like this:
DELETE from my_table where user_id = X and device_id = Y and record_time = Z and timestamp = XX;
it returns without an error, but when I query again the record is still there. Now if I try to delete a whole row like this:
DELETE from my_table where user_id = X
It works and removes the whole row, and querying again immediately doesn't return any more data from that row.
What I am doing wrong? How you can remove a record in Cassandra?
Thanks
Ok, here is my theory as to what is going on. You have to be careful with timestamps, because they will store data down to the millisecond. But, they will only display data to the second. Take this sample table for example:
aploetz#cqlsh:stackoverflow> SELECT id, datetime FROM data;
id | datetime
--------+--------------------------
B25881 | 2015-02-16 12:00:03-0600
B26354 | 2015-02-16 12:00:03-0600
(2 rows)
The datetimes (of type timestamp) are equal, right? Nope:
aploetz#cqlsh:stackoverflow> SELECT id, blobAsBigint(timestampAsBlob(datetime)),
datetime FROM data;
id | blobAsBigint(timestampAsBlob(datetime)) | datetime
--------+-----------------------------------------+--------------------------
B25881 | 1424109603000 | 2015-02-16 12:00:03-0600
B26354 | 1424109603234 | 2015-02-16 12:00:03-0600
(2 rows)
As you are finding out, this becomes problematic when you use timestamps as part of your PRIMARY KEY. It is possible that your timestamp is storing more precision than it is showing you. And thus, you will need to provide that hidden precision if you will be successful in deleting that single row.
Anyway, you have a couple of options here. One, find a way to ensure that you are not entering more precision than necessary into your record_time. Or, you could define record_time as a timeuuid.
Again, it's a theory. I could be totally wrong, but I have seen people do this a few times. Usually it happens when they insert timestamp data using dateof(now()) like this:
INSERT INTO table (key, time, data) VALUES (1,dateof(now()),'blah blah');
CREATE TABLE worker_login_table (
worker_id text,
logged_in_time timestamp,
PRIMARY KEY (worker_id, logged_in_time)
);
INSERT INTO worker_login_table (worker_id, logged_in_time)
VALUES ("worker_1",toTimestamp(now()));
after 1 hour executed the above insert statement once again
select * from worker_login_table;
worker_id| logged_in_time
----------+--------------------------
worker_1 | 2019-10-23 12:00:03+0000
worker_1 | 2015-10-23 13:00:03+0000
(2 rows)
Query the table to get absolute timestamp
select worker_id, blobAsBigint(timestampAsBlob(logged_in_time )), logged_in_time from worker_login_table;
worker_id | blobAsBigint(timestampAsBlob(logged_in_time)) | logged_in_time
--------+-----------------------------------------+--------------------------
worker_1 | 1524109603000 | 2019-10-23 12:00:03+0000
worker_1 | 1524209403234 | 2019-10-23 13:00:03+0000
(2 rows)
The below command will not delete the entry from Cassandra as the precise value of timestamp is required to delete the entry
DELETE from worker_login_table where worker_id='worker_1' and logged_in_time ='2019-10-23 12:00:03+0000';
By using the timestamp from blob we can delete the entry from Cassandra
DELETE from worker_login_table where worker_id='worker_1' and logged_in_time ='1524209403234';

Cassandra: can you add dynamic columns within existing column clustering?

I'm using Cassandra 1.2.12 with CQL 3, and am having trouble modeling my column family.
I currently store snapshots of customer data at particular times. Works great:
CREATE TABLE data (
cust_id varchar,
time timeuuid,
data_text text,
PRIMARY KEY (cust_id, time)
);
The cust_id is the partition key and time is the clustering id, so, as I understand it, I can think of each row in the table like:
| cust_id | timeuuid1 : data_text | timeuuid2 : data_text |
| CUST1 | data at this time | data at this time |
Now I'd like to store another group of metrics for each snapshot - but the name of each of these columns isn't fixed. So something like:
| cust_id | timeuuid1 : data_text | timeuuid1 : dynamicCol1 | timeuuid1 : dynamicCol2 | timeuuid1 : dynamicColN |
| CUST1 | data |{some value} |{some value} |{some value} |
I've achieved dynamic columns for timestamp by using a composite primary key, but I can't see how to achieve this within each cluster of columns, if you see what I mean.
If I add, say, "dynamicColumnName" to the existing composite key, I'll end up with customer data stored for each dynamic column, which is not what I want.
Is this possible, without using a Map column? Hope you can help, thanks!
I am not a CQL user... With the thrift API you dynamically add a column to a column family by inserting/updating a record with a value for a column with name X. The column X will start to exist right there and then for that record.
Have you tried an INSERT statement specifying a column that you have not explicitly defined? I would expect that to have the same effect (column is created).

Resources