how do I query for rows that are missing a value? - cassandra

Got a column family that looks like:
CREATE TABLE data (
id uuid,
order_id text,
order_ts timestamp,
product_category text,
product_distributor text,
store_state text,
transaction_discount decimal,
transaction_id text,
transaction_qty int,
transaction_total decimal,
PRIMARY KEY (id)
)
How do I query all rows that don't have transaction_total? Seems like it'd be simple (ISNULL) but that doesn't exist in Cassandra.

To be able to filter rows where a column is NULL that implies:
the storage engine actually stores a value for that column
the NULL is considered to be a value and not a marker of a missing value
As a side note, there have been long discussions in the SQL space about the meaning, interpretation, and implications of the NULL marker-vs-value and its 3-value logic (see this wkipedia article
Getting back to Cassandra:
Cassandra doesn't store missing values (so a NULL column will actually not exist -- there will be no marker, or flag, or value stored)
To avoid the NULL-is-it-a-value-or-a-marker problem you could use a default value (for this particular example it seems like setting transaction_total to -1 would make it clear that the value needs to be computed)
Update: posting the above got me thinking if there would be a way to introduce a is_column_missing operator (that would also not be a performance hog). Cassandra uses bloom filters to reduce the number of disk seeks -- the bloom filter will basically tell with certainty if a row is not present in a file. Unfortunately there's no per-row column index available to check the same sort of information, so basically C* would have to read all entries for a row in order to determine if a column is present or not. As you can imagine that would be terrible.

You can just check it null value. Like below:
select * from data where transaction_total <> null
Check additinal information here 3783

Related

Is there difference in storing a list of floats vs. denormalising into multiple rows?

I need to store multiple floating point numbers per record in Cassandra. My current schema looks like:
CREATE
TABLE
data_point
( account ASCII
, groupkey TINYINT
, productid TEXT
, vectors LIST<FLOAT>
, PRIMARY KEY ((account, groupkey), productid))
WITH CLUSTERING
ORDER
BY
( productid ASC
);
Each record has 1280 floats. These rows, once inserted, are never updated or deleted. While this works, I've been thinking if it better to have these in separate 1280 rows.
CREATE
TABLE
data_point
( account ASCII
, groupkey TINYINT
, productid TEXT
, vector FLOAT
, PRIMARY KEY ((account, groupkey), productid))
WITH CLUSTERING
ORDER
BY
( productid ASC
);
The Datastax docs reads:
Collections are meant for storing/denormalizing relatively small amount of data.
...but I'm unsure what defines a little or lot. The ordering of the list is not relevant. The rows are never individually read. All reads come from Spark and use token ranges to read large swathes of data.
If data is never changing, then use frozen version of the list, so all points will be stored as one binary object:
vectors frozen<LIST<FLOAT>>
Using the separate rows make sense only if you need to read only one value, or something like. If you always read the whole dataset - use frozen list.
I would echo Alex's advice, a frozen list would suit your use case better than the non-frozen above - however there is also some points I would add.
On the 2nd table example, there is no additional column to denote the different list items when normalized - the primary key remains the same, so in essence that would store just 1 value per primary key and not 1280 you intended. There would have to be an additional column within the key to make it a unique row per list entry still.
For the 1st table, while you can use a frozen list - if there is no actual order to the items within the list and no duplication, you could opt for a set which would be simpler since there is no ordinal position being stored / considered. (The lack of any ordering denoted in the 2nd table design is the catalyst for the consideration)

How to find the delta difference for a table in cassandra using uuid column type

I have the following table on my Cassandra db, I want to find the delta difference in terms of cassandra query. For example, if I operate any insert,update,delete operation to the table I should be able to show which row/rows are getting impacted as my final result.
Let's say on first instance I have perform some 10 rows insertions so if I take the delta difference the output should only show that 10 rows are inserted. Same if we modify any number of rows or delete some rows then those changes should be captured.
Next time if we run the query it should idealy give 0 as we have not insert/modify/delete any row/rows
Here is the following table
CREATE TABLE datainv (
datainv_account_id uuid,
datainv_run_id uuid,
id uuid,
datainv_summary text,
json text,
number text,
PRIMARY KEY (datainv_account_id, datainv_run_id));
many things I have searched on internet but most of the solution are based on timeuuid,but in this case I have uuid columns only. So I'm not getting any solution that the same use-case can be achieved using uuid
It's not so easy to generate a diff between 2 table states in Cassandra, because you can't easily detect if you have inserted new partitions or not. You can implement something based on the timeuuid or on the timestamp as clustering column - in this case you'll able to filter out the data since latest change, as you have ordering of values that you don't have with uuid that is completely random. But it still requires that you perform the full scan of all the table. Plus it won't detect deletions...
Theoretically you can implement this with Spark as following:
read all primary key values & store this data in some other table/on disk;
next time, read all primary key values & find difference between original set of primary keys & new set - for example, do full outer join & use presence of None on left as addition, and presence of None on right as deletion;
store new set of the primary keys in a separate table/on disk, but previous version should be truncated.
but it will consume quite a lot of resources.

Why cassandra/cql restrict to use where clause on a column that not indexed?

I have a table as follows in Cassandra 2.0.8:
CREATE TABLE emp (
empid int,
deptid int,
first_name text,
last_name text,
PRIMARY KEY (empid, deptid)
)
when I try to search by: "select * from emp where first_name='John';"
cql shell says:
"Bad Request: No indexed columns present in by-columns clause with Equal operator"
I searched for the issue and every places it says add a secondary index for the column 'first_name'.
But I need to know the exact reason for why that column need to be indexed?
Only thing I can figure out is performance.
Any other reasons?
Cassandra does not support for searching by arbitrary column. It is because it would involve scanning all the rows, which is not supported.
The data are internally organised into something which one can compare to HashMap[X, SortedMap[Y, Z]]. The key of the outer map is a partition key value and the key of the inner map is a kind of concatenation of all clustering columns values and a name of some regular column.
Unless you have an index on a column, you need to provide full (preferred) or partial path to the data you want to collect with the query. Therefore, you should design your schema so that queries contain primary key value and some range on clustering columns.
You may read about what is allowed and what is not here
Alternatively you can create an index in Cassandra, but that will hamper your write performance.

How to make Cassandra have a varying column key for a specific row key?

I was reading the following article about Cassandra:
http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/#.UzIcL-ddVRw
and it seemed to imply you can have varying column keys in cassandra for a given row key. Is that true? And if its true, how do you allow for varying row keys.
The reason I think this might be true is because say we have a user and it can like many items and we simply want the userId to be the rowkey. We let this rowKey (userID) map to all the items that specific user might like. Each specific user might like a different number of items. Therefore, if we could have multiple column keys, one for each itemID each user likes, then we could solve the problem that way.
Therefore, is it possible to have varying length of cassandra column keys for a specific rowKey? (and how do you do it)
Providing an example and/or some cql code would be awesome!
The thing that is confusing me is that I have seen some .cql files and they define keyspaces before hand and it seems pretty inflexible on how to make it dynamic, i.e. allow it to have additional columns as we please. For example:
CREATE TABLE IF NOT EXISTS results (
test blob,
tid timeuuid,
result text,
PRIMARY KEY(test, tid)
);
How can this even allow growing columns? Don't we need to specify the name before hand anyway?Or additional custom columns as the application desires?
Yes, you can have a varying number of columns per row_key. From a relational perspective, it's not obvious that tid is the name of a variable. It acts as a placeholder for the variable column key. Note in the inserts statements below, "tid", "result", and "data" are never mentioned in the statement.
CREATE TABLE IF NOT EXISTS results (
data blob,
tid timeuuid,
result text,
PRIMARY KEY(test, tid)
);
So in your example, you need to identify the row_key, column_key, and payload of the table.
The primary key contains both the row_key and column_key.
Test is your row_key.
tid is your column_key.
data is your payload.
The following inserts are all valid:
INSERT your_keyspace.results('row_key_1', 'a4a70900-24e1-11df-8924-001ff3591711', 'blob_1');
INSERT your_keyspace.results('row_key_1', 'a4a70900-24e1-11df-8924-001ff3591712', 'blob_2');
#notice that the column_key changed but the row_key remained the same
INSERT your_keyspace.results('row_key_2', 'a4a70900-24e1-11df-8924-001ff3591711', 'blob_3');
See here
Did you thought of exploring collection support in cassandra for handling such relations in colocated way{e.g. on same data node}.
Not sure if it helps, but what about keeping user id as row key and a map containing item id as key and some value?
-Vivel

Time UUID type in pycassa

I'm having problems with using the time_uuid type as a key in my columnfamily. I want to store my records, and have them ordered by when they were inserted, and then I figured that the time_uuid is a good way to go. This is how I've set up my column family:
sys.create_column_family("keyspace", "records", comparator_type=TIME_UUID_TYPE)
When I try to insert, I do this:
q=pycassa.ColumnFamily(pycassa.connect("keyspace"), "records")
myKey=pycassa.util.convert_time_to_uuid(datetime.datetime.utcnow())
q.insert(myKey,{'somedata':'comevalue'})
However, when I insert data, I always get an error:
Argument for a v1 UUID column name or value was neither a UUID, a datetime, or a number.
If I change the comparator_type to UTF8_TYPE, it works, but the order of the items when returned are not as they should be. What am I doing wrong?
The problem is that in your data model, you are using the time as a row key. Although this is possible, you won't get a meaningful ordering unless you also use the ByteOrderedPartitioner.
For this reason, most people insert time-ordered data using the time as a column name, not a row key. In this model, your insert statement would look like:
q.insert(someKey, {datetime.datetime.utcnow(): 'somevalue'})
where someKey is a key that relates to the entire time series that you're inserting (for example, a username). (Note that you don't have to convert the time to UUID, pycassa does it for you.) To store something more than a single value, use a supercolumn or a composite key.
If you really want to store the time in your row keys, then you need to specify key_validation_class, not comparator_type. comparator_type sets the type of the column names, while key_validation_class sets the type of the row keys.
sys.create_column_family("keyspace", "records", key_validation_class=TIME_UUID_TYPE)
Remember the rows will not be sorted unless you also use the ByteOrderedPartitioner.
The comparator for a column family is used for ordering the columns within each row. You are seeing that error because 'somedata' is valid utf-8 but not a valid uuid.
The ordering of the rows stored in cassandra is determined by the partitioner. Most likely you are using RandomPartitioner which distributes load evenly across your cluster but does not allow for meaningful range queries (the rows will be returned in a random order.)
http://wiki.apache.org/cassandra/FAQ#range_rp

Resources