In HBase there's a HTable.exists(Get) method that allows me to see if a rowkey has (at least) one cell associated. What about Cassandra? It seems there's no such corresponding feature
Just make a request for one column. If anything is returned, the row exists.
For example, in pycassa you would do:
if column_family.get(key, column_count=1):
print key, "exists"
Through CQL 3, depending on your schema, you would just do a simple select like:
SELECT * FROM mycf WHERE key = 'foo'
Related
I need to output the write timestamp as part of a table export for lots of tables, though I quite cannot figure out a way which does not force me to explicitely select all columns in the statement.
Instead of being able to do just this:
SELECT *, writetime(data) AS timestamp FROM dls.licenses;
I have to do that:
SELECT column1, column2, ... , writetime(data) AS timestamp FROM dls.licenses;
This is pretty unconvenient since it means I'd have to change the export tool every time the schema of any of the tables changes.
Is there a better way?
Edit: To clarify, the actual error I get is the following. The way the syntax is presented in the error one could think that the SQL should be ok:
SELECT *, writetime(id) AS timestamp FROM dls.licenses;
SyntaxException: line 1:8 mismatched input ',' expecting K_FROM (SELECT *[,]...)
Edit 2: Here is the keyspace and create statement used for this table:
CREATE KEYSPACE IF NOT EXISTS dls WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': ‚1‘ };
CREATE TABLE IF NOT EXISTS dls.licenses (subscription_id text, id text, key text, data text, PRIMARY KEY (key));
CREATE INDEX IF NOT EXISTS ON dls.licenses (id);
BTW: I'm using the fresh Cassandra 4.0.0 (GA).
If you are exporting to CSV or JSON files, you may consider using DataStax's dsbulk.
https://github.com/datastax/dsbulk
The latest version of dsbulk 1.8.0 added support to export writetime and ttl.
https://docs.datastax.com/en/dsbulk/doc/dsbulk/reference/schemaOptions.html#schemaOptions__schemaOptionsPreserveTimestamp
dsbulk unload -url myData.csv -k ks1 -t table1 --timestamp
The WHERE clause specifies which rows must be queried. It is composed of relations on the columns that are part of the PRIMARY KEY and/or have a secondary index defined on them.
The column specification of the relation must be one of the following:
One or more members of the partition key of the table
A clustering column, only if the relation is preceded by other relations that specify all columns in the partition key
A column that is indexed using CREATE INDEX.
In Cassandra 3.6 and later, add ALLOW FILTERING to filter only on a non-indexed cluster column.
You may be able to solve your query problem by creating a secondary index on the column you want the writetime for. Keep in mind secondary indexes create overhead and which may result in unintended consequences.
The star (*) in SELECT * is the CQL syntax for "ALL columns" so by definition, it is not possible to include another column since ALL of them are selected even for native CQL functions. For this reason, you need to enumerate all column names + functions-on-columns.
+1 to Yuki's answer. I wanted to add that DSBulk adds a WRITETIME() column for every column in the table because it isn't possible to know in advance the write-time of each column in the partition until the full partition has been read.
Allow me to explain it using a couple of examples.
Schema
Consider this table:
CREATE TABLE users_by_email (
email text,
name text,
address text,
mobile text,
PRIMARY KEY (email)
)
Example 1
If we add a new record with a value specified for all columns:
INSERT INTO users_by_email (email, name, address, mobile)
VALUES ('alice#staysafe.com', 'Alice', '221B Baker St', '098-765-432-109');
then for this partition, all columns will have the same write-time.
Example 2
Consider a situation where a record is fragmented across multiple inserts over a period of time such as:
INSERT INTO users_by_email (email, name) VALUES ('dude#getvaccinated.now', 'Bob');
INSERT INTO users_by_email (email, address) VALUES ('dude#getvaccinated.now', '350 Fifth Ave');
INSERT INTO users_by_email (email, mobile) VALUES ('dude#getvaccinated.now', '012-555-123-456');
Each of the columns name, address and mobile would all have different write-times.
From these 2 examples, you should see that there isn't always a single write-time that applies to all columns in the partition.
For your specific use case, you need to figure out from the DSBulk output which write-time to use for situations where the partition fragments are inserted/updated at different times. Cheers!
I have a table as follows in Cassandra 2.0.8:
CREATE TABLE emp (
empid int,
deptid int,
first_name text,
last_name text,
PRIMARY KEY (empid, deptid)
)
when I try to search by: "select * from emp where first_name='John';"
cql shell says:
"Bad Request: No indexed columns present in by-columns clause with Equal operator"
I searched for the issue and every places it says add a secondary index for the column 'first_name'.
But I need to know the exact reason for why that column need to be indexed?
Only thing I can figure out is performance.
Any other reasons?
Cassandra does not support for searching by arbitrary column. It is because it would involve scanning all the rows, which is not supported.
The data are internally organised into something which one can compare to HashMap[X, SortedMap[Y, Z]]. The key of the outer map is a partition key value and the key of the inner map is a kind of concatenation of all clustering columns values and a name of some regular column.
Unless you have an index on a column, you need to provide full (preferred) or partial path to the data you want to collect with the query. Therefore, you should design your schema so that queries contain primary key value and some range on clustering columns.
You may read about what is allowed and what is not here
Alternatively you can create an index in Cassandra, but that will hamper your write performance.
Got a column family that looks like:
CREATE TABLE data (
id uuid,
order_id text,
order_ts timestamp,
product_category text,
product_distributor text,
store_state text,
transaction_discount decimal,
transaction_id text,
transaction_qty int,
transaction_total decimal,
PRIMARY KEY (id)
)
How do I query all rows that don't have transaction_total? Seems like it'd be simple (ISNULL) but that doesn't exist in Cassandra.
To be able to filter rows where a column is NULL that implies:
the storage engine actually stores a value for that column
the NULL is considered to be a value and not a marker of a missing value
As a side note, there have been long discussions in the SQL space about the meaning, interpretation, and implications of the NULL marker-vs-value and its 3-value logic (see this wkipedia article
Getting back to Cassandra:
Cassandra doesn't store missing values (so a NULL column will actually not exist -- there will be no marker, or flag, or value stored)
To avoid the NULL-is-it-a-value-or-a-marker problem you could use a default value (for this particular example it seems like setting transaction_total to -1 would make it clear that the value needs to be computed)
Update: posting the above got me thinking if there would be a way to introduce a is_column_missing operator (that would also not be a performance hog). Cassandra uses bloom filters to reduce the number of disk seeks -- the bloom filter will basically tell with certainty if a row is not present in a file. Unfortunately there's no per-row column index available to check the same sort of information, so basically C* would have to read all entries for a row in order to determine if a column is present or not. As you can imagine that would be terrible.
You can just check it null value. Like below:
select * from data where transaction_total <> null
Check additinal information here 3783
I was reading the following article about Cassandra:
http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/#.UzIcL-ddVRw
and it seemed to imply you can have varying column keys in cassandra for a given row key. Is that true? And if its true, how do you allow for varying row keys.
The reason I think this might be true is because say we have a user and it can like many items and we simply want the userId to be the rowkey. We let this rowKey (userID) map to all the items that specific user might like. Each specific user might like a different number of items. Therefore, if we could have multiple column keys, one for each itemID each user likes, then we could solve the problem that way.
Therefore, is it possible to have varying length of cassandra column keys for a specific rowKey? (and how do you do it)
Providing an example and/or some cql code would be awesome!
The thing that is confusing me is that I have seen some .cql files and they define keyspaces before hand and it seems pretty inflexible on how to make it dynamic, i.e. allow it to have additional columns as we please. For example:
CREATE TABLE IF NOT EXISTS results (
test blob,
tid timeuuid,
result text,
PRIMARY KEY(test, tid)
);
How can this even allow growing columns? Don't we need to specify the name before hand anyway?Or additional custom columns as the application desires?
Yes, you can have a varying number of columns per row_key. From a relational perspective, it's not obvious that tid is the name of a variable. It acts as a placeholder for the variable column key. Note in the inserts statements below, "tid", "result", and "data" are never mentioned in the statement.
CREATE TABLE IF NOT EXISTS results (
data blob,
tid timeuuid,
result text,
PRIMARY KEY(test, tid)
);
So in your example, you need to identify the row_key, column_key, and payload of the table.
The primary key contains both the row_key and column_key.
Test is your row_key.
tid is your column_key.
data is your payload.
The following inserts are all valid:
INSERT your_keyspace.results('row_key_1', 'a4a70900-24e1-11df-8924-001ff3591711', 'blob_1');
INSERT your_keyspace.results('row_key_1', 'a4a70900-24e1-11df-8924-001ff3591712', 'blob_2');
#notice that the column_key changed but the row_key remained the same
INSERT your_keyspace.results('row_key_2', 'a4a70900-24e1-11df-8924-001ff3591711', 'blob_3');
See here
Did you thought of exploring collection support in cassandra for handling such relations in colocated way{e.g. on same data node}.
Not sure if it helps, but what about keeping user id as row key and a map containing item id as key and some value?
-Vivel
I've just had a crash course of Cassandra over the last week and went from Thrift API to CQL to grokking SuperColumns to learning I shouldn't use them and user Composite Keys instead.
I'm now trying out CQL3 and it would appear that I can no longer insert into columns that are not defined in the schema, or see those columns in a select *
Am I missing some option to enable this in CQL3 or does it expect me to define every column in the schema (defeating the purpose of wide, flexible rows, imho).
Yes, CQL3 does require columns to be declared before used.
But, you can do as many ALTERs as you want, no locking or performance hit is entailed.
That said, most of the places that you'd use "dynamic columns" in earlier C* versions are better served by a Map in C* 1.2.
I suggest you to explore composite columns with "WITH COMPACT STORAGE".
A "COMPACT STORAGE" column family allows you to practically only define key columns:
Example:
CREATE TABLE entities_cargo (
entity_id ascii,
item_id ascii,
qt ascii,
PRIMARY KEY (entity_id, item_id)
) WITH COMPACT STORAGE
Actually, when you insert different values from itemid, you dont add a row with entity_id,item_id and qt, but you add a column with name (item_id content) and value (qt content).
So:
insert into entities_cargo (entity_id,item_id,qt) values(100,'oggetto 1',3);
insert into entities_cargo (entity_id,item_id,qt) values(100,'oggetto 2',3);
Now, here is how you see this rows in CQL3:
cqlsh:goh_master> select * from entities_cargo where entity_id = 100;
entity_id | item_id | qt
-----------+-----------+----
100 | oggetto 1 | 3
100 | oggetto 2 | 3
And how they are if you check tnem from cli:
[default#goh_master] get entities_cargo[100];
=> (column=oggetto 1, value=3, timestamp=1349853780838000)
=> (column=oggetto 2, value=3, timestamp=1349853784172000)
Returned 2 results.
You can access a single column with
select * from entities_cargo where entity_id = 100 and item_id = 'oggetto 1';
Hope it helps
Cassandra still allows using wide rows. This answer references that DataStax blog entry, written after the question was asked, which details the links between CQL and the underlying architecture.
Legacy support
A dynamic column family defined through Thrift with the following command (notice there is no column-specific metadata):
create column family clicks
with key_validation_class = UTF8Type
and comparator = DateType
and default_validation_class = UTF8Type
Here is the exact equivalent in CQL:
CREATE TABLE clicks (
key text,
column1 timestamp,
value text,
PRIMARY KEY (key, column1)
) WITH COMPACT STORAGE
Both of these commands create a wide-row column family that stores records ordered by date.
CQL Extras
In addition, CQL provides the ability to assign labels to the row id, column and value elements to indicate what is being stored. The following, alternative way of defining this same structure in CQL, highlights this feature on DataStax's example - a column family used for storing users' clicks on a website, ordered by time:
CREATE TABLE clicks (
user_id text,
time timestamp,
url text,
PRIMARY KEY (user_id, time)
) WITH COMPACT STORAGE
Notes
a Table in CQL is always mapped to a Column Family in Thrift
the CQL driver uses the first element of the primary key definition as the row key
Composite Columns are used to implement the extra columns that one can define in CQL
using WITH COMPACT STORAGE is not recommended for new designs because it fixes the number of possible columns. In other words, ALTER TABLE ... ADD is not possible on such a table. Just leave it out unless it's absolutely necessary.
interesting, something I didn't know about CQL3. In PlayOrm, the idea is it is a "partial" schema you must define and in the WHERE clause of the select, you can only use stuff that is defined in the partial schema BUT it returns ALL the data of the rows EVEN the data it does not know about....I would expect that CQL should have been doing the same :( I need to look into this now.
thanks,
Dean