Do specific rows in a partition have to be specified in order to update and/or delete static columns? - cassandra

The CQL3 specification description of the UPDATE statement begins with the following paragraph:
The UPDATE statement writes one or more columns for a given row in a
table. The (where-clause) is used to select the row to update and must
include all columns composing the PRIMARY KEY (the IN relation is only
supported for the last column of the partition key). Other columns
values are specified through after the SET keyword.
The description in the specification of the DELETE statement begins with a similar paragraph:
The DELETE statement deletes columns and rows. If column names are provided
directly after the DELETE keyword, only those columns are deleted from the row
indicated by the (where-clause) (the id[value] syntax in (selection) is for
collection, please refer to the collection section for more details).
Otherwise whole rows are removed. The (where-clause) allows to specify the
key for the row(s) to delete (the IN relation is only supported for the last
column of the partition key).
The bolded portions of each of these descriptions state, in layman's terms, that these statements can be used to modify data in a solely row-based manner.
However, given the nature of the relationship (or lack thereof) between the rows and the static columns (which exist independent of any particular row) of a table, it seems as though there should be a way to modify such columns given only the keys of the partitions they're respectively contained in. According to the specification however, that does not seem to be possible, and I'm not sure if that is a product of the difficulty to allow such in the CQL3 syntax, or something else.
If a static column cannot be updated or deleted independent of any row in its table, then such operations become coupled with their non-static-column-based counterparts, making the set of columns targeted by such operations, difficult to determine. For example, given a populated table with the following definition:
CREATE TABLE IF NOT EXISTS example_table
(
partitionKeyColumn int
clusteringColumn int
nonPrimaryKeyColumn int
staticColumn varchar static
PRIMARY KEY (partitionKeyColumn, clusteringColumn)
)
... it is not immediately obvious if the following DELETE statements are equivalent:
//#1 (Explicitly specifies all of the columns in and "in" the target row)
DELETE partitionKeyColumn, clusteringColumn, nonPrimaryKeyColumn, staticColumn FROM example_table WHERE partitionKeyColumn = 1 AND clusteringColumn = 2
//#2 (Implicitly specifies all of the columns in (but not "in"?) the target row)
DELETE FROM example_table WHERE partitionKeyColumn = 1 AND clusteringColumn = 2
So, phrasing my observations in question form:
Are the above DELETE statements equivalent?
Does the primary key of at least one row in a CQL3 table have to be supplied in order to update or delete a static column in said table? If so, why?

I do not know about specification but in the real cassandra world, your two DELETE statements are not equivalent.
The first statement deletes the static_column whereas the second one does not. The reason of this is that static columns are shared by rows. You have to specify it explicitly to actually delete it.
Furthermore, I do not think its a good idea to DELETE static columns and non-static columns at the same time. By the way, this statement won't work :
DELETE staticColumn FROM example_table WHERE partitionKeyColumn = 1 AND clusteringColumn = 2
The error output is :
Bad Request: Invalid restriction on clustering column priceable_name since the DELETE statement modifies only static columns

Related

How to identify all columns that have different values in a Spark self-join

I have a Databricks delta table of financial transactions that is essentially a running log of all changes that ever took place on each record. Each record is uniquely identified by 3 keys. So given that uniqueness, each record can have multiple instances in this table. Each representing a historical entry of a change(across one or more columns of that record) Now if I wanted to find out cases where a specific column value changed I can easily achieve that by doing something like this -->
SELECT t1.Key1, t1.Key2, t1.Key3, t1.Col12 as "Before", t2.Col12 as "After"
from table1 t1 inner join table t2 on t1.Key1= t2.Key1 and t1.Key2 = t2.Key2
and t1.Key3 = t2.Key3 where t1.Col12 != t2.Col12
However, these tables have a large amount of columns. What I'm trying to achieve is a way to identify any columns that changed in a self-join like this. Essentially a list of all columns that changed. I don't care about the actual value that changed. Just a list of column names that changed across all records. Doesn't even have to be per row. But the 3 keys will always be excluded, since they uniquely define a record.
Essentially I'm trying to find any columns that are susceptible to change. So that I can focus on them dedicatedly for some other purpose.
Any suggestions would be really appreciated.
Databricks has change data feed (CDF / CDC) functionality that can simplify these type of use cases. https://docs.databricks.com/delta/delta-change-data-feed.html

How to understand the 'Flexible schema' in Cassandra?

I am new to Cassandra, and found below in the wikipedia.
A column family (called "table" since CQL 3) resembles a table in an RDBMS (Relational Database Management System). Column families contain rows and columns. Each row is uniquely identified by a row key. Each row has multiple columns, each of which has a name, value, and a timestamp. Unlike a table in an RDBMS, different rows in the same column family do not have to share the same set of columns, and a column may be added to one or multiple rows at any time.[29]
It said that 'different rows in the same column family do not have to share the same set of columns', but how to implement it? I have almost read all the documents in the offical site.
I can create table and insert data like below.
CREATE TABLE Emp_record(E_id int PRIMARY KEY,E_score int,E_name text,E_city text);
INSERT INTO Emp_record(E_id, E_score, E_name, E_city) values (101, 85, 'ashish', 'Noida');
INSERT INTO Emp_record(E_id, E_score, E_name, E_city) values (102, 90, 'ankur', 'meerut');
It's very like I did in the relational database. So how to create multiply rows with different columns?
I also found the offical document mentioned 'Flexible schema', how to understand it here?
Thanks very much in advance.
Column family is from the original design of Cassandra, when the data model looked like the Google BigTable or Apache HBase, and Thrift protocol was used for communication. But this required that schema was defined inside the application, and that makes access to data from many applications more problematic, as you need to update the schema inside all of them...
The CREATE TABLE and INSERT is a part of the Cassandra Query Language (CQL) that was introduced long time ago, and replaced Thrift-based implementation (Cassandra 4.0 completely removed the Thrift support). In CQL you need to have schema defined for a table, where you need to provide column name & type. If you really need to have dynamic columns, there are several approaches to that (I'll link answers that I already wrote over the time, so there won't duplicates):
If you have values of the same type, you can use one column as a name of the attribute/column, and another to store the value, like described here
if you have values of different types, you can also use one column as a name of attribute/column, and define multiple columns for values - one for each of the data types: int, text, ..., and you insert value into the corresponding columns only (described here)
you can use maps (described here) - it's similar to first or second, but mostly designed for very small number of "dynamic columns", plus have other limitations, like, you need to read the full map to fetch one value, etc.)

How to retrieve data from child cells EXCEL

I want to retrieve all items within a specific column of a table.
In this scenario, I have 2 tables, The first table contains a primary key, and the second table contains a foreign key. a 1 to many relationship is set up for the tables respectively.
I want a function/way of retrieving all items within a column in table 2 that has a foreign key that matches the primary key in table 1.
One way of doing this is through a VLOOKUP, though surely through using DAX, or some other function set, I can exploit the relationship I have made in the DataModel to make this easier for me to do.
Why don't you just get the required data from the DB with a proper SELECT statement? Something like
SELECT column
FROM t1, t2
WHERE t1.key = t2.fkey
AND t1.key = 'whatever you search for';
Then you should get the data you want.

How to make Cassandra have a varying column key for a specific row key?

I was reading the following article about Cassandra:
http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/#.UzIcL-ddVRw
and it seemed to imply you can have varying column keys in cassandra for a given row key. Is that true? And if its true, how do you allow for varying row keys.
The reason I think this might be true is because say we have a user and it can like many items and we simply want the userId to be the rowkey. We let this rowKey (userID) map to all the items that specific user might like. Each specific user might like a different number of items. Therefore, if we could have multiple column keys, one for each itemID each user likes, then we could solve the problem that way.
Therefore, is it possible to have varying length of cassandra column keys for a specific rowKey? (and how do you do it)
Providing an example and/or some cql code would be awesome!
The thing that is confusing me is that I have seen some .cql files and they define keyspaces before hand and it seems pretty inflexible on how to make it dynamic, i.e. allow it to have additional columns as we please. For example:
CREATE TABLE IF NOT EXISTS results (
test blob,
tid timeuuid,
result text,
PRIMARY KEY(test, tid)
);
How can this even allow growing columns? Don't we need to specify the name before hand anyway?Or additional custom columns as the application desires?
Yes, you can have a varying number of columns per row_key. From a relational perspective, it's not obvious that tid is the name of a variable. It acts as a placeholder for the variable column key. Note in the inserts statements below, "tid", "result", and "data" are never mentioned in the statement.
CREATE TABLE IF NOT EXISTS results (
data blob,
tid timeuuid,
result text,
PRIMARY KEY(test, tid)
);
So in your example, you need to identify the row_key, column_key, and payload of the table.
The primary key contains both the row_key and column_key.
Test is your row_key.
tid is your column_key.
data is your payload.
The following inserts are all valid:
INSERT your_keyspace.results('row_key_1', 'a4a70900-24e1-11df-8924-001ff3591711', 'blob_1');
INSERT your_keyspace.results('row_key_1', 'a4a70900-24e1-11df-8924-001ff3591712', 'blob_2');
#notice that the column_key changed but the row_key remained the same
INSERT your_keyspace.results('row_key_2', 'a4a70900-24e1-11df-8924-001ff3591711', 'blob_3');
See here
Did you thought of exploring collection support in cassandra for handling such relations in colocated way{e.g. on same data node}.
Not sure if it helps, but what about keeping user id as row key and a map containing item id as key and some value?
-Vivel

how to retrieve the all the values of a super column in a set rowID from a columnfamily in Hector Cassandra

I want to retrieve the different row id values depending on super column name.
For that purpose I have used this code
SuperColumnQuery<String, String, String, String> superColumnQuery =
HFactory.createSuperColumnQuery(keyspaceOperator, se, se,se,se);
superColumnQuery.setColumnFamily(COLUMN_FAMILY).setKey(rowID).setSuperName(superColumnName);
QueryResult<HSuperColumn<String, String, String>> result = superColumnQuery.execute();
//rowID contains a list of rows separated by ','
But it's not working.
Given that you're trying to select row keys based on column names, I'd venture to guess that your data model is backwards. You should generally be moving from the outside in -- select on row key, then on supercolumn name, then on column name. Otherwise you're going to be stuck iterating over rows in your code trying to match a column name, instead of using the Cassandra engine to select what you need. This approach is never going to scale.
So I'd suggest redoing your data model -- or if you need to have it this way, consider adding another ColumnFamily that serves as an index for the first. Contrary to old-school SQL databases, the credo in NoSQL dbs like Cassandra is "If you're denormalizing -- you're doing it right".

Resources