Delete lots of rows from a very large Cassandra Table - cassandra

I have a table Foo with 4 columns A, B, C, D. The partitioning key is A. The clustering key is B, C, D.
I want to scan the entire table and find all rows where D is in set (X, Y, Z).
Then I want to delete these rows but I don't want to "kill" Cassandra (because of compactions), I'd like these rows deleted with minimal disruption or risk.
How can I do this?

You have a big problem here. Indeed, you really can't find the rows without actually scanning all of your partitions. The problem real problem is that C* will allow you to restrict your queries with a partition key, and then by your cluster keys in the order in which they appear in your PRIMARY KEY table declaration. So if your PK is like this:
PRIMARY KEY (A, B, C, D)
then you'd need to filter by A first, then by B, C, and only at the end by D.
That being said, for the part of finding your rows, if this is something you have to run only once, you
Could scan all your table and do comparisons of D in your App logic.
If you know the values of A you could query every partition in parallel and then compare D in your application
You could attach a secondary index and try to exploit speed from there.
Please note that depending on how many nodes do you have 3 is really not an option, secondary indexes don't scale)
If you need to perform such tasks multiple times, I'd suggest you to create another table that would satisfy this query, something like PRIMARY KEY (D), you'd then just scan three partitions and that would be very fast.
About deleting your rows, I think there's no way to do it without triggering compactions, they are part of C* and you have to live with them. If you really can't tolerate tombstone creation and/or compactions, the only alternative is to not delete rows from a C* cluster, and that often means thinking about a new data model that won't need deletes.

Related

Cannot create materialized view using SELECT query with PER PARTITION LIMIT

Table:
CREATE TABLE IF NOT EXISTS table (
a TEXT,
b TEXT,
c BIGINT,
PRIMARY KEY ((a, b), c)
) WITH CLUSTERING ORDER BY (c DESC);
I need to get only one record from each (a, b) partition for the entire selection where c will be in DESC order and b in ASC order:
SELECT * FROM table WHERE a='a-1' ORDER BY b ASC PER PARTITION LIMIT 1 ALLOW FILTERING;
Result:
ORDER BY is only supported when the partition key is restricted by an EQ or an IN.
I tried create materialized view for ordering by b:
CREATE MATERIALIZED VIEW IF NOT EXISTS table_view AS
SELECT a, b, c
FROM table
WHERE a IS NOT NULL AND b IS NOT NULL AND c IS NOT NULL
PER PARTITION LIMIT 1
PRIMARY KEY (a, b, c)
WITH CLUSTERING ORDER BY (b ASC, c DESC);
I get an error while creating on PER PARTITION LIMIT.
Is it really possible to do this? Or maybe there is some workaround for this case?
I'll try to explain why Scylla (and Cassandra) do not support the things you tried to do.
In Scylla (and Cassandra), partition keys are not ordered in any useful way - they are ordered by a hash function of the partition key, not by the partition key itself. In your case, the partition key is (a, b) - that is - the full pair. The restriction WHERE a='...' may match a million different partitions with partition keys ('...', b) for a million different b's, and these are not ordered by b's... Not only are they not ordered by b's - they aren't even colocated on the same node. The only way for Scylla to implement the WHERE a='...' restriction is to do a full-table scan across the entire cluster. This is why you had to add ALLOW FILTERING.
But even then, there is no O(N) way to implement the ORDER by b and this is why Scylla refuses to do it. As I said above, the query WHERE a='...' may return a million different partitions (a, b). Scylla would need to collect those million results, sort them all, and return them ordered by b. It can't do that. Scylla can scan an already-sorted partition (this is what the error message is telling you), but not sort unsorted results.
You can argue that Scylla could do in this case what search engines do, namely - do not sort the full result list up-front (O(nlogn) complexity, O(n) space), but rather collect only the top K results while scanning the entire table. But that makes paging through the entire result set inefficient - Scylla would need to do the entire scan for each page. That's not something that Scylla does in any other case.
Finally, for the materialized view, there is a different problem. You're right that PER PARTITION LIMIT is not supported there. There is a real problem to implement that. Imagine the following scenario:
You add an item with key a=1, b=1, c=1 to the base table. It is added to the view as well.
You add an item with key a=1, b=1, c=2. Because of the per-partition limit, and there is already an item with the same partition key (a=1,b=1), this new item is not inserted into the view.
You now delete the item with key a=1, b=1, c=1. It is deleted from the view as well, but now Scylla needs to figure out that it needs to add a=1,b=1,c=2 to the view because there is now room for this item in the per-partition limit.
Step 3 is difficult and inefficient, so Scylla does not currently support this use case.
Your query is invalid. As the error states, you can only use the ORDER BY clause if you specify the partition key.
In your case, the partition key is (a, b) -- not just column a, but BOTH a AND b. You cannot use ORDER BY on column b because it is part of the partition key.
You can however ORDER BY c because it is a clustering column and not part of the partition key.
SELECT ... FROM table
WHERE a = ?
AND b = ?
ORDER BY c ...
Note that in this example, BOTH a and b are restricted by the equality (EQ) operator. Cheers!

Sparse matrix using column store on MemSQL

I am new to column store db family and some of the concepts are not yet completely clear to me. I want to use MemSQL to store sparse matrix.
The table would look something like this:
CREATE TABLE matrix (
r_id INT,
c_id INT,
cell_data VARCHAR(10),
KEY (`r_id`, `c_id`) USING CLUSTERED COLUMNSTORE,
);
The Queries:
SELECT c_id, cell_data FROM matrix WHERE r_id=<val>; i.e. whole row
SELECT r_id, cell_data FROM matrix WHERE c_id=<val>; i.e. whole column
SELECT cell_data FROM matrix WHERE r_id=<val1> AND c_id=<val2>; i.e. one cell
UPDATE matrix SET cell_data=<val> WHERE r_id=<val1> AND c_id=<val2>;
INSERT INTO matrix VALUES (<v1>, <v2>, <v3>);
The queries 1 and 2 are about equally frequent and 3, 4 and 5 are also equally frequent. One of Q1,2 are equally frequent as one of Q3,4,5 (i.e. Q1,2:Q3,4,5 ~= 1:1).
I do realize that inserting into column store one row at a time creates Row segment group for each insert and thus degrading performance. I cannot batch the inserts. Also I cannot use in-memory row store (the matrix is too big).
I have three questions:
Does the issue with single row inserts concern updates too if only cell_data is changed (i.e. Q4)?
Would it be possible to have in-memory row table in which I would do INSERT (?and UPDATE?) operations and periodically batch the contents to column table?
How would I perform Q1,2 if I need most recent data (?UNION ALL?)?
Is it possible avoid executing Q3 for both tables (?which would mean two round trips?)?
I am concerned by execution speed of Q1 and Q2. Is the Clustered key optimal for those. I am not sure how the records would be stored with table above.
1.
Yes, single-row updates also perform poorly - they are essentially a delete and an insert.
2.
Yes, and in fact we automatically do this behind the scenes - the most recently inserted data (if it is too small a number of rows to be a good columnar segment) is kept in an in-memory rowstore form, and read queries are essentially looking at a UNION ALL of that data and the column-oriented data. We then batch up this data to write into column-oriented form.
If that doesn't work well enough, depending on your workload, you may benefit from explicitly keeping some of your data in a rowstore table instead of relying on the above behavior, in which case:
2a. yes, to see the most recent data you would use UNION ALL
2b. the data could be in either table, so you would have to query both (like for Q1,2, using UNION ALL works). This does not do two round trips, just one.
3.
You can either order by r or c first in the columnstore key - r in your current schema. This makes queries for a row efficient, but queries for a column are going to be very inefficient, they may have to scan basically the full table (depending on the patterns in your data). Unfortunately columnstore tables do not support using multiple keys, so there is no good way to solve this. One potential hacky solution is to maintain two copies of your table, one with key (r, c) and one with key (c, r) - this is essentially manually maintaining two indexes.
Based on the workload you're describing, it sounds like you are doing many single-row queries (Q3,4,5, which is 50% of the workload), which rowstore is much better suited for than columnstore (see http://docs.memsql.com/latest/concepts/columnstore/). Unfortunately, if it doesn't fit in memory, there isn't really a good way around this other than perhaps to add more memory.

Update Different Columns in One Row Concurrently in Cassandra

In Cassandra, if I update different columns concurrently in one row, will there be any write conflicts?
For example I have a table
CREATE TABLE foo (k text, a text, b text, PRIMARY KEY (k))
One thread in my code updates column a
INSERT INTO foo (k, a) VALUES ('hello', 'foo')
while the other thread updates column b
INSERT INTO foo (k, b) VALUES ('hello', 'bar').
When running concurrently, it is possible that the two queries arrive at the server at the same time.
Could I always expect the same result as I update the two columns in one CQL?
INSERT INTO foo(k, a, b) VALUES ('hello', 'foo', 'bar')
Will there be any write conflicts? Is each insertion atomic?
As Tom mentioned in his reply that in Cassandra, all the operations are column-based. Then each column should have a timestamp. In such a case, the above scenario will not bring any trouble given one thread will only update column a while the other only update column b. Is my understanding correct?
Thank you!
Write conflicts are resolved by having each server track the time of the write. If they arrive at the exact same time (with ms accuracy) Cassandra will pick one based on an algorithm (not sure about the details, I would assume it involves node UUIDs).
So write conflicts are not something you need to worry about. Reducing those two queries into the single one will do the right thing.
Of course it is very important that your servers are synchronized with their times, or funny things may happen.

performance of using primary key and secondary index in a query in cassandra

Lets say there is a table with 3 columns A,B, and C. A is primary key.
I have 2 types of query, one that searches by A and B and another that searches by A and C.
Is it better to add a secondary index for C to search based on A and C or make a new table with A, C, and B columns.
To put it in different perspective, in general it is a bad idea to have two secondary indexes on two columns and have a where clause conditioning on both indexes. Is it the same case for combining primary key and a secondary index?
https://www.youtube.com/watch?v=CbeRmb8fI9s#t=56
https://www.youtube.com/watch?v=N6UY1y3dgAk#t=30
Secondary indexes almost never aid performance, they are mostly a tool of convince for allowing queries to explore your data. Almost all performance gains come from properly structuring your primary key and creating data schemas which properly model the queries you want to perform.
So having two tables
A by B and A by C would most likely be the ideal solution and will actually scale with your data.

Why many refer to Cassandra as a Column oriented database?

Reading several papers and documents on internet, I found many contradictory information about the Cassandra data model. There are many which identify it as a column oriented database, other as a row-oriented and then who define it as a hybrid way of both.
According to what I know about how Cassandra stores file, it uses the *-Index.db file to access at the right position of the *-Data.db file where it is stored the bloom filter, column index and then the columns of the required row.
In my opinion, this is strictly row-oriented. Is there something I'm missing?
If you take a look at the Readme file at Apache Cassandra git repo, it says that,
Cassandra is a partitioned row store. Rows are organized into tables
with a required primary key.
Partitioning means that Cassandra can distribute your data across
multiple machines in an application-transparent matter. Cassandra will
automatically repartition as machines are added and removed from the
cluster.
Row store means that like relational databases, Cassandra organizes
data by rows and columns.
Column oriented or columnar databases are stored on disk column wise.
e.g: Table Bonuses table
ID Last First Bonus
1 Doe John 8000
2 Smith Jane 4000
3 Beck Sam 1000
In a row-oriented database management system, the data would be stored like this: 1,Doe,John,8000;2,Smith,Jane,4000;3,Beck,Sam,1000;
In a column-oriented database management system, the data would be stored like this:
1,2,3;Doe,Smith,Beck;John,Jane,Sam;8000,4000,1000;
Cassandra is basically a column-family store
Cassandra would store the above data as,
"Bonuses" : {
row1 : { "ID":1, "Last":"Doe", "First":"John", "Bonus":8000},
row2 : { "ID":2, "Last":"Smith", "First":"Jane", "Bonus":4000}
...
}
Also, the number of columns in each row doesn't have to be the same. One row can have 100 columns and the next row can have only 1 column.
Read this for more details.
Yes, the "column-oriented" terminology is a bit confusing.
The model in Cassandra is that rows contain columns. To access the smallest unit of data (a column) you have to specify first the row name (key), then the column name.
So in a columnfamily called Fruit you could have a structure like the following example (with 2 rows), where the fruit types are the row keys, and the columns each have a name and value.
apple -> colour weight price variety
"red" 100 40 "Cox"
orange -> colour weight price origin
"orange" 120 50 "Spain"
One difference from a table-based relational database is that one can omit columns (orange has no variety), or add arbitrary columns (orange has origin) at any time. You can still imagine the data above as a table, albeit a sparse one where many values might be empty.
However, a "column-oriented" model can also be used for lists and time series, where every column name is unique (and here we have just one row, but we could have thousands or millions of columns):
temperature -> 2012-09-01 2012-09-02 2012-09-03 ...
40 41 39 ...
which is quite different from a relational model, where one would have to model the entries of a time series as rows not columns. This type of usage is often referred to as "wide rows".
You both make good points and it can be confusing. In the example where
apple -> colour weight price variety
"red" 100 40 "Cox"
apple is the key value and the column is the data, which contains all 4 data items. From what was described it sounds like all 4 data items are stored together as a single object then parsed by the application to pull just the value required. Therefore from an IO perspective I need to read the entire object. IMHO this is inherently row (or object) based not column based.
Column based storage became popular for warehousing, because it offers extreme compression and reduced IO for full table scans (DW) but at the cost of increased IO for OLTP when you needed to pull every column (select *). Most queries don't need every column and due to compression the IO can be greatly reduced for full table scans for just a few columns. Let me provide an example
apple -> colour weight price variety
"red" 100 40 "Cox"
grape -> colour weight price variety
"red" 100 40 "Cox"
We have two different fruits, but both have a colour = red. If we store colour in a separate disk page (block) from weight, price and variety so the only thing stored is colour, then when we compress the page we can achieve extreme compression due to a lot of de-duplication. Instead of storing 100 rows (hypothetically) in a page, we can store 10,000 colour's. Now to read everything with the colour red it might be 1 IO instead of thousands of IO's which is really good for warehousing and analytics, but bad for OLTP if I need to update the entire row since the row might have hundreds of columns and a single update (or insert) could require hundreds of IO's.
Unless I'm missing something I wouldn't call this columnar based, I'd call it object based. It's still not clear on how objects are arranged on disk. Are multiple objects placed into the same disk page? Is there any way of ensuring objects with the same meta data go together? To the point that one fruit might contain different data than another fruit since its just meta data or xml or whatever you want to store in the object itself, is there a way to ensure certain matching fruit types are stored together to increase efficiency?
Larry
The most unambiguous term I have come across is wide-column store.
It is a kind of two-dimensional key-value store, where you use a row key and a column key to access data.
The main difference between this model and the relational ones (both row-oriented and column-oriented) is that the column information is part of the data.
This implies data can be sparse. That means different rows don't need to share the same column names nor number of columns. This enables semi-structured data or schema free tables.
You can think of wide-column stores as tables that can hold an unlimited number of columns, and thus are wide.
Here's a couple of links to back this up:
This mongodb article
This Datastax article mentions it too, although it classifies Cassandra as a key-value store.
This db-engines article
This 2013 article
Wikipedia
Column Family does not mean it is column-oriented. Cassandra is column family but not column-oriented. It stores the row with all its column families together.
Hbase is column family as well as stores column families in column-oriented fashion. Different column families are stored separately in a node or they can even reside in different node.
IMO that's the wrong term used for Cassandra. Instead, it is more appropriate to call it row-partition store. Let me provide you some details on it:
Primary Key, Partitioning Key, Clustering Columns, and Data Columns:
Every table must have a primary key with unique constraint.
Primary Key = Partition key + Clustering Columns
# Example
Primary Key: ((col1, col2), col3, col4) # primary key uniquely identifies a row
# we need to choose its components partition key
# and clustering columns so that each row can be
# uniquely identified
Partition Key: (col1, col2) # decides on which node to store the data
# partitioning key is mandatory, and it
# can be made up of one column or multiple
Clustering Columns: col3, col4 # decides arrangement within a partition
# clustering columns are optional
Partition key is the first component of Primary key. Its hashed value is used to determine the node to store the data. The partition key can be a compound key consisting of multiple columns. We want almost equal spreads of data, and we keep this in mind while choosing primary key.
Any fields listed after the Partition Key in Primary Key are called Clustering Columns. These store data in ascending order within the partition. The clustering column component also helps in making sure the primary key of each row is unique.
You can use as many clustering columns as you would like. You cannot use the clustering columns out of order in the SELECT statement. You may choose to omit using a clustering column in you SELECT statement. That's OK. Just remember to sue them in order when you are using the SELECT statement. But note that, in your CQL query, you can not try to access a column or a clustering column if you have not used the other defined clustering columns. For example, if primary key is (year, artist_name, album_name) and you want to use city column in your query's WHERE clause, then you can use it only if your WHERE clause makes use of all of the columns which are part of primary key.
Tokens:
Cassandra uses tokens to determine which node holds what data. A token is a 64-bit integer, and Cassandra assigns ranges of these tokens to nodes so that each possible token is owned by a node. Adding more nodes to the cluster or removing old ones leads to redistributing these token among nodes.
A row's partition key is used to calculate a token using a given partitioner (a hash function for computing the token of a partition key) to determine which node owns that row.
Cassandra is Row-partition store:
Row is the smallest unit that stores related data in Cassandra.
Don't think of Cassandra's column family (that is, table) as a RDBMS table, but think of it as a dict of a dict (here dict is data structure similar to Python's OrderedDict):
the outer dict is keyed by a row key (primary key): this determines which partition and which row in partition
the inner dict is keyed by a column key (data columns): this is data in dict with column names as keys
both dict are ordered (by key) and are sorted: the outer dict is sorted by primary key
This model allows you to omit columns or add arbitrary columns at any time, as it allows you to have different data columns for different rows.
Cassandra has a concept of column families(table), which originally comes from BigTable. Though, it is really misleading to call them column-oriented as you mentioned. Within each column family, they store all columns from a row together, along with a row key, and they do not use column compression. Thus, the Bigtable model is still mostly row-oriented.

Resources