Cassandra 1.2 : Updating type in primary Key CQL3 - cassandra

We currently have a table defined as below
create table tableA(id int,
seqno int,
data text,
PRIMARY KEY((id), seqno))
WITH CLUSTERING ORDER BY (seqno DESC);
We need to update the type for the id column from int to text. We are wondering out of the two approaches, would be the most advisable.
ALTER TABLE tableA ALTER id TYPE varchar; (the command succeeds but then we have issues reading the data. Is this because the ALTER table doesn't update the underlying storage of the id column?)
COPY to/from oldtable/newtable. This works but we have issues with the RPC timeout (which we can change), but is this a bad idea on a table across a cluster?
We have checked the online docs and these are only 2 options we can find around this. are there other options??
Thanks
Paul

I would say option 1 isn't really supported. If your integers don't map to actual strings you're going to have problem, you're probably seeing key validation errors.
for option 2 you probably just need to copy smaller chunks of data for each read/write.

Related

Cloud Spanner complex primary key and queries

I'm playing with Cloud Spanner and I created an imgur clone with the schema as follows:
CREATE TABLE Images (id STRING(36) NOT NULL, createdAt TIMESTAMP, caption STRING(1024), fileType STRING(10)) PRIMARY KEY (id, createdAt DESC)
The id is a version 4 UUID as the GCP documentation specifies so that I avoid hotspots. The createdAt is a timestamp when an image is first created. I have my PRIMARY KEY defined as (id, createdAt DESC) so that I can more easily query by latest added images.
What I don't understand is what happens if I want to get a single image using only SELECT * FROM Images WHERE id = 'some UUID? Will Spanner still search by key in an efficient way, meaning getting the information from the server that stores the specific key in its key range even though I only specified a part of the primary key?
In your simple example, yes. It will try to come up with an efficient execution plan which may include using an index (automatically created for PKs) even though your predicate is on just 1 of the 2-column composite PK because it is on the 1 column. If your predicate was just "...createdAt= then it will scan the table. It would be far more expensive to find matches for col2 in your composite PK of (col1, col2) than it is to just scan col2.
This assumes there's enough data to matter. For example, if you have 42 rows, it really won't matter how you execute the query or what predicates were provided; the number off I/O requests (often the most expensive part of a query) will be the same.
In general, Spanner tries to pick the index it thinks will be most efficient. The actual physical steps don't work like that but conceptually, it's a reasonable way to think about it.
Whether an index is helpful or not depends on a few things and whether it gets picked or not also has dependencies. Does it have statistics, are the statistics correct/fresh, is it making correct estimates on row counts, etc... Composite indexes/keys are a just a bit more interesting as noted above.
Just make sure you always test with enough data (closely matching your production environment if possible).

Regarding Cassandra's (sloppy, still confusing) documentation on keys, partitions

I have a high-write table I'm moving from Oracle to Cassandra. In Oracle the PK is a (int: clientId, id: UUID). There are about 10 billion rows. Right off the bat I run into this nonsensical warning:
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useWhenIndex.html :
"If you create an index on a high-cardinality column, which has many distinct values, a query between the fields will incur many seeks for very few results. In the table with a billion songs, looking up songs by writer (a value that is typically unique for each song) instead of by their artist, is likely to be very inefficient. It would probably be more efficient to manually maintain the table as a form of an index instead of using the Cassandra built-in index."
Not only does this seem to defeat efficient find by PK it fails to define what it means to "query between the fields" and what the difference is between a built-in index, a secondary-index, and the primary_key+clustering subphrases in a create table command. A junk description. This is 2019. Shouldn't this be fixed by now?
AFAIK it's misleading anyway:
CREATE TABLE dev.record (
clientid int,
id uuid,
version int,
payload text,
PRIMARY KEY (clientid, id, version)
) WITH CLUSTERING ORDER BY (id ASC, version DESC)
insert into record (id,version,clientid,payload) values
(d5ca94dd-1001-4c51-9854-554256a5b9f9,3,1001,'');
insert into record (id,version,clientid,payload) values
(d5ca94dd-1002-4c51-9854-554256a5b9e5,0,1002,'');
The token on clientid indeed shows they're in different partitions as expected.
Turning to the big point. If one was looking for a single row given the clientId, and UUID ---AND--- Cassandra allowed you to skip specifying the clientId so it wouldn't know which node(s) to search, then sure that find could be slow. But it doesn't:
select * from record where id=
d5ca94dd-1002-4c51-9854-554256a5b9e5;
InvalidRequest: ... despite the performance unpredictability,
use ALLOW FILTERING"
And ditto with other variations that exclude clientid. So shouldn't we conclude Cassandra handles high cardinality tables searches that return "very few results" just fine?
Anything that requires reading the entire context of the database wont work which is the case with scanning on id since any of your clientid partition key's may contain one. Walking through potentially thousands of sstables per host and walking through each partition of each of those to check will not work. If having hard time with data model and not totally getting difference between partition keys and clustering keys I would recommend you walk through some introduction classes (ie datastax academy), youtube videos or book etc before designing your schema. This is not a relational database and designing around your data instead of your queries will get you into trouble. When moving from oracle you should not just copy your tables over and move the data or it will not work as well.
The clustering key is the order in which the data for a partition is ordered on disk which is what it is referring to as "build-in index". Each sstable has an index component that contains the partition key locations for that sstable. This also includes an index of the clustering keys for each partition every 64kb (by default at least) that can be searched on. The clustering keys that exist between each of these indexed points are unknown so they all have to be checked. A long time ago there was a bloom filter of clustering keys kept as well but it was such a rare use case where it helped vs the overhead that it was removed in 2.0.
Secondary indexes are difficult to scale well which is where the warning comes from about cardinality, I would strongly recommend just denormalizing data and not using index in any form as using large scatter gather queries across a distributed system is going to have availability and performance issues. If you really need it check out http://www.doanduyhai.com/blog/?p=13191 to try to get the data right (not worth it in my opinion).

Cassandra secondary vs extra table and read

I'm facing a dilemma that my small knowledge of Cassandra doesn't allow me to solve.
I have a index table used to retrieve data from an item (a notification) using an external id. However, the data contained in that table (in that case the status of the notification) is modified so I need to update the index table as well. Here is the tables design:
TABLE notification_by_external_id (
external_id text,
partition_key_date text,
id uuid,
status text,
...
PRIMARY KEY (external_id, partition_key_date, id)
);
TABLE notification (
partition_key_date text,
status text,
id uuid,
...
PRIMARY KEY (partition_key_date, status, id)
);
The problem is that when I want to update the notification status (and hence the notification_by_external_id table), I don't have access to the external ID.
So far I came up to 2 solutions, none of which seems optimal, and I can't decide which one to go with.
Solution 1
Create an index on notification_by_external_id.id, but this will obviously be a high cardinality column. There can be several external IDs for each notifications, but we're talking about something around 5-10 to one top.
Solution 2
Create a table
TABLE external_id_notification (
notification_id uuid,
external_id text
PRIMARY KEY (notification_id, external_id)
);
but that would mean making one extra read operation (and of course maintain another table) which I understood is also a bad practice.
The thing to understand about secondary indexes is, that their scalability issue is not with the number of rows in the table, but with the amount of nodes in your cluster. A select on an index column means that every single node will have to process it and respond to it, just that it itself will be able to process the select efficiently.
Use secondary indexes for administrative purposes (i.e. you on cqlsh) only. Do not use it for productive purposes.
That being said. You could duplicate all the information into your external_id_notification table. That would alleviate the need for an extra read operation. I know that relational databases taught you, that duplicate data is bad (what if it differs?), and that you should always normalize. But you are not on a relational database. Denormalization is a thing, and on Cassandra, you should always go for that, unless you absolutely cannot.

How to copy data from a Cassandra table to another structure for better performance

In several places it's advised to design our Cassandra tables according to the queries we are going to perform on them. In this article by DataScale they state this:
The truth is that having many similar tables with similar data is a good thing in Cassandra. Limit the primary key to exactly what you’ll be searching with. If you plan on searching the data with a similar, but different criteria, then make it a separate table. There is no drawback for having the same data stored differently. Duplication of data is your friend in Cassandra.
[...]
If you need to store the same piece of data in 14 different tables, then write it out 14 times. There isn’t a handicap against multiple writes.
I have understood this, and now my question is: provided that I have an existing table, say
CREATE TABLE invoices (
id_invoice int PRIMARY KEY,
year int,
id_client int,
type_invoice text
)
But I want to query by year and type instead, so I'd like to have something like
CREATE TABLE invoices_yr (
id_invoice int,
year int,
id_client int,
type_invoice text,
PRIMARY KEY (type_invoice, year)
)
With id_invoice as the partition key and year as the clustering key, what's the preferred way to copy the data from one table to another to perform optimized queries later on?
My Cassandra version:
user#cqlsh> show version;
[cqlsh 5.0.1 | Cassandra 3.5.0 | CQL spec 3.4.0 | Native protocol v4]
You can use cqlsh COPY command :
To copy your invoices data into csv file use :
COPY invoices(id_invoice, year, id_client, type_invoice) TO 'invoices.csv';
And Copy back from csv file to table in your case invoices_yr use :
COPY invoices_yr(id_invoice, year, id_client, type_invoice) FROM 'invoices.csv';
If you have huge data you can use sstable writer to write and sstableloader to load data faster.
http://www.datastax.com/dev/blog/using-the-cassandra-bulk-loader-updated
To echo what was said about the COPY command, it is a great solution for something like this.
However, I will disagree with what was said about the Bulk Loader, as it is infinitely harder to use. Specifically, because you need to run it on every node (whereas COPY needs to only be run on a single node).
To help COPY scale for larger data sets, you can use the PAGETIMEOUT and PAGESIZE parameters.
COPY invoices(id_invoice, year, id_client, type_invoice)
TO 'invoices.csv' WITH PAGETIMEOUT=40 AND PAGESIZE=20;
Using these parameters appropriately, I have used COPY to successfully export/import 370 million rows before.
For more info, check out this article titled: New options and better performance in cqlsh copy.
An alternative to using COPY command (see other answers for examples) or Spark to migrate data is to create a materialized view to do the denormalization for you.
CREATE MATERIALIZED VIEW invoices_yr AS
SELECT * FROM invoices
WHERE id_client IS NOT NULL AND type_invoice IS NOT NULL AND id_client IS NOT NULL
PRIMARY KEY ((type_invoice), year, id_client)
WITH CLUSTERING ORDER BY (year DESC)
Cassandra will fill the table for you then so you wont have to migrate yourself. With 3.5 be aware that repairs don't work well (see CASSANDRA-12888).
Note: that Materialized Views are probably not best idea to use and has been changed to "experimental" status

An Approach to Cassandra Data Model

Please note that I am first time using NoSQL and pretty much every concept is new in this NoSQL world, being from RDBMS for long time!!
In one of my heavy used applications, I want to use NoSQL for some part of the data and move out from MySQL where transactions/Relational model doesn't make sense. What I would get is, CAP [Availability and Partition Tolerance].
The present data model is simple as this
ID (integer) | ENTITY_ID (integer) | ENTITY_TYPE (String) | ENTITY_DATA (Text) | CREATED_ON (Date) | VERSION (interger)|
We can safely assume that this part of application is similar to Logging of the Activity!
I would like to move this to NoSQL as per my requirements and separate from Performance Oriented MySQL DB.
Cassandra says, everything in it is simple Map<Key,Value> type! Thinking in terms of Map level,
I can use ENTITY_ID|ENTITY_TYPE|ENTITY_APP as key and store the rest of the data in values!
After reading through User Defined Types in Cassandra, can I use UserDefinedType as value which essentially leverage as One Key and multiple values! Otherwise, Use it as normal column level without UserDefinedType! One idea is to use the same model for different applications across systems where it would be simple logging/activity data can be pushed to the same, since the key varies from application to application and within application each entity will be unique!
No application/business function to access this data without Key, or in simple terms no requirement to get data randomly!
References: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
Let me explain the cassandra data model a bit (or at least, a part of it). You create tables like so:
create table event(
id uuid,
timestamp timeuuid,
some_column text,
some_column2 list<text>,
some_column3 map<text, text>,
some_column4 map<text, text>,
primary key (id, timestamp .... );
Note the primary key. There's multiple columns specified. The first column is the partition key. All "rows" in a partition are stored together. Inside a partition, data is ordered by the second, then third, then fourth... keys in the primary key. These are called clustering keys. To query, you almost always hit a partition (by specifying equality in the where clause). Any further filters in your query are then done on the selected partition. If you don't specify a partition key, you make a cluster wide query, which may be slow or most likely, time out. After hitting the partition, you can filter with matches on subsequent keys in order, with a range query on the last clustering key specified in your query. Anyway, that's all about querying.
In terms of structure, you have a few column types. Some primitives like text, int, etc., but also three collections - sets, lists and maps. Yes, maps. UDTs are typically more useful when used in collections. e.g. A Person may have a map of addresses: map. You would typically store info in columns if you needed to query on it, or index on it, or you know each row will have those columns. You're also free to use a map column which would let you store "arbitrary" key-value data; which is what it seems you're looking to do.
One thing to watch out for... your primary key is unique per records. If you do another insert with the same pk, you won't get an error, it'll simply overwrite the existing data. Everything in cassandra is an upsert. And you won't be able to change the value of any column that's in the primary key for any row.
You mentioned querying is not a factor. However, if you do find yourself needing to do aggregations, you should check out Apache Spark, which works very well with Cassandra (and also supports relational data sources....so you should be able to aggregate data across mysql and cassandra for analytics).
Lastly, if your data is time series log data, cassandra is a very very good choice.

Resources