Change order of composed primary key - cassandra

I have a Cassandra and I want to use the cql "IN" query. Therefore I have to change the order of the elements in my composed primary key (only the last piece is available for "IN" queries). The table is quite big but does not span multiple nodes now.
So what I have tried now (which is not working) is the following:
create a new column family with identical columns but different order of primary key elements
stop write processes and nodetool flush
copy all /data/keyspace/columnfamily/ files
rename the files to match the new column family name
use the sstable loader to load the files into the new column family
But afterwards the primary key is just messed up:
Failed to decode value '53ccb45d4ab0d3560e8c36fd' (for column 'cent') as int: unpack requires a string argument of length 4
I can also not use COPY ... TO ... because this is just timing out ...
Any ideas?

There are a couple of good bulkloaders available on GIT that works better and wont timeout like the CQLSH COPY TO/FROM tool.
You can find it here. or here
Otherwise I'd recommend using something like SPARK to move the data for you.
You could also use SCALA once you have your second table already created:
val mydata = sc.cassandraTable("mykeyspace","mytable")
.select("key","column1","column2","column3")
mydata.saveToCassandra("whateverkeyspace","whatevertable", SomeColumns("key","column1","column2","colum3"))

Related

it is possible to insert or update by giving not all primary keys in casssandra database?

I have an application which using cassandra as database. and the row of the table filled by three separate moments (by three inputs) , I have four primary keys in that table and these primary keys not available at all the moment going to insert or update.
the error is:
Some partition key parts are missing' when trying to insert or update.
Please consider that my application have to a lot of (near 300,000) writes in to database in a short time of interval , so i want to consider the maximum writes available in db.
May be it is a criteria can solve the issue ,'first read from db then write into db and use dummy values for primary key if it is not available at the moment of inserting or updating' . But it will take place more activities about a another copy of 300,000 reads in the db and it will slow the entire processes of db and my application.
So i looking for another solution.
four primary keys in that table and these primary keys not available at all the moment going to insert or update.
As you are finding out, that is not possible. For partition keys in particular, they are used (hashed) to determine which node in the cluster is primarily responsible for the data. As it is a fundamental part of the Cassandra write path, it must be complete at write-time and cannot be changed/updated later.
The same is true with clustering keys (keys which determine the on-disk sort order of all data within a partition). Omitting one or more will yield this message:
Some clustering keys are missing
Unfortunately, there isn't a good way around this. A row's keys must all be known before writing. It's worth mentioning that keys in Cassandra are unique, so any attempt to update them will result in a new row.
Perhaps writing the data to a streaming topic or message broker (like Pulsar or Kafka) beforehand would be a better option? Then, once all keys are known, the data (message) can consumed from the topic and written to Cassandra.

Spark SQL - Identifying What Partitions Were Output by an INSERT Statement

I am trying to find out if there is a way for a Spark SQL job to identify what partition values were output in both static and dynamic partition scenarios.
I am already familiar ApplicationListeners and QueryListeners and have found that combined you can access what exact static partition values were written to. When a dynamic partition write is performed, however, you only get the partition keys and not the corresponding values.
For these jobs, there is a constraint that they run off of only SQL so I cannot use some sort of accumulator or broadcast variable.
e.g.
spark.sql("INSERT OVERWRITE .....")
As an example, if I have a job write out to the base path
s3://bucket/table
With partition columns (part_a String, part_b String)
I want to be able to, as a post process, see the partition values and/or paths I've written to.
If I wrote to part_a=a and part_b=b (via SQL), and then part_a=1 and part_b=2 resulting in the below paths:
s3://bucket/table/part_a=a/part_b=b
s3://bucket/table/part_a=1/part_b=2
How can I either obtain the above paths post run or at a bare minimum the dynamic partition values? I have used AspectJ to this effect to catch the actual filesystem paths being passed on write but this is very brittle to the Spark version itself and essentially breaks on every upgrade.

Cassandra get latest entry for each element contained within IN clause

So, I have a Cassandra CQL statement that looks like this:
SELECT * FROM DATA WHERE APPLICATION_ID = ? AND PARTNER_ID = ? AND LOCATION_ID = ? AND DEVICE_ID = ? AND DATA_SCHEMA = ?
This table is sorted by a timestamp column.
The functionality is fronted by a REST API, and one of the filter parameters that they can specify to get the most recent row, and then I appent "LIMIT 1" to the end of the CQL statement since it's ordered by the timestamp column in descending order. What I would like to do is allow them to specify multiple device id's to get back the latest entries for. So, my question is, is there any way to do something like this in Cassandra:
SELECT * FROM DATA WHERE APPLICATION_ID = ? AND PARTNER_ID = ? AND LOCATION_ID = ? AND DEVICE_ID IN ? AND DATA_SCHEMA = ?
and still use something like "LIMIT 1" to only get back the latest row for each device id? Or, will I simply have to execute a separate CQL statement for each device to get the latest row for each of them?
FWIW, the table's composite key looks like this:
PRIMARY KEY ((application_id, partner_id, location_id, device_id, data_schema), activity_timestamp)
) WITH CLUSTERING ORDER BY (activity_timestamp DESC);
IN is not recommended when there are a lot of parameters for it and under the hood it's making reqs to multiple partitions anyway and it's putting pressure on the coordinator node.
Not that you can't do it. It is perfectly legal, but most of the time it's not performant and is not suggested. If you specify limit, it's for the whole statement, basically you can't pick just the first item out from partitions. The simplest option would be to issue multiple queries to the cluster (every element in IN would become one query) and put a limit 1 to every one of them.
To be honest this was my solution in a lot of the projects and it works pretty much fine. Basically coordinator would under the hood go to multiple nodes anyway but would also have to work more for you to get you all the requests, might run into timeouts etc.
In short it's far better for the cluster and more performant if client asks multiple times (using multiple coordinators with smaller requests) than to make single coordinator do to all the work.
This is all in case you can't afford more disk space for your cluster
Usual Cassandra solution
Data in cassandra is suggested to be ready for query (query first). So basically you would have to have one additional table that would have the same partitioning key as you have it now, and you would have to drop the clustering column activity_timestamp. i.e.
PRIMARY KEY ((application_id, partner_id, location_id, device_id, data_schema))
double (()) is intentional.
Every time you would write to your table you would also write data to the latest_entry (table without activity_timestamp) Then you can specify the query that you need with in and this table contains the latest entry so you don't have to use the limit 1 because there is only one entry per partitioning key ... that would be the usual solution in cassandra.
If you are afraid of the additional writes, don't worry , they are inexpensive and cpu bound. With cassandra it's always "bring on the writes" I guess :)
Basically it's up to you:
multiple queries - a bit of refactoring, no additional space cost
new schema - additional inserts when writing, additional space cost
Your table definition is not suitable for such use of the IN clause. Indeed, it is supported on the last field of the primary key or the last field of the clustering key. So you can:
swap your two last fields of the primary key
use one query for each device id

How to copy data from a Cassandra table to another structure for better performance

In several places it's advised to design our Cassandra tables according to the queries we are going to perform on them. In this article by DataScale they state this:
The truth is that having many similar tables with similar data is a good thing in Cassandra. Limit the primary key to exactly what you’ll be searching with. If you plan on searching the data with a similar, but different criteria, then make it a separate table. There is no drawback for having the same data stored differently. Duplication of data is your friend in Cassandra.
[...]
If you need to store the same piece of data in 14 different tables, then write it out 14 times. There isn’t a handicap against multiple writes.
I have understood this, and now my question is: provided that I have an existing table, say
CREATE TABLE invoices (
id_invoice int PRIMARY KEY,
year int,
id_client int,
type_invoice text
)
But I want to query by year and type instead, so I'd like to have something like
CREATE TABLE invoices_yr (
id_invoice int,
year int,
id_client int,
type_invoice text,
PRIMARY KEY (type_invoice, year)
)
With id_invoice as the partition key and year as the clustering key, what's the preferred way to copy the data from one table to another to perform optimized queries later on?
My Cassandra version:
user#cqlsh> show version;
[cqlsh 5.0.1 | Cassandra 3.5.0 | CQL spec 3.4.0 | Native protocol v4]
You can use cqlsh COPY command :
To copy your invoices data into csv file use :
COPY invoices(id_invoice, year, id_client, type_invoice) TO 'invoices.csv';
And Copy back from csv file to table in your case invoices_yr use :
COPY invoices_yr(id_invoice, year, id_client, type_invoice) FROM 'invoices.csv';
If you have huge data you can use sstable writer to write and sstableloader to load data faster.
http://www.datastax.com/dev/blog/using-the-cassandra-bulk-loader-updated
To echo what was said about the COPY command, it is a great solution for something like this.
However, I will disagree with what was said about the Bulk Loader, as it is infinitely harder to use. Specifically, because you need to run it on every node (whereas COPY needs to only be run on a single node).
To help COPY scale for larger data sets, you can use the PAGETIMEOUT and PAGESIZE parameters.
COPY invoices(id_invoice, year, id_client, type_invoice)
TO 'invoices.csv' WITH PAGETIMEOUT=40 AND PAGESIZE=20;
Using these parameters appropriately, I have used COPY to successfully export/import 370 million rows before.
For more info, check out this article titled: New options and better performance in cqlsh copy.
An alternative to using COPY command (see other answers for examples) or Spark to migrate data is to create a materialized view to do the denormalization for you.
CREATE MATERIALIZED VIEW invoices_yr AS
SELECT * FROM invoices
WHERE id_client IS NOT NULL AND type_invoice IS NOT NULL AND id_client IS NOT NULL
PRIMARY KEY ((type_invoice), year, id_client)
WITH CLUSTERING ORDER BY (year DESC)
Cassandra will fill the table for you then so you wont have to migrate yourself. With 3.5 be aware that repairs don't work well (see CASSANDRA-12888).
Note: that Materialized Views are probably not best idea to use and has been changed to "experimental" status

Cassandra 1.2 : Updating type in primary Key CQL3

We currently have a table defined as below
create table tableA(id int,
seqno int,
data text,
PRIMARY KEY((id), seqno))
WITH CLUSTERING ORDER BY (seqno DESC);
We need to update the type for the id column from int to text. We are wondering out of the two approaches, would be the most advisable.
ALTER TABLE tableA ALTER id TYPE varchar; (the command succeeds but then we have issues reading the data. Is this because the ALTER table doesn't update the underlying storage of the id column?)
COPY to/from oldtable/newtable. This works but we have issues with the RPC timeout (which we can change), but is this a bad idea on a table across a cluster?
We have checked the online docs and these are only 2 options we can find around this. are there other options??
Thanks
Paul
I would say option 1 isn't really supported. If your integers don't map to actual strings you're going to have problem, you're probably seeing key validation errors.
for option 2 you probably just need to copy smaller chunks of data for each read/write.

Resources