Datamodel for Scylla/Cassandra for table partition key is not known beforehand -> static field? - cassandra

I am using ScyllaDb, but I think this also applies to Cassandra since ScyllaDb is compatible with Cassandra.
I have the following table (I got ~5 of this kind of tables):
create table batch_job_conversation (
conversation_id uuid,
primary key (conversation_id)
);
This is used by a batch job to make sure some fields are kept in sync. In the application, a lot of concurrent writes/reads can happen. Once in a while, I will correct the values with a batch job.
A lot of writes can happen to the same row, so it will overwrite the rows. A batch job currently picks up rows with this query:
select * from batch_job_conversation
Then the batch job will read the data at that point and makes sure things are in sync. I think this query is bad because it stresses all the partitions and the node coordinator because it needs to visit ALL partitions.
My question is if it is better for this kind of tables to have a fixed field? Something like this:
create table batch_job_conversation (
always_zero int,
conversation_id uuid,
primary key ((always_zero), conversation_id)
);
And than the query would be this:
select * from batch_job_conversation where always_zero = 0
For each batch job I can use a different partition key. The amount of rows in these tables will be roughly the same size (a few thousand at most). The tables will overwrite the same row probably a lot of times.
Is it better to have a fixed value? Is there another way to handle this? I don't have a logical partition key I can use.

second model would create a LARGE partition and you don't want that, trust me ;-)
(you would do a partition scan on top of large partition, which is worse than original full scan)
(and another advice - keep your partitions small and have a lot of them, then all your cpus will be used rather equally)
first approach is OK - and is called FULL SCAN, BUT
you need to manage it properly
there are several ways, we blogged about it in https://www.scylladb.com/2017/02/13/efficient-full-table-scans-with-scylla-1-6/
and basically it boils down to divide and conquer
also note spark implements full scans too
hth
L

Related

Cassandra data model too many table

I have a single structured row as input with write rate of 10K per seconds. Each row has 20 columns. Some queries should be answered on these inputs. Because most of the queries needs different WHERE, GROUP BY or ORDER BY, The final data model ended up like this:
primary key for table of query1 : ((column1,column2),column3,column4)
primary key for table of query2 : ((column3,column4),column2,column1)
and so on
I am aware of the limit in number of tables in Cassandra data model (200 is warning and 500 would fail)
Because for every input row I should do an insert in every table, the final write per seconds became big * big data!:
writes per seconds = 10K (input)
* number of tables (queries)
* replication factor
The main question: am I on the right path? Is it normal to have a table for every query even when the input rate is already so high?
Shouldn't I use something like spark or hadoop instead of relying on bare datamodel? Or event Hbase instead of Cassandra?
It could be that Elassandra would resolve your problem.
The query system is quite different from CQL, but the duplication for indexing would automatically be managed by Elassandra on the backend. All the columns of one table will be indexed so the Elasticsearch part of Elassandra can be used with the REST API to query anything you'd like.
In one of my tests, I pushed a huge amount of data to an Elassandra database (8Gb) going non-stop and I never timed out. Also the search engine remained ready pretty much the whole time. More or less what you are talking about. The docs says that it takes 5 to 10 seconds for newly added data to become available in the Elassandra indexes. I guess it will somewhat depend on your installation, but I think that's more than enough speed for most applications.
The use of Elassandra may sound a bit hairy at first, but once in place, it's incredible how fast you can find results. It includes incredible (powerful) WHERE for sure. The GROUP BY is a bit difficult to put in place. The ORDER BY is simple enough, however, when (re-)ordering you lose on speed... Something to keep in mind. On my tests, though, even the ORDER BY equivalents was very fast.

What is better in cassandra many tables of same structure or one table with many rows

Suppose that I have 1000 entities with exactly the same structure. For example all entities have three fields:
String id;
String name;
int amount;
Also I expect that there will be huge amount of every type of entity in the system.
So I have two variants right now:
For each entity create separate table which looks like:
CREATE TABLE <SOME_ENTITY_NAME> (
id text PRIMARY KEY,
name text,
amount int
)
I'll create only one table but with composite priamry key:
CREATE TABLE ALL_ENTITIES_TABLE (
entity_name text,
id text,
name text,
amount int,
PRIMARY KEY ((entity_name, id))
);
Of course, supporting only one table is more simplier, but what is with performance?
So, the question is what variant is better in terms of performance, taking into account that each type of entity will have millions(may be billions) of records?
There is a limitation on the number of the tables that can be created in the Cassandra cluster. Usual recommendation is too keep this number lower than 200, with ~500 is like a "hard stop"...
The reason for this is that every table requires allocation of additional memory, and other resources to keep auxiliary data, like, key/row caches, bloom filters, etc. Depending on the Cassandra version, every table may require 1-2Mb of memory.
So in your case, the 2nd design is better because you keep all data in single table, and your partition key will allow to spread data evenly between nodes of cluster.
In my opinion the first approach is incorrect in terms of maintainability. Too much of dynamically created tables should be tough to maintain. Also, If you use your partitioning/clustering order properly (as per the need of data retrieval) it should be easier and efficient to query. Also if you are using 3.x version of Cassandra, secondary indexes can come in handy.
NOTE: Secondary indexes don't allow sorting.
Cassandra was designed around the fact that disk space is the cheapest resource among all. You must build your data model around the queries that you will be using the most regardless of whether this model would consume more disk space or not - as long as it serves the purpose of your queries in the most efficient way. I wouldn't be able to answer your question without taking a look at the queries you will be using. In general, you must feel free to create as many tables as needed as long as it serves the purpose of your queries. I would recommend having a look here.

Cassandra get latest entry for each element contained within IN clause

So, I have a Cassandra CQL statement that looks like this:
SELECT * FROM DATA WHERE APPLICATION_ID = ? AND PARTNER_ID = ? AND LOCATION_ID = ? AND DEVICE_ID = ? AND DATA_SCHEMA = ?
This table is sorted by a timestamp column.
The functionality is fronted by a REST API, and one of the filter parameters that they can specify to get the most recent row, and then I appent "LIMIT 1" to the end of the CQL statement since it's ordered by the timestamp column in descending order. What I would like to do is allow them to specify multiple device id's to get back the latest entries for. So, my question is, is there any way to do something like this in Cassandra:
SELECT * FROM DATA WHERE APPLICATION_ID = ? AND PARTNER_ID = ? AND LOCATION_ID = ? AND DEVICE_ID IN ? AND DATA_SCHEMA = ?
and still use something like "LIMIT 1" to only get back the latest row for each device id? Or, will I simply have to execute a separate CQL statement for each device to get the latest row for each of them?
FWIW, the table's composite key looks like this:
PRIMARY KEY ((application_id, partner_id, location_id, device_id, data_schema), activity_timestamp)
) WITH CLUSTERING ORDER BY (activity_timestamp DESC);
IN is not recommended when there are a lot of parameters for it and under the hood it's making reqs to multiple partitions anyway and it's putting pressure on the coordinator node.
Not that you can't do it. It is perfectly legal, but most of the time it's not performant and is not suggested. If you specify limit, it's for the whole statement, basically you can't pick just the first item out from partitions. The simplest option would be to issue multiple queries to the cluster (every element in IN would become one query) and put a limit 1 to every one of them.
To be honest this was my solution in a lot of the projects and it works pretty much fine. Basically coordinator would under the hood go to multiple nodes anyway but would also have to work more for you to get you all the requests, might run into timeouts etc.
In short it's far better for the cluster and more performant if client asks multiple times (using multiple coordinators with smaller requests) than to make single coordinator do to all the work.
This is all in case you can't afford more disk space for your cluster
Usual Cassandra solution
Data in cassandra is suggested to be ready for query (query first). So basically you would have to have one additional table that would have the same partitioning key as you have it now, and you would have to drop the clustering column activity_timestamp. i.e.
PRIMARY KEY ((application_id, partner_id, location_id, device_id, data_schema))
double (()) is intentional.
Every time you would write to your table you would also write data to the latest_entry (table without activity_timestamp) Then you can specify the query that you need with in and this table contains the latest entry so you don't have to use the limit 1 because there is only one entry per partitioning key ... that would be the usual solution in cassandra.
If you are afraid of the additional writes, don't worry , they are inexpensive and cpu bound. With cassandra it's always "bring on the writes" I guess :)
Basically it's up to you:
multiple queries - a bit of refactoring, no additional space cost
new schema - additional inserts when writing, additional space cost
Your table definition is not suitable for such use of the IN clause. Indeed, it is supported on the last field of the primary key or the last field of the clustering key. So you can:
swap your two last fields of the primary key
use one query for each device id

Spark: Continuously reading data from Cassandra

I have gone through Reading from Cassandra using Spark Streaming and through tutorial-1 and tutorial-2 links.
Is it fair to say that Cassandra-Spark integration currently does not provide anything out of the box to continuously get the updates from Cassandra and stream them to other systems like HDFS?
By continuously, I mean getting only those rows in a table which have changed (inserted or updated) since the last fetch by Spark. If there are too many such rows, there should be an option to limit the number of rows and the subsequent spark fetch should begin from where it left off. At-least once guarantee is ok but exactly-once would be a huge welcome.
If its not supported, one way to support it could be to have an auxiliary column updated_time in each cassandra-table that needs to be queried by storm and then use that column for queries. Or an auxiliary table per table that contains ID, timestamp of the rows being changed. Has anyone tried this before?
I don't think Apache Cassandra has this functionality out of the box. Internally [for some period of time] it stores all operations on data in sequential manner, but it's per node and it gets compacted eventually (to save space). Frankly, Cassandra's (as most other DB's) promise is to provide latest view of data (which by itself can be quite tricky in distributed environment), but not full history of how data was changing.
So if you still want to have such info in Cassandra (and process it in Spark), you'll have to do some additional work yourself: design dedicated table(s) (or add synthetic columns), take care of partitioning, save offset to keep track of progress, etc.
Cassandra is ok for time series data, but in your case I would consider just using streaming solution (like Kafka) instead of inventing it.
I agree with what Ralkie stated but wanted to propose one more solution if you're tied to C* with this use case. This solution assumes you have full control over the schema and ingest as well. This is not a streaming solution though it could awkwardly be shoehorned into one.
Have you considered using composite key composed of the timebucket along with a murmur_hash_of_one_or_more_clustering_columns % some_int_designed_limit_row_width? In this way, you could set your timebuckets to 1 minute, 5 minutes, 1 hour, etc depending on how "real-time" you need to analyze/archive your data. The murmur hash based off of one or more of the clustering columns is needed to help located data in the C* cluster (and is a terrible solution if you're often looking up specific clustering columns).
For example, take an IoT use case where sensors report in every minute and have some sensor reading that can be represented as an integer.
create table if not exists iottable {
timebucket bigint,
sensorbucket int,
sensorid varchar,
sensorvalue int,
primary key ((timebucket, sensorbucket), sensorid)
} with caching = 'none'
and compaction = { 'class': 'com.jeffjirsa.cassandra.db.compaction.TimeWindowedCompaction' };
Note the use of TimeWindowedCompaction. I'm not sure what version of C* you're using; but with the 2.x series, I'd stay away from DateTieredCompaction. I cannot speak to how well it performs in 3.x. Any any rate, you should test and benchmark extensively before settling on your schema and compaction strategy.
Also note that this schema could result in hotspotting as it is vulnerable to sensors that report more often than others. Again, not knowing the use case it's hard to provide a perfect solution -- it's just an example. If you don't care about ever reading C* for a specific sensor (or column), you don't have to use a clustering column at all and you can simply use a timeUUID or something random for the murmur hash bucketing.
Regardless of how you decide to partition the data, a schema like this would then allow you to use repartitionByCassandraReplica and joinWithCassandraTable to extract the data written during a given timebucket.

Cassandra CQL3 order by clustered key efficiency (with limit clause?)

I have the following table (using CQL3):
create table test (
shard text,
tuuid timeuuid,
some_data text,
status text,
primary key (shard, tuuid, some_data, status)
);
I would like to get rows ordered by tuuid. But this is only possible when I restrict shard - I get this is due to performance.
I have shard purely for sharding, and I can potentially restrict its range of values to some small range [0-16) say. Then, I could run a query like this:
select * from test where shard in (0,...,15) order by tuuid limit L;
I may have millions of rows in the table, so I would like to understand the performance characteristics of such a order by query. It would seem like the performance could be pretty bad in general, BUT with a limit clause of some reasonable number (order of 10K), this may not be so bad - i.e. a 16 way merge but with a fairly low limit.
Any tips, advice or pointers into the code on where to look would be appreciated.
Your data is sorted according to your column key. So the performance issue in your merge in your query above does not happen due to the WHERE clause but because of your LIMIT clause, afaik.
Your columns are inserted IN ORDER according to tuuid so there is no performance issue there.
If you are fetching too many rows at once, I recommended creating a test_meta table where you store the latest timeuuid every X-inserts, to get an upper bound on the rows your query will fetch. Then, you can change your query to:
select * from test where shard in (0,...,15) and tuuid > x and tuuid < y;
In short: make use of your column keys and get rid of the limit. Alternatively, in Cassandra 2.0, there will be pagination which will help here, too.
Another issue I stumbled over, you say that
I may have millions of rows in the table
But according to your data model, you will have exactly shard number of rows. This is your row key and - together with the partitioner - will determine the distribution/sharding of your data.
hope that helps!
UPDATE
From my personal experience, cassandra performances quite well during heavy reads as well as writes. If the result sets became too large, I rather experienced memory issues on the receiving/client side rather then timeouts on the server side. Still, to prevent either, I recommend having a look a the upcoming (2.0) pagination feature.
In the meanwhile:
Try to investigate using the trace functionality in 1.2.
If you are mostly reading the "latest" data, try adding a reversed type.
For general optimizations like caches etc, first, read how cassandra handles reads on a node and then, see this tuning guide.

Resources