equivalent of batch statement for SELECT in cassandra - cassandra

I am dealing with a weird issue and I am not sure if my design is correct. I've got a table that looks like this
CREATE TABLE Content (
group_id bigint,
content_id bigint,
metadata blob,
group_payload blob static,
PRIMARY KEY (group_id, content_id)
As you can see the group_payload is static. If I need to fetch all the data for a given group_id I used to do so like this
SELECT * FROM Content WHERE group_id = X;
However this fetches group_payload multiple times and that causes both performance and memory issues as it is a fairly big blob.
As a result I've split the query in 2 as follows
SELECT group_payload WHERE group_id = X limit 1;
SELECT metadata WHERE group_id = X;
This worked wonders as a performance improvement, but it suffers from the occasional race condition, i.e. I get the group_payload, but by the time I get the metadata the group_payload is out of date.
Is there a way to "batch" the 2 select queries. should I maybe capture the inconsistency and retry (the data allows to detect this), or there is a better way to do this altogether?
Thanks

Short answer - no - there is no such thing as batch for select as in Cassandra there is no snapshot isolation for data reading in Cassandra.
In your situation, I maybe would think about the logic of data processing - maybe it's ok to get all metadata first, and then get group payload?

You won't be able to batch reads. If reading metadata is time consuming, then read payload after you read metadata. In case you want to check payload was updated after you fetched metadata, cassandra allows reading write-time of particular column but that doesn't look good solution.
What's interesting is, your use case need single payload from group and all metadata from group. Also, you are reading single partition for metadata, it should take few milli-seconds. If your payload is updating that frequently then you can think of outdated but consistent(based on write-time of metadata & payload) results.

Related

Datamodel for Scylla/Cassandra for table partition key is not known beforehand -> static field?

I am using ScyllaDb, but I think this also applies to Cassandra since ScyllaDb is compatible with Cassandra.
I have the following table (I got ~5 of this kind of tables):
create table batch_job_conversation (
conversation_id uuid,
primary key (conversation_id)
);
This is used by a batch job to make sure some fields are kept in sync. In the application, a lot of concurrent writes/reads can happen. Once in a while, I will correct the values with a batch job.
A lot of writes can happen to the same row, so it will overwrite the rows. A batch job currently picks up rows with this query:
select * from batch_job_conversation
Then the batch job will read the data at that point and makes sure things are in sync. I think this query is bad because it stresses all the partitions and the node coordinator because it needs to visit ALL partitions.
My question is if it is better for this kind of tables to have a fixed field? Something like this:
create table batch_job_conversation (
always_zero int,
conversation_id uuid,
primary key ((always_zero), conversation_id)
);
And than the query would be this:
select * from batch_job_conversation where always_zero = 0
For each batch job I can use a different partition key. The amount of rows in these tables will be roughly the same size (a few thousand at most). The tables will overwrite the same row probably a lot of times.
Is it better to have a fixed value? Is there another way to handle this? I don't have a logical partition key I can use.
second model would create a LARGE partition and you don't want that, trust me ;-)
(you would do a partition scan on top of large partition, which is worse than original full scan)
(and another advice - keep your partitions small and have a lot of them, then all your cpus will be used rather equally)
first approach is OK - and is called FULL SCAN, BUT
you need to manage it properly
there are several ways, we blogged about it in https://www.scylladb.com/2017/02/13/efficient-full-table-scans-with-scylla-1-6/
and basically it boils down to divide and conquer
also note spark implements full scans too
hth
L

Prevent race condition while writing to Cassandra

I have a realtime streaming solution with Kafka, Spark (as the aggregation engine) and Cassandra (as the store). User defines the aggregates that are needed and the engine creates the aggregate and writes them to the store. Here is an example of how the aggregates are created
CREATE AGGR COUNT FROM input_data WHERE type,event,id
This creates a count aggregate for the 3 columns and writes to C*.
We have a requirement to process historical data as well. That means if an aggregate was created today, we need to go back and fix history for it. To cater to this use case, we have created a hvalue column in Cassandra. Here is the schema for reference
CREATE TABLE tbl (
key blob,
key2 blob,
key3 blob,
...
key15 blob,
column1 blob,
column2 blob,
...
column20 blob,
*hvalue* blob,
*value* blob,
PRIMARY KEY ((key, key2, key3 ... key15), column1 ... column20)
) WITH CLUSTERING ORDER BY (column1 ASC,column2 ASC .. column20 ASC)
value stores the facts that are computed while online processing. hvalue stores the value for historical processing. While querying, both the columns are retrieved, merged and returned to user.
We are using datastax leftJoin API to join with Cassandra.
RDD.leftJoinWithCassandraTable(keyspace,tableName)
.on(SomeColumns(...)
.map { case (ip, row) => row match {
case None => ip
case Some(data) => CASSANDRA_MAP_SCHEMA(...)
)
}
}.saveToCassandra(keyspace,tableName)
In short, we create a schema for the RDD, and write the row to Cassandra.
Now, here is the problem. During the historical process, we need to create a row to write to Cassandra. This means that we need to provide some data to the "value" column. If it is a new row that is not present in Cassandra, we create a null object and write back. If the row is present, we take the existing value and write it back.
The online and historical process will run at the same time. This means that when the historical process reads a row, and writes back, the online process may have created the same row. This will result in corrupt data, since the historical process may read a stale data and update the value that was written by the online process.
I am not sure how to resolve this problem. I'll appreciate if there is any other solutions to prevent this.
I tried to explain the best I can, let me know if further clarifications are needed and I'll try to add more inputs.
Thanks in advance for the help.
There are a few ways to work around this, but none are really simple. Fundamentally write after write problems are hard.
The first is that you introduce a shared external locking mechanism where you obtain a lock for the row and either release it when it is done or have a short ttl. You can use something like Redis for this.
A second option is to funnel all changes to Cassandra through a kafka queue so that only one source is allowed to write. Though there is a chance that this will make your problem worse. If you are going to do this, make sure that you are partitioning your queue based on keys so that the same key always routes to the same queue.
A third option is that the services are only allowed to operate on data for a given time range. If your online data is only allowed to work on data in the last day, or X hours, etc. and your historical is only allowed to work on data that is more than that period of time old then there is virtually no chance of running into conflicts.
The fourth option is to accept that it is a possibility and that the possibility of it happening is small enough that it isn't an issue. If the datacenter where your code runs is very close (ideally colocated with your db) and you aren't doing significant processing on the row between read and write this may be a reasonable option.

Cassandra data modeling for real time data

I currently have an application that persists event driven real time streaming data to a column family which is modeled as such:
CREATE TABLE current_data (
account_id text,
value text,
PRIMARY KEY (account_id)
)
Data is being sent every X seconds per accountId, so we overwrite an existing row every time we receive an event. This data contains current real time information, and we only care about the most recent event (no use for older data, that is why we insert over an already existing key).
From the application user end - we query a select by account_id statement.
I was wondering if there is a better way to model this behaviour and was looking at Cassandra's best practices and similar questions asked (How to model Cassandra DB for Time Series, server metrics).
Thought about something like this:
CREATE TABLE current_data_2 (
account_id text,
time timeuuid,
value text,
PRIMARY KEY (account_id, time) WITH CLUSTERING ORDER BY (time DESC)
)
No overwrites will occur, and each insertion will also be done with a TTL (can be a TTL of a few minutes).
The question is HOW better, if at all, is the second data model over the first one. From what I understand, the main advantage will be in the READS - since the data is ordered by time all I need to do is a simple
SELECT * FROM metrics WHERE account_id = <id> LIMIT 1
while in the first data model Cassandra actually reads ALL rows that where overwritten the same key and then chooses the last one by its write timestamp (please correct me if I'm wrong).
Thanks.
First of all I encourage you to examine the official documentation about read path.
data is ordered by time
This is only true in your second case, when Cassandra reads a single SSTable and MemTable (check the flow diagram).
Cassandra actually reads ALL rows that where overwritten the same key
and then chooses the last one by its write timestamp
This happens at the Merge Cells by Timestamp step in the documentation (again check the flow diagram). Notice, that in each SSTable the number of rows will be one in your first case.
In both of your cases the main driving factor is that how many SSTables do you have to check during read. It's somewhat independent from how many records each SSTable contains.
But on the second case you have much bigger SSTabes which leads to longer SSTable compaction. Also TTL expiration performs additional writes. So first case is somewhat preferable.

Cassandra data modeling - Do I choose hotspots to make the query easier?

Is it ever okay to build a data model that makes the fetch query easier even though it will likely created hotspots within the cluster?
While reading, please keep in mind I am not working with Solr right now and given the frequency this data will be accessed I didn’t think using spark-sql would be appropriate. I would like to keep this as pure Cassandra.
We have transactions, which are modeled using a UUID as the partition key so that the data is evenly distributed around the cluster. One of our access patterns requires that a UI get all records for a given user and date range, query like so:
select * from transactions_by_user_and_day where user_id = ? and created_date_time > ?;
The first model I built uses the user_id and created_date (day the transaction was created, always set to midnight) as the primary key:
CREATE transactions_by_user_and_day (
user_ id int,
created_date timestamp,
created_date_time timestamp,
transaction_id uuid,
PRIMARY KEY ((user_id, created_date), created_date_time)
) WITH CLUSTERING ORDER BY (created_date_time DESC);
This table seems to perform well. Using the created_date as part of the PK allows users to be spread around the cluster more evenly to prevent hotspots. However, from an access perspective it makes the data access layer do a bit more work that we would like. It ends up having to create an IN statement with all days in the provided range instead of giving a date and greater than operator:
select * from transactions_by_user_and_day where user_id = ? and created_date in (?, ?, …) and created_date_time > ?;
To simplify the work to be done at the data access layer, I have considered modeling the data like so:
CREATE transactions_by_user_and_day (
user_id int,
created_date_time timestamp,
transaction_id uuid,
PRIMARY KEY ((user_global_id), created_date_time)
) WITH CLUSTERING ORDER BY (created_date_time DESC);
With the above model, the data access layer can fetch the transaction_id’s for the user and filter on a specific date range within Cassandra. However, this causes a chance of hotspots within the cluster. Users with longevity and/or high volume will create quite a few more columns in the row. We intend on supplying a TTL on the data so anything older than 60 days drops off. Additionally, I’ve analyzed the size of the data and 60 days’ worth of data for our most high volume user is under 2 MB. Doing the math, if we assume that all 40,000 users (this number wont grow significantly) are spread evenly over a 3 node cluster and 2 MB of data per user you end up with a max of just over 26 GB per node ((13333.33*2)/1024). In reality, you aren’t going to end up with 1/3 of your users doing that much volume and you’d have to get really unlucky to have Cassandra, using V-Nodes, put all of those users on a single node. From a resources perspective, I don’t think 26 GB is going to make or break anything either.
Thanks for your thoughts.
Date Model 1:Something else you could do would be to change your data access layer to do a query for each ID individually, instead of using the IN clause. Check out this page to understand why that would be better.
https://lostechies.com/ryansvihla/2014/09/22/cassandra-query-patterns-not-using-the-in-query-for-multiple-partitions/
Data model 2: 26GB of data per node doesn't seem like much, but a 2MB fetch seems a bit large. Of course if this is an outlier, then I don't see a problem with it. You might try setting up a cassandra-stress job to test the model. As long as the majority of your partitions are smaller than 2MB, that should be fine.
One other solution would be to use Data Model 2 with Bucketing. This would give you more overhead on writes as you'd have to maintain a bucket lookup table as well though. Let me know if need me to elaborate more on this approach.

Cassandra CQL3 order by clustered key efficiency (with limit clause?)

I have the following table (using CQL3):
create table test (
shard text,
tuuid timeuuid,
some_data text,
status text,
primary key (shard, tuuid, some_data, status)
);
I would like to get rows ordered by tuuid. But this is only possible when I restrict shard - I get this is due to performance.
I have shard purely for sharding, and I can potentially restrict its range of values to some small range [0-16) say. Then, I could run a query like this:
select * from test where shard in (0,...,15) order by tuuid limit L;
I may have millions of rows in the table, so I would like to understand the performance characteristics of such a order by query. It would seem like the performance could be pretty bad in general, BUT with a limit clause of some reasonable number (order of 10K), this may not be so bad - i.e. a 16 way merge but with a fairly low limit.
Any tips, advice or pointers into the code on where to look would be appreciated.
Your data is sorted according to your column key. So the performance issue in your merge in your query above does not happen due to the WHERE clause but because of your LIMIT clause, afaik.
Your columns are inserted IN ORDER according to tuuid so there is no performance issue there.
If you are fetching too many rows at once, I recommended creating a test_meta table where you store the latest timeuuid every X-inserts, to get an upper bound on the rows your query will fetch. Then, you can change your query to:
select * from test where shard in (0,...,15) and tuuid > x and tuuid < y;
In short: make use of your column keys and get rid of the limit. Alternatively, in Cassandra 2.0, there will be pagination which will help here, too.
Another issue I stumbled over, you say that
I may have millions of rows in the table
But according to your data model, you will have exactly shard number of rows. This is your row key and - together with the partitioner - will determine the distribution/sharding of your data.
hope that helps!
UPDATE
From my personal experience, cassandra performances quite well during heavy reads as well as writes. If the result sets became too large, I rather experienced memory issues on the receiving/client side rather then timeouts on the server side. Still, to prevent either, I recommend having a look a the upcoming (2.0) pagination feature.
In the meanwhile:
Try to investigate using the trace functionality in 1.2.
If you are mostly reading the "latest" data, try adding a reversed type.
For general optimizations like caches etc, first, read how cassandra handles reads on a node and then, see this tuning guide.

Resources