Cassandra compound key performance - cassandra

I am using Cassandra for saving logs, and on client-side I want to show all logs for some day for example.
Of course for one day there can be thousands of log records, and I need to use paging.
I saw that paging in not like "native" in cassandra, and we need to use some "tricks", like saving last retreived record, and looking for more records after that record.
My idea is to save uuid and date for primary key, and then order column familly by date, so I can pass uuid and date, and cassandra should give me records after that record, and so on.
Does anyone knows is this good idea, I mean in terms of performance. Is it good to have uuid and date for compound keys? Or maybe there is better solution for solving this?
Thank you!

As far as I can tell, your choice of primary key based on an id and date should helps to retrieve all the logs for one day. What you probably need to validate is that:
each log entry is not a huge value
you'll not have more than 2bn log entries per day (in that case you'll probably need to change the primary key to use a sub-day interval)
As regards pagination, if you are using Cassandra 2.0 this should work (there were some corner case issues with auto pagination until, iirc, 2.0.9 though). The blog post Improvements on the driver side with Cassandra 2.0 should give you an idea of how pagination worked in Cassandra 1.2 and the improvement in 2.0

Related

Best way of querying table without providing the primary key

I am designing the data model of our Scylla database. For example, I created a table, intraday_history, with fields:
CREATE TABLE intraday_history (id bigint,timestamp_seconds bigint,timestamp timestamp,sec_code text,open float,high float,low float,close float,volume float,trade int, PRIMARY KEY ((id,sec_code),timestamp_seconds,timestamp));
My id is a twitter_snowflake generated 64-bit integers.. My problem is how can I use WHERE without providing always the id (most of the time I will use the timestamp with bigint). I also encounter this problem in other tables. Because the id is unique then I cannot query a batch of timestamp.
Is it okay if lets say for a bunch of tables for my 1 node, I will use an ID like cluster1 so that when I query the id I will just id=cluster1 ? But it loss the uniqueness feature
Allow filtering comes as an option here. But I keep reading that it is a bad practice, especially when dealing with millions of query.
I'm using the ScyllaDB, a compatible c++ version of Apache Cassandra.
In Cassandra, as you've probably already read, the queries derive the tables, not the other way around. So your situation where you want to query by a different filter would ideally entail you creating another Cassandra table. That's the optimal way. Partition keys are required in filters unless you provide the "allow filtering" "switch", but it isn't recommended as it will perform a DC (possibly cluster)-wide search, and you're still subjected to timeouts. You could consider using indexes or materialized views, which are basically cassandra maintained tables populated by the base table's changes. That would save you the troubles of having the application populate multiple tables (Cassandra would do it for you). We've had some luck with materialized views, but with either of these components, there can be side effects like any other cassandra table (inconsistencies, latencies, additional rules, etc.). I would say do a bit of research to determine the best approach, but most likely providing "allow filtering" isn't the best choice (especially for high volume and frequent queries or with tables containing high volumes of data). You could also investigate SOLR if that's an option, depending on what you're filtering.
Hope that helps.
-Jim

Spark: Continuously reading data from Cassandra

I have gone through Reading from Cassandra using Spark Streaming and through tutorial-1 and tutorial-2 links.
Is it fair to say that Cassandra-Spark integration currently does not provide anything out of the box to continuously get the updates from Cassandra and stream them to other systems like HDFS?
By continuously, I mean getting only those rows in a table which have changed (inserted or updated) since the last fetch by Spark. If there are too many such rows, there should be an option to limit the number of rows and the subsequent spark fetch should begin from where it left off. At-least once guarantee is ok but exactly-once would be a huge welcome.
If its not supported, one way to support it could be to have an auxiliary column updated_time in each cassandra-table that needs to be queried by storm and then use that column for queries. Or an auxiliary table per table that contains ID, timestamp of the rows being changed. Has anyone tried this before?
I don't think Apache Cassandra has this functionality out of the box. Internally [for some period of time] it stores all operations on data in sequential manner, but it's per node and it gets compacted eventually (to save space). Frankly, Cassandra's (as most other DB's) promise is to provide latest view of data (which by itself can be quite tricky in distributed environment), but not full history of how data was changing.
So if you still want to have such info in Cassandra (and process it in Spark), you'll have to do some additional work yourself: design dedicated table(s) (or add synthetic columns), take care of partitioning, save offset to keep track of progress, etc.
Cassandra is ok for time series data, but in your case I would consider just using streaming solution (like Kafka) instead of inventing it.
I agree with what Ralkie stated but wanted to propose one more solution if you're tied to C* with this use case. This solution assumes you have full control over the schema and ingest as well. This is not a streaming solution though it could awkwardly be shoehorned into one.
Have you considered using composite key composed of the timebucket along with a murmur_hash_of_one_or_more_clustering_columns % some_int_designed_limit_row_width? In this way, you could set your timebuckets to 1 minute, 5 minutes, 1 hour, etc depending on how "real-time" you need to analyze/archive your data. The murmur hash based off of one or more of the clustering columns is needed to help located data in the C* cluster (and is a terrible solution if you're often looking up specific clustering columns).
For example, take an IoT use case where sensors report in every minute and have some sensor reading that can be represented as an integer.
create table if not exists iottable {
timebucket bigint,
sensorbucket int,
sensorid varchar,
sensorvalue int,
primary key ((timebucket, sensorbucket), sensorid)
} with caching = 'none'
and compaction = { 'class': 'com.jeffjirsa.cassandra.db.compaction.TimeWindowedCompaction' };
Note the use of TimeWindowedCompaction. I'm not sure what version of C* you're using; but with the 2.x series, I'd stay away from DateTieredCompaction. I cannot speak to how well it performs in 3.x. Any any rate, you should test and benchmark extensively before settling on your schema and compaction strategy.
Also note that this schema could result in hotspotting as it is vulnerable to sensors that report more often than others. Again, not knowing the use case it's hard to provide a perfect solution -- it's just an example. If you don't care about ever reading C* for a specific sensor (or column), you don't have to use a clustering column at all and you can simply use a timeUUID or something random for the murmur hash bucketing.
Regardless of how you decide to partition the data, a schema like this would then allow you to use repartitionByCassandraReplica and joinWithCassandraTable to extract the data written during a given timebucket.

Filtering tags from Cassandra 3.x

We have an activity metrics page where users can select a date period and see other user's aggregated activity (by action) and optionally filter everything by 4 or 5 fields. Actions happen sequentially, but one of the fields is Tags and the user may change old action's tags anytime. The data is in a Cassandra 3.7 with the partition key being (company_id, action_year, action_week). For each week we have about 70k actions (there are 20 columns with long or int data for each action, each action with the partition key plus action_timestamp and action_key as row key).
PRIMARY KEY ((company_id, action_year, action_week), action_date, action_key)
) WITH CLUSTERING ORDER BY (action_date ASC, action_key ASC)
With a first version we are querying the full actions for a period and doing all the aggregations and filtering in memory. When the user selects a couple of weeks, the whole request takes like 10 or 15 secs. And we are expecting to scale to thousands of users requesting these analytics that should work as near real time analytics.
We thought of moving the filtering to C* using "allow filtering", but the WHERE clause seems very limited. And we are also worried about the frequent updates for the labels.
What's the right way of using C* here?
Since you are on Cassandra 3.7 you should consider using SASI. It is a newer feature allowing more powerful secondary indexes in Cassandra. Documentation is available here.
But as this is a new feature make sure you fully test it for your usecase and really dig into the operations of the index to make sure your usecase fits well.
Another good article to start with is this preview one.

cassandra data purging for time series data based on timestamp column

I am storing the time series data in cassandra on daily basis. We would like to archive/purge the data older than 2 days on daily basis. We are using Hector API to store the data. Can some one suggest me the approach to delete the cassandra data on daily basis where data is older than 2 days? Using TTL approach for cassandra row is not feasible, as the number of days to delete data is configurable. Right now there is no timestamp column in the table. we are planning to add timestamp column. But the problem is, timestamp alone cannot be used in where clause, as this new column is not part of primary key.
Please provide your suggestion.
TTL is the right answer, there is an internal timestamp attached to every mutation that is used so you don't need to add one. Manually purging almost never a good idea. You may need to work on your data model a bit, check the datastax academy examples for time series
Also thrift has been frozen for two years and is now officially deprecated (removal in 4.0). Hector and other thrift clients are not really maintained anymore (see here). Using CQL and java driver will give better results with more resources available to learn as well.
I don't see what is stopping you from using TTL approach.
TTL can be used, not only while defining schema,
but also while saving data into table using datastax cassandra driver.
So, in reality you can have separate TTL for each row, configured by your java code.
Also, as Chris already mentioned, TTL uses internal timestamp for this.
Strictly based on what you describe, I think the only solution is to add that timestamp column and add a secondary index on it.
However this is a huge indicator that your data model is far from being adapted to the situation.
Emphasising my initial comment:
Is your model adapted/designed to something else? Because this doesn't look like a timeseries data in Cassandra: a timestamp like column should be part of the clustering key.

Group Cassandra entries by nearby timestamp

I have this Cassandra table:
CREATE TABLE events(
userId uuid,
datetime timestamp,
id uuid,
event text,
PRIMARY KEY (userId, datetime, id)
);
What I want to do is group events that happened around the same time for a specific user. So, for example, if events for one user happen at:
9:00:11 AM
9:00:13 AM
9:00:16 AM
9:03:55 AM
9:03:58 AM
9:04:03 AM
9:15:35 AM
9:15:38 AM
I would want to get 3 groups:
1: 9:00:11 AM to 9:00:16 AM
2: 9:03:55 AM to 9:04:03 AM
3: 9:15:35 AM to 9:15:38 AM
I hope a machine learning algorithm such as
DBSCAN can figure out how the clustering should be done, but grouping events that have an interval of less than a minute between them would probably be enough.
Bonus points if I can get a confidence interval on the start and end time of each groups.
I've looked into using basic CQL like group by, Apache Spark's groupByKey and MLib Clustering without any success. Ideally, results would be processed in near real-time with Apache Spark Streaming.
This is a greenfield project, so Cassandra and Spark are not a must. I've also considered using Storm.
It seems you are talking about session windows. Right now I am only aware of Google Dataflow to give you system support for this. If you use Storm, you would need to hand code the sessioning logic.
In any case, if you are using a streaming system, you first need to sort your data on timestamps and stream them in ascending timestamp order through the system.
Apache Flink might give you some more support than Storm to code this, but it would be a manual effort, too. Even if Flink is closer to Google Dataflow than Storm (Flink might also add session windows in the near future).
Btw: the groupBy / keyBy statements you mentioned would be appropriate to partition the data by user-id, but not for building windows.

Resources