Group Cassandra entries by nearby timestamp - apache-spark

I have this Cassandra table:
CREATE TABLE events(
userId uuid,
datetime timestamp,
id uuid,
event text,
PRIMARY KEY (userId, datetime, id)
);
What I want to do is group events that happened around the same time for a specific user. So, for example, if events for one user happen at:
9:00:11 AM
9:00:13 AM
9:00:16 AM
9:03:55 AM
9:03:58 AM
9:04:03 AM
9:15:35 AM
9:15:38 AM
I would want to get 3 groups:
1: 9:00:11 AM to 9:00:16 AM
2: 9:03:55 AM to 9:04:03 AM
3: 9:15:35 AM to 9:15:38 AM
I hope a machine learning algorithm such as
DBSCAN can figure out how the clustering should be done, but grouping events that have an interval of less than a minute between them would probably be enough.
Bonus points if I can get a confidence interval on the start and end time of each groups.
I've looked into using basic CQL like group by, Apache Spark's groupByKey and MLib Clustering without any success. Ideally, results would be processed in near real-time with Apache Spark Streaming.
This is a greenfield project, so Cassandra and Spark are not a must. I've also considered using Storm.

It seems you are talking about session windows. Right now I am only aware of Google Dataflow to give you system support for this. If you use Storm, you would need to hand code the sessioning logic.
In any case, if you are using a streaming system, you first need to sort your data on timestamps and stream them in ascending timestamp order through the system.
Apache Flink might give you some more support than Storm to code this, but it would be a manual effort, too. Even if Flink is closer to Google Dataflow than Storm (Flink might also add session windows in the near future).
Btw: the groupBy / keyBy statements you mentioned would be appropriate to partition the data by user-id, but not for building windows.

Related

Selecting records in Cassandra based on Time range in frequent intervals

I have a table in Cassandra where i am storing events as they are coming in , different processing are done on the events at different stages. The events are entered into the table with the event occurrence time. I need to get all the events whose event time is less than a certain time and do some processing on them. As its a select range query and its invariably will use scatter gather. Can some one suggest best way to do this. This process is going to happen in every 5 secs and scatter gather happening in Cassandra happening frequently is not a good idea as its an overhead on Cassandra itself which will degrade my overall application Performance.
The table is as below:
PAS_REQ_STAGE (PartitionKey = EndpointID, category ; clusterkey= Automation_flag,alertID)
AlertID
BatchPickTime: Timestamp
Automation_Threshold
ResourceID
ConditionID
category
Automation_time: Timestamp
Automation_flag
FilterValue
Eventtime which i have referred above is the BatchPickTime..
A scheduler wakes up at regular interval and gets all the records whose BatchPickTime is Less than the current scheduler wakeup time and sweeps them off from the table to process them.
Because of this usecase i cannot provide any specific Partition key for the query as it will have to get all data which has expired and is less than the current scheduler wake-up time.
Hi and welcome to Stackoverflow.
Please post your schema and maybe some example code with your question - you can edit it :)
The Cassandra-way of doing this is to denormalize data if necessary and build your schema around your queries. In your case I would suggest putting your events in to a table together with a time bucket:
CREATE TABLE events (event_source int, bucket timestamp,
event_time timestamp, event_text text PRIMARY KEY ((event_source, bucket),event_time));
The reason for this is that it is very efficent in cassandra to select a row by its so called partition key (in this example (event_source, bucket)) as such a query hits only one node. The reminder of the primary key is called clustering columns and defines the order of data, here all events for a day inside the bucket are sorted by event_time.
Try to model your event table in a way that you do not need to make multiple queries. There is a good and free data modeling course from DataStax available: https://academy.datastax.com/resources/ds220-data-modeling
One note - be careful when using cassandra as queue - this is maybe an antipattern and you might be better of with a message queue as ActiveMQ or RabbitMQ or similar.

Is it okay to directly read from Cassandra to surface information from a web application?

I'm using Cassandra as my primary data store for a time series logging application. I receive a high-volume number of writes to this database, so Cassandra was a natural choice.
However, when I try showing statistics about the data on a web application, I make costly reads to this database and things start to slow down.
My initial idea is to run periodic cron jobs that pre-compute these statistics every hour. This would ensure no slow reads. I'm wondering if there's another way to read from a Cassandra database and what is the best solution?
You are on the right track with what your initial thinking.
How you store data in C*, and specifically how you select you Primary Key fields have a direct influence on how you can read data out. If you are hitting a single partition on a table reading data out of a C* cluster is very efficient and is an excellent choice for showing data on a website.
In your case if you want to show some level of aggregated data (e.g. by hour) I would suggest that you create your partition key in such as way as to make it so all the data you want to aggregate is contained in the same partition. Here is an example schema for what I mean:
CREATE TABLE data_by_hour (
day text,
hour int,
minute int,
data float,
PRIMARY KEY((day, hour), minute)
);
You can then use a cron job or some other mechanism to run a query and aggregate the data into another table to show on the website.

Spark: Continuously reading data from Cassandra

I have gone through Reading from Cassandra using Spark Streaming and through tutorial-1 and tutorial-2 links.
Is it fair to say that Cassandra-Spark integration currently does not provide anything out of the box to continuously get the updates from Cassandra and stream them to other systems like HDFS?
By continuously, I mean getting only those rows in a table which have changed (inserted or updated) since the last fetch by Spark. If there are too many such rows, there should be an option to limit the number of rows and the subsequent spark fetch should begin from where it left off. At-least once guarantee is ok but exactly-once would be a huge welcome.
If its not supported, one way to support it could be to have an auxiliary column updated_time in each cassandra-table that needs to be queried by storm and then use that column for queries. Or an auxiliary table per table that contains ID, timestamp of the rows being changed. Has anyone tried this before?
I don't think Apache Cassandra has this functionality out of the box. Internally [for some period of time] it stores all operations on data in sequential manner, but it's per node and it gets compacted eventually (to save space). Frankly, Cassandra's (as most other DB's) promise is to provide latest view of data (which by itself can be quite tricky in distributed environment), but not full history of how data was changing.
So if you still want to have such info in Cassandra (and process it in Spark), you'll have to do some additional work yourself: design dedicated table(s) (or add synthetic columns), take care of partitioning, save offset to keep track of progress, etc.
Cassandra is ok for time series data, but in your case I would consider just using streaming solution (like Kafka) instead of inventing it.
I agree with what Ralkie stated but wanted to propose one more solution if you're tied to C* with this use case. This solution assumes you have full control over the schema and ingest as well. This is not a streaming solution though it could awkwardly be shoehorned into one.
Have you considered using composite key composed of the timebucket along with a murmur_hash_of_one_or_more_clustering_columns % some_int_designed_limit_row_width? In this way, you could set your timebuckets to 1 minute, 5 minutes, 1 hour, etc depending on how "real-time" you need to analyze/archive your data. The murmur hash based off of one or more of the clustering columns is needed to help located data in the C* cluster (and is a terrible solution if you're often looking up specific clustering columns).
For example, take an IoT use case where sensors report in every minute and have some sensor reading that can be represented as an integer.
create table if not exists iottable {
timebucket bigint,
sensorbucket int,
sensorid varchar,
sensorvalue int,
primary key ((timebucket, sensorbucket), sensorid)
} with caching = 'none'
and compaction = { 'class': 'com.jeffjirsa.cassandra.db.compaction.TimeWindowedCompaction' };
Note the use of TimeWindowedCompaction. I'm not sure what version of C* you're using; but with the 2.x series, I'd stay away from DateTieredCompaction. I cannot speak to how well it performs in 3.x. Any any rate, you should test and benchmark extensively before settling on your schema and compaction strategy.
Also note that this schema could result in hotspotting as it is vulnerable to sensors that report more often than others. Again, not knowing the use case it's hard to provide a perfect solution -- it's just an example. If you don't care about ever reading C* for a specific sensor (or column), you don't have to use a clustering column at all and you can simply use a timeUUID or something random for the murmur hash bucketing.
Regardless of how you decide to partition the data, a schema like this would then allow you to use repartitionByCassandraReplica and joinWithCassandraTable to extract the data written during a given timebucket.

Perform queries over the time-series stream of data

I'm trying to design an architecture of my streaming application and choose the right tools for the job.
This is how it works currently:
Messages from "application-producer" part have a form of (address_of_sensor, timestamp, content) tuples.
I've already implemented all functionality before Kafka, and now I've encountered major flaw in the design. In "Spark Streaming" part, consolidated stream of messages is translated into stream of events. The problem is that events for the most part are composite - consist of multiple messages, which have occurred at the same time at different sensors.
I can't rely on "time of arrival to Kafka" as a mean to detect "simultaneity". So I has to somehow sort messages in Kafka before extracting them with Spark. Or, more precisely, make queries over Kafka messages.
Maybe Cassandra is the right replacement for Kafka here? I have really simple data model, and only two possible types of queries to perform: query by address, and range query by timestamp. Maybe this is the right choice?
Do somebody have any numbers of Cassandra's throughput?
If you want to run queries on your time series, Cassandra may be the best fit - it is very write optimized, you can build 'wide' rows for your series. It is possible to make slices on your wide rows, so you can select some time ranges with only one query.
On the other hand, kafka can be considered as a raw data flow - you don't have queries, only recently produced data. In order to collect data based on some key in the same partition, you have to select this key carefully. All data within same partition are time sorted.
Range Query on Timestamp is the classic use case of cassandra , if u need address based queries as well u would have to make them as clustering column if using cassandra . As far as cassandra througput are concerned if you can invest in proper performance analysis on cassandra cluster you can achieve very high write throughput . But I have used SparkQL , Cassandra Driver and spark Cassandra connector they don't really give high query throughput speed until you have a big cluster with high CPU configuration , it does not work well with small dataset .
Kafka should not be used as data source for queries , its more of commit log

Spark Cassandra connector - Range query on partition key

I'm evaluating spark-cassandra-connector and i'm struggling trying to get a range query on partition key to work.
According to the connector's documentation it seems that's possible to make server-side filtering on partition key using equality or IN operator, but unfortunately, my partition key is a timestamp, so I can not use it.
So I tried using Spark SQL with the following query ('timestamp' is the partition key):
select * from datastore.data where timestamp >= '2013-01-01T00:00:00.000Z' and timestamp < '2013-12-31T00:00:00.000Z'
Although the job spawns 200 tasks, the query is not returning any data.
Also I can assure that there is data to be returned since running the query on cqlsh (doing the appropriate conversion using 'token' function) DOES return data.
I'm using spark 1.1.0 with standalone mode. Cassandra is 2.1.2 and connector version is 'b1.1' branch. Cassandra driver is DataStax 'master' branch.
Cassandra cluster is overlaid on spark cluster with 3 servers with replication factor of 1.
Here is the job's full log
Any clue anyone?
Update: When trying to do server-side filtering based on the partition key (using CassandraRDD.where method) I get the following exception:
Exception in thread "main" java.lang.UnsupportedOperationException: Range predicates on partition key columns (here: timestamp) are not supported in where. Use filter instead.
But unfortunately I don't know what "filter" is...
i think the CassandraRDD error is telling that the query that you are trying to do is not allowed in Cassandra and you have to load all the table in a CassandraRDD and then make a spark filter operation over this CassandraRDD.
So your code (in scala) should something like this:
val cassRDD= sc.cassandraTable("keyspace name", "table name").filter(row=> row.getDate("timestamp")>=DateFormat('2013-01-01T00:00:00.000Z')&&row.getDate("timestamp") < DateFormat('2013-12-31T00:00:00.000Z'))
If you are interested in making this type of queries you might have to take a look to others Cassandra connectors, like the one developed by Stratio
You have several options to get the solution you are looking for.
The most powerful one would be to use Lucene indexes integrated with Cassandra by Stratio, which allows you to search by any indexed field in the server side. Your writing time will be increased but, on the other hand, you will be able to query any time range. You can find further information about Lucene indexes in Cassandra here. This extended version of Cassandra is fully integrated into the deep-spark project so you can take all the advantages of the Lucene indexes in Cassandra through it. I would recommend you to use Lucene indexes when you are executing a restricted query that retrieves a small-medium result set, if you are going to retrieve a big piece of your data set, you should use the third option underneath.
Another approach, depending on how your application works, might be to truncate your timestamp field so you can look for it using an IN operator. The problem is, as far as I know, you can't use the spark-cassandra-connector for that, you should use the direct Cassandra driver which is not integrated with Spark, or you can have a look at the deep-spark project where a new feature allowing this is about to be released very soon. Your query would look something like this:
select * from datastore.data where timestamp IN ('2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', ... , '2013-12-31')
, but, as I said before, I don't know if it fits to your needs since you might not be able to truncate your data and group it by date/time.
The last option you have, but the less efficient, is to bring the full data set to your spark cluster and apply a filter on the RDD.
Disclaimer: I work for Stratio :-) Don't hesitate on contacting us if you need any help.
I hope it helps!

Resources