How to store microsecond level timestamps in cassandra? - cassandra

I'm trying to store data in cassandra which contains microsecond level timestamps.
Cassandra's docs say that the 'timestamp' data type can store milliseconds since epoch but several messages on the internet seem to imply that cassandra can natively store microsecond timestamps.
What is the best way practice for storing microsecond level times in cassandra? Should I just leave out the date part and store a long?
I'm trying to sore a columsn which look like this:
2015-11-18 07:30:46.700824
I get the following error:
ErrorMessage code=2200 [Invalid query] message="unable to coerce '2015-11-18 07:30:18.261543' to a formatted date (long)"
Aborting import at record #1. Previously inserted records are still present, and some records after that may be present as well.
My cassandra version:
[cqlsh 5.0.1 | Cassandra 2.1.11 | CQL spec 3.2.1 | Native protocol v3]
EDIT:
Here is an example of microsecond confusion in Cassandra's own docs:
"CAS and new features in CQL such as DROP COLUMN assume that cell timestamps are microseconds-since-epoch"
https://docs.datastax.com/en/upgrade/doc/upgrade/cassandra/upgradeChangesC_c.html
Another: https://issues.apache.org/jira/browse/CASSANDRA-8297
EDIT2:
I should mention that I intend to query this using spark. From what I understand, spark parses its own flavor of sql and translates it to cassandra (although I'm using CassandraContext in zeppelin). Anything which might help or hinder my search for microsecond level timestmaps?

You can use bigint or a timeuuid. Type 1 uuid's have 100ns precision so it can cover you. Some utilities, libraries, convenience functions may not give you what you need though so be prepared to write some uuid functions.

Related

Spark: Continuously reading data from Cassandra

I have gone through Reading from Cassandra using Spark Streaming and through tutorial-1 and tutorial-2 links.
Is it fair to say that Cassandra-Spark integration currently does not provide anything out of the box to continuously get the updates from Cassandra and stream them to other systems like HDFS?
By continuously, I mean getting only those rows in a table which have changed (inserted or updated) since the last fetch by Spark. If there are too many such rows, there should be an option to limit the number of rows and the subsequent spark fetch should begin from where it left off. At-least once guarantee is ok but exactly-once would be a huge welcome.
If its not supported, one way to support it could be to have an auxiliary column updated_time in each cassandra-table that needs to be queried by storm and then use that column for queries. Or an auxiliary table per table that contains ID, timestamp of the rows being changed. Has anyone tried this before?
I don't think Apache Cassandra has this functionality out of the box. Internally [for some period of time] it stores all operations on data in sequential manner, but it's per node and it gets compacted eventually (to save space). Frankly, Cassandra's (as most other DB's) promise is to provide latest view of data (which by itself can be quite tricky in distributed environment), but not full history of how data was changing.
So if you still want to have such info in Cassandra (and process it in Spark), you'll have to do some additional work yourself: design dedicated table(s) (or add synthetic columns), take care of partitioning, save offset to keep track of progress, etc.
Cassandra is ok for time series data, but in your case I would consider just using streaming solution (like Kafka) instead of inventing it.
I agree with what Ralkie stated but wanted to propose one more solution if you're tied to C* with this use case. This solution assumes you have full control over the schema and ingest as well. This is not a streaming solution though it could awkwardly be shoehorned into one.
Have you considered using composite key composed of the timebucket along with a murmur_hash_of_one_or_more_clustering_columns % some_int_designed_limit_row_width? In this way, you could set your timebuckets to 1 minute, 5 minutes, 1 hour, etc depending on how "real-time" you need to analyze/archive your data. The murmur hash based off of one or more of the clustering columns is needed to help located data in the C* cluster (and is a terrible solution if you're often looking up specific clustering columns).
For example, take an IoT use case where sensors report in every minute and have some sensor reading that can be represented as an integer.
create table if not exists iottable {
timebucket bigint,
sensorbucket int,
sensorid varchar,
sensorvalue int,
primary key ((timebucket, sensorbucket), sensorid)
} with caching = 'none'
and compaction = { 'class': 'com.jeffjirsa.cassandra.db.compaction.TimeWindowedCompaction' };
Note the use of TimeWindowedCompaction. I'm not sure what version of C* you're using; but with the 2.x series, I'd stay away from DateTieredCompaction. I cannot speak to how well it performs in 3.x. Any any rate, you should test and benchmark extensively before settling on your schema and compaction strategy.
Also note that this schema could result in hotspotting as it is vulnerable to sensors that report more often than others. Again, not knowing the use case it's hard to provide a perfect solution -- it's just an example. If you don't care about ever reading C* for a specific sensor (or column), you don't have to use a clustering column at all and you can simply use a timeUUID or something random for the murmur hash bucketing.
Regardless of how you decide to partition the data, a schema like this would then allow you to use repartitionByCassandraReplica and joinWithCassandraTable to extract the data written during a given timebucket.

cassandra data purging for time series data based on timestamp column

I am storing the time series data in cassandra on daily basis. We would like to archive/purge the data older than 2 days on daily basis. We are using Hector API to store the data. Can some one suggest me the approach to delete the cassandra data on daily basis where data is older than 2 days? Using TTL approach for cassandra row is not feasible, as the number of days to delete data is configurable. Right now there is no timestamp column in the table. we are planning to add timestamp column. But the problem is, timestamp alone cannot be used in where clause, as this new column is not part of primary key.
Please provide your suggestion.
TTL is the right answer, there is an internal timestamp attached to every mutation that is used so you don't need to add one. Manually purging almost never a good idea. You may need to work on your data model a bit, check the datastax academy examples for time series
Also thrift has been frozen for two years and is now officially deprecated (removal in 4.0). Hector and other thrift clients are not really maintained anymore (see here). Using CQL and java driver will give better results with more resources available to learn as well.
I don't see what is stopping you from using TTL approach.
TTL can be used, not only while defining schema,
but also while saving data into table using datastax cassandra driver.
So, in reality you can have separate TTL for each row, configured by your java code.
Also, as Chris already mentioned, TTL uses internal timestamp for this.
Strictly based on what you describe, I think the only solution is to add that timestamp column and add a secondary index on it.
However this is a huge indicator that your data model is far from being adapted to the situation.
Emphasising my initial comment:
Is your model adapted/designed to something else? Because this doesn't look like a timeseries data in Cassandra: a timestamp like column should be part of the clustering key.

How to generate UUID(Long) using cassandra timestamp in cluster environment?

I have the requirement where we need to generate UUID as Long value using Java based on Cassandra timestamp which is in cluster. Can anyone help how to geranate it using java and cassandra cluster timestamp combination?
Use TimeUUID cql3 data type:
A value of the timeuuid type is a Type 1 UUID. A type 1 UUID includes the time of its generation and are sorted by timestamp, making them ideal for use in applications requiring conflict-free timestamps. For example, you can use this type to identify a column (such as a blog entry) by its timestamp and allow multiple clients to write to the same partition key simultaneously. Collisions that would potentially overwrite data that was not intended to be overwritten cannot occur.
In Java you can use UUIDs helper class from com.datastax.driver.core.utils.UUIDs:
UUIDs.timeBased()

Spark Cassandra connector - Range query on partition key

I'm evaluating spark-cassandra-connector and i'm struggling trying to get a range query on partition key to work.
According to the connector's documentation it seems that's possible to make server-side filtering on partition key using equality or IN operator, but unfortunately, my partition key is a timestamp, so I can not use it.
So I tried using Spark SQL with the following query ('timestamp' is the partition key):
select * from datastore.data where timestamp >= '2013-01-01T00:00:00.000Z' and timestamp < '2013-12-31T00:00:00.000Z'
Although the job spawns 200 tasks, the query is not returning any data.
Also I can assure that there is data to be returned since running the query on cqlsh (doing the appropriate conversion using 'token' function) DOES return data.
I'm using spark 1.1.0 with standalone mode. Cassandra is 2.1.2 and connector version is 'b1.1' branch. Cassandra driver is DataStax 'master' branch.
Cassandra cluster is overlaid on spark cluster with 3 servers with replication factor of 1.
Here is the job's full log
Any clue anyone?
Update: When trying to do server-side filtering based on the partition key (using CassandraRDD.where method) I get the following exception:
Exception in thread "main" java.lang.UnsupportedOperationException: Range predicates on partition key columns (here: timestamp) are not supported in where. Use filter instead.
But unfortunately I don't know what "filter" is...
i think the CassandraRDD error is telling that the query that you are trying to do is not allowed in Cassandra and you have to load all the table in a CassandraRDD and then make a spark filter operation over this CassandraRDD.
So your code (in scala) should something like this:
val cassRDD= sc.cassandraTable("keyspace name", "table name").filter(row=> row.getDate("timestamp")>=DateFormat('2013-01-01T00:00:00.000Z')&&row.getDate("timestamp") < DateFormat('2013-12-31T00:00:00.000Z'))
If you are interested in making this type of queries you might have to take a look to others Cassandra connectors, like the one developed by Stratio
You have several options to get the solution you are looking for.
The most powerful one would be to use Lucene indexes integrated with Cassandra by Stratio, which allows you to search by any indexed field in the server side. Your writing time will be increased but, on the other hand, you will be able to query any time range. You can find further information about Lucene indexes in Cassandra here. This extended version of Cassandra is fully integrated into the deep-spark project so you can take all the advantages of the Lucene indexes in Cassandra through it. I would recommend you to use Lucene indexes when you are executing a restricted query that retrieves a small-medium result set, if you are going to retrieve a big piece of your data set, you should use the third option underneath.
Another approach, depending on how your application works, might be to truncate your timestamp field so you can look for it using an IN operator. The problem is, as far as I know, you can't use the spark-cassandra-connector for that, you should use the direct Cassandra driver which is not integrated with Spark, or you can have a look at the deep-spark project where a new feature allowing this is about to be released very soon. Your query would look something like this:
select * from datastore.data where timestamp IN ('2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', ... , '2013-12-31')
, but, as I said before, I don't know if it fits to your needs since you might not be able to truncate your data and group it by date/time.
The last option you have, but the less efficient, is to bring the full data set to your spark cluster and apply a filter on the RDD.
Disclaimer: I work for Stratio :-) Don't hesitate on contacting us if you need any help.
I hope it helps!

How to get a range of data from Cassandra

[cqlsh 5.0.1 | Cassandra 2.1.0 | CQL spec 3.2.0 | Native protocol v3]
table:
CREATE TABLE dc.event (
id timeuuid PRIMARY KEY,
name text
) WITH bloom_filter_fp_chance = 0.01;
How do I get a time range of data from Cassandra?
For example, when I try 'select * from event where id> maxTimeuuid('2014-11-01 00:05+0000') and minTimeuuid('2014-11-02 10:00+0000')', as seen here http://www.datastax.com/documentation/cql/3.0/cql/cql_reference/timeuuid_functions_r.html
I get the following error: 'code=2200 [Invalid query] message="Only EQ and IN relation are supported on the partition key (unless you use the token() function)"'
Can I keep timeuuid as primary key and meet the requirement?
Thanks
Can I keep timeuuid as primary key and meet the requirement?
Not really, no. From http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/select_r.html
WHERE clauses can include a greater-than and less-than comparisons,
but for a given partition key, the conditions on the clustering column
are restricted to the filters that allow Cassandra to select a
contiguous ordering of rows.
You could try adding "ALLOW FILTERING" to your query... but I doubt that would work. And I don't know of a good way (and neither do I believe there is a good way) to tokenize the timeuuids. I'm about 99% sure the ordering from the partitioner would yield unexpected, bad results, even though the query itself would execute and appear correct until you dug into it.
As an aside, you should really check out a similar question that was asked about a year ago: time series data, selecting range with maxTimeuuid/minTimeuuid in cassandra
Short answer, No. Long answer, you can do something similar EG:
CREATE TABLE dc.event (
event_time timestamp,
id timeuuid,
name text,
PRIMARY KEY(event_time, id)
) WITH bloom_filter_fp_chance = 0.01;
The timestamp would presumably be truncated so that it only reflected a whole day (or hour or minute depending on the velocity of your data). Your where clause would have to include the "IN" parameter for the timestamps that are included in your timeuuid range.
If you use an appropriate chunking factor (how much you truncate your timestamp), you may even answer some of the questions you're looking for without using a range of timeuuids, just a simple where clause.
Essentially this allows you the leeway to make the kind of query you're looking for while respecting the restrictions in Cassandra. As Raedwald pointed out, you can't use the partition key in continuous ranges because of the underpinning nature of Cassandra as a large hash- That being said, Cassandra is well known to do some incredibly powerful things in time-series data.
Take a look at how Newts is doing time series for ranges. The author has a great set of slides and a talk describing the data model to get precisely what you seem to be looking for. https://github.com/OpenNMS/newts/
Cassandra can not do this kind of query because Cassandra is a key-value store implemented using a giant hash map, not a relational database. Just like an in memory hash map, the only way to find the key values within a sub range is to iterate through all the keys. That can be expensive enough for an in memory hash map, but for Cassandra it would be crippling.
Yes, you can do it by using spark with scala and spark-cassandra-connector!
I think you should keep your partition keys fewer by setting them to 'YYYY-MM-dd hh:00+0000' and filter on dates and hours only.
Then you could use something like:
case class TableKey(id: timeuuid)
val dates = Array("2014-11-02 10:00+0000","2014-11-02 11:00+0000","2014-11-02 12:00+0000")
val selected_data = sc.parallelize(dates).map(x => TableKey(_)).joinWithCassandraTable('dc', 'event')
And there you have your selected data rdd that you could collect:
val data = selected_data.collect
I had similar problem...

Resources