fetching timeseries/range data in cassandra - cassandra

I am new to Cassandra and trying to see if it fits my data query needs. I am populating test data in a table and fetching them using cql client in Golang.
I am storing time series data in Cassandra, sorted by timestamp. I store data on a per-minute basis.
Schema is like this:
parent: string
child: string
bytes: int
val2: int
timestamp: date/time
I need to answer queries where a timestamp range is provided and a childname is given. The result needs to be the bytes value in that time range(Single value, not series) I made a primary key(child, timestamp). I followed this approach rather than the column-family, comparator-type with timeuuid since that was not supported in cql.
Since the data stored in every timestamp(every minute) is the accumulated value, when I get a range query for time t1 to t2, I need to find the bytes value at t2, bytes value at t1 and subtract the 2 values before returning. This works fine if t1 and t2 actually had entries in the table. If they do not, I need to find those times between (t1, t2) that have data and return the difference.
One approach I can think of is to "select * from tablename WHERE timestamp <= t2 AND timestamp >= t1;" and then find the difference between the first and last entry in this array of rows returned. Is this the best way to do it? Since MIN and MAX queries are not supported, is there is a way to find the maximum timestamp in the table less than a given value? Thanks for your time.

Are you storing each entry as a new row with a different partition key(first column in the Primary key)? If so, select * from x where f < a and f > b is a cluster wide query, which will cause you problems. Consider adding a "fake" partition key, or use a partition key per date / week / month etc. so that your queries hit a single partition.
Also, your queries in cassandra are >= and <= even if you specify > and <. If you need strictly greater than or less than, you'll need to filter client side.

Related

Delete records in Cassandra table based on time range

I have a Cassandra table with schema:
CREATE TABLE IF NOT EXISTS TestTable(
documentId text,
sequenceNo bigint,
messageData blob,
clientId text
PRIMARY KEY(documentId, sequenceNo))
WITH CLUSTERING ORDER BY(sequenceNo DESC);
Is there a way to delete the records which were inserted between a given time range? I know internally Cassandra must be using some timestamp to track the insertion time of each record, which would be used by features like TTL.
Since there is no explicit column for insertion timestamp in the given schema, is there a way to use the implicit timestamp or is there any better approach?
There is never any update to the records after insertion.
It's an interesting question...
All columns that aren't part of the primary key have so-called WriteTime that could be retrieved using the writetime(column_name) function of CQL (warning: it doesn't work with collection columns, and return null for UDTs!). But because we don't have nested queries in the CQL, you will need to write a program to fetch data, filter out entries by WriteTime, and delete entries where WriteTime is older than your threshold. (note that value of writetime is in microseconds, not milliseconds as in CQL's timestamp type).
The easiest way is to use Spark Cassandra Connector's RDD API, something like this:
val timestamp = someDate.toInstant.getEpochSecond * 1000L
val oldData = sc.cassandraTable(srcKeyspace, srcTable)
.select("prk1", "prk2", "reg_col".writeTime as "writetime")
.filter(row => row.getLong("writetime") < timestamp)
oldData.deleteFromCassandra(srcKeyspace, srcTable,
keyColumns = SomeColumns("prk1", "prk2"))
where: prk1, prk2, ... are all components of the primary key (documentId and sequenceNo in your case), and reg_col - any of the "regular" columns of the table that isn't collection or UDT (for example, clientId). It's important that list of the primary key columns in select and deleteFromCassandra was the same.

Storing time specific data in cassandra

I am looking for a good way to store time specific data in cassandra.
Each entry can look like (start_time, value). Later, I would like to retrieve the current value.
Logic of retrieving current value is like following.
Find all rows with start_time<=current_time.
Then find the value with maximum start_time from the rows obtained in the first step.
PS:- Edited the question to make it more clear
The exact requirements are not possible. But we can get close to it with one more column.
First, to be able to use <= operator, your start_time column need to be the clustering key of your table.
Then, you need a different partition key. You could choose a fixed value but it could bring problems when the partition will have too many rows. Then you should better use something like the year or the month of the start_time.
CREATE TABLE time_specific_table (
year bigint,
start_time timestamp,
value text,
PRIMARY KEY((year), start_time)
) WITH CLUSTERING ORDER BY (start_time DESC);
The problem is that when you will query the table, you will need to know the value of the partition key :
Find all rows with start_time<=current_time
SELECT * FROM time_specific_table
WHERE year = :year AND start_time <= :time;
select the value with maximum start_time
SELECT * FROM time_specific_table
WHERE year = :year LIMIT 1;
Create two separate table like below :
CREATE TABLE data (
start_time timestamp,
value int,
PRIMARY KEY(start_time, value)
);
CREATE TABLE current_value (
partition int PRIMARY KEY,
value int
);
Now you have to insert data into both table, to insert data into second table use a static value like 1
INSERT INTO current_value(partition, value) VALUES(1, 10);
Now In current value table your data will be upsert and You will get latest value whenever you select.

filter for key-value pair in cassandra wide rows

I am trying to model time series data with many sensors (> 50k) with cassandra. As I would like to do filtering on multiple sensors at the same time, I thought using the following (wide row) schema might be suitable:
CREATE TABLE data(
time timestamp,
session_id int,
sensor text,
value float,
PRIMARY KEY((time, session_id), sensor)
);
If every sensor value was a column in an RDBMS, my query would ideally look like:
SELECT * FROM data WHERE sensor_1 > 10 AND sensor_2 < 2;
Translated to my cassandra schema, I assumed the query might look like:
SELECT * FROM data
WHERE
sensor = 'sensor_1' AND
value > 10 AND
sensor = 'sensor_2' AND
value < 2;
I now have two problems:
cassandra tells me that I can filter on the sensor column only
once:
sensor cannot be restricted by more than one relation if it
includes an Equal
Obviously, the filter on value doesn't make sense at the moment. I wouldn't know how to express the relationship
between sensor and value in the query in order to filter multiple
columns in the same (wide) row.
I do know that a solution to the first question would be to use CQL's IN clause. This however doesn't solve the second problem.
Is this scenario even suitable for cassandra?
Many thanks in advance.
You could try to use IN clause here.
So your query would be like this:
SELECT * FROM data
WHERE time = <time> and session_id = <session id>
AND sensor IN ('sensor_1', 'sensor_2')
AND value > 10 AND value < 2

Presto Cassandra Connector Clustering Index

CQL Execution [returns instantly, assuming uses clustering key index]:
cqlsh:stats> select count(*) from events where month='2015-04' and day = '2015-04-02';
count
-------
5447
Presto Execution [takes around 8secs]:
presto:default> select count(*) as c from cassandra.stats.events where month = '2015-04' and day = timestamp '2015-04-02';
c
------
5447
(1 row)
Query 20150228_171912_00102_cxzfb, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0:08 [147K rows, 144KB] [17.6K rows/s, 17.2KB/s]
Why should presto get to process 147K rows when cassandra itself responds with just 5447 rows for the same query [I tried select * too]?
Why presto is not able to use the clustering key optimization?
I tried all possible values like timestamp, date, different formats of dates. Not able to see any effect on number of rows being fetched.
CF Reference:
CREATE TABLE events (
month text,
day timestamp,
test_data text,
some_random_column text,
event_time timestamp,
PRIMARY KEY (month, day, event_time)
) WITH comment='Test Data'
AND read_repair_chance = 1.0;
Added event_timestamp too as a constraint in response to Dain's answer
presto:default> select count(*) from cassandra.stats.events where month = '2015-04' and day = timestamp '2015-04-02 00:00:00+0000' and event_time = timestamp '2015-04-02 00:00:34+0000';
_col0
-------
1
(1 row)
Query 20150301_071417_00009_cxzfb, FINISHED, 1 node
Splits: 2 total, 2 done (100.00%)
0:07 [147K rows, 144KB] [21.3K rows/s, 20.8KB/s]
The Presto engine will pushdown simple WHERE clauses like this to a connector (you can see this in the Hive connector), so the question is, why does the Cassandra connector not take advantage of this. To see why, we'll have to look at the code.
The pushdown system first interacts with connectors in the ConnectorSplitManager.getPartitions(ConnectorTableHandle, TupleDomain) method, so looking at the CassandraSplitManager, I see it is delegating the logic to getPartitionKeysSet. This method looks for a range constraint (e.g., x=33 or x BETWEEN 1 AND 10) for every column in the primary key, so in your case, you would need to add a constraint on event_time.
I don't know why the code insists on having a constraint on every column in the primary key, but I'd guess that it is a bug. It should be easy to tweak this code to remove that constraint.

Query columns based on datetime in Cassandra

We are trying to create/query information from a CF based on the following structure (e.g. a datetime, datetime, integer)
e.g.
03-22-2012 10.00, 03-22-2012 10.30 100
03-22-2012 10.30, 03-22-2012 11.00 50
03-22-2012 11.00, 03-22-2012 11.30 200
How do I model the above structure in Cassandra and perform the following queries via Hector.
select * from <CF> where datetime1 > 03-22-2012 10.00 and datetime2 < 03-22-2012 10.30
select * from <CF> where datetime1 > 03-22-2012 10.00 and datetime2 < 03-22-2012 11.00
select * from <CF> where datetime = 03-22-2012 (i.e. for the entire day)
This is a great introduction to working with dates and times in Cassandra: Basic Time Series with Cassandra.
In short, use timestamps (or v1 UUIDs) as your column names and set the comparator to LongType (or TimeUUIDType) in order to get chronological sorting of the columns. It's then easy to get a slice of data between two points in time.
Your question isn't totally clear about this, but if you want to get all events that happened during a given range of time of day regardless of the date, then you will want to structure your data differently. In this case, column names may be CompositeType(LongType, AsciiType), where the first component is a normal timestamp mod 86400 (the number of seconds in a day), and the second component is the date or something else that changes over time, like a full timestamp. You would also want to break up the row in this case, perhaps dedicating a different row to each hour.
Unfortunately there is no way to do this easily with just one column family in Cassandra. The problem is you are wanting cassandra to sort based on two different things: datetime1 and datetime2.
The obvious structure for this would be to have your Columns being Composite types of Composite(TimeUUID, TimeUUID, Integer). In this case, they will get sorted by datetime1, then datetime2, then integer.
But you will always get the ordering based on datetime1 and not on datetime2 (though if two entries have the same datetime1 then it will then order just those entries based on datetime2).
A possible workaround would be to have two column families with duplicate data (or indeed two rows for each logical row). One row where data is inserted (datetime1:datetime2:integer) and the other where it is inserted (datetime2:datetime1:integer). You can then do a multigetslice operation on these two rows and combine the data before handing it off to the caller:
final MultigetSliceQuery<String, Composite, String> query = HFactory.createMultigetSliceQuery(keyspace,
StringSerializer.get(),
CompositeSerializer.get(),
StringSerializer.get());
query.setColumnFamily("myColumnFamily");
startQuery.setKeys("myRow.arrangedByDateTime1", "myRow.arrangedByDateTime2");
startQuery.setRange(new Composite(startTime), new Composite(endTime), false, Integer.MAX_VALUE);
final QueryResult<Rows<String,Composite,String>> queryResult = query.execute();
final Rows<String,Composite,String> rows = queryResult.get();

Resources