I have a use case and want suggestion on the below.
Structure :
Rowkey_1:
Column1 = value1;
Column2 = value2;
Rowkey_2:
Column1 = value1;
Column2 = value2;
" Suppose i am writing 1000 rows into cassandra with each row having couple of columns. After sometime i update only 100 rows and make changes for column values ".
-> when i read data from cassandra i only want to get these 100 updated rows and not the entire row key information.
Is there a way to say to cassandra like give me all row keys from start - > end where time in between "Time_start" to "Time_end"
in SQL Lingo -- > select from "" to "" where time between "time_start" and "time_end".
P.S. i read Basic Time Series with Cassandra where it says you can annotate rowkey like the below
Inserting data — {:key => ‘server1-load-20110306′, :column_name => TimeUUID(now), :column_value => 0.75}
Here the column family has TimeUUID columns.
My question is can you annotate you rowkey with date and time like this : { :key ==> 2012-11-18 16:00:15 }
OR any other way to get only the recent updated rows.
Any suggestion/ guidance really appreciated.
You can't do range queries on keys unless you use ByteOrderedPartitioner, which you shouldn't use. The way to do this is by writing known sentinel values as keys, such as a timestamp representing the beginning of the day. Then you can do the column slice by time.
Related
I have been trying to model a data in Cassandra, and was trying to filter the data based on date in that, as given by the answer here on SO, Here second answer is not using allow filter.
This is my current schema,
CREATE TABLE Banking.BankData(acctID TEXT,
email TEXT,
transactionDate Date ,
transactionAmount double ,
balance DOUBLE,
currentTime timestamp ,
PRIMARY KEY((acctID, transactionDate), currentTime )
WITH CLUSTERING ORDER BY (currentTime DESC);
Now have inserted a data by
INSERT INTO banking.BankData(acctID, email, transactionDate, transactionAmount, balance, currentTime) values ('11', 'alpitanand20#gmail.com','2013-04-03',10010, 10010, toTimestamp(now()));
Now when I try to query, like
SELECT * FROM banking.BankData WHERE acctID = '11' AND transactionDate > '2012-04-03';
It's saying me to allow filtering, however in the link mentioned above, it was not the case.
The final requirement was to get data by year, month, week and so on, thats why had taken to partition it by day, but date range query is not working.
Any suggestion in remodel or i am doing something wrong ?
Thanks
Cassandra supports only equality predicate on the partition key columns, so you can use only = operation on it.
Range predicates (>, <, >=, <=) are supported only only on the clustering columns, and it should be a last clustering column of condition.
For example, if you have following primary key: (pk, c1, c2, c3), you can have range predicate as following:
where pk = xxxx and c1 > yyyy
where pk = xxxx and c1 = yyyy and c2 > zzzz
where pk = xxxx and c1 = yyyy and c2 = zzzz and c3 > wwww
but you can't have:
where pk = xxxx and c2 > zzzz
where pk = xxxx and c3 > zzzz
because you need to restrict previous clustering columns before using range operation.
If you want to perform a range query on this data, you need to declare corresponding column as clustering column, like this:
PRIMARY KEY(acctID, transactionDate, currentTime )
in this case you can perform your query. But because you have time component, you can simply do:
PRIMARY KEY(acctID, currentTime )
and do the query like this:
SELECT * FROM banking.BankData WHERE acctID = '11'
AND currentTime > '2012-04-03T00:00:00Z';
But you need to take 2 things into consideration:
your primary should be unique - maybe you'll need to add another clustering column, like, transaction ID (for example, as uuid type) - in this case even 2 transactions happen into the same millisecond, they won't overwrite each other;
if you have a lot of transactions per account, then you may need to add an another column into partition key. For example, year, or year/month, so you don't have big partitions.
P.S. In linked answer use of non-equality operation is possible because ts is clustering column.
I am looking for a good way to store time specific data in cassandra.
Each entry can look like (start_time, value). Later, I would like to retrieve the current value.
Logic of retrieving current value is like following.
Find all rows with start_time<=current_time.
Then find the value with maximum start_time from the rows obtained in the first step.
PS:- Edited the question to make it more clear
The exact requirements are not possible. But we can get close to it with one more column.
First, to be able to use <= operator, your start_time column need to be the clustering key of your table.
Then, you need a different partition key. You could choose a fixed value but it could bring problems when the partition will have too many rows. Then you should better use something like the year or the month of the start_time.
CREATE TABLE time_specific_table (
year bigint,
start_time timestamp,
value text,
PRIMARY KEY((year), start_time)
) WITH CLUSTERING ORDER BY (start_time DESC);
The problem is that when you will query the table, you will need to know the value of the partition key :
Find all rows with start_time<=current_time
SELECT * FROM time_specific_table
WHERE year = :year AND start_time <= :time;
select the value with maximum start_time
SELECT * FROM time_specific_table
WHERE year = :year LIMIT 1;
Create two separate table like below :
CREATE TABLE data (
start_time timestamp,
value int,
PRIMARY KEY(start_time, value)
);
CREATE TABLE current_value (
partition int PRIMARY KEY,
value int
);
Now you have to insert data into both table, to insert data into second table use a static value like 1
INSERT INTO current_value(partition, value) VALUES(1, 10);
Now In current value table your data will be upsert and You will get latest value whenever you select.
I am trying to model time series data with many sensors (> 50k) with cassandra. As I would like to do filtering on multiple sensors at the same time, I thought using the following (wide row) schema might be suitable:
CREATE TABLE data(
time timestamp,
session_id int,
sensor text,
value float,
PRIMARY KEY((time, session_id), sensor)
);
If every sensor value was a column in an RDBMS, my query would ideally look like:
SELECT * FROM data WHERE sensor_1 > 10 AND sensor_2 < 2;
Translated to my cassandra schema, I assumed the query might look like:
SELECT * FROM data
WHERE
sensor = 'sensor_1' AND
value > 10 AND
sensor = 'sensor_2' AND
value < 2;
I now have two problems:
cassandra tells me that I can filter on the sensor column only
once:
sensor cannot be restricted by more than one relation if it
includes an Equal
Obviously, the filter on value doesn't make sense at the moment. I wouldn't know how to express the relationship
between sensor and value in the query in order to filter multiple
columns in the same (wide) row.
I do know that a solution to the first question would be to use CQL's IN clause. This however doesn't solve the second problem.
Is this scenario even suitable for cassandra?
Many thanks in advance.
You could try to use IN clause here.
So your query would be like this:
SELECT * FROM data
WHERE time = <time> and session_id = <session id>
AND sensor IN ('sensor_1', 'sensor_2')
AND value > 10 AND value < 2
I am new to Cassandra and trying to see if it fits my data query needs. I am populating test data in a table and fetching them using cql client in Golang.
I am storing time series data in Cassandra, sorted by timestamp. I store data on a per-minute basis.
Schema is like this:
parent: string
child: string
bytes: int
val2: int
timestamp: date/time
I need to answer queries where a timestamp range is provided and a childname is given. The result needs to be the bytes value in that time range(Single value, not series) I made a primary key(child, timestamp). I followed this approach rather than the column-family, comparator-type with timeuuid since that was not supported in cql.
Since the data stored in every timestamp(every minute) is the accumulated value, when I get a range query for time t1 to t2, I need to find the bytes value at t2, bytes value at t1 and subtract the 2 values before returning. This works fine if t1 and t2 actually had entries in the table. If they do not, I need to find those times between (t1, t2) that have data and return the difference.
One approach I can think of is to "select * from tablename WHERE timestamp <= t2 AND timestamp >= t1;" and then find the difference between the first and last entry in this array of rows returned. Is this the best way to do it? Since MIN and MAX queries are not supported, is there is a way to find the maximum timestamp in the table less than a given value? Thanks for your time.
Are you storing each entry as a new row with a different partition key(first column in the Primary key)? If so, select * from x where f < a and f > b is a cluster wide query, which will cause you problems. Consider adding a "fake" partition key, or use a partition key per date / week / month etc. so that your queries hit a single partition.
Also, your queries in cassandra are >= and <= even if you specify > and <. If you need strictly greater than or less than, you'll need to filter client side.
We are trying to create/query information from a CF based on the following structure (e.g. a datetime, datetime, integer)
e.g.
03-22-2012 10.00, 03-22-2012 10.30 100
03-22-2012 10.30, 03-22-2012 11.00 50
03-22-2012 11.00, 03-22-2012 11.30 200
How do I model the above structure in Cassandra and perform the following queries via Hector.
select * from <CF> where datetime1 > 03-22-2012 10.00 and datetime2 < 03-22-2012 10.30
select * from <CF> where datetime1 > 03-22-2012 10.00 and datetime2 < 03-22-2012 11.00
select * from <CF> where datetime = 03-22-2012 (i.e. for the entire day)
This is a great introduction to working with dates and times in Cassandra: Basic Time Series with Cassandra.
In short, use timestamps (or v1 UUIDs) as your column names and set the comparator to LongType (or TimeUUIDType) in order to get chronological sorting of the columns. It's then easy to get a slice of data between two points in time.
Your question isn't totally clear about this, but if you want to get all events that happened during a given range of time of day regardless of the date, then you will want to structure your data differently. In this case, column names may be CompositeType(LongType, AsciiType), where the first component is a normal timestamp mod 86400 (the number of seconds in a day), and the second component is the date or something else that changes over time, like a full timestamp. You would also want to break up the row in this case, perhaps dedicating a different row to each hour.
Unfortunately there is no way to do this easily with just one column family in Cassandra. The problem is you are wanting cassandra to sort based on two different things: datetime1 and datetime2.
The obvious structure for this would be to have your Columns being Composite types of Composite(TimeUUID, TimeUUID, Integer). In this case, they will get sorted by datetime1, then datetime2, then integer.
But you will always get the ordering based on datetime1 and not on datetime2 (though if two entries have the same datetime1 then it will then order just those entries based on datetime2).
A possible workaround would be to have two column families with duplicate data (or indeed two rows for each logical row). One row where data is inserted (datetime1:datetime2:integer) and the other where it is inserted (datetime2:datetime1:integer). You can then do a multigetslice operation on these two rows and combine the data before handing it off to the caller:
final MultigetSliceQuery<String, Composite, String> query = HFactory.createMultigetSliceQuery(keyspace,
StringSerializer.get(),
CompositeSerializer.get(),
StringSerializer.get());
query.setColumnFamily("myColumnFamily");
startQuery.setKeys("myRow.arrangedByDateTime1", "myRow.arrangedByDateTime2");
startQuery.setRange(new Composite(startTime), new Composite(endTime), false, Integer.MAX_VALUE);
final QueryResult<Rows<String,Composite,String>> queryResult = query.execute();
final Rows<String,Composite,String> rows = queryResult.get();