How to select max timestamp in a partition using Cassandra - cassandra

I have a problem modeling my data using Cassandra. I would like to use it as an event store. My events have creation timestamp. Those event belong to a partition which is identified by an id.
Now I'd like to see most recent event for each id and then filter this ids according to the timestamp.
So I have something like this:
ID | CREATION_TIMESTAMP | CONTENT
---+---------------------------------+----------------
1 | 2018-11-09 12:15:45.841000+0000 | {SOME_CONTENT}
1 | 2018-11-09 12:15:55.654656+0000 | {SOME_CONTENT}
2 | 2018-11-09 12:15:35.982354+0000 | {SOME_CONTENT}
2 | 2018-11-09 12:35:25.321655+0000 | {SOME_CONTENT}
2 | 2018-11-09 13:15:15.068498+0000 | {SOME_CONTENT}
I tried grouping by partition id and querying for max of creation_timestamp but that is not allowed and I should specify partition id using EQ or IN. Additional reading led me to believe that this is entirely wrong way of approaching this problem but I don't know whether NoSQL is not suitable tool for the job or I am simply approaching this problem from wrong angle?

You can easily achieve this by having your CREATION_TIMESTAMP as clustering column and ordered DESC. Then you would query by your id and using limit 1 (which will return the most recent event since the data is order DESC in that partition key).

can you please share your table definition .
by looking at your data you can use ID as partition key and CREATION_TIMESTAMP as clustering column.
you can use select MAX(CREATION_TIMESTAMP) from keyspace.table where ID='value';

Related

Best way to filter to a specific row in pyspark dataframe

I have what seems like a simple question, but I cannot figure it out. I am trying to filter to a specific row, based on an id (primary key) column, because I want to spot-check it against the same id in another table where a transform has been applied.
More detail... I have a dataframe like this:
| id | name | age |
| 1112 | Bob | 54 |
| 1123 | Sue | 23 |
| 1234 | Jim | 37 |
| 1251 | Mel | 58 |
...
except it has ~3000MM rows and ~2k columns. The obvious answer is something like df.filter('id = 1234').show(). The problem is that I have ~300MM rows and this query takes forever (as in 10-20 minutes on a ~20 node AWS EMR cluster).
I understand that it has to do table scan, but fundamentally I don't understand why something like df.filter('age > 50').show() finishes in ~30 seconds and the id query takes so long. Don't they both have to do the same scan?
Any insight is very welcome. I am using pyspark 2.4.0 on linux.
Don't they both have to do the same scan?
That depends on the data distribution.
First of all show takes only as little data as possible, so as long there is enough data to collect 20 rows (defualt value) it can process as little as a single partition, using LIMIT logic (you can check Spark count vs take and length for a detailed description of LIMIT behavior).
If 1234 was on the first partition and you've explicitly set limit to 1
df.filter('id = 1234').show(1)
the time would be comparable to the other example.
But if limit is smaller than number of values that satisfy the predicate, or values of interest reside in the further partitions, Spark will have to scan all data.
If you want to make it work faster you'll need data bucketed (on disk) or partitioned (in memory) using field of interest, or use one of the proprietary extensions (like Databricks indexing) or specialized storage (like unfortunately inactive, succint).
But really, if you need fast lookups, use a proper database - this what they are designed for.

Retain last row for given key in spark structured streaming

Similar to Kafka's log compaction there are quite a few use cases where it is required to keep only the last update on a given key and use the result for example for joining data.
How can this be archived in spark structured streaming (preferably using PySpark)?
For example suppose I have table
key | time | value
----------------------------
A | 1 | foo
B | 2 | foobar
A | 2 | bar
A | 15 | foobeedoo
Now I would like to retain the last values for each key as state (with watermarking), i.e. to have access to a the dataframe
key | time | value
----------------------------
B | 2 | foobar
A | 15 | foobeedoo
that I might like to join against another stream.
Preferably this should be done without wasting the one supported aggregation step. I suppose I would need kind of a dropDuplicates() function with reverse order.
Please note that this question is explicily about structured streaming and how to solve the problem without constructs that waste the aggregation step (hence, everything with window functions or max aggregation is not a good answer). (In case you do not know: Chaining Aggregations is right now unsupported in structured streaming.)
Using flatMapGroupsWithState or mapGroupsWithState, group by key, and sort the value by time in the flatMapGroupsWithState function, store the last line into the GroupState.

Efficient Cassandra DB design to retrieve a summary of time series financial data

I am looking to use the apache cassandra database to store a time series of 1 minute OHLCV financial data for ~1000 symbols. This will need to be updated in real-time as data is streamed in. All entries where time>24hr oldare not needed and should be discarded.
Assuming there are 1000 symbols with entries for each minute from the past 24 hrs, the total number of entries will amount to 1000*(60*24) = 1,440,000.
I am interested in designing this database to efficiency retrieve a slice of all symbols from the past [30m, 1h, 12h, 24h] with fast querying times. Ultimately, I need to retrieve the OHLCV that summarises this slice. The resulting output would be {symbol, FIRST(open), MAX(high), MIN(low), LAST(close), SUM(volume)} of the slice for each symbol. This essentially summarises the 1m OHLCV entries and creates an [30m, 1h, 12h, 24h] OHLCV from the time of the query. E.g. If I want to retrieve the past 1h OHLCV from 1:32pm, the query will give me a 1h OHLCV that represents data from 12:32pm-1:32pm.
What would be a good design to meet these requirements? I am not concerned with the database's memory footprint on the hard drive. The real issue is with fast querying times that is light on cpu and ram.
I have come up with a simple and naive way to store each record with clustering ordered by time:
CREATE TABLE symbols (
time timestamp,
symbol text,
open double,
high double,
low double,
close double,
volume double
PRIMARY KEY (time, symbol)
) WITH CLUSTERING ORDER BY (time DESC);
But I am not sure how to select from this to meet my requirements. I would rather design it specifically for my query, and duplicate data if necessary.
Any suggestions will be much appreciated.
While not based on Cassandra, Axibase Time Series Database can be quite relevant to this particular use case. It supports SQL with time-series syntax extensions to aggregate data into periods of arbitrary length.
An OHLCV query for a 15-minute window might look as follows:
SELECT date_format(datetime, 'yyyy-MM-dd HH:mm:ss', 'US/Eastern') AS time,
FIRST(t_open.value) AS open,
MAX(t_high.value) AS high,
MIN(t_low.value) AS low,
LAST(t_close.value) AS close,
SUM(t_volume.value) AS volume
FROM stock.open AS t_open
JOIN stock.high AS t_high
JOIN stock.low AS t_low
JOIN stock.close AS t_close
JOIN stock.volume AS t_volume
WHERE t_open.entity = 'ibm'
AND t_open.datetime >= '2018-03-29T14:32:00Z' AND t_open.datetime < '2018-03-29T15:32:00Z'
GROUP BY PERIOD(15 MINUTE, END_TIME)
ORDER BY datetime
Note the GROUP BY PERIOD clause above which does all the work behind the scenes.
Query results:
| time | open | high | low | close | volume |
|----------------------|----------|---------|----------|---------|--------|
| 2018-03-29 10:32:00 | 151.8 | 152.14 | 151.65 | 152.14 | 85188 |
| 2018-03-29 10:47:00 | 152.18 | 152.64 | 152 | 152.64 | 88065 |
| 2018-03-29 11:02:00 | 152.641 | 153.04 | 152.641 | 152.69 | 126511 |
| 2018-03-29 11:17:00 | 152.68 | 152.75 | 152.43 | 152.51 | 104068 |
You can use a Type 4 JDBC driver, API clients or just curl to run these queries.
I'm using sample 1-minute data for the above example which you can download from Kibot as described in these compression tests.
Also, ATSD supports scheduled queries to materialize minutely data into OHLCV bars of longer duration, say for long-term retention.
Disclaimer: I work for Axibase.

Sum aggregation for each columns in cassandra

I have a Data model like below,
CREATE TABLE appstat.nodedata (
nodeip text,
timestamp timestamp,
flashmode text,
physicalusage int,
readbw int,
readiops int,
totalcapacity int,
writebw int,
writeiops int,
writelatency int,
PRIMARY KEY (nodeip, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC)
where, nodeip - primary key and timestamp - clustering key (Sorted by descinding oder to get the latest),
Sample data in this table,
SELECT * from nodedata WHERE nodeip = '172.30.56.60' LIMIT 2;
nodeip | timestamp | flashmode | physicalusage | readbw | readiops | totalcapacity | writebw | writeiops | writelatency
--------------+---------------------------------+-----------+---------------+--------+----------+---------------+---------+-----------+--------------
172.30.56.60 | 2017-12-08 06:13:07.161000+0000 | yes | 34 | 57 | 19 | 27 | 8 | 89 | 57
172.30.56.60 | 2017-12-08 06:12:07.161000+0000 | yes | 70 | 6 | 43 | 88 | 79 | 83 | 89
This is properly available and whenever I need to get the statistics I am able to get the data using the partition key like below,
(The above logic seems similar to my previous question : Aggregation in Cassandra across partitions) but expectation is different,
I have value for each column (like readbw, latency etc.,) populated for every one minute in all the 4 nodes.
Now, If I need to get the max value for a column (Example : readbw), It is possible using the following query,
SELECT max(readbw) FROM nodedata WHERE nodeip IN ('172.30.56.60','172.30.56.61','172.30.56.60','172.30.56.63') AND timestamp < 1512652272989 AND timestamp > 1512537899000;
1) First question : Is there a way to perform max aggregation on all nodes of a column (readbw) without using IN query?
2) Second question : Is there a way in Cassandra, whenever I insert the data in Node 1, Node 2, Node 3 and Node 4.
It needs to be aggregated and stored in another table. So that I will collect the aggregated value of each column from the aggregated table.
If any of my point is not clear, please let me know.
Thanks,
Harry
If you are dse Cassandra you can enable spark and write the aggregation queries
Disclaimer. In your question you should define restrictions to query speed. Readers do not know whether you're trying to show this in real time, or is it more for analytical purposes. It's also not clear on how much data you're operating and the answers might depend on that.
Firstly decide whether you want to do aggregation on read or write. This largely depends on your read/write patterns.
1) First question: (aggregation on read)
The short answer is no - it's not possible. If you want to use Cassandra for this, the best approach would be doing aggregation in your application by reading each nodeip with timestamp restriction. That would be slow. But Cassandra aggregations are also potentially slow. This warning exists for a reason:
Warnings :
Aggregation query used without partition key
I found C++ Cassandra driver to be the fastest option, if you're into that.
If your data size allows, I'd look into using other databases. Regular old MySQL or Postgres will do the job just fine, unless you have terabytes of data. There's also influx DB if you want a more exotic one. But I'm getting off-topic here.
2) Second question: (aggregation on write)
That's the approach I've been using for a while. Whenever I need some aggregations, I would do them in memory (redis) and then flush to Cassandra. Remember, Cassandra is super efficient at writing data, don't be afraid to create some extra tables for your aggregations. I can't say exactly how to do this for your data, as it all depends on your requirements. It doesn't seem feasible to provide results for arbitrary timestamp intervals when aggregating on write.
Just don't try to put large sets of data into a single partition. You're better of with traditional SQL databases then.

cassandra where clause .could not find proper rows

i have a coulmn family in cassandra 1.2 as below:
time | class_name | level_log | message | thread_name
-----------------+-----------------------------+-----------+---------------+-------------
121118135945759 | ir.apk.tm.test.LoggerSimple | DEBUG | This is DEBUG | main
121118135947310 | ir.apk.tm.test.LoggerSimple | ERROR | This is ERROR | main
121118135947855 | ir.apk.tm.test.LoggerSimple | WARN | This is WARN | main
121118135946221 | ir.apk.tm.test.LoggerSimple | DEBUG | This is DEBUG | main
121118135951461 | ir.apk.tm.test.LoggerSimple | WARN | This is WARN | main
when i use this query:
SELECT * FROM LogTM WHERE token(time) > token(0);
i get nothing!!! but as you see all of time values are greater than zero!
this is CF schema:
CREATE TABLE logtm(
time bigint PRIMARY KEY ,
level_log text ,
thread_name text ,
class_name text ,
msg text
);
can any body help?
thanks :)
If you're not using an ordered partitioner (if you don't know what that means you don't) that query doesn't do what you think. Just because two timestamps sort one way doesn't mean that their tokens do. The token is the (Murmur3) hash of the cell value (unless you've changed the partitioner).
If you need to do range queries you can't do it on the partition key, only on clustering keys. One way you can do it is to use a schema like this:
CREATE TABLE LogTM (
shard INT,
time INT,
class_name ASCII,
level_log ASCII,
thread_name ASCII,
message TEXT,
PRIMARY KEY (shard, time, class_name, level_log, thread_name)
)
If you set shard to zero the schema will be roughly equivalent to what you're doing now, but the query SELECT * FROM LogTM WHERE timestamp > 0 will give you the results you expect.
However, the performance will be awful. With a single value of shard only a single partition/row will be created, and you will only use a single node of your cluster (and that node will be very busy trying to compact that single row).
So you need to figure out a way to spread the load across more nodes. One way is to pick a random shard between something like 0 and 359 (or 0 and 255 if you like multiples of two, the exact range isn't important, it just needs to be an order of magnitude or so larger than the number of nodes) for each insert, and read from all shards when you read back: SELECT * FROM LogTM WHERE shard IN (0,1,2,...) (you need to include all shards in the list, in place of ...).
You can also pick the shard by hashing the message, that way you don't have to worry about duplicates.
You need to tell us more about what exactly it is that you're trying to do, especially how you intend to query the data. Don't go do the thing I described above, it is probably completely wrong for your use case, I just wanted to give you an example so that I could explain what is going on inside Cassandra.

Resources