Timeseries with Spark/Cassandra - How to find timestamps when values satisfy a condition? - apache-spark

I have timeseries stored in a Cassandra table, coming from several sensors. Here is the schema I use for storing data :
CREATE TABLE data_sensors (
sensor_id int,
time timestamp,
value float,
PRIMARY KEY ((sensor_id), time)
);
Values can be temperature or pressure for instance, depending on the sensor from which it is coming from.
My objective is to be able to find basic statistics (min, max, avg, std) on pressure, but only when temperature is higher than a certain value.
Here is a schema of the whole process I'd like to get.
I think it could be better if I changed the Cassandra model, at least for temperature data, to be able to filter on value. Is there another way, after importing data into a Spark RDD, to avoid altering the Cassandra table?
Then, once filtering on temperature is done, how to get the sequence of timestamps I have to use to filter pressure data? Please note that I don't have necessarily the same timestamps for temperature and pressure, that is why I think I need to have periods of time instead of a list of precise timestamps.
Thanks for your help!

It's not really a Cassandra-specific answer, but maybe you want to look at time series databases that provide SQL layer on top of NoSQL stores with support for JOINs and aggregations.
Here's an example of an ATSD SQL syntax that supports period aggregations and joins.
SELECT t1.entity, t1.datetime, min(t1.value), max(t1.value), avg(t2.value)
FROM mpstat.cpu_busy t1
JOIN meminfo.memfree t2
WHERE t1.datetime >= '2016-09-20T15:00:00Z' AND t1.datetime < '2016-09-20T15:15:00Z'
GROUP BY entity, t1.PERIOD(1 MINUTE)
HAVING max(t1.value) > 30
The query joins two metrics, filters out 1-minute rows where first metric was below the threshold and then returns a bunch of statistics for the second series.
If the two series are unevenly spaced, you can regularize the array using linear interpolation.
Disclosure: I work for Axibase that develops ATSD.

Related

Efficient reading/transforming partitioned data in delta lake

I have my data in a delta lake in ADLS and am reading it through Databricks. The data is partitioned by year and date and z ordered by storeIdNum, where there are about 10 store Id #s, each with a few million rows per date. When I read it, sometimes I am reading one date partition (~20 million rows) and sometimes I am reading in a whole month or year of data to do a batch operation. I have a 2nd much smaller table with around 75,000 rows per date that is also z ordered by storeIdNum and most of my operations involve joining the larger table of data to the smaller table on the storeIdNum (and some various other fields - like a time window, the smaller table is a roll up by hour and the other table has data points every second). When I read the tables in, I join them and do a bunch of operations (group by, window by and partition by with lag/lead/avg/dense_rank functions, etc.).
My question is: should I have the date in all of the joins, group by and partition by statements? Whenever I am reading one date of data, I always have the year and the date in the statement that reads the data as I know I only want to read from a certain partition (or a year of partitions), but is it important to also reference the partition col. in windows and group bus for efficiencies, or is this redundant? After the analysis/transformations, I am not going to overwrite/modify the data I am reading in, but instead write to a new table (likely partitioned on the same columns), in case that is a factor.
For example:
dfBig = spark.sql("SELECT YEAR, DATE, STORE_ID_NUM, UNIX_TS, BARCODE, CUSTNUM, .... FROM STORE_DATA_SECONDS WHERE YEAR = 2020 and DATE='2020-11-12'")
dfSmall = spark.sql("SELECT YEAR, DATE, STORE_ID_NUM, TS_HR, CUSTNUM, .... FROM STORE_DATA_HRS WHERE YEAR = 2020 and DATE='2020-11-12'")
Now, if I join them, do I want to include YEAR and DATE in the join, or should I just join on STORE_ID_NUM (and then any of the timestamp fields/customer Id number fields I need to join on)? I definitely need STORE_ID_NUM, but I can forego YEAR AND DATE if it is just adding another column and makes it more inefficient because it is more things to join on. I don't know how exactly it works, so I wanted to check as by foregoing the join, maybe I am making it more inefficient as I am not utilizing the partitions when doing the operations? Thank you!
The key with delta is to choose the partitioned columns very well, this could take some trial and error, if you want to optimize the performance of the response, a technique I learned was to choose a filter column with low cardinality (you know if the problem is of time series, it will be the date, on the other hand if it is about a report for all clients in that case it may be convenient to choose your city), remember that if you work with delta each partition represents a level of the file structure where its cardinality will be the number of directories.
In your case I find it good to partition by YEAR, but I would add the MONTH given the number of records that would help somewhat with the dynamic pruning of spark
Another thing you can try is to use BRADCAST JOIN if the table is very small compared to the other.
Broadcast Hash Join en Spark (ES)
Join Strategy Hints for SQL Queries
The latter link explains how dynamic pruning helps in MERGE operations.
How to improve performance of Delta Lake MERGE INTO queries using partition pruning

Cassandra time series table design for timestamp range queries

Our problem is a bit different from a usual timeseries problem as we do not have natural partition key in our data. In our system we get not more than 5k/s messages, so following many publications (like this one) we figured out a following schema (it's more complex but the below matters most):
CREATE TABLE IF NOT EXISTS test.messages (
date TEXT,
hour INT,
createdAt TIMESTAMP,
uuid UUID,
data TEXT,
PRIMARY KEY ((date, hour), createdAt, uuid)
)
We mostly want to query the system based on the creation (event) time; other filtering will likely be done on different engines like Spark. The problem is that we may have a query that spans e.g. two months, so ideally we should put 60+ dates and 24 hours in the WHERE-IN-part of query, which is cumbersome to say the least. Of course, we can execute queries like below:
SELECT * FROM messages WHERE createdat >= '2017-03-01 00:00:00' LIMIT 10 ALLOW FILTERING;
My understanding is that, while the above works, it will make a full scan, which will be expensive on larger cluster. Or am I mistaken and C* knows, which partitions to scan?
I was thinking to add an index, but this problem likely falls into high-cardinality antipattern, as I understand.
EDIT: the question is not that much about the data model, though suggestions are welcome, but more about feasibility of making the queries with cratedat range instead or listing all date and hour values required in WHERE-IN-part of query to avoid full scans.

Cassandra Schema for standard SELECT/FROM/WHERE/IN query

Pretty new to Cassandra - I have data that looks like this:
<geohash text, category int, payload text>
The only query I want to run is:
SELECT category, payload FROM table WHERE geohash IN (list of 9 geohashes)
What would be the best schema in this case?
I know I could simply make my geohash the primary key and be done with it, but is there a better approach?
What are the benefits for defining PRIMARY KEY (geohash, category, payload)?
It depends on the size of your data for each row (geohash text, category int, payload text). If your payload size does not reach to tens of Mb, then you may want to put more geohash values into the same partition by using an artificial bucketId int, so your query can be performed on a server. Schema would look like this
geohash text, bucketId int, category int, payload text where the partition key is goehash and bucketId.
The recommendation is to have a sizeable partition <= 100 Mb, so you don't have to look up too many partitions. More is available here.
If you have a primary key on (geohash, category, payload), then you can have your data sorted on category and payload in the ascending order.
So based on the query, it sounds like you're considering a CQL schema that looks like this:
CREATE TABLE geohash_data (
geohash text,
category int,
data text,
PRIMARY KEY (geohash)
);
In Cassandra, the first (and in this case only) column in your PRIMARY KEY is the Partition Key. The Partition Key is what's used to distribute data around the cluster. So when you do your SELECT ... IN () query, you're basically querying for the data in 9 different partitions which, depending on how large your cluster is, the replication factor, and the consistency level you use to do the query, could end up querying at least 9 servers (and maybe more). Why does that matter?
Latency: The more partitions (and thus replicas/servers) involved in our query, the more potential for a slow server being able to negatively impact how quickly the data is returned.
Availability: The more partitions (and thus replicas/servers) involved in our query, the more potential that a single server going down could make it impossible for the query to be satisfied at all.
Both of those are bad scenarios so (as Toan rightly points out in his answer and the link he provided), we try to data model in Cassandra so that our queries will hit as few partitions (and thus replicas/servers) as possible. What does that mean for your scenario? Without knowing all the details, it's hard to say for sure, but let me make a couple guesses about your scenario and give you an example of how I'd try to solve it.
It sounds like maybe you already know the list of possible geohash values ahead of time (maybe they're at some regularly spaced interval of a predefined grid). It also sounds like maybe you're querying for 9 geohash values because you're doing some kind of "proximity" search where you're trying to get the data for the 9 geohashes in each direction around a given point.
If that's the case, the trick could be to denormalize the data at write time into a data model optimized for reading. For example, a schema like this:
CREATE TABLE geohash_data (
geohash text,
data_geohash text,
category int,
data text,
PRIMARY KEY (geohash, data_geohash)
);
When you INSERT a data point, you'd calculate the geohashes for the surrounding areas where you expect that data should show up in the results. You'd then INSERT the data multiple times for each geohash you calculated. So the value for geohash is the calculated value where you expect it to show up in the query results and the value for data_geohash is the actual value from the data you're inserting. Thus you'd have multiple (up to 9?) rows in your partition for a given geohash which represent the data of the surrounding geohashes.
This means your SELECT query now doesn't have to do an IN and hit multiple partitions. You just query WHERE geohash = ? for the point you want to search around.

Cassandra Time Series Data Modelling and Limiting Partition Size

We are currently investigating Cassandra as the database for a large time series system.
I have read through https://academy.datastax.com/resources/getting-started-time-series-data-modeling about modelling time series data in Cassandra.
What we have is high velocity timeseries data coming in for many weather stations. Each weather station has a number of "sensors" that each collect three metrics: temperature, humidity, and light.
We are trying to store each series as a wide row. However, we expect to get billions of readings per station over the life of the project, so we would like to limit the row size.
We would like there to be a single row for each (weather_station_id, year, day_of_year), that is, a new row for every day. However, we still want the partition key to be weather_station_id - that is, we want all readings for a station to be on the same node.
We currently have the following schema, but I would like to get some feedback.
CREATE TABLE weather_station_data (
weather_station_id int,
year int,
day_of_year int,
time timestamp,
sensor_id int,
temperature int,
humidity int,
light int,
PRIMARY KEY ((weather_station_id), year, day_of_year, time, sensor_id)
) WITH CLUSTERING ORDER BY (year DESC, day_of_year DESC, time DESC, sensor_id DESC);
In the aforementioned document, they make use of this "limit partition row by date" concept. However, it is unclear to me whether or not the date in their examples is part of the partition key.
According to the tutorial, if we choose to have weather_station_id as the only partition, the row will be exhausted.
i.e C* has a practical limitation of 2 billion columns per partition.
So IMO, your data-model is bad.
However, it is unclear to me whether or not the date in their examples is part of the partition key.
The tutorial used
PRIMARY KEY ((weatherstation_id,date),event_time)
So, yes they considered data to be part of partition key.
we want all readings for a station to be on the same node.
I am not sure, why you wan't such a requirement. You can always fetch weather data using multiple queries for more than one year.
select * from weather_station_data where weather_station_id=1234 and year= 2013;
select * from weather_station_data where weather_station_id=1234 and year= 2014;
So consider changing your structure to
PRIMARY KEY ((weather_station_id, year), day_of_year, time, sensor_id)
Hope it helps!
In my opinion the datastax model isn't really great. The problem with this model:
They are using the weather station as partition key. All rows with the same partition key are stored on the same machine. This means: If you have 10 years raw data (100ms steps), you will break cassandras limit really fast. 10 years × 365 days × 24 hours × 60 min × 60 seconds x 10 (for 100ms steps) x 7 columns. The limit is 2 Billion. In my opinion you will not use the benefits of cassandra if you build this data model. You can also use, for each weather station, a mongo, mysql or another database.
The better solution: Ask yourself how you will query this data. If you say: I query all data per year, use the year also as partion key. If you need also query data of more than one year, you can create two queries with a different year. This works and the performance is better. (The bottleneck is maybe only the network to your client)
One little more tipp: Cassandra isn't like mysql. It's a denormalized database. This means: It's not dirty to save your data more than one time. This means: It is important for your to query your data per year, it's also important to query your data per hour, per day of year or per sensor_id, you can create column families with different partition key and parimary key order. It's okay to duplicate your data. Cassandra is optimized for write performance, not for read. This means: It's often better to write the data in the right order instead of reading it in the right order. In cassandra 3.0 there is a new feature, called materialized views, for automatic duplicating. And if you think: Ohhh no, i will duplicate the needed storage. Remember: Storage is really cheap. It's okay to buy ten HDDs with 1tb. It cost nothing. The performance is important.
I have one question to your: Can you aggregate your data? Cassandra has a column type called counter. You can create a java/scala application where your aggregate your data while they are produced. You can use a streaming framework for this: Flink or Spark. (If you need a bit more than only counting.). One scenario: You aggregating your data for each hour and day. You got your data in your streaming app. Now: You have an variable for hourly data. You count up or down or whatever. If the hour is finishes, your put this row in your hourly column family and daily column family. In your daily column family your using a counter. I hope, you understand what i mean.

Querying Cassandra for multiple columns

I am using Cassandra to store stocks information. Each 'row' has some base fields like: time, price, close, open, low, high, etc. on top of these fields I have a list of floats-typed values which contains some internal system calculations.
Example for an object:
Class stockentry
time timestamp;
price float;
close float;
open float;
low float;
high float;
x float;
y float;
z float;
xx2 float;
xx3 float;
xx... yy... z...
a lot more...
Creating a lot of columns in a column family and storing all this data is no problem with Cassandra. The problem is querying it.
I would like to query on fields like x,y,xx2.. and these fields contains a very unique data values (floats with 4 decimal places).
Adding all these columns (100-150) as secondary indexes is not likely to be a good solution and is not recommended by the Cassandra docs.
What is the recommended data modeling, considering the requirements, when working with Cassandra?
Cassandra data modeling follows a query-driven design pattern. What this means is that instead of building a model to naturally represent the data (as we might in an RDBMS), we design schemas to accomodate data access patterns instead.
So for example, if you knew that the majority of your queries would involve a where clause on the column x, and ordered by the rows in column y, you might want to create an additional table in which the partition key was x, and the clustering column was y. For example:
CREATE TABLE <tablename>
"x" float,
"y" float,
"price" float,
.
.
<rest of columns>
.
.
PRIMARY KEY("x","y"));
Now, querying in column x becomes very efficient as the data for a particular value of x is stored together.
For queries in which a range of values is required (x> pricerange), you would be wise to store them as clustering columns.
Admittedly, this leads to multiple writes, as the values in columns x and y must be written across both tables. Cassandra encourages writes as storing data in this day and age is cheap. Essentially, in Cassandra you trade off additional writes for blazing fast reads.
Therefore, before designing your data model, think about what kind of queries you would most likely be doing and design accordingly.
CREATE TABLE pricing(
id blob,
price_tag string, // open, close, high, low, ...
time timestamp,
value float, // I would suggest blob with custom/thrift serialization
PRIMARY KEY (id, price_tag, time)
)
It will give very efficient queries for different price types over time.
You can find more in great presentation: http://www.slideshare.net/carlyeks/nyc-big-tech-day-2013?ref=http://techblog.bluemountaincapital.com/

Resources