Spark SQL and detecting anomalies in time-series data - apache-spark

I have some time-series data and let's say that model looks like this:
data {
date: Date,
greenPoint: Double,
redPoint: Double,
anomalyDetected: Boolean
}
Let's say that if red point is under specific threshold I want to mark it as anomaly.
I can easily load this data set (I'm using Spark SQL with Elastic, so I have here DataFrame to be more specific) and mark some data as anomaly, but how can I group them after that, or at least count how many anomalies were detected?
If I could somehow add to each anomaly some unique ID (generated when anomaly flag is changing) then I would easily group them even in Elastic query.
It would be great also to have start anomaly date and end anomaly date in these grouped data.
Basically my goal is to in below example detect five different anomalies (marked in red) and either mark original data to easily group them and display on UI or produce list contains five elements (five anomalies, with some additional fields like start and end date etc.).

Related

Apache Sedona - Query a Non-Rectangular Raster Mask Using Spark on Databricks

I'm referring to the manual on how to query spatial data, but in my scenario, there is a pixel-wise raster dataset of climate data over 8000+ days.
In simple words, the query could look something like this:
select temperature data for 5000 days for the defined amorphous region and aggregate the values using spark
My questions are:
Should all the pixels be converted into vector-like "points" to be applicable for masking operations?
Can third dimension (time) be somehow added to the spatial index, or should date be just a regular column?
Grateful for any input!
Some visual representation:

Spark moving average

I am trying to implement moving average for a dataset containing a number of time series. Each column represents one parameter being measured, while one row contains all parameters measured in a second. So a row would look something like:
timestamp, parameter1, parameter2, ..., parameterN
I found a way to do something like that using window functions, but the following bugs me:
Partitioning Specification: controls which rows will be in the same partition with the given row. Also, the user might want to make sure all rows having the same value for the category column are collected to the same machine before ordering and calculating the frame. If no partitioning specification is given, then all data must be collected to a single machine.
The thing is, I don't have anything to partition by. So can I use this method to calculate moving average without the risk of collecting all the data on a single machine? If not, what is a better way to do it?
Every nontrivial Spark job demands partitioning. There is just no way around it if you want your jobs to finish before the apocalypse. The question is simple: When it comes time to do the inevitable aggregation (in your case, an average), how can you partition your data in such a way as to minimize shuffle by grouping as much related data as possible on the same machine?
My experience with moving averages is with stocks. In that case it's easy; the partition would be on the stock ticker symbol. After all, the calculation of the 50-Day Moving Average for Stock A has nothing to with that for Stock B, so those data don't need to be on the same machine. The obvious partition makes this simpler than your situation--not to mention that it only requires one data point (probably) per day (the closing price of the stock at the end of trading) while you have one per second.
So I can only say that you need to consider adding a feature to your data set whose sole purpose is to serve as a partition key even if it is irrelevant to what you're measuring. I would be surprised if there isn't one, but if not, then consider a time-based partition on days for example.

Timeseries with Spark/Cassandra - How to find timestamps when values satisfy a condition?

I have timeseries stored in a Cassandra table, coming from several sensors. Here is the schema I use for storing data :
CREATE TABLE data_sensors (
sensor_id int,
time timestamp,
value float,
PRIMARY KEY ((sensor_id), time)
);
Values can be temperature or pressure for instance, depending on the sensor from which it is coming from.
My objective is to be able to find basic statistics (min, max, avg, std) on pressure, but only when temperature is higher than a certain value.
Here is a schema of the whole process I'd like to get.
I think it could be better if I changed the Cassandra model, at least for temperature data, to be able to filter on value. Is there another way, after importing data into a Spark RDD, to avoid altering the Cassandra table?
Then, once filtering on temperature is done, how to get the sequence of timestamps I have to use to filter pressure data? Please note that I don't have necessarily the same timestamps for temperature and pressure, that is why I think I need to have periods of time instead of a list of precise timestamps.
Thanks for your help!
It's not really a Cassandra-specific answer, but maybe you want to look at time series databases that provide SQL layer on top of NoSQL stores with support for JOINs and aggregations.
Here's an example of an ATSD SQL syntax that supports period aggregations and joins.
SELECT t1.entity, t1.datetime, min(t1.value), max(t1.value), avg(t2.value)
FROM mpstat.cpu_busy t1
JOIN meminfo.memfree t2
WHERE t1.datetime >= '2016-09-20T15:00:00Z' AND t1.datetime < '2016-09-20T15:15:00Z'
GROUP BY entity, t1.PERIOD(1 MINUTE)
HAVING max(t1.value) > 30
The query joins two metrics, filters out 1-minute rows where first metric was below the threshold and then returns a bunch of statistics for the second series.
If the two series are unevenly spaced, you can regularize the array using linear interpolation.
Disclosure: I work for Axibase that develops ATSD.

Excel Data Model - Creating a many-to-one summary table by mapping table and column names

I am trying to create a summary calculation on a set of tables which I have added to an Excel data model. I have 5 tables with the following columns:
Datetime (UTC)
Measured Data 1
Simulated Data 1
Measured Data 2
Simulated Data 2
etc.
I have created a master Date-time table which links these 5 tables together on their UTC date-time column.
My query is how to optimally create calculated fields on each of these tables without needing to explicitly specify the calculations on each individual table, as is the case with PivotTables (I need to select the specific measured and simulated data columns from one individual table in the data model). Instead I would like to be able to map all measured fields to one column and all simulated fields to another, and then use filters to select out the fields (or groups of fields) I want to compare.
I started by creating a summary table which lists all my tables in my data model by name along with the names of measured and simulated columns within each. However, I'm unsure what the next step should be. This seems like a pretty straightforward problem, but it has me stumped this morning. I'd appreciate any help. Let me know if I haven't fully explained anything.

Another Cassandra Data Modelling Approach

I have read various sources on this subject and understand the idea of modelling around the queries needed but wondered how far this can be stretched for Cassandra.
I need to store processing events which contain both measures and dimension data if I was relating to a traditional data warehouse.
The format of the data is something like
log_timestamp (timestamp): user_id (text): measure_1 (num): measure_2 (num) : measure_3 (num) : dim_1 (text) : dim_2 (text) : ... dim_n(text)
where there may be 10 or more dim data items.
The queries I would like to model include :
user_id by time (minute/hour/day/week/month/year) with measure aggregates
user_id by single dim by time with measure aggregates
single dim by time with measure aggregates
some of the dimension fields form a natural hierarchy so I would like the above queries with more than one dim field as well.
Before embarking on the creation of a large number of discrete column families to try and cover the permutations I would like to know if anyone can recommend a better approach
e.g. using a single cf for the dim data with a column identifying the type of dim and another for the value and a similar idea for hierarchy data with the hierarchy type and member dims and values.
Alternatively what might be a good model for storing the data at a relatively granular level such that it could be read back out into an aggregation tool e.g. hive or spark (which looks really quite interesting).
Thanks.
Suppose you want to be able to query for aggregated data by week. You could use the following data structure.
Column Family = day
Row Key: Date = day_identifier (e.g., time at beginning of some day this week)
Column Name: Date = timestamp, Long = field_ordinal
Column Value: field value
Column Family = week
Row Key: Date = week_identifier (e.g., time at beginning of first day of a week)
Column Name: Date = timestamp, Long = field_ordinal
Column Value: field value
At the end of each week, you'd take the entries in the day column family and aggregate them into an entry in the week column family. Then you can remove the data per day if it's no longer that useful to you.
This concept allows you to store much less data, but you can still accomplish a lot. For instance, if you want to query for data aggregated over a month, you'd just access all the weeks in that month. Alternatively, you could use the same concept to roll up data for an entire month as well.
Good luck.

Resources