Real time analytics Time series Database - cassandra

I'm looking for a distributed Time series database which is free to use in a cluster setup up mode and production ready plus it has to fit well in the hadoop ecosystem.
I have an IOT project which is basically around 150k Sensors which send data every 10 minutes or One hour, so I'm trying to look at time series database that has useful functions like aggregating metrics, Down-sampling, pre-aggregate (roll-ups) i have found this comparative in this Google stylesheet document time series database comparative .
I have tested Opentsdb, the data model of the hbaserowkey really suits my use case : but the functions that sill need to be developed for my use case are :
aggregate multiples metrics
do rollups
I have tested also keirosDB which is a fork of opentsdb with a richer API and it uses Cassandra as a backend storage the thing is that their API does all what my looking for downsampling rollups querying multiples metrics and a lot more.
I have tested Warp10.io and Apache Phoenix which i have read here Hortonworks link that it will be used by Ambari Metrics so i assume that its well suited for time series data too.
My question is as of now what's the best Time series Database to do real time analytics with requests performance under 1S for all the type of requests example : we want the average of the aggregated data sent by 50 sensors in a period of 5 years resampled by months ?
Such requests I assume can't be done under 1S so I believe for such requests we need some rollups/ pre aggregate mechanism, but I'm not so sure because there's a lot of tools out there and i can't decide which one suits my need the best.

I'm the lead for Warp 10 so my answer can be considered opinionated.
Given your projected data volume, 150k sensors sending data every 10 minutes, it is a mean of 250 datapoints per second and less than 40B on a period of 5 years. Such a volume can easily fit on a simple Warp 10 standalone, and if you later need to have a larger infrastructure you can migrate to a distributed Warp 10 based on Hadoop.
In terms of requests, if your data is already resampled, fetch 5 years of monthly data for 50 sensors is only 3000 datapoints, Warp 10 can do that in far less than 1s, and doing the automatic rollups is just a matter of scheduling WarpScript code in a monthly manner, nothing fancy.
Lastly, in terms of integration with the Hadoop ecosystem, Warp 10 is on top of things with integration of the WarpScript language in Pig, Spark, Flink and Storm. With the Warp10InputFormat you can fetch data from a Warp 10 platform or you can load data using any other InputFormat and then manipulate them using WarpScript.

At OVH we are heavy users of #OvhMetrics which rely on Warp10/HBase, and we provide a protocol abstraction with OpenTSDB/WarpScript/PromQL/...
I'm not interested in Warp10, but it has been a great success for us. Both on the scaling challenge and for the use cases that WarpScript can cover.
Most of the time we don't even leverage hadoop/flink integration because our customers needs are addressed easily with the real time WarpScript API.

For real time analytics, you can try Druid, an open source project maintainted by Apache, or you can also check out database specialized for IoT: GridDB and CrateDB. The best way is to test these databases yourselves and see if they suit your need. You can also connect these databases as a sink to Kafka.
When you are dealing with IoT project, you need to forecast if you have to maintain large data set in the future or if you are happy with downsampled data. Some TSDB have good compression like InfluxDB, but others may not be scalable beyond tens of terabytes, so if you think you need to scale big, look also for one with scale-out architecture.

Related

Data Factory V2 copy Data Activities and Data flow ETL

My question about the Data Factory V2 copy-data activity i have 5 questions.
Questions 1
Should I use parquet file or SQL server With 500 DTU I want to transfer data fast to staging table or staging parquet file
Questions 2
Copy data activity data integration Unit should i use auto or 32 data integration Unit
Questions 3
What benefit of using degree of copy parallelism should I use Auto or use 32 again I want to transfer everything quick as possible I have around 50 million rows every day.
Questions 4
Data Flow Integration run time so should I use General Purpose, Compute Optimized or Memory Optimized as I mention we have 50 million rows every day, so we want to process the data as quickly as possible and somehow cheap if we can in Data Flow
Questions 5
A bulk insert is better in Data Factory and Data flow Sink
I think you have too many questions about too many topics, the answers to which will depend entirely on your desired end result. Even so, I will do my best to briefly address your situation.
If you are dealing with large volume and/or frequency, Data Flow (ADFDF) would probably be better than Copy activity. ADFDF runs on Spark via Data Bricks and is built from the ground up to run parallel workloads. Parquet is also built to support parallel workloads. If your SQL is an Azure Synapse (SQLDW) instance, then ADFDF will use Polybase to manage the upload, which is very fast because it is also built for parallel workloads. I'm not sure how this differs for Azure SQL, and there is no way to tell you what DTU level will work best for your task.
If having Parquet as your end result is acceptable, then that would probably be the easiest and least expensive to configure since it is just blob storage. ADFDF works just fine with Parquet, as either Source or Sink. For ETL workloads, Compute is the most likely IR configuration. The good news is it is the least expensive of the three. The bad news is I have no way to know what the core count should be, you'll just have to find out through trial and error. 50 million rows may sound like a lot, but it really depends on the row size (byte count and column count), and frequency. If the process is running many times a day, then you can include a "Time to live" value in the IR configuration. This will keep the cluster warm while it waits for another job, thus potentially reducing startup time (but incurring more run time cost).

Transparent Streaming & Batch processing

I'm still quite new to the world of stream and batch processing and trying to understnad concepts and speach. It is admitedly very possible that the answer to my question well known, easy to find or even answered a hundred times here at SO, but I was not able to find it.
The background:
I am working in a big scientific project (nuclear fusion research), and we are producing tons of measurement data during experiment runs. Those data are mostly streams of samples tagged with a nanosecond timestamp, where samples can be anything from a single by ADC value, via an array of such, via deeply structured data (with up to hundreds of entries from 1 bit booleans to 64bit double precision floats) to raw HD video frames or even string text messages. If I understand the common terminologies right, I would regard our data as "tabular data", for the most part.
We are working with mostly selfmade software solutions from data acquisition over simple online (streaming) analysis (like scaling, subsampling and such) to our own data sotrage, management and access facilities.
In view of the scale of the operation and the effort for maintaining all those implementations, we are investigating the possibilities to use standard frameworks and tools for more of our tasks.
My question:
In particular at this stage, we are facing the need for more and more sofisticated (automated and manual) data analytics on live/online/realtime data as well as "after the fact" offline/batch analytics of "historic" data. In this endavor, I am trying to understand if and how existing analytics frameworks like Spark, Flink, Storm etc. (possibly supported by message queues like Kafka, Pulsar,...) can support a scenario, where
data is flowing/streamed into the platform/framework, attached an identifier like a URL or an ID or such
the platform interacts with integrated or external storage to persist the streaming data (for years), associated with the identifier
analytics processes can now transparently query/analyse data addressed by an identifier and an arbitrary (open or closed) time window, and the framework suplies data batches/samples for the analysis either from backend storage or coming in live from data acquisition
Simply streaming the online data into storage and querying from there seems no option as we need both raw and analysed data for live monitoring and realtime feedback control of the experiment.
Also, letting the user query either a live input signal or a historic batch from storage differently would not be ideal, as our physicists mostly are no data scientists and we would like to keep such "technicalities" away from them and idealy the exact same algorithms should be used for analysing new real time data and old stored data from previous experiments.
Sitenotes:
we are talking about peek data loads in the range of 10th of gigabits per second coming in bursts of increasing length of seconds up to minutes - could this be handled by the candidates?
we are using timestamps in nanosecond resolution, even thinking about pico - this poses some limitations on the list of possible candidates if I unserstand correctly?
I would be very greatfull if anyone would be able to understand my question and to shed some light on the topic for me :-)
Many Thanks and kind regards,
Beppo
I don't think anyone can say "yes, framework X can definitely handle your workload", because it depends a lot on what you need out of your message processing, e.g. regarding messaging reliability, and how your data streams can be partitioned.
You may be interested in BenchmarkingDistributedStreamProcessingEngines. The paper is using versions of Storm/Flink/Spark that are a few years old (looks like they were released in 2016), but maybe the authors would be willing to let you use their benchmark to evaluate newer versions of the three frameworks?
A very common setup for streaming analytics is to go data source -> Kafka/Pulsar -> analytics framework -> long term data store. This decouples processing from data ingest, and lets you do stuff like reprocessing historical data as if it were new.
I think the first step for you should be to see if you can get the data volume you need through Kafka/Pulsar. Either generate a test set manually, or grab some data you think could be representative from your production environment, and see if you can put it through Kafka/Pulsar at the throughput/latency you need.
Remember to consider partitioning of your data. If some of your data streams could be processed independently (i.e. ordering doesn't matter), you should not be putting them in the same partitions. For example, there is probably no reason to mix sensor measurements and the video feed streams. If you can separate your data into independent streams, you are less likely to run into bottlenecks both in Kafka/Pulsar and the analytics framework. Separate data streams would also allow you to parallelize processing in the analytics framework much better, as you could run e.g. video feed and sensor processing on different machines.
Once you know whether you can get enough throughput through Kafka/Pulsar, you should write a small example for each of the 3 frameworks. To start, I would just receive and drop the data from Kafka/Pulsar, which should let you know early whether there's a bottleneck in the Kafka/Pulsar -> analytics path. After that, you can extend the example to do something interesting with the example data, e.g. do a bit of processing like what you might want to do in production.
You also need to consider which kinds of processing guarantees you need for your data streams. Generally you will pay a performance penalty for guaranteeing at-least-once or exactly-once processing. For some types of data (e.g. the video feed), it might be okay to occasionally lose messages. Once you decide on a needed guarantee, you can configure the analytics frameworks appropriately (e.g. disable acking in Storm), and try benchmarking on your test data.
Just to answer some of your questions more explicitly:
The live data analysis/monitoring use case sounds like it fits the Storm/Flink systems fairly well. Hooking it up to Kafka/Pulsar directly, and then doing whatever analytics you need sounds like it could work for you.
Reprocessing of historical data is going to depend on what kind of queries you need to do. If you simply need a time interval + id, you can likely do that with Kafka plus a filter or appropriate partitioning. Kafka lets you start processing at a specific timestamp, and if you data is partitioned by id or you filter it as the first step in your analytics, you could start at the provided timestamp and stop processing when you hit a message outside the time window. This only applies if the timestamp you're interested in is when the message was added to Kafka though. I also don't believe Kafka supports below-millisecond resolution on the timestamps it generates.
If you need to do more advanced queries (e.g. you need to look at timestamps generated by your sensors), you could look at using Cassandra or Elasticsearch or Solr as your permanent data store. You will also want to investigate how to get the data from those systems back into your analytics system. For example, I believe Spark ships with a connector for reading from Elasticsearch, while Elasticsearch provides a connector for Storm. You should check whether such a connector exists for your data store/analytics system combination, or be willing to write your own.
Edit: Elaborating to answer your comment.
I was not aware that Kafka or Pulsar supported timestamps specified by the user, but sure enough, they both do. I don't see that Pulsar supports sub-millisecond timestamps though?
The idea you describe can definitely be supported by Kafka.
What you need is the ability to start a Kafka/Pulsar client at a specific timestamp, and read forward. Pulsar doesn't seem to support this yet, but Kafka does.
You need to guarantee that when you write data into a partition, they arrive in order of timestamp. This means that you are not allowed to e.g. write first message 1 with timestamp 10, and then message 2 with timestamp 5.
If you can make sure you write messages in order to Kafka, the example you describe will work. Then you can say "Start at timestamp 'last night at midnight'", and Kafka will start there. As live data comes in, it will receive it and add it to the end of its log. When the consumer/analytics framework has read all the data from last midnight to current time, it will start waiting for new (live) data to arrive, and process it as it comes in. You can then write custom code in your analytics framework to make sure it stops processing when it reaches the first message with timestamp 'tomorrow night'.
With regard to support of sub-millisecond timestamps, I don't think Kafka or Pulsar will support it out of the box, but you can work around it reasonably easily. Just put the sub-millisecond timestamp in the message as a custom field. When you want to start at e.g. timestamp 9ms 10ns, you ask Kafka to start at 9ms, and use a filter in the analytics framework to drop all messages between 9ms and 9ms 10ns.
Allow me to add the following suggestions on how Apache Pulsar might help address some of your requirements. Food for thought as it were.
"data is flowing/streamed into the platform/framework, attached an identifier like a URL or an ID or such"
You might want to look at Pulsar Functions, which allows you to write simple functions (In Java or Python) that gets executed on each individual message that is published to a topic. They are ideal for this type of data augmentation use case.
the platform interacts with integrated or external storage to persist the streaming data (for years), associated with the identifier
Pulsar has recently added tiered-storage, that allows you to retain event streams in S3, Azure Blob Store, or Google Cloud storage. This would allow you to keep the data for years in a cheap and reliable data store
analytics processes can now transparently query/analyse data addressed by an identifier and an arbitrary (open or closed) time window, and the framework suplies data batches/samples for the analysis either from backend storage or coming in live from data acquisition
Apache Pulsar has also added integration with the Presto query engine, which would allow you to query the data over a given time period (including data from tiered-storage) and place it into a topic for processing.

What timeseries database to select for large number of records?

I got into scenario where I have about 100,000 input records per seconds to store. The nature of records is timeseries data.
I need to run both aggregation, other analytics and also some machine learning algorithms over the data continuously. Performance is here the factor as I look for near real-time results.
What would you recommend as database engine?
Take a look at ClickHouse analytical database. It can accept millions of rows per second. It can scan billions of rows per second on a single computer. It scales horizontally to multiple nodes. It fits time series workloads.
If you still need time series database, then try VictoriaMetrics. It is built on ClickHouse ideas, so it is fast and resource-efficient.
I am adding my own solution...
ClickHouse is definitely nice killer. But I am now evaulating for new project open source gpu database OmniSci. Its open source version is limited to single gpu node (up to 16 gpu devices - with oem tesla having 64GB per device you can get 1TB VRAM, of course not that cheap as clickhouse). Its simply SQL database on steroids (JDBC driver exists) with Kafka data source
Omnisci is having also crossdashboarding solution which is licensed already, but you can have real time dashboarding over lets say 20-50 billions of ts records (8-16 gpus) and multidashboard real time analytics without any kind of preaggregation required, etc....
But it will cost money...
If you want going purely open source, my second candidate is NVIDA's RAPIDS framework which implements cuDF (CUDA Dataframe - like Spark data structure), eventually you can use it to keep your data window (append new, delete obsolete), and cuxfilter solution which is similar to OmniSci, but its more framework, but with skilled frontend coder you can achieve something very similar/same as OmniSci.
Of course you can go and implement your own on top of cassandra with an appropriate data model for your usecase. This will maybe get you the best results tailored to your needs.
You could look at KairosDB (https://kairosdb.github.io/) which is a timeseries database on top of apache cassandra and I got 50k writes per second on a medium sized single (but bare metal) node.
It's quite good documented (https://kairosdb.github.io/docs/build/html/CassandraSchema.html) and it has aggregators out of the box (https://kairosdb.github.io/docs/build/html/restapi/QueryMetrics.html).
OpenTSDB was slower in my tests. Influx looks promising but i have no experience myself: https://github.com/influxdata/influxdb

Hazelcast or AppScale to manage parallel computational tasks over a shared dataset

Starting out on a new project and looking for advice on a suitable platform. Current thinking is between Hazelcast or AppScale, given our team’s combined (but limited) experience covers an older version of Hazelcast and GAE. Both can also apparently be setup on EC2, which may be the easiest way to meet the CPU demand we expect.
Problem Profile
1). Our data consists of many small records stored by date (but not always time). Some are small numerical records (business stats, looks like daily weather info or stock market prices) and some are bulky text (log file entries). Data volumes not huge, in the region of hundreds/day between 1k and 50k each.
2). Very very large number of instances of computationally expensive numerical models (think monte-carlo sims) operate constantly over fixed-size windows of the same data.
3). A number of monitoring agents make data available.
4). Larger (longer periods of time) sets of the same data to be processed offline once daily.
With Hazelcast we would add incoming data to maps and use the Executor service to run models over the shared data. Likely use of Tomcat to provide minimal front end access to the grid as required.
With AppScale we would add tables per data-type and use the Task Queues API to frame the numerical models. Servlets deployed to AppScale as per GAE to provide front end.
Question
Should we use AppScale or Hazelcast for requirements like this? That is - for the problem as stated, are there any stand-out factors for/against either platform that we should consider?
If you prefer/require a distributed, service-oriented programming model (bag of tasks) then the answer is AppScale. If you prefer/require a parallel programming model (single machine abstraction) then the answer is Hazelcast. AppScale is also a complete cloud platform (vs only a datastore) which enables you to do more things with your app as it evolves. If you go with AppScale, you can adjust the timing restriction on the tasks and customize the platform with the libraries you want to use, for your computationally expensive methods.

HIVE/HDFS for realtime storage of sensor data on a massive scale?

I am evaluating sensor data collection systems with the following requirements,
1 million endpoints sending in 100 bytes of data every minute (as a time series).
Basically millions of small writes to the storage.
This data is write-once, so basically it never gets updated.
Access requirements
a. Full data for a user needs to be accessed periodically (less frequent)
b. Partial data for a user needs to be access periodically (more frequent). For e.g I need sensor data collected over the last hour/day/week/month for analysis/reporting.
Have started looking at Hive/HDFS as an option. Can someone comments on the applicability of Hive in such a use case? I am concerned that while the distributed storage needs would work, it seems more suited to data warehousing applications than real time data collection/storage.
Do HBase/Cassandra make more sense in this scenario?
I think HBase can be a good option for you. In fact there's already an open/source implementation in HBase which solves similar problem that you might want to use. Take a look at openTSB which is an open source implementation for solving similar problems. Here's a short excerpt from their blurb:
OpenTSDB is a distributed, scalable Time Series Database (TSDB)
written on top of HBase. OpenTSDB was written to address a common
need: store, index and serve metrics collected from computer systems
(network gear, operating systems, applications) at a large scale, and
make this data easily accessible and graphable. Thanks to HBase's
scalability, OpenTSDB allows you to collect many thousands of metrics
from thousands of hosts and applications, at a high rate (every few
seconds). OpenTSDB will never delete or downsample data and can easily
store billions of data points. As a matter of fact, StumbleUpon uses
it to keep track of hundred of thousands of time series and collects
over 600 million data points per day in their main production
datacenter.
There are actually quite a few people collecting sensor data in a time-series fashion with Cassandra. It's a very good fit. I recommend you read this article on basic time series in Cassandra for an idea of what your data model would be like.
Writes in Cassandra are extremely cheap, so even a moderately sized cluster could easily handle one million writes per minute.
Both of your read queries could be answered very efficiently. For the second type of query, where you're reading data for a slice of time for a single sensor, you would end up reading a contiguous slice from a single row; this should take about 10ms for a completely cold read. For the first type of query, you would simply be running several of the per-sensor queries in parallel. Assuming you store a basic map of users to sensor IDs, you would lookup all of the sensor IDs for a user with one query, and then your second query would fetch the data for all of those sensors (although you might break up this query if the number of sensors is high).
Hive and HDFS don't really make sense when you're talking about real-time queries, as they're more suited for long-running batch jobs.

Resources