Time based distributed search - search

I have a stream of notifications coming at a high rate (about 5 per ms). They are currently stored in a distributed no-sql db indexed by a guid attached with each notification. I want to make them searchable by time.
An example query would be return all notification between 5 and 6 pm from tomorrow.
To begin with I would like to keep the data searchable for a week.
I want to know if there is well-known solution for implementing such time based search for distributed systems?

Related

How to model time series data in cassandra when data has non-uniform generation rate?

I am planning to migrate data from my existing database (Postgres) to Cassandra. Here is a brief overview of the system:
Current data set size is around 2 Billion
Each data point represents an event. Properties of this event are - user_id, event_name, timestamp
This data is coming from a finite set of sources - For the sake of simplicity let's assume 3 different sources S1, S2, S3 - all of them pushing in a Kafka Topic. This cassandra microservice is consuming data from this topic.
The rate of data coming from S1, S2 and S3 is different. Assume S1 is pushing 1 event for a single user every minute, S2 is pushing 1 event for each user every 15 minutes and S3 is pushing single event for each user every 1 hour.
There are two types of queries this system should support
Get latest event for a given user
Get list of events for a given user and date range (This data range can have diff of at most 30 days)
I am trying to model this data using few different approaches.
Partition data for a single user into monthly buckets. For this additional parameters timestamp_year, timestamp_month are added. timestamp is used a cluster key.
Pros: Less than 10ms write latency. Max partition size is around ~60MB (working good for cassandra 3.11). Get latest event is working in less than 10ms (99.999 percentile).
Cons: Getting month level data is slow because of too much data being read from a single partition. If i put limit on number of records being fetched (let's say 10000) the latency improves. Partition size is non-uniform because of different rate of data from 3 different sources.
I have tried using weekly buckets instead of monthly buckets and pagination to improve on other parameters. But this is something i am not able to sort out Partition size is non-uniform because of different rate of data from 3 different sources.
How can i keep partition size consistent (almost) in such a data model? Ideas are welcome.
This is a classical problem and there are no easy solutions to make partition size uniform. If you can predict the rate of ingestion per user, probably you can have different buckets of users, such as, high, medium and low ingestion users.
Depending on the type, the time bucket would be different. For a high ingestion user, partition means a day and for a low ingestion user, partition means a month.
For speeding up your month query on a high ingestion user, you can run parallel queries of 30 days and see if it helps to optimize your query time.

node.js & mongodb, when the data is rapidly increasing, how to design data archive?

At present, I design the platform, there will be a large number of data stored in the database every day, but these information will only be used within three days to a week, then the information is only used for future inquiries.
This large amount of information will lead to a reduction in the effectiveness of the database. My current thinking is to periodically submit the data to another DB.This DB will be used exclusively as a query.
How do I design it? Suppose I extracted once a month, how do I get this month's data from mongodb and dump it to another repository.

Real time analytics Time series Database

I'm looking for a distributed Time series database which is free to use in a cluster setup up mode and production ready plus it has to fit well in the hadoop ecosystem.
I have an IOT project which is basically around 150k Sensors which send data every 10 minutes or One hour, so I'm trying to look at time series database that has useful functions like aggregating metrics, Down-sampling, pre-aggregate (roll-ups) i have found this comparative in this Google stylesheet document time series database comparative .
I have tested Opentsdb, the data model of the hbaserowkey really suits my use case : but the functions that sill need to be developed for my use case are :
aggregate multiples metrics
do rollups
I have tested also keirosDB which is a fork of opentsdb with a richer API and it uses Cassandra as a backend storage the thing is that their API does all what my looking for downsampling rollups querying multiples metrics and a lot more.
I have tested Warp10.io and Apache Phoenix which i have read here Hortonworks link that it will be used by Ambari Metrics so i assume that its well suited for time series data too.
My question is as of now what's the best Time series Database to do real time analytics with requests performance under 1S for all the type of requests example : we want the average of the aggregated data sent by 50 sensors in a period of 5 years resampled by months ?
Such requests I assume can't be done under 1S so I believe for such requests we need some rollups/ pre aggregate mechanism, but I'm not so sure because there's a lot of tools out there and i can't decide which one suits my need the best.
I'm the lead for Warp 10 so my answer can be considered opinionated.
Given your projected data volume, 150k sensors sending data every 10 minutes, it is a mean of 250 datapoints per second and less than 40B on a period of 5 years. Such a volume can easily fit on a simple Warp 10 standalone, and if you later need to have a larger infrastructure you can migrate to a distributed Warp 10 based on Hadoop.
In terms of requests, if your data is already resampled, fetch 5 years of monthly data for 50 sensors is only 3000 datapoints, Warp 10 can do that in far less than 1s, and doing the automatic rollups is just a matter of scheduling WarpScript code in a monthly manner, nothing fancy.
Lastly, in terms of integration with the Hadoop ecosystem, Warp 10 is on top of things with integration of the WarpScript language in Pig, Spark, Flink and Storm. With the Warp10InputFormat you can fetch data from a Warp 10 platform or you can load data using any other InputFormat and then manipulate them using WarpScript.
At OVH we are heavy users of #OvhMetrics which rely on Warp10/HBase, and we provide a protocol abstraction with OpenTSDB/WarpScript/PromQL/...
I'm not interested in Warp10, but it has been a great success for us. Both on the scaling challenge and for the use cases that WarpScript can cover.
Most of the time we don't even leverage hadoop/flink integration because our customers needs are addressed easily with the real time WarpScript API.
For real time analytics, you can try Druid, an open source project maintainted by Apache, or you can also check out database specialized for IoT: GridDB and CrateDB. The best way is to test these databases yourselves and see if they suit your need. You can also connect these databases as a sink to Kafka.
When you are dealing with IoT project, you need to forecast if you have to maintain large data set in the future or if you are happy with downsampled data. Some TSDB have good compression like InfluxDB, but others may not be scalable beyond tens of terabytes, so if you think you need to scale big, look also for one with scale-out architecture.

Ideal database for grouping data by timestamp

I'm in the process of testing some noSql solutions for handling some basic log analytics. I'm looking for something that is optimized for reads. The data has a timestamp and some other columns that I want to count and sum. I need the ability to group and sum on Year, Month, day, hour and the values of some of the other columns.
My data will likely be operating at above about 50 million records, and likely from a single server (no sharding, or horizontal scaling required), but a RESTful API is handy for tying into other applications easily.
I'm currently trying out couchDB, but would like to know if there's something more suited for this task.
I can probably improve this map and overall performance, but wanted to check some other options.
function(doc) {
ts = doc.timestamp.split(/[^A-Z0-9\_]+/i)
emit([ts[0],ts[1],ts[2],ts[3],ts[4], doc.eventtype,doc.name],1);
}
I'm not using relation databases, because entries vary in the data they have based on the event type, and I want to be able to handle the data dynamically, rather than having to update the schema every time a new event type is logged.
Use a Time Series Database which would be designed for this kind of data persistence.

Design of Partitioning for Azure Table Storage

I have some software which collects data over a large period of time, approx 200 readings per second. It uses an SQL database for this. I am looking to use Azure to move a lot of my old "archived" data to.
The software uses a multi-tenant type architecture, so I am planning to use one Azure Table per Tenant. Each tenant is perhaps monitoring 10-20 different metrics, so I am planning to use the Metric ID (int) as the Partition Key.
Since each metric will only have one reading per minute (max), I am planning to use DateTime.Ticks.ToString("d19") as my RowKey.
I am lacking a little understanding as to how this will scale however; so was hoping somebody might be able to clear this up:
For performance Azure will/might split my table by partitionkey in order to keep things nice and quick. This would result in one partition per metric in this case.
However, my rowkey could potentially represent data over approx 5 years, so I estimate approx 2.5 million rows.
Is Azure clever enough to then split based on rowkey as well, or am I designing in a future bottleneck? I know normally not to prematurely optimise, but with something like Azure that doesn't seem as sensible as normal!
Looking for an Azure expert to let me know if I am on the right line or whether I should be partitioning my data into more tables too.
Few comments:
Apart from storing the data, you may also want to look into how you would want to retrieve the data as that may change your design considerably. Some of the questions you might want to ask yourself:
When I retrieve the data, will I always be retrieving the data for a particular metric and for a date/time range?
Or I need to retrieve the data for all metrics for a particular date/time range? If this is the case then you're looking at full table scan. Obviously you could avoid this by doing multiple queries (one query / PartitionKey)
Do I need to see the most latest results first or I don't really care. If it's former, then your RowKey strategy should be something like (DateTime.MaxValue.Ticks - DateTime.UtcNow.Ticks).ToString("d19").
Also since PartitionKey is a string value, you may want to convert int value to a string value with some "0" prepadding so that all your ids appear in order otherwise you'll get 1, 10, 11, .., 19, 2, ...etc.
To the best of my knowledge, Windows Azure partitions the data based on PartitionKey only and not the RowKey. Within a Partition, RowKey serves as unique key. Windows Azure will try and keep data with the same PartitionKey in the same node but since each node is a physical device (and thus has size limitation), the data may flow to another node as well.
You may want to read this blog post from Windows Azure Storage Team: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx.
UPDATE
Based on your comments below and some information from above, let's try and do some math. This is based on the latest scalability targets published here: http://blogs.msdn.com/b/windowsazurestorage/archive/2012/11/04/windows-azure-s-flat-network-storage-and-2012-scalability-targets.aspx. The documentation states that:
Single Table Partition– a table partition are all of the entities in a
table with the same partition key value, and usually tables have many
partitions. The throughput target for a single table partition is:
Up to 2,000 entities per second
Note, this is for a single partition, and not a single table. Therefore, a table with good partitioning, can process up to the
20,000 entities/second, which is the overall account target described
above.
Now you mentioned that you've 10 - 20 different metric points and for for each metric point you'll write a maximum of 1 record per minute that means you would be writing a maximum of 20 entities / minute / table which is well under the scalability target of 2000 entities / second.
Now the question remains of reading. Assuming a user would read a maximum of 24 hours worth of data (i.e. 24 * 60 = 1440 points) per partition. Now assuming that the user gets the data for all 20 metrics for 1 day, then each user (thus each table) will fetch a maximum 28,800 data points. The question that is left for you I guess is how many requests like this you can get per second to meet that threshold. If you could somehow extrapolate this information, I think you can reach some conclusion about the scalability of your architecture.
I would also recommend watching this video as well: http://channel9.msdn.com/Events/Build/2012/4-004.
Hope this helps.

Resources