Best way to store high frequency, periodic time-series data? - node.js

I have created an MVP for a nodejs project, following are some of the features that are relevant to the question I am about to ask:
1-The application has a list of IP addresses with CRUD actions.
2-The application will ping each IP address after every 5 seconds.
3- And display against each IP address it's status i.e alive or dead and the uptime if alive
I created a working MVP on nodejs with the help of the library net-ping, express, mongo and angular. Now I have a new feature request that is:
"to calculate the round trip time(latency) for each ping that is generated for each IP address and populate a bar chart or any type of chart that will display the RTT(latency) history(1 months-1 year) of every connection"
I need to store the response of each ping in the database, Assuming the best case that if each document that I will store is of size 0.5 kb, that will make 9.5MB data to be stored in each day,285MB in each month and 3.4GB in a year for a single IP address and I am going to have 100-200 IP addresses in my application.
What is the best solution (including those which are paid) that will suit the best for my requirements considering the app can scale more?

Time series data require special treatment from a database perspective as they introduce challenges to the traditional database management from capacity, query performance, read/write optimisation targets, etc.
I wouldn't recommend you store this data in a traditional RDBMS, or object/document database.
Best option is to use a specialised time-series database engine, like InfluxDB, that can support downsampling (aggregation) and raw data retention rules

So I changed The schema design for the Time-series data after reading this and that reduced the numbers in my calculation of size massively
previous Schema looked like this:
{
timestamp: ISODate("2013-10-10T23:06:37.000Z"),
type: "Latency",
value: 1000000
},
{
timestamp: ISODate("2013-10-10T23:06:38.000Z"),
type: "Latency",
value: 15000000
}
Size of each document: 0.22kb
number of document created in an hour= 720
size of data generated in an hour=0.22*720 = 158.4kb
size of data generated by one IP address in a day= 158 *24 = 3.7MB
Since every next time_Stamp is just the increment of 5 seconds from the previous one, the schema can be optimized to cut the redundant data.
The new schema looks like this :
{
timestamp_hour: ISODate("2013-10-10T23:06:00.000Z"),// will contain hours
type: “Latency”,
values: {//will contain data for all pings in the specific hour
0: 999999,
…
37: 1000000,
38: 1500000,
…
720: 2000000
}
}
Size of each document: 0.5kb
number of document created in an hour= 1
size of data generated in an hour= 0.5kb
size of data generated by one IP address in a day= 0.5 *24 = 12kb
So I Am assuming the size of the data will not be an issue anymore, and I although there is a debate for what type of storage should be used in such scenarios to ensure best performance but I am going to trust mongoDB in my case.

Related

How to aggregate data by period in a rrdtool graph

I have a rrd file with average ping times to a server (GAUGE) every minute and when the server is offline (which is very frequent for reasons that doesn't matter now) it stores a NaN/unknown.
I'd like to create a graph with the percentage the server is offline each hour which I think can be achieved by counting every NaN within 60 samples and then dividing by 60.
For now I get to the point where I define a variable that is 1 when the server is offline and 0 otherwise, but I already read the docs and don't know how to aggregate this:
DEF:avg=server.rrd:rtt:AVERAGE CDEF:offline=avg,UN,1,0,IF
Is it possible to do this when creating a graph? Or I will have to store that info in another rrd?
I don't think you can do exactly what you want, but you have a couple of options.
You can define a sliding window average, that shows the percentage of the previous hour that was unknown, and graph that, using TRENDNAN.
DEF:avg=server.rrd:rtt:AVERAGE:step=60
CDEF:offline=avg,UN,100,0,IF
CDEF:pcavail=offline,3600,TREND
LINE:pcavail#ff0000:Availability
This defines avg as the 1-min time series of ping data. Note we use step=60 to ensure we get the best resolution of data even in a smaller graph. Then we define offline as 100 when the server is there, 0 when not. Then, pcavail is a 1-hour sliding window average of this, which will in effect be the percentage of time during the previous hour during which the server was available.
However, there's a problem in that RRDTool will silently summarise the source data before you get your hands on it, if there are many data points to a pixel in the graph (this won't happen if doing a fetch of course). To get around that, you'd need to have the offline CDEF done at store time -- IE, have a COMPUTE type DS that is 100 or 0 depending on if the avg DS is known. Then, any averaging will preserve data (normal averaging omits the unknowns, or the xff setting makes the whole cdp unknown).
rrdtool create ...
DS:rtt:GAUGE:120:0:9999
DS:offline:COMPUTE:rtt,UN,100,0,IF
rrdtool graph ...
DEF:offline=server.rrd:offline:AVERAGE:step=3600
LINE:offline#ff0000:Availability
If you are able to modify your RRD, and do not need historical data, then use of a COMPUTE in this way will allow you to display your data in a 1-hour stepped graph as you wanted.

Hazelcast management center shows get latency of 0 ms for replicated map

Setup :
3 member embedded cluster deployed as a spring boot jar.
Total keys on each member: 900K
Get operation is being attempted via a rest api.
Background:
I am trying to benchmark the replicated map of hazelcast.
Management center UI shows around 10k/s request being executed but avg get latency per sec is coming 0ms.
I believe it is not showing because it might be in microseconds.
Please let me know how to configure management center UI to show latency in micro/nanoseconds?
Management center UI shows around 10k/s request being executed but avg get latency per sec is coming 0ms.
I believe you're talking about Replicated Map Throughput Statistics in the replicated map details page. The Avg Get Latency column in that table shows on average how much time it took for a cluster member to execute the get operations for the time period that is selected on the top right corner of the table. For example, if you select Last Minute there, you only see the average time it took for the get operations in the last minute.
I believe it is not showing because it might be in microseconds.
Cluster is sending it as milliseconds (calculating it as nanoseconds in a newer cluster version but still sending as milliseconds). However, since a replicated map replicates all data on all members and every member contains the whole data set, get latency is typically very low as there's no network trip.
I guess that the way we render very small metric values confused you. In Management Center UI, we only show two fractional digits. You can see it in action in the below screenshots:
As you can see, since the value is very low, it is shown as 0. I believe we can do a better job rendering these values though (using a smaller time unit for example). I will create an issue for this on our private issue tracker.

Every 'nth' document from a collection - MongoDB + NodeJS

I am looking for a method to return data at different resolutions that is stored in MongoDB. The most elegant solution I can envision is a query that returns every 'nth' (second, third, tenth, etc.) document from the collection.
I am storing data (say temperature) at a 5 second interval but want to look at different trends in the data.
To find the instantaneous trend, I look at the last 720 entries (1 hour). This part is easy.
If I want to look at slightly longer trend, say 3 hours, I could retrieve the last 2160 entries (3 hours) however that is more time to pull from the server, and more time and memory to plot. As when looking at the larger trends, the small movements are noise and I would be better off retrieving the same number of documents (720) but only every 3rd, still giving me 3 hours of results, with the same resources used, for a minor sacrifice in detail.
This only gets more extreme when I want to look at weeks (120,960 documents) or months (500,000+ documents).
My current code collects every single document (n = 1):
db.collection(collection).find().sort({$natural:-1}).limit(limit)
I could then loop through the returned array and remove every document when:
index % n != 0
This at least saves the client from dealing with all the data however this seems extremely inefficient and I would rather the database handle this part.
Does anyone know a method to accomplish this?
Apparenlty, there is no inbuilt solution in mongo to solve your problem.
The way forward would be to archive your data smartly, in fragments.
So you can store your data in a collection which will house no more than weekly or monthly data. A new month/week means storing your data in a different collection. That way you wont be doing a full table scan and wont be collecting every single document as you mentioned in your problem. Your application code will decide which collection to query.
If I were in your shoes, I would use a different tool as mongo is more suited for a general purpose database. Timeseries data(storing something every 5 sec) can be handled pretty well by database like cassandra which can handle frequent writes with ease, just as in your case.
Alternate fragmentation(update) :
Always write your current data in collection "week0" and in the background run a weekly scheduler that moves the data from "week0" to history collections "week1","week2" and so on. Fragmentation logic depends on your requirements.
I think the $bucket stage might help you with it.
You can do something like,
db.collection.aggregate([
{
$bucketAuto: {
groupBy: "$_id", // here you'll put the variable you need, in your example 'temperature'
buckets: 5 // this is the number of documents you want to return, so if you want a sample of 500 documents, you can put 500 here
}
}
])
Each document in the result for the above query would be something like this,
"_id": {
"max": 3,
"min": 1
},
"count": 2
If you had grouped by temperature, then each document will have the minimum and maximum temperature found in that sample
You might have another problem. Docs state not to rely on natural ordering:
This ordering is an internal implementation feature, and you should
not rely on any particular structure within it.
You can instead save the epoch seconds in each document and do your mod arithmetic on it as part of a query, with limit and sort.

Liferay: huge DLFileRank table

I have a Liferay 6.2 server that has been running for years and is starting to take a lot of database space, despite limited actual content.
Table Size Number of rows
--------------------------------------
DLFileRank 5 GB 16 million
DLFileEntry 90 MB 60,000
JournalArticle 2 GB 100,000
The size of the DLFileRank table sounds to me as abnormally big (if it is totally normal please let me know).
While the file ranking feature of Liferay is nice to have, we would not really mind resetting it if it halves the size of the database.
Question: Would a DELETE * FROM DLFileRank be safe? (stop Liferay, run that SQL command, maybe set dl.file.rank.enabled=false in portal-ext.properties, start Liferay again)
Is there any better way to do it?
Bonus if there is a way to keep recent ranking data and throw away only the old data (not a strong requirement).
Wow. According to the documentation here (Ctrl-F rank), I'd not have expected the number of entries to be so high - did you configure those values differently?
Set the interval in minutes on how often CheckFileRankMessageListener
will run to check for and remove file ranks in excess of the maximum
number of file ranks to maintain per user per file. Defaults:
dl.file.rank.check.interval=15
Set this to true to enable file rank for document library files.
Defaults:
dl.file.rank.enabled=true
Set the maximum number of file ranks to maintain per user per file.
Defaults:
dl.file.rank.max.size=5
And according to the implementation of CheckFileRankMessageListener, it should be enough to just trigger DLFileRankLocalServiceUtil.checkFileRanks() yourself (e.g. through the scripting console). Why you accumulate that large number of files is beyond me...
As you might know, I can never be quoted by stating that direct database manipulation is the way to go - in fact I refuse thinking about the problem from that way.

Strange data access time in Azure Table Storage while using .Take()

this is our situation:
We store user messages in table Storage. The Partition key is the UserId and the RowKey is used as a message id.
When a users opens his message panel we want to just .Take(x) number of messages, we don't care about the sortOrder. But what we have noticed is that the time it takes to get the messages varies very much by the number of messages we take.
We did some small tests:
We did 50 * .Take(X) and compared the differences:
So we did .Take(1) 50 times and .Take(100) 50 times etc.
To make an extra check we did the same test 5 times.
Here are the results:
As you can see there are some HUGE differences. The difference between 1 and 2 is very strange. The same for 199-200.
Does anybody have any clue how this is happening? The Table Storage is on a live server btw, not development storage.
Many thanks.
X: # Takes
Y: Test Number
Update
The problem only seems to come when I'm using a wireless network. But I'm using the cable the times are normal.
Possibly the data is collected in batches of a certain number x. When you request x+1 rows, it would have to take two batches and then drop a certain number.
Try running your test with increments of 1 as the Take() parameter, to confirm or dismiss this assumption.

Resources