What is the best way to store usage reports over time? - linux

I currently have a few server reports that return usage statistics whenever run. The data is collected from several different sources (mostly log files), so they're not in a database to begin with.
The returned data are simple lists, for example, detailing how much disk space a user is using (user => space) average percent memory they've used for the month (user => memory), avg CPU time, etc.
Some of the information is a running total (like disk usage) and others are averages of snapshots taken throughout the month.
Running these reports and looking at the results works perfect, but I'd like to start storing these results to look at long-term trends.
What would be the best way to do this?

CACTI is very useful and highly configurable. Utilizes RRD Tool.
RRD Tool is great, b/c it stores data in a circular format and summarizes it. When RRD creates a data file, it creates it with every data point that it will ever store, so it never gets larger. You don't have to worry about log files getting too large. The key is to configure it to summarize in time periods, e.g., daily, monthly, yearly. The downside is that next year, you might not be able to know the CPU usage for the five minute period from January 1 of this year. But who really needs that?

RRDtool seems the obvious solution for this.
Or for that matter, one of the out-of-the box monitoring tools, some of which happen to use rrdtool for storing their data. E.g. Munin.

Related

In Microsoft Dataverse how can I see a real time capacity or data usage? I only see a daily values

We are moving to an online D365 environment and we're trying to determine how much data we're using in the Dataverse tables. Under Capacity I can go into the environment details and see how much space per table is used, but it's a daily value. We're trying to remove some data as we're reaching some of our capacity limits -- but I can't find where it shows how much data is being used per table in real time. Thanks for any advise on how to pull this.
This article does mention how to get to the capacity limits and usage, but all values appear to be updated just daily:
https://learn.microsoft.com/en-us/power-platform/admin/capacity-storage?source=docs
I'm trying to find some way to see the data used in real time.

Options for running data extraction on a daily basis

I currently have an excel based data extraction method using power query and vba (for docs with passwords). Ideally this would be programmed to run once or twice a day.
My current solution involves setting up a spare laptop on the network that will run the extraction twice a day on its own. This works but I am keen to understand the other options. The task itself seems to be quite a struggle for our standard hardware. It is 6 network locations across 2 servers with around 30,000 rows and increasing.
Any suggestions would be greatly appreciated
Thanks
if you are going to work with increasing data, and you are going to dedicate a exclusive laptot for the process, i will think about install a database in the laptot (MySQL per example), you can use Access too... but Access file corruptions are a risk.
Download to this db all data you need for your report, based on incremental downloads (only new, modified and deleted info).
then run the Excel report extracting from this database in the same computer.
this should increase your solution performance.
probably your bigger problem can be that you query ALL data on each report generation.

Adding and retrieving sorted counts in Cassandra

I have a case where I need to record a user action in Cassandra, then later retrieve a sorted list of users with the highest number of that action in an arbitrary time period.
Can anyone suggest a way to store and retrieve this data in a pre-aggregated method?
Outside of Cassandra I would recommend using stream-summary or count min sketch you would be able to solve this with much less space and have immediate results. Just update and periodically serialize and persist it (assuming you don't need guaranteed accuracy)
In Cassandra you can keep a row per period of time like by hours and have a counter per user in that row, incrementing them on use. Then use a batch job to run through them and find the heavy hitters. You would be constrained to having the minimal queryable time be 1 hour and it wont be particularly cheap or fast to compute but it would work.
Generally it would be good treating these as a log of operation, every time there is an event store it and have batch jobs do analytics against it with hadoop or custom. If need it realtime id recommend the above approach of keeping stream summaries in memory.

HIVE/HDFS for realtime storage of sensor data on a massive scale?

I am evaluating sensor data collection systems with the following requirements,
1 million endpoints sending in 100 bytes of data every minute (as a time series).
Basically millions of small writes to the storage.
This data is write-once, so basically it never gets updated.
Access requirements
a. Full data for a user needs to be accessed periodically (less frequent)
b. Partial data for a user needs to be access periodically (more frequent). For e.g I need sensor data collected over the last hour/day/week/month for analysis/reporting.
Have started looking at Hive/HDFS as an option. Can someone comments on the applicability of Hive in such a use case? I am concerned that while the distributed storage needs would work, it seems more suited to data warehousing applications than real time data collection/storage.
Do HBase/Cassandra make more sense in this scenario?
I think HBase can be a good option for you. In fact there's already an open/source implementation in HBase which solves similar problem that you might want to use. Take a look at openTSB which is an open source implementation for solving similar problems. Here's a short excerpt from their blurb:
OpenTSDB is a distributed, scalable Time Series Database (TSDB)
written on top of HBase. OpenTSDB was written to address a common
need: store, index and serve metrics collected from computer systems
(network gear, operating systems, applications) at a large scale, and
make this data easily accessible and graphable. Thanks to HBase's
scalability, OpenTSDB allows you to collect many thousands of metrics
from thousands of hosts and applications, at a high rate (every few
seconds). OpenTSDB will never delete or downsample data and can easily
store billions of data points. As a matter of fact, StumbleUpon uses
it to keep track of hundred of thousands of time series and collects
over 600 million data points per day in their main production
datacenter.
There are actually quite a few people collecting sensor data in a time-series fashion with Cassandra. It's a very good fit. I recommend you read this article on basic time series in Cassandra for an idea of what your data model would be like.
Writes in Cassandra are extremely cheap, so even a moderately sized cluster could easily handle one million writes per minute.
Both of your read queries could be answered very efficiently. For the second type of query, where you're reading data for a slice of time for a single sensor, you would end up reading a contiguous slice from a single row; this should take about 10ms for a completely cold read. For the first type of query, you would simply be running several of the per-sensor queries in parallel. Assuming you store a basic map of users to sensor IDs, you would lookup all of the sensor IDs for a user with one query, and then your second query would fetch the data for all of those sensors (although you might break up this query if the number of sensors is high).
Hive and HDFS don't really make sense when you're talking about real-time queries, as they're more suited for long-running batch jobs.

huge Oracle Redo Logs

At 10PM each Tuesday all of a sudden oracle is generating huge REDO logs until the disk runs out of space. My application is not running any huge queries or anything during this time according to the logs.
The only thing I can find is that the dba_scheduler_job_run_details table started an oracle job right at that time. I can't find any info on google about this job, so am desperate for any ideas.
Info from dba_scheduler_job_run_details:
JOB_NAME: ORA$AT_SA_SPC_SY_254
STATUS: STOPPED
ACTUAL_START_DATE: 11-03-22 22:00:02.125060000 CST6CDT
RUN_DURATION 9:4:19.0
10PM is usually the time that automatic statistics gathering starts. Although it normally runs every day. In 11g stats gathering uses auto tasks instead of the scheduler, try looking for the stats job with a query like this: select * from dba_autotask_job_history order by window_start_time desc;
But even if the problem is caused by statistics, it seems odd that it would cause too much REDO. Usually gathering statistics is a lot of reading and a very small amount of writing. Unless you've got many small tables that change all the time; in that case the amount of statistics information could be much larger than the actual data. If that's the case you may need to gather the stats more often, or maybe lock the stats.
Or possibly the statistics process is blowing up on a specific table. This will show you what table was last analyzed, maybe it will give you a clue: select last_analyzed, dba_tables.* from dba_tables order by 1 desc nulls last;
I something generates huge REDOLOG, then you must have huge DML activity. For examaple cleanup script which tries to purge some data, but fails, rollbacks, and then tries to do the same task again and again and again...
The best way how to prove/disprove your doubts is the "Log miner tool". It's not trivial to use, but it will tell you which statements (and against which table) generated most of the redo and that time.

Resources