High CPU usage issue because of cron jobs - python-3.x

I apologize for the long question but I have to explain it.
I am doing a change point detection through python for almost 50 customers and for different processes.
I am getting minute by minute interval numeric data from influx.
I am calculating the Z-Score and saving that in mongoDB locally on
the
machine on which the cron job is running.
I am comparing the Z-score with historic Z-score and then I am
alerting
the system.
Issues:
As i am doing this for 50 customers which can scale to say 500 or
5000 and each customer will
be having say 10 process so its not practical to have that many cron jobs.
Increasing cron jobs are creating high cpu usage and as my data is
sitting locally and if I
will loose the linux machine then i will loose all my data sitting there and won't be able to
compare it with historic data.
Solutions:
Create a clustered mongoDB server rather than having the data saved
locally.
Replace the cron jobs with multi threading and multi processing.
Suggestions:
What could be the best way to implement this to decrease the load on
cpu considering this is going to run all the time in a loop?
After fixing the above, what could be the best way to decrease the
false positive numbers in alerts?
Keep in mind:
Its a time series data.
Every time stamp has 5 or 6 variables and as of now i am doing the
same
operation for each variable separately.
Time var1 var2 var3.................varN
09:00 PM 10,000 5,000 150,000..............10
09:01 PM 10,500 5,050 160,000..............25
There is possibility that there is a correlation in these numbers.
Thanks!

Related

Stream Analytics: How can I start and stop a TUMBLINGWINDOW aggregation job inorder to reduce costs while still getting the same aggregation results?

Context
I have created a streaming job using Azure portal which aggregates data using a day wise TUMBLINGWINDOW. Have attached a code snippet below, modified from the docs, which shows similar logic.
SELECT
DATEADD(day, -1, System.Timestamp()) AS WindowStart
System.Timestamp() AS WindowEnd,
TollId,
COUNT(*)
FROM Input TIMESTAMP BY EntryTime
GROUP BY TumblingWindow(day, 1), TollId
Now that the job has been running and can see it producing output I want to be able to reduce the costs ideally by setting some sort of time scheduling so that the job can run and still produce the same output without being on all the time.
The only real constraint being that the aggregated output at the end of each TUMBLINGWINDOW has to remain the same as if it were running all the time (no impact of stop-starting on output).
This then brings me to my question.
Update: 2021-02-28
Before going into the question another thing that drove me was that through Azure portal you can manually start and stop a job. When you start/restart a job you can set a custom start time for the job/query. With this level of control say I start a job (or have a job running) and then decide to stop it for majority of the day and then turn it on at say 11:30pm each day with a custom start time of midnight of the current day then it would be able to be on for approx 30min before it would output the results (yet still to my understanding produce the same aggregation results/effect compared to if it was on the whole day up until that point). This job could then be paused again at 00:30am ( the next day for which it stays paused for the majority of the day (1380min total until 11:30pm again) upon which the same above logic is applied.
This way it remains off the majority of the day yet still can produce the same output for each day wise window (correct me if I am wrong in my thinking). The only issue with this to me seems to be the fact someone would have to manually perform this. Thus I was driven to the docs looking for a way to automate this.
Question
How can I start and stop a job in an automated fashion such that the required output would still remain intact but so that the job doesn't have to remain on all the time (like it currently is)?
Does the documentation linked above suffice given the context above, if so what are some possible arrangements for the N minutes (on) and M minutes (off) time variables for this to work?
Is this possible given the scenario that I want to aggregate on a one day TUMBLINGWINDOW window (whereby I want each window to start and end at midnight of each day, as per its default behaviour.)?
Eg
Window start: 2022-02-20 00:00:00 Window end: 2022-02-21 00:00:00 (aggregation performed),
Window start: 2022-02-21 00:00:00 Window end: 2022-02-22 00:00:00 (aggregation performed),
Window start: 2022-02-22 00:00:00 Window end: 2022-02-23 00:00:00 (aggregation performed),
....so on
Thoughts
I found this documentation from Microsoft regarding auto-pausing jobs using a few methods
However came across a paragraph (quoted below) which made me doubtful whether it seems reasonable in my particular use case (TUMBLING 1 day window as described in my question section).
Note
There are downsides to auto-pausing a job. The main ones being the loss of the low latency /real time capabilities, and the potential risks from allowing the input event backlog to grow unsupervised while a job is paused. Auto-pausing should not be considered for most production scenarios running at scale.
Could this method
There are 3 ways to lower costs:
downscale your job, you will have higher latency but for a lower cost, up to a point where your job crashes because it runs out of memory over time and/or can't catch up with its backlog. Here you need to keep an eye on your metrics to make sure you can react before it's too late
going further, you can regroup multiple queries into a single job. This job most likely won't be aligned in partitions, so it won't be able to scale linearly (adding SUs is not guaranteed to give you better performance). Same comment as above, plus you need to remember that when you need to scale back up, you probably will have to break down that job into multiple jobs to again be able to scale in a linear fashion
finally you can auto-pause a job, one way to implement that being explained in the doc you linked. I wrote that doc, and what I meant by that comment is that here again you are taking the risk of overloading the job if it can't run long enough to process the backlog of events. This is a risky proposition for most production scenarios
But if you know what you are doing, and are monitoring closely the appropriate metrics (as explained in the doc), this is definitely something you should explore.
Finally, all of these approaches, including the auto-pause one, will deal with tumbling windows transparently for you.
Update: 2022-03-03 following comments here
Update: 2022-03-04 following comments there
There are 3 time dimensions here:
When the job is running or not: the wall clock
When the time window is expected to output results: Tumbling(day,1) -> 00:00AM every day, this is absolute (on the day, on the hour, on the minute...) and independent of the job start time below
What output you want produced from the job, via the job start time
Let's say you have the job running 24/7 for multiple months, and decide to stop it at noon (12:00PM) on the 1st day of March.
It already has generated an output for the last day of February, at 00:00AM Mar1.
You won't see a difference in output until the following day, 00:00AM Mar2, when you expect to see the daily window of Mar1, but it's not output because the job is stopped.
Let's start the job at 01:00AM Mar2 wall clock time. If you want the missing time window, you should either pick a start time at 'when last stopped' (noon the day before), or a custom time any time before 23:59PM Mar1. What you are driving is the output window you want. Here you are telling ASA you want all the windows from that point onward.
ASA will then reload all the data it needs to generate that window (make sure the event hub has enough retention for that, we don't cache data between restarts in the job): Azure Stream Analytics will automatically look back at the data in the input source. For instance, if you start a job “Now” and if your query uses a 5-minutes Tumbling Window, Azure Stream Analytics will seek data from 5 minutes ago in the input. The first possible output event would have a timestamp equal to or greater than the current time, and ASA guarantees that all input events that may logically contribute to the output has been accounted for.

Invisible Delays between Spark Jobs

There are 4 major actions(jdbc write) with respect to application and few counts which in total takes around 4-5 minutes for completion.
But the total uptime of Application is around 12-13minutes.
I see there are certain jobs by name run at ThreadPoolExecutor.java : 1149. Just before this job being reflected on Spark UI, the invisible long delays occur.
I want to know what are the possible causes for these delays.
My application is reading 8-10 CSV files, 5-6 VIEWs from table. Number of joins are around 59, few groupBy with agg(sum) are there and 3 unions are there.
I am not able to reproduce the issue in DEV/UAT env since the data is not that much.
It's in the production where I get the app. executed run by my Manager.
If anyone has come across such delays in their job, please share your experience what could be the potential cause for this, currently I am working around the unions, i.e. caching the associated dataframes and calling count so as to get the benefit of cache in the coming union(yet to test, if union is the reason for delays)
Similarly, I tried the break the long chain of transformations with cache and count in between to break the long lineage.
The time reduced from initial 18 minutes to 12 minutes but the issue with invisible delays still persist.
Thanks in advance
I assume you don't have a CPU or IO heavy code between your spark jobs.
So it really sparks, 99% it is QueryPlaning delay.
You can use
spark.listenerManager.register(QueryExecutionListener) to check different metrics of query planing performance.

Spark streaming - waiting for data for window aggregations?

I have data in the format { host | metric | value | time-stamp }. We have hosts all around the world reporting metrics.
I'm a little confused about using window operations (say, 1 hour) to process data like this.
Can I tell my window when to start, or does it just start when the application starts? I want to ensure I'm aggregating all data from hour 11 of the day, for example. If my window starts at 10:50, I'll just get 10:50-11:50 and miss 10 minutes.
Even if the window is perfect, data may arrive late.
How do people handle this kind of issue? Do they make windows far bigger than needed and just grab the data they care about on every batch cycle (kind of sliding)?
In the past, I worked on a large-scale IoT platform and solved that problem by considering that the windows were only partial calculations. I modeled the backend (Cassandra) to receive more than 1 record for each window. The actual value of any given window would be the addition of all -potentially partial- records found for that window.
So, a perfect window would be 1 record, a split window would be 2 records, late-arrivals are naturally supported but only accepted up to a certain 'age' threshold. Reconciliation was done at read time. As this platform was orders of magnitude heavier in terms of writes vs reads, it made for a good compromise.
After speaking with people in depth on MapR forums, the consensus seems to be that hourly and daily aggregations should not be done in a stream, but rather in a separate batch job once the data is ready.
When doing streaming you should stick to small batches with windows that are relatively small multiples of the streaming interval. Sliding windows can be useful for, say, trends over the last 50 batches. Using them for tasks as large as an hour or a day doesn't seem sensible though.
Also, I don't believe you can tell your batches when to start/stop, etc.

How big the spark stream window could be?

I have some data flows need to be calculated. I am thinking about use spark stream to do this job. But there is one thing I am not sure and feel worry about.
My requirements is like :
Data comes in as CSV files every 5 minutes. I need report on data of recent 5 minutes, 1 hour and 1 day. So If I setup a spark stream to do this calculation. I need a interval as 5 minutes. Also I need to setup two window 1 hour and 1 day.
Every 5 minutes there will be 1GB data comes in. So the one hour window will calculate 12GB (60/5) data and the one day window will calculate 288GB(24*60/5) data.
I do not have much experience on spark. So this worries me.
Can spark handle such big window ?
How much RAM do I need to calculation those 288 GB data? More than 288 GB RAM? (I know this may depend on my disk I/O, CPU and the calculation pattern. But I just want some estimated answer based on experience)
If calculation on one day / one hour data is too expensive in stream. Do you have any better suggestion?

How to deal with a large amount of logs and redis?

Say I have about 150 requests coming in every second to an api (node.js) which are then logged in Redis. At that rate, the moderately priced RedisToGo instance will fill up every hour or so.
The logs are only necessary to generate daily\monthly\annual statistics: which was the top requested keyword, which was the top requested url, total number of requests daily, etc. No super heavy calculations, but a somewhat time-consuming run through arrays to see which is the most frequent element in each.
If I analyze and then dump this data (with a setInterval function in node maybe?), say, every 30 minutes, it doesn't seem like such a big deal. But what if all of sudden I have to deal with, say, 2500 requests per second?
All of a sudden I'm dealing with 4.5 ~Gb of data per hour. About 2.25Gb every 30 minutes. Even with how fast redis\node are, it'd still take a minute to calculate the most frequent requests.
Questions:
What will happen to the redis instance while 2.25 gb worth of dada is being processed? (from a list, I imagine)
Is there a better way to deal with potentially large amounts of log data than moving it to redis and then flushing it out periodically?
IMO, you should not use Redis as a buffer to store your log lines and process them in batch afterwards. It does not really make sense to consume memory for this. You will better served by collecting your logs in a single server and write them on a filesystem.
Now what you can do with Redis is trying to calculate your statistics in real-time. This is where Redis really shines. Instead of keeping the raw data in Redis (to be processed in batch later), you can directly store and aggregate the statistics you need to calculate.
For instance, for each log line, you could pipeline the following commands to Redis:
zincrby day:top:keyword 1 my_keyword
zincrby day:top:url 1 my_url
incr day:nb_req
This will calculate the top keywords, top urls and number of requests for the current day. At the end of the day:
# Save data and reset counters (atomically)
multi
rename day:top:keyword tmp:top:keyword
rename day:top:url tmp:top:url
rename day:nb_req tmp:nb_req
exec
# Keep only the 100 top keyword and url of the day
zremrangebyrank tmp:top:keyword 0 -101
zremrangebyrank tmp:top:url 0 -101
# Aggregate monthly statistics for keyword
multi
rename month:top:keyword tmp
zunionstore month:top:keyword 2 tmp tmp:top:keyword
del tmp tmp:top:keyword
exec
# Aggregate monthly statistics for url
multi
rename month:top:url tmp
zunionstore month:top:url 2 tmp tmp:top:url
del tmp tmp:top:url
exec
# Aggregate number of requests of the month
get tmp:nb_req
incr month:nb_req <result of the previous command>
del tmp:nb_req
At the end of the month, the process is completely similar (using zunionstore or get/incr on monthly data to aggregate the yearly data).
The main benefit of this approach is the number of operations done for each log line is limited while the monthly and yearly aggregation can easily be calculated.
how about using flume or chukwa (or perhaps even scribe) to move log data to a different server (if available) - you could store log data using hadoop/hbase or any other disk based store.
https://cwiki.apache.org/FLUME/
http://incubator.apache.org/chukwa/
https://github.com/facebook/scribe/

Resources