Server Cluster: One Task - cron

The Background
Our clients use a service for which they set daily budget. This is a prepaid service and we allocate a particular amount from user's budget every day.
Tables:
budgets - how much we are allowed to spend per day
money - clients real balance
money_allocated - amount deducted from money that can be spent today (based on budgets)
There is a cron job that runs every few minutes and checks:
if user has money_allocated for a given day
if money_allocated >= budgets (user may increase budget during the day)
In the first case we allocate full amount of daily budget, in the latter - the difference between budget and already allocated amount for that day (in this case we create additional record in money_allocated for the same day).
Allocation has two stages - in the first round we add a row with status "pending" (allocation requested) and another cron checks all "pending" allocations and moves money from money to money_allocated if user has enough money. This changes status to "completed".
The Problem
We have a cluster of application servers (under NLB) and above cron job runs on each of them which means that money can accidentally be allocated multiple times (or not allocated at all if we implement wrong "already allocated" triggers).
Our options include:
Run cron job on one server only - no redundancy, client complaints and money lost on failure
Add a unique index on money_allocated that goes like (client_id, date, amount) - won't allocate more money for a given day if client doubles the budget or increases it multiple times by the same amount during the day
There is an option to record each movement in budgets and link all allocations to either "first allocation of the day" or "change of budget (id xxx)" (add this to the unique index as well). This does not look sexy enough, however.
Any other options? Any advice would be highly appreciated!

Ok, so I ended up running this on one of the cluster's instances. If you use Amazon AWS and are in a similar situation, below is one of the options..
On each machine, at the beginning of your cron job's code, do the following:
Call describe_load_balancers (AWS API), parse the response to get a list/array of all instances
Get http://169.254.169.254/latest/meta-data/instance-id - this returns instance ID of the machine that is sending request
If received instance ID is #1 in the list/array of all instances - proceed, if not - exit
Also, be sure to automatically replace unhealthy instances under this load balancer in short time as describe_load_balancers returns a list of both healthy and unhealthy instances. You may end up with a job not being done for a while if instance #1 goes down.

Related

High CPU usage issue because of cron jobs

I apologize for the long question but I have to explain it.
I am doing a change point detection through python for almost 50 customers and for different processes.
I am getting minute by minute interval numeric data from influx.
I am calculating the Z-Score and saving that in mongoDB locally on
the
machine on which the cron job is running.
I am comparing the Z-score with historic Z-score and then I am
alerting
the system.
Issues:
As i am doing this for 50 customers which can scale to say 500 or
5000 and each customer will
be having say 10 process so its not practical to have that many cron jobs.
Increasing cron jobs are creating high cpu usage and as my data is
sitting locally and if I
will loose the linux machine then i will loose all my data sitting there and won't be able to
compare it with historic data.
Solutions:
Create a clustered mongoDB server rather than having the data saved
locally.
Replace the cron jobs with multi threading and multi processing.
Suggestions:
What could be the best way to implement this to decrease the load on
cpu considering this is going to run all the time in a loop?
After fixing the above, what could be the best way to decrease the
false positive numbers in alerts?
Keep in mind:
Its a time series data.
Every time stamp has 5 or 6 variables and as of now i am doing the
same
operation for each variable separately.
Time var1 var2 var3.................varN
09:00 PM 10,000 5,000 150,000..............10
09:01 PM 10,500 5,050 160,000..............25
There is possibility that there is a correlation in these numbers.
Thanks!

Azure : Resource usage API issue

I tried to pull the Azure resource usage data for billing metrics. I followed the steps as mentioned in the blog to get Usage data of resources.
https://msdn.microsoft.com/en-us/library/azure/mt219001.aspx
Even If I set "start and endtime" parameter in the URL, its not take effect. It returns entire output [ from resource created/added time ].
For example :
https://management.azure.com/subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/providers/Microsoft.Commerce/UsageAggregates?api-version=2015-06-01-preview&reportedStartTime=2017-03-03T00%3a00%3a00%2b00%3a00&reportedEndTime=2017-03-04T00%3a00%3a00%2b00%3a00&aggregationGranularity=Hourly&showDetails=true"
As per the above URL, it should return the data between "2017-03-03 to 2017-03-04". But It shows the data from 2nd March [ 2017-03-02]. don't know why this return entire output and time filter section is not working.
Note : Endtime parameter value takes effect, mean it shows the output upto what mentioned in the endtime. But it doesn't consider the start time.
Anyone have a suggestion on this.
So there are a few things to consider:
There is usage date/time and then there is reported date/time.
Former tells you the date/time when the resources were used while the
latter tells you the date/time when this information was received by
the billing sub-system. There will be some delay in when the
resources used versus when they are reported. From this link:
Set {dateTimeOffset-value} for reportedStartTime and reportedEndTime
to valid dateTime values. Please note that this dateTimeOffset value
represents the timestamp at which the resource usage was recorded
within the Azure billing system. As Azure is a distributed system,
spanning across 19 datacenters around the world, there is bound to be
a delay between the resource usage time (when the resource was
actually consumed) and the resource usage reported time (when the
usage event reached the billing system) and callers need a predictable
way to get all usage events for a subscription for a given time
period.
The query only lets you search for reported date/time and there is no provision for usage date/time. However the data returned back to you contains usage date/time and not the reported date/time.
Long story short, because of the delay in propagating the usage information to the billing sub-system, the behavior you're seeing is correct. In my experience, it takes about 24 hours for all the usage information to show up in the billing sub-system.
The way we handle this scenario in our application is we fetch the data for a longer duration and then pick up only the data we're interested in seeing. So for example, if I need to see the data for 1st of March then we query the data for reported date/time from 1st March to say 4th March (i.e. today's date) and then discard any data where usage date is not 1st of March.
If we don't find any data (which is quite possible and is happening in your case as well), we simply tell the users that usage information is not yet available.

Azure Sql Database replication - 9 1/2 hour delay + cant remove the replica

Note
This is getting quite long so I will try and re-edit parts through the day.
These databases are no long active, which means I can play with them to work out what is going wrong.
The only thing left to answer: Given two databases running on Azure Databases at S3 (100 DTU). Should any secondary ever get significantly behind the primary database? Even while the DTU is hammered to 100% for over half the day. The reason for the DTU being hammered being IO writes mostly.
The Start: a few problems.
DTU limits were hit on Monday, Tuesday and to some extent on Wednesday for a significant amount of time. 3PM UTC - 6AM UTC.
Problem 1 (lag in data on the secondary): This had appeared to have caused a lag of data in the secondary of about 9 1/2 hours. The servers were effectively being spammed with updates causing a lot of IO updates. 6-8 million records on one table for the 24 hour period for example. This problem drove the reason for the post:
Shouldn't these be much more in sync?
The data became out of sync on Monday morning and continued out of sync until Friday. On Thursday some new databases were started up to replace these standard SQL databases and so they were left to rot. Well, for me to experiment with at least.
The application causing the redundant queries couldn't be fixed for a few days (I'm doubting they will ever fix it now) so that leaves changing the instance type. That action was attempted on the current instance but, the instance must disconnect with all standard replicas to increase to the performance tier. This led to the second problem (see below). The replica taking its time to be removed. Began on Wednesday morning and did not complete until Friday.
Problem 2 (can't remove the replica):
(Solved itself after two days)
Disconnecting the secondary database process began ~ Wed 8UTC (when the primary was at about 80GBs). The secondary being about 11GB behind in size at this point.
Setup
The databases (primary and secondary) are S3 that is geo-replicated (North + West Europe).
It has an audit log table(which I read from the secondary - normally with an SQL query), but this is currently 9 1/2 hours behind the last entry for the primary database. Running the query again on the secondary a few seconds later it is slowly catching up, but appears to be relative to the refresh rather than playing catch-up.
Both primary and secondary (read-only) databases are S3 (about to be bumped to P2).
the azure documentation states:
Active Geo-Replication (readable secondaries) is now available for all databases in all service tiers. In April 2017, the non-readable secondary type will be retired and existing non-readable databases will automatically be upgraded to readable secondaries.
How has the secondary has got so far behind? seconds to minutes would be acceptable. Hours not so much. The link above describes it as slightly:
While at any given point, the secondary database might be slightly behind the primary database, the secondary data is guaranteed to always be transactionally consistent with changes committed to the primary database.
Given the secondary is about to be destroyed and replaced by a higher level (need to remove replicas when upgrading from standard -> premium). I'm curious to know if it will happen again as well as what the definition of slight might be in this instance?
Notes: The primary did reach maximum DTU for a few hours but didn't harm the performance significantly, which is where the 9-hour difference was noticed.
Stats
Update for TheGameiswar:
I can't query it right now as it started removing itself as a replica (to be able to move the primary up to the P2 level, but that began hours ago at ~8.30UTC and 5 hours later it is still going). I think it's quite broken now.
Query - nothing special:
SELECT TOP 1000 [ID]
,[DateCreated]
,[SrcID]
,[HardwareID]
,[IP_Port]
,[Action]
,[ResponseTime]
,[ErrorCode]
,[ExtraInfo]
FROM [dbo].[Audit]
order by datecreated desc
I can't compare the tables anymore as it's quite stuck and refusing connections.
The 586 hours (10-14GB) are inserts into the primary database audit table. It was similar yesterday when noticing the 9 1/2 hour difference in data.
When the attempt to remove the replica (another person stated the process) it had about 10GB difference in size.
Cant compare data but can show DB-size at equivalent time
Primary DB Size (last 24 hours):
Secondary DB Size (last 24 hours):
Primary database size - week view
Secondary database size - week view
As mentioned ... it is being removed as a replica... but is still playing catch up with the DB size if you observe the charts above.
Stop replication errored for serverName: ---------, databaseName: Cloud4
ErrorCode: 400
ErrorMessage: A terminate operation is already in progress for database 'Cloud4'.
Update 2 - Replication - dm_continuous_copy_status
Replication is still removing ... moving on...
select * from sys.dm_continuous_copy_status
sys.dm_exec_requests
Querying from Thursday
Appears to be quite empty. The only record being
Replica removed itself at last.
The replica has removed itself after 2 days. At the 80GB mark that I predicted. It waited to replay the data in the transactions (till the point it was removed as a replica) before it would remove the replica.
A Week after on the P2 databases
DTU is holding between 20-40% at busy periods and currently performing ~12 million data entries every 24 hours (a similar amount for reads, but writing is much worse on the indexes and the table). 70-100% inserts extra in a week. This time, the replica is not struggling to keep up, which is good but that is likely due to it not reaching 100% DTU.
Conclusion
The replicas are useful but not in this case. This one caused degraded performance for several days that could have been averted. A simple increase to the performance tier until the cause of the problem could be fixed. IF the replica looks like it is dragging behind and you are on the border of Basic -> Standard or Standard -> Performance it would be safe to remove the replica as soon as possible and increase to a different tier.
Now we are on P2. The database is increasing at 20GB a day... and they say they have fixed the problem that sends 15 thousand redundant updates per minute. Thanks to the query performance insight for highlight that as querying the table is extremely painful on the DTU (even querying the last minute of data in that table is bad on the DTU. ~15 thousand new records every minute).
62617: insert ...
62618: select ...
62619: select ...
A positive from the above is that it's moved from 586 hours combined time for the insert statements (7.5 million entry rows per day) on S3 to 3 hours on P2 (12.4 million row rows per day). An extremely significant decrease in processing time. It did start with an empty table on Thursday but that has surpassed the previous size in a week whereas the previous one took a few months to get there.
It's doing well on the new tier. It should be ~5% if the applications were using the database responsibly and the secondary is up to date.
Spoke too soon. Now on P2
Someone thought it was a good idea to run an SQL query that repeats itself that deletes a thousand rows at a time. 12 million new rows a day.
10AM - 12AM it's managed to remove about 5.2 million rows. Now the database is showing signs of being in the same state as last week. Im curious if that is what happened now.

How to analyse log with > 30m measurements

Consider a Java application that receives financial trading transactions to determine their vality by applying several checks, such as if the transaction is allowed under contractual and legal constraints. The application implements a JMS message handler to receive messages on one queue, and a second queue to send back the message to the consumer.
In order to measure response times and enable post-processing performance analysis, the application logs the start and end time of several steps, e.g. reception of message, processing, prepare and send answer back to the client. There are approx. 3 million messages received by the application per day, and hence a multiple of this number of time measurements (around 18 million logged measurements a day). Each measurement consists of the following data: ID of measurement (e.g. RECEIVE_START/END, PROCESS_START/END, SEND_START/END), time stamp as given by java.lang.system.nanoTime(), a unique message id. The time measurements are sent to a log file.
To find the processing times, the log file is transformed and stored in a MySQL database on a daily basis. This is done by a sequence of Python scripts that take the raw log data, transform and store it into a MySQL table, whereby each record corresponds to one processed message, with each measurement in one column (i.e. the table groups records by the unique message id).
My question is this: what are the best tactics and tools to analyse this relatively large data set (consider a month or several month worth of log data)? In particular I would like to calculate and graph:
a) the distribution of measurements in terms of response times (e.g. SEND_END - RECEIVE_START), for a selected time frame (e.g. monthly, daily, hourly).
b) the frequencies of messages per time unit (second, hour, day, week, month), over a selected time period (e.g. day, week, month, year)
Any hints or reports on your own experience is appreciated.
We've had a lot of success with splunk for processing/reporting on large log files. It's a tool that's built specifically for that purpose. You can run SQL-like queries on your data files to get the kind of reports/graphs you are looking for. I believe it can be pretty expensive though, IIRC they charge you based on the amount of data that you process.
http://www.splunk.com/?r=header

Rebilling Customers using Cron?

We need to rebill x amount of customers on any given day.
Currently, we run a cron every 5 mins to bill 20 people/send invoice etc
However, when the number of customers grows, extending to 100 people per 5 min may result in the cron overlapping and billing customers twice.
I have two thoughts:
Running the cron once, but making it sleep x amount after 20 billed/invoiced so that we dont spam the API.
Using a message queue where people are added to the queue and then "workers" process the queue. The problem is I have no experience in this, so not sure what is the best route to take.
Does anyone have any experience in this?

Resources