AWS Aurora Serverless spewing 2 particular errors continuously - amazon-rds

Looking in CloudWatch logs, I see these errors repeated continuously, approximately twice a second:
2019-03-10 00:56:12 2b1be9674700[AWS_OSCAR_ERROR]:Failed to read persisted replica status info. Error code = 11
2019-03-10 00:56:12 2b1be9674700[AWS_OSCAR_ERROR]:Failed to update replication status table (stage 2)
Does anyone know what these are? I've tried Googling the exact and partial errors and was amazed to see that there's nothing out there (well Google, anyway, has never seen them!)
I am concerned because --even if they are nothing to worry about-- they are generating millions of log entries and thus costing me lots in AWS fees.
Thanks in advance.

This looks like an aurora internal error message (that explains why google didn't have much to offer when you searched for it). I think the right way ahead is to contact AWS Support.

Related

OpenAi api 429 rate limit error without reaching rate limit

On occasion I'm getting a rate limit error without being over my rate limit. I'm using the text completions endpoint on the paid api which has a rate limit of 3,000 requests per minute. I am using at most 3-4 requests per minute.
Sometimes I will get the following error from the api:
Status Code: 429 (Too Many Requests)
Open Ai error type: server_error
Open Ai error message: That model is currently overloaded with other requests. You can retry your request, or contact us through our help center at help.openai.com if the error persists.
Open ai documentation states that a 429 error indicates that you have exceeded your rate limit which clearly I have not.
https://help.openai.com/en/articles/6891829-error-code-429-rate-limit-reached-for-requests
The weird thing is the open ai error message is not stating that. It is giving the response I usually get from a 503 error (service unavailable).
I'd love to hear some thoughts on this, any theories, or if anyone else has been experiencing this.
I have seen a few message on the OpenAI community forum with similar messages. I suggest checking out the error code guide we have with suggestion to mitigate these errors, in general though, it's possible the model itself was down and has nothing to do with your rate limit: https://platform.openai.com/docs/guides/error-codes
This error indicates the OpenAI servers have too many requests from all users and their servers have reached their capacity to service your request. It's pretty common at the moment.
Hopefully they will upgrade their servers soon. Not really sure why it is a big problem since they run on Azure and should be able to scale based on ramped up demand. Maybe they are just trying to minimise costs.

Azure Data Factory error DFExecutorUserError error code 1204

So I am getting an error in Azure Data Factory that I haven't been able to find any information about. I am running a data flow and eventually (after an hour or so) get this error
{"StatusCode":"DFExecutorUserError","Message":"Job failed due to
reason: The service has encountered an error processing your request.
Please try again. Error code 1204.","Details":"The service has
encountered an error processing your request. Please try again. Error
code 1204."}
Troubleshooting I have already done :
I have successfully ran the data flow using the sample option. Did this with 1 million rows.
I am processing 3 years of data and I have successfully processed all the data by filter the data by year and running the data flow once for each year.
So I think I have shown the data isn't the problem, because I have processed all of it by breaking it down into 3 runs.
I haven't found a pattern in the time the pipeline runs for before the error occurs that would indicate I am hitting any timeout value.
The source and sink for this data flow are both an Azure SQL Server database.
Does anyone have any thoughts? Any suggestions for getting a more verbose error out of data factory (I already have the pipeline set with verbose logging).
We are glad to hear that you has found the cause:
"I opened a Microsoft support ticket and they are saying it is a
database transient caused failure."
I think the error will be resolved automatically. I post this as answer and this can be beneficial to other community members. Thank you.
Update:
The most important thing is that you have resolved it by increase the vCorces in the end.
"The only thing they gave me was their BS article on handling
transient errors. Maybe I’m just old but a database that cannot
maintain connections to it is not very useful. What I’ve done to
workaround this is increase my vCores. This sql database was a
serverless one. While performance didn’t look bad my guess is the
database must be doing some sort of resize in the background to
handle the hour long data builds I need it to do. I had already tried
setting the min/max vCores to be the same. The connection errors
disappeared when I increased the vCores count to 6."

Azure DocumentDB Throttled Requests

I have a document db database on azure. I have a particularly heavy query that happens when I archive a user record and all of their data.
I was on the S1 plan and would get an exception that indicated I was hitting the limit of RU/s. The S1 plan has 250.
I decided to switch to the Standard plan that lets you set the RU/s and pay for it.
I set it to 500 RU/s.
I did the same query and went back and looked at the monitoring chart.
At the time I did this latest query test it said I did 226 requests and 10 were throttled.
Why is that? I set it to 500 RU/s. The query had failed, by the way.
Firstly, Requests != Request Units, so your 226 requests will at some point have caused more than 500 Request Units to be needed within one second.
The DocumentDb API will tell you how many RUs each request costs, so you can examine that client side to find out which request is causing the problem. From my experience, even a simple by-id request often cost at least a few RUs.
How you see that cost is dependent on which client-side SDK you use. In my code, I have added something to automatically log all requests that cost more than 10 RUs, just so I know and can take action.
It's also the case that the monitoring tools in the portal are quite inadequate and I know the team are working on that; you can only see the total RUs for every five minute interval, but you may try to use 600 RUs in one second and you can't really see that in the portal.
In your case, you may either have a single big query that just costs more than 500 RU - the logging will tell you. In that case, look at the generated SQL to see why, maybe even post it here.
Alternatively, it may be the cumulative effect of lots of small requests being fired off in a small time window. If you are doing 226 requests in response to one user action (and I don't know if you are) then you probably want to reconsider your design :)
Finally, you can retry failed requests. I'm not sure about other SDKs but the .Net SDK retries a request automatically 9 times before giving up (that might be another explanation for the 229 requests hitting the server).
If your chosen SDK doesn't retry, you can easily do it yourself; the server will return a specific status code (I think 429 but can't quite remember) along with an instruction on how long to wait before retrying.
Please examine the queries and update your question so we can help further.

Set visibilitytimeout to 7days means automatically delete for azure queue?

It seems that new azure SDK extends the visibilitytimeout to <= 7 days. I know by default, when I add a message to an azure queue, the live time is 7days. When I get message out, and set the visibilitytimeout to 7days. Does that mean I don't need to delete this message if I don't care about message reliable? the message will disappear later 7 days.
I want to take this way because DeleteMessage is very slow. If I don't delete message, doesn't it have any impact on performance of GetMessage?
Based on the documentation for Get Messages, I believe it is certainly possible to set the VisibilityTimeout period to 7 days so that messages are fetched only once. However I see some issues with this approach instead of just deleting the message once the process is done:
What happens when you get the message and start processing it and somehow the process fails? If you set the visibility timeout to be 7 days, then the message would never appear in the queue again and thus the process it was supposed to do never gets done.
Even though the message is hidden, it is still there in the queue thus you keep on incurring storage charges for that message. Even though the cost is trivial but why keep the message when you don't really need it.
A lot of systems rely on Approximate Messages Count property of a queue to check on the health of processes which are performed by messages in a queue. Please note that even though you make the message hidden, it is still there in the queue and thus will be included in total messages count in the queue. So if you're building a system which relies on this for health check, you will always find your system to be unhealthy because you're never deleting the messages.
I'm curious to know why you find deleting messages to be very slow. In my experience this is quite fast. How are you monitoring message deletion?
Rather than hacking around the problem I think you should drill into understanding why the deletes are slow. Have you enabled logs and looked at the e2elatency and serverlatency numbers across all your queue operations. Ideally you shouldn't be seeing a large difference between the two for all your queue operations. If you do see a large difference then it implies something is happening on the client that you should investigate further.
For more information on logging take a look at the following articles:
http://blogs.msdn.com/b/windowsazurestorage/archive/tags/analytics+2d00+logging+_2600_amp_3b00_+metrics/
http://msdn.microsoft.com/en-us/library/azure/hh343262.aspx
Information on client side logging can also be found in this post: e client side logging – which you can learn more about in this blog post. http://blogs.msdn.com/b/windowsazurestorage/archive/2013/09/07/announcing-storage-client-library-2-1-rtm.aspx
Please let me know what you find.
Jason

Turning On Diagnostics in Azure eats up transactions - MACommand.xml

We were just trying out the Azure Storage Analytics Service, and something very unusual caught our attention.
The transaction count for the diagnostics storage account ( the account to which the Diagnostics Service writes it's data) was extremely high. We are talking about 600~ transaction per hour, all of which are GetBlob() operations, and all of them ended with error ( ClientOtherError is equal to the total number of operations ). Further investigation revealed that each instance running which has Diagnostics turned on, produces 300~ transactions per hour ( we has 2 instances, thus the 600). Continuing the investigation, looking at the $logs that the Analytics Service is producing revealed what really going on :
The log is filled with lots of calls to an xml file that's not exists. The log file itself is very cluttered but it's very clear that most of the calls are searching for
https://****.blob.core.windows.net/mam/MACommand.xml and also /mam/MACommanda.xml and /mam/MACommandb.xml
all those calls have an error of 404.
This issue is a real problem for us, and we have no idea what causing it.
Has anyone encountered this issue ?
(edit: Forgot to mention, the Diagnostics Service is not logging anything - scheduledTransferPeriod is zero for all the categories)
Those transactions are an expected behavior since SDK 1.6.
See full explanation here:
http://social.msdn.microsoft.com/Forums/en-US/windowsazuretroubleshooting/thread/2e2f46dd-638a-4af1-b8ac-cfd7659a3171

Resources