Is there any possibility that request got 500 time out error after swapping in Azure Portal? - azure

I met an issue when doing swap on Azure Portal.
There are 2 slots, production and blue in our App Service. And we have 2 instances. Both slots and instances well before swapping. Swap action is swapping blue to production(not swap with preview action).
After swapped, one of instance seems not worked properly. Website logs like this:
06:07:44, 1 request for SiteWarmup, response as 200 OK
06:07:45 ~ 06:07:59, 2 times heartbeat and response as 200 OK
06:08:00 ~ 06:09:27, no request
06:09:28 ~ 06:09:36, 13 requests for our APIs, 200 response but take long time.(about 1 second in normal but over 20 second got here)
06:09:37 ~ 06:11:41, no request again
06:11:42 ~ 06:23:38, bit less than 2000 times requests, all except 10 heartbeat are response as 500 time out, after 230 seconds for each API.
In application log, there are only 213 times records from 06:07:53 to 06:09:36, nothing in later 14 minutes. 202 of all is Token Acquisition operation for Microsoft.IdentityModel.Clients.ActiveDirectory.TokenCache, but got same Hash, looks strange for me.
2017-04-19T06:09:34,Information,{my-service-name},d659c0,636281789745216643,0,4252,30,"4/19/2017 6:09:34 AM: 88404b06-f143-4de1-99c1-aa2779516c3b - AcquireTokenHandlerBase: === Token Acquisition started:
Authority: https://login.windows.net/{my-domain}/
Resource: https://graph.windows.net
ClientId: {my-client-id}
CacheType: Microsoft.IdentityModel.Clients.ActiveDirectory.TokenCache (1 items)
Authentication Target: Client
",00000000-0000-0000-0200-008000000056
2017-04-19T06:09:34,Information,{my-service-name},d659c0,636281789745216643,0,4252,30,4/19/2017 6:09:34 AM: 88404b06-f143-4de1-99c1-aa2779516c3b - TokenCache: Looking up cache for a token...,00000000-0000-0000-0200-008000000056
2017-04-19T06:09:34,Information,{my-service-name},d659c0,636281789745216643,0,4252,30,4/19/2017 6:09:34 AM: 88404b06-f143-4de1-99c1-aa2779516c3b - TokenCache: An item matching the requested resource was found in the cache,00000000-0000-0000-0200-008000000056
2017-04-19T06:09:34,Information,{my-service-name},d659c0,636281789745216643,0,4252,30,4/19/2017 6:09:34 AM: 88404b06-f143-4de1-99c1-aa2779516c3b - TokenCache: 58.305389355 minutes left until token in cache expires,00000000-0000-0000-0200-008000000056
2017-04-19T06:09:34,Information,{my-service-name},d659c0,636281789745216643,0,4252,30,4/19/2017 6:09:34 AM: 88404b06-f143-4de1-99c1-aa2779516c3b - TokenCache: A matching item (access token or refresh token or both) was found in the cache,00000000-0000-0000-0200-008000000056
2017-04-19T06:09:34,Information,{my-service-name},d659c0,636281789745216643,0,4252,30,"4/19/2017 6:09:34 AM: 88404b06-f143-4de1-99c1-aa2779516c3b - AcquireTokenHandlerBase: === Token Acquisition finished successfully. An access token was retuned:
Access Token Hash: {same-hash}
Refresh Token Hash: [No Refresh Token]
Expiration Time: 4/19/2017 7:07:52 AM +00:00
User Hash: null
",00000000-0000-0000-0200-008000000056
Meanwhile, another instance works well, over 100 requests per minute and no other than 200 OK. process time is also normal, most is less than 1 second.
And I noticed 2 things in Metrics Per Instance monitoring.
this instance has a high CPU usage, 10 times more than another one.
this instance has a Http Queue Length equal to 8.0, but nothing other than 0.0 I can found in other time or other instance.
Then swapped production back to blue, both instances and slots worked again.
It is the first time we met this issue, never happened before on same deployment. Is there any possibility to cause this issue, how can I resolve or avoid it?
Thank you for your advice in advance.
Regards
Kong

Related

Availability tests: is it possibile to HTTP-ping an endpoint with AAD enabled?

To better understand and then implement it in our projects I've created a Function App with 3 endpoints (source code in this gist) and configured 6 Availability Tests (of kind Ping Test):
oktest: https://myfuncname.azurewebsites.net/api/ok (parse dependent requests = false)
oktest2: https://myfuncname.azurewebsites.net/api/ok (parse dependent requests = true)
badtest: https://myfuncname.azurewebsites.net/api/bad (parse dependent requests = false)
badtest2: https://myfuncname.azurewebsites.net/api/bad (parse dependent requests = true)
errtest: https://myfuncname.azurewebsites.net/api/err (parse dependent requests = false)
errtest2: https://myfuncname.azurewebsites.net/api/err (parse dependent requests = true)
The Function App is configured with AAD using Client ID and Allowed Token Audiences.
/ok returns 200 content OK
/bad returns 400 content BAD
/err returns 500
Testing Azure the endpoints with Postman (using a valid Bearer Token) produces expected results as in local hosting environment.
I'm expecting to have 100% ok success in oktest and oktest2 and 100% of failures in other tests.
I'm getting these results:
oktest: success 21 fail 0
oktest2: success 17 fail 0
badtest: success 20 fail 2
badtest2: success 17 fail 0
errtest: success 21 fail 1
errtest2: success 17 fail 0
Then I set authentication to Allow Anonymous and I get these results after some cycle:
oktest: success 31 fail 0
oktest2: success 36 fail 0
badtest: success 29 fail 9
badtest2: success 25 fail 8
errtest: success 28 fail 8
errtest2: success 24 fail 7
It's quite clear that first authentication settings prevented endpoints to be HTTP-pinged. Is this possible to keep AAD authentication or we've to rethink our network architecture?
Any help or suggestion will be very appreciated!
Regards,
Giacomo S. S.
I think I've understood the situation.
When Allow Anonymous is enabled the HTTP-ping correctly fails in 'badtest':
When AAD authentication is enabled the HTTP-ping returns a success cause the
Availability Test is able to authenticate and than returns 200 OK for that (a successful authentication):
Looking these details it's possible explain any single result.
My conclusion: Availability Tests are more oriented to test web pages and not REST services.
Confirmations can be found in MS docs.
If anyone have something to add, I'm very interested!

Azure website timing out after long process

Team,
I have a Azure website published on Azure. The application reads around 30000 employees from an API and after the read is successful, it updates the secondary redis cache with all the 30,000 employees.
The timeout occurs in the second step whereby when it updates the secondary redis cache with all the employees. From my local it works fine. But as soon as i deploy this to Azure, it gives me a
500 - The request timed out.
The web server failed to respond within the specified time
From the blogs i came to know that the default time out is set as 4 mins for azure website.
I have tried all the fixes provided on the blogs like setting the command SCM_COMMAND_IDLE_TIMEOUT in the application settings to 3600.
I even tried putting the Azure redis cache session state provider settings as this in the web.config with inflated timeout figures.
<add type="Microsoft.Web.Redis.RedisSessionStateProvider" name="MySessionStateStore" host="[name].redis.cache.windows.net" port="6380" accessKey="QtFFY5pm9bhaMNd26eyfdyiB+StmFn8=" ssl="true" abortConnect="False" throwOnError="true" retryTimeoutInMilliseconds="500000" databaseId="0" applicationName="samname" connectionTimeoutInMilliseconds="500000" operationTimeoutInMilliseconds="100000" />
The offending code responsible for the timeout is this:
`
public void Update(ReadOnlyCollection<ColleagueReferenceDataEntity> entities)
{
//Trace.WriteLine("Updating the secondary cache with colleague data");
var secondaryCache = this.Provider.GetSecondaryCache();
foreach (var entity in entities)
{
try
{
secondaryCache.Put(entity.Id, entity);
}
catch (Exception ex)
{
// if a record fails - log and continue.
this.Logger.Error(ex, string.Format("Error updating a colleague in secondary cache: Id {0}, exception {1}", entity.Id));
}
}
}
`
Is there any thing i can make changes to this code ?
Please can anyone help me...i have run out of ideas !
You're doing it wrong! Redis is not a problem. The main request thread itself is getting terminated before the process is completed. You shouldn't let a request wait for that long. There's a hard-coded restriction on in-flight requests of 230-seconds max which can't be changed.
Read here: Why does my request time out after 230 seconds?
Assumption #1: You're loading the data on very first request from client-side!
Solution: If the 30000 employees record is for the whole application, and not per specific user - you can trigger the data load on app start-up, not on user request.
Assumption #2: You have individual users and for each of them you have to store 30000 employees data, on the first request from client-side.
Solution: Add a background job (maybe WebJob/Azure Function) to process the task. Upon request from client - return a 202 (Accepted with the job-status location in the header. The client can then poll for the status of the task at a certain frequency update the user accordingly!
Edit 1:
For Assumption #1 - You can try batching the objects while pushing the objects to Redis. Currently, you're updating one object at one time, which will be 30000 requests this way. It is definitely will exhaust the 230 seconds limit. As a quick solution, batch multiple objects in one request to Redis. I hope it should do the trick!
UPDATE:
As you're using StackExchange.Redis - use the following pattern to batch the objects mentioned here already.
Batch set data from Dictionary into Redis
The number of objects per requests varies depending on the payload size and bandwidth available. As your site is hosted on Azure, I do not thing bandwidth will be much of a concern
Hope that helps!

Taurus: Play a scenario every 5 minutes

I have an authenticate scenario which return a token. After 5 minutes (example), the token is expired. But this token is mandatory for the success of other scenarios.
Now, I don't really want to run this scenario each time before the other scenarios.
Ideally, I will run it a first time, get the token, and when the expiration time, rerun the authenticate scenario.
Currently, my yml file follow this logic:
execution:
- scenario: mainload
scenarios:
authenticate:
requests:
- http://auth.com
mainload:
requests:
- include-scenario: http://needToken.com
- http://needToken.com
So, how can I using Taurus inside an yml file do this? Like, waiting 5 minutes before relaunching the scenario?
Have a nice day.
You can create 2 scenario elements, one for authentication and another one for main load, the relevant Taurus YAML syntax would be something like:
execution:
- scenario: authenticate
- scenario: mainload
scenarios:
authenticate:
think-time: 5m
requests:
- http://example.com
mainload:
requests:
- http://blazedemo.com
The think-time attribute basically adds a Constant Timer with 5 minutes "sleep" time so the request to example.com will be executed each 5 minutes while others will be fired without delays.
References:
Building Test Plan from Config
Taurus Configuration Syntax
Taurus - Working with Multiple JMeter Tests

Atlassian-connect: Error on 'installed' event

I'm trying to run example Jira add-on.
I have created credentials.json file and have run npm i and node app.js.
But I have problems with installed event. Here is nodejs log:
Watching atlassian-connect.json for changes
Add-on server running at http://MacBook-Air.local:3000
Initialized sqlite3 storage adapter
Local tunnel established at https://a277dbdf.ngrok.io/
Check http://127.0.0.1:4040 for tunnel status
Registering add-on...
GET /atlassian-connect.json 200 13.677 ms - 784
Saved tenant details for 608ff294-74b9-3edf-8124-7efae2c16397 to database
{ key: 'my-add-on',
clientKey: '608ff294-74b9-3edf-8124-7efae2c16397',
publicKey: 'MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQCtKxrEBipTMXhRHlv9zcSLR2Y9h5YQgNQ5vpJ40tF9RmuIzByjkKTurCLHFwMAWU6aLQM+H+Z8wAlpL9AVlN5NKrEP8+a3mGFUOj/5nSJ7ZWHjgju0sqUruyEkKLvKuhWkKkd9NqBxogN0hxv7ue5msP5ezwei/nTJXmnmA5qOAQIDAQAB',
sharedSecret: 'LfT9elHM7iHkto5pHr+MnpH0SR1ypunIDoCyt6ugVJ1Q4hWHurG8k5DjVzLcvT2C98DDbiJiA89VNB0e3DiUvQ',
serverVersion: '100075',
pluginsVersion: '1.3.407',
baseUrl: 'https://gleb-olololololo-22.atlassian.net',
productType: 'jira',
description: 'Atlassian JIRA at https://gleb-olololololo-22.atlassian.net ',
eventType: 'installed' }
POST /installed?user_key=admin 204 51.021 ms - -
Failed to register with host https://gleb-olololololo-22%40yopmail.com:gleb-olololololo-22#gleb-olololololo-22.atlassian.net (200)
The add-on host did not respond when we tried to contact it at "https://a277dbdf.ngrok.io/installed" during installation (the attempt timed out). Please try again later or contact the add-on vendor.
{"type":"INSTALL","pingAfter":300,"status":{"done":true,"statusCode":200,"contentType":"application/vnd.atl.plugins.task.install.err+json","errorMessage":"The add-on host did not respond when we tried to contact it at \"https://a277dbdf.ngrok.io/installed\" during installation (the attempt timed out). Please try again later or contact the add-on vendor.","source":"https://a277dbdf.ngrok.io/atlassian-connect.json","name":"https://a277dbdf.ngrok.io/atlassian-connect.json"},"links":{"self":"/rest/plugins/1.0/pending/80928cb9-f64e-42d0-9a7e-a1fe8ba81055","alternate":"/rest/plugins/1.0/tasks/80928cb9-f64e-42d0-9a7e-a1fe8ba81055"},"timestamp":1513692335651,"userKey":"admin","id":"80928cb9-f64e-42d0-9a7e-a1fe8ba81055"}
Add-on not registered; no compatible hosts detected
I have reviewed tons of information in Google, but didn't found an answer.
More details, that can helps you to answer.
It happens suddenly. It worked OK, but about 1 week ago I start to get this error and cannot fix it. So I didn't change anything, just run add-on again, as I did it every day.
If I try to upload add-on manually I got error in terminal
GET / 302 17.224 ms - 0
GET /atlassian-connect.json 200 2.503 ms - 783
Found existing settings for client 608ff294-74b9-3edf-8124-7efae2c16397. Authenticating reinstall request
Authentication verification error: 401 Could not find authentication data on request
POST /installed?user_key=admin 401 22.636 ms - 45
The most possible reason (that I've found in google) is that I have wrong server time. But the time on my local machine is correct (at least for my timezone).
Anyone has any thoughts about this problem?
Thanks!
I kept randomly having this happen to me. It would be working, then run npm start and I would get the error. Since I'm not using a database right now, I simply removed all references to the juggling-sqlite database. This was in package.json, package-lock.json, config.json, and I just removed store.db. That got it working for me. Pretty frustrating that this happens, not sure a better way around it.

Job schedule entry could not be created. Status code: 500

I have received the following error when trying to save a DSX Scheduled Job:
Job schedule entry could not be created. Status code: 500
Screenshot of the error message:
I've tried about six times over the last few hours and have consistently received the above error message.
Debugging via the browser network inspection tool I can see:
{
"code":"CDSX.N.1001",
"message":"The service is not responding, try again later.",
"description":"",
"httpMethod":"POST",
"url":"https://batch-schedule-prod.spark.bluemix.net:12100/schedules",
"qs":"",
"body":"While attempting to get Bluemix Token from BlueIDToken, unable to retrieve AccessToken, failed with status 400 Bad Request, message = {\"error\":\"invalid_grant\",\"error_description\":\"A redirect_uri can only be used by implicit or authorization_code grant types.\"} Entity: {\"error\":\"invalid_grant\",\"error_description\":\"A redirect_uri can only be used by implicit or authorization_code grant types.\"}",
"statusCode":500,
"duration":666,
"expectedStatusCode":201
}
As per Charles comment, the functionality is working ok now. I guess if this happens to another user some time in the future they should contact support.

Resources