I've web API hosted on IIS.
I've been getting "Connection Was Reset" on random times as a response to my API Calls. and after several retrial my request gets a proper response
I've checked HTTPERR and found a lot of "Connection_Abandoned_By_ReqQueue" near the time i got Connection Was Reset Response. also i sometimes get App Crash in Event Viewer contains the following:
*Fault bucket , type 0
Event Name: APPCRASH
Response: Not available
Cab Id: 0
Problem signature:
P1: w3wp.exe
P2: 8.5.9600.16384
P3: 52157ba0
P4: SAPbobsCOM90.dll
P5: 9.30.190.0
P6: 5c7f4d65
P7: c0000005
P8: 0089a548
P9:
P10:
Attached files:
These files may be available here:
C:\ProgramData\Microsoft\Windows\WER\ReportQueue\AppCrash_w3wp.exe_695eddbfadf3c4a0ec181813e91099c0502fcadb_234314c6_7e0a5f90
Analysis symbol:
Rechecking for solution: 0
Report Id: 6975c2ef-e9ad-11e9-80ca-00155d0b1f02
Report Status: 0
Hashed bucket: *
I've managed to export a Dump file from IIS the time the crash happened.
and used windbg and DebugDiag. and found out the exception is System.NullReferenceException Object reference not set to an instance of an object.
but this error makes no sense, since the same request get a proper request after multiple retrials.
I use a combination of lookup and foreach activities to iterate through the set of data ingestion queries and execute them (reasons behind that is a separate broad topic :)). As the data source is connected to the private network, I have provisioned a dedicated VM to run the self-hosted runtime. In most cases everything runs smoothly, I can see worker processes eating the CPU and high overall CPU utilization (which is good).
But: sometimes, when most work is done, and there are just 2-3 activities standing in line, I can see that the runtime does no processing and CPU usage drops to zero, no new entries appear in the event log. After some time (approximately 10 minutes) I get the 30002 (the example is provided below) and runtime happily completes the work.
Example event message:
Job ID: ***-fcab-429a-bb45-***
Task ID: ***-d820-414e-ad8c-***
Queue ID: ***-4f44-4c39-a1c1-***
Log ID: PulledOffNewTask
The question: What could be the root cause of such Azure Data Factory self-hosted integration runtime's behaviour? Can this be fine-tuned?
UPDATE 1
Errors have been spotted in the application log and warnings have been spotted in the integration runtime log.
Application log contains 3 sets of errors (see below events [1] to [5]) that occured in the time interval of ~2 minutes, shortly after that 8 events (exactly the number of my worker processes) were logged to the integration runtime log (see [6]), straight after that "Windows Error Reporting" events appear. And then we face a "freeze".
So - looks like a bug :(
"application" log:
[1]
Application: diawp.exe
Framework Version: v4.0.30319
Description: The process was terminated due to an unhandled exception.
Exception Info: System.NullReferenceException
at Microsoft.DataTransfer.TransferTask.CopyTaskBase.UpdateJobProgress(System.Object)
at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
at System.Threading.TimerQueueTimer.CallCallback()
at System.Threading.TimerQueueTimer.Fire()
at System.Threading.TimerQueue.FireNextTimers()
[2]
Faulting application name: diawp.exe, version: 3.5.6639.1, time stamp: 0x5aa8cf5f
Faulting module name: unknown, version: 0.0.0.0, time stamp: 0x00000000
Exception code: 0xc0000005
Fault offset: 0x00007ff914402c65
Faulting process id: 0x1bc4
Faulting application start time: 0x01d3d287ef6e34fa
Faulting application path: C:\Program Files\Microsoft Integration Runtime\3.0\Shared\diawp.exe
Faulting module path: unknown
Report Id: 1fe7de4d-5481-478d-b9e7-d542c24ab18a
Faulting package full name:
Faulting package-relative application ID:
[3]: Unable to open the Server service performance object. The first four bytes (DWORD) of the Data section contains the status code.
[4]: The Open Procedure for service "WmiApRpl" in DLL "C:\Windows\system32\wbem\wmiaprpl.dll" failed. Performance data for this service will not be available.
"Integration Runtime" log:
[6]
'Type=System.InvalidOperationException,Message=Instance 'diawp#10' does not exist in the specified Category.,Source=System,StackTrace= at System.Diagnostics.CounterDefinitionSample.GetInstanceValue(String instanceName)
at System.Diagnostics.PerformanceCounter.NextSample()
at System.Diagnostics.PerformanceCounter.NextValue()
at Microsoft.DataTransfer.TransferTask.FormatedPerfCounter.TryGet(Single& value),'
Job ID: 7b629411-c6cd-42d0-9939-e830e58db015
Log ID: Warning
It looks like caused by worker crash. Could you please check event log from: Windows Log => Application? Any error in the category?
As far as I know, you don't have a lot of options to tune the Integration Runtime. My bet is a connectivity issue with your private network. Whenever you run the pipeline, open a cmd at the vm and ping the database pc with -t. If the process hangs, take a look at the response time between pings.
Example ping:
ping 192.168.1.1 -t
Hope this helped!
30002 means IntegrationRuntime got new tasks assigned and started execution.
If the 10 minutes "retry interval" could constantly be reproduced, then 30002 could further indicate that IntegrationRuntime lost tracks on the previous failed tasks it got assigned and had to go with retry.
You can search the specific JobIds in the eventlogs to verify whether shown up 10 minutes before and any exceptions related to.
Btw, the polling interval in normal happy path is in seconds level.
After running Azure Diagnostics 2.5 for a period of time, without any problem, it started to fail.
Here is my wadcfgx.
Here is the CommandExecution log from the sole instance of the app.
Here is my DiagnosticsPluginLauncher log.
Here is DiagnosticsPlugin log.
Where is the problem here?
DiagnosticsPlugin log ends with :
DiagnosticsPlugin.exe Error: 0 : [6/3/2015 12:02:41 PM] System.Xml.Schema.XmlSchemaValidationException: The element 'CounterSets' has incomplete content. List of possible elements expected: 'CounterSet'.
at Microsoft.Azure.Plugins.Plugin.BaseMonitoringConfig.Validate(String configFile, String schemaFile)
at Microsoft.Azure.Plugins.Plugin.WadParser.Translate(String baseMaResourcePath, Int32 actualDiskQuota, String& fullConfigFilePath)
DiagnosticsPlugin.exe Error: 0 : [6/3/2015 12:02:41 PM] Failed to convert WAD1.1 config to Monagent config format
DiagnosticsPlugin.exe Information: 0 : [6/3/2015 12:02:41 PM] DiagnosticPlugin.exe exit with code -108
I have humble suggestion - the performance counters set is making the problem. Am I right?
The suggestion was right - trifling with performance counters (a.k.a. selecting my own custom list) is punishable. Disabling them mitigated the problem. I suppose there is a default and allowed (a.k.a. possible) list of performance counters.
Here is a good article about diagnostics with exhaustive error list and the very good string :
%SystemDrive%\ WindowsAzure\Logs\Plugins\Microsoft.Azure.Diagnostics.PaaSDiagnostics<DiagnosticsVersion>\CommandExecution.log
which made me discover the all the logs needed.
In order to solve similar problems :
Publish your instances to staging environment with Remote Desktop enabled.
RDP to an instance through the Server Explorer in VS
Use the above mentioned folder and discover the logs.
Open every log and use it to discover the error with the error code and use the article with the error code suggestions.
We have Azure PAAS service implementation.In that in each Instance we have 10 customer and each customer owns a mounted cloud drive(page blob) to store the some files.
This deployment is available for last one year in azure .
For last 2-3 weeks we observe that is 1-2 cloud drive(page blob) getting un-mounted from this instance .We got some error information from the System log of event viewer which is added and this error is also not consistent. Currently as work around we are rebooting the Instance daily which we remount the vhd (pageblob) again.
Guest OS version-1.18
Azure SDK 1.7
Please let us know what is reason for this issue?
Error details
Log Name: System
Source: PlugPlayManager
Date: 4/22/2013 11:10:50 AM
Event ID: 12
Task Category: None
Level: Error
Keywords: Classic
User: N/A
Computer: RD00155D477FE9
Description:
The device 'Msft VHD Disk SCSI Disk Device' (SCSI\Disk&Ven_Msft&Prod_VHD_Disk\1&26c3c0c&0&000002) disappeared from the system without first being prepared for removal.
Log Name: System
Source: WaDrivePrt
Date: 4/22/2013 11:10:49 AM
Event ID: 4
Task Category: None
Level: Error
Keywords: Classic
User: N/A
Computer: RD00155D477FE9
Description:
'/lwe_2f44e5e3.vhd' failed to renew lease the specified XDisk.
I developed a application site where one facility is FAQ in which user can post text data without any limit.
I hv two server to run the application whenever a single field (question or answer) size is huge (like one page long) one of the server is giving service unavailable. I checked in log the error detail is
-------------------
Event Type: Error
Event Source: W3SVC
Event Category: None
Event ID: 1002
Date: 1/23/2012
Time: 3:29:49 PM
User: N/A
Computer: BA5SWWW006
Description:
Application pool 'pool_name' is being automatically disabled due to a series of failures in the process(es) serving that application pool.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
--------------------
AND ALSO
-------------------
Event Type: Error
Event Source: VsJITDebugger
Event Category: None
Event ID: 4096
Date: 1/23/2012
Time: 3:29:44 PM
User: NT AUTHORITY\NETWORK SERVICE
Computer: BA5SWWW006
Description:
An unhandled win32 exception occurred in w3wp.exe [10896]. Just-In-Time debugging this exception failed with the following error: Debugger could not be started because no user is logged on.
Check the documentation index for 'Just-in-time debugging, errors' for more information.
For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 02 00 5c 80 ..\
------------------------
but the other server is working fine. i check all d basic setting of both is same.
and also there no such logged in error for any other module.
Even error wont occur for same module if the text in question or answer is less.
when this occur it ask for enter the user credentials i couldn't understand why it prompt for such?????
i use my-sql with longtext field to store question or answer.
May be best to try the IIS Debug Diagnostics Tool to further diagnose the problem.
This SO question has plenty of other suggestions: How to diagnose IIS fatal communication error problem