I have a basic SOAP based web service that has a method to retrieve an image from a database, save it to a local directory, and return the path to the image. This is hosted in IIS 8.5 on Windows Server 2012R2 with a single core 2.3 GHz processor.
This application runs fine for single threaded calls; however, every morning we have a batch job that sends a few dozen or so requests to the service and causes RFP to kick in and restart the app.
I've tried changing the service behavior in the config file to the bare minimum
<behavior name="serviceBehavior">
<serviceThrottling maxConcurrentInstances="1" maxConcurrentSessions="1" />
</behavior>
but the app still crashes. Here is a clip of the WER file from the App crash:
FriendlyEventName=Stopped working
ConsentKey=APPCRASH
AppName=IIS Worker Process
AppPath=C:\Windows\SysWOW64\inetsrv\w3wp.exe
I used a 3rd party tool called SoapUI to perform some load testing and the error it sees on the requesting end is:
java.net.SocketException: Connection reset
My question is whether this is something that can even be controlled by the web service or if this expands to more of a networking/hardware level?
EDIT: To clarify on some points the w3wp.exe application crashes and gets a new PID when it restarts. Here is the repeated output in the Event Viewer during the crashes.
Faulting application name: w3wp.exe, version: 8.5.9600.16384, time stamp:
0x52157ba0
Faulting module name: unknown, version: 0.0.0.0, time stamp: 0x00000000
Exception code: 0xc00001a5
Fault offset: 0x069d1e99
Faulting process id: 0x3e44
Faulting application start time: 0x01d3c4dab9cec060
Faulting application path: C:\Windows\SysWOW64\inetsrv\w3wp.exe
Faulting module path: unknown
Report Id: a605ce41-30da-11e8-80ce-005056a725cd
Faulting package full name:
Faulting package-relative application ID:
Related
I use a combination of lookup and foreach activities to iterate through the set of data ingestion queries and execute them (reasons behind that is a separate broad topic :)). As the data source is connected to the private network, I have provisioned a dedicated VM to run the self-hosted runtime. In most cases everything runs smoothly, I can see worker processes eating the CPU and high overall CPU utilization (which is good).
But: sometimes, when most work is done, and there are just 2-3 activities standing in line, I can see that the runtime does no processing and CPU usage drops to zero, no new entries appear in the event log. After some time (approximately 10 minutes) I get the 30002 (the example is provided below) and runtime happily completes the work.
Example event message:
Job ID: ***-fcab-429a-bb45-***
Task ID: ***-d820-414e-ad8c-***
Queue ID: ***-4f44-4c39-a1c1-***
Log ID: PulledOffNewTask
The question: What could be the root cause of such Azure Data Factory self-hosted integration runtime's behaviour? Can this be fine-tuned?
UPDATE 1
Errors have been spotted in the application log and warnings have been spotted in the integration runtime log.
Application log contains 3 sets of errors (see below events [1] to [5]) that occured in the time interval of ~2 minutes, shortly after that 8 events (exactly the number of my worker processes) were logged to the integration runtime log (see [6]), straight after that "Windows Error Reporting" events appear. And then we face a "freeze".
So - looks like a bug :(
"application" log:
[1]
Application: diawp.exe
Framework Version: v4.0.30319
Description: The process was terminated due to an unhandled exception.
Exception Info: System.NullReferenceException
at Microsoft.DataTransfer.TransferTask.CopyTaskBase.UpdateJobProgress(System.Object)
at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
at System.Threading.TimerQueueTimer.CallCallback()
at System.Threading.TimerQueueTimer.Fire()
at System.Threading.TimerQueue.FireNextTimers()
[2]
Faulting application name: diawp.exe, version: 3.5.6639.1, time stamp: 0x5aa8cf5f
Faulting module name: unknown, version: 0.0.0.0, time stamp: 0x00000000
Exception code: 0xc0000005
Fault offset: 0x00007ff914402c65
Faulting process id: 0x1bc4
Faulting application start time: 0x01d3d287ef6e34fa
Faulting application path: C:\Program Files\Microsoft Integration Runtime\3.0\Shared\diawp.exe
Faulting module path: unknown
Report Id: 1fe7de4d-5481-478d-b9e7-d542c24ab18a
Faulting package full name:
Faulting package-relative application ID:
[3]: Unable to open the Server service performance object. The first four bytes (DWORD) of the Data section contains the status code.
[4]: The Open Procedure for service "WmiApRpl" in DLL "C:\Windows\system32\wbem\wmiaprpl.dll" failed. Performance data for this service will not be available.
"Integration Runtime" log:
[6]
'Type=System.InvalidOperationException,Message=Instance 'diawp#10' does not exist in the specified Category.,Source=System,StackTrace= at System.Diagnostics.CounterDefinitionSample.GetInstanceValue(String instanceName)
at System.Diagnostics.PerformanceCounter.NextSample()
at System.Diagnostics.PerformanceCounter.NextValue()
at Microsoft.DataTransfer.TransferTask.FormatedPerfCounter.TryGet(Single& value),'
Job ID: 7b629411-c6cd-42d0-9939-e830e58db015
Log ID: Warning
It looks like caused by worker crash. Could you please check event log from: Windows Log => Application? Any error in the category?
As far as I know, you don't have a lot of options to tune the Integration Runtime. My bet is a connectivity issue with your private network. Whenever you run the pipeline, open a cmd at the vm and ping the database pc with -t. If the process hangs, take a look at the response time between pings.
Example ping:
ping 192.168.1.1 -t
Hope this helped!
30002 means IntegrationRuntime got new tasks assigned and started execution.
If the 10 minutes "retry interval" could constantly be reproduced, then 30002 could further indicate that IntegrationRuntime lost tracks on the previous failed tasks it got assigned and had to go with retry.
You can search the specific JobIds in the eventlogs to verify whether shown up 10 minutes before and any exceptions related to.
Btw, the polling interval in normal happy path is in seconds level.
I hope someone can help with this very strange situation.
The w3wp process (for any site on our server) crashes with an Access Violation exception when it's recycled. This happens whether it recycles due to a time or request limit, or a manual trigger. It doesn't seem to be related to our application, as a dummy site with no content exhibits the same behaviour.
There are 2 servers in production with this behaviour, using NLB and ARR to load balance to an IIS server farm. We have 2 test servers with the same setup on our local infrastructure which also have this problem, but "single server" setups (and developer machines) of the same application don't have this problem.
The Event Logs of these errors are like the following:
Faulting application name: w3wp.exe, version: 7.5.7601.17514, time stamp: 0x4ce7afa2
Faulting module name: unknown, version: 0.0.0.0, time stamp: 0x00000000
Exception code: 0xc0000005
Fault offset: 0x0000000000000001
Faulting process id: 0x32d8
Faulting application start time: 0x01d34d9dc16ec0b0
Faulting application path: c:\windows\system32\inetsrv\w3wp.exe
Faulting module path: unknown
Report Id: f3b2821a-b991-11e7-9902-00505683efbb
Sometimes they have a different Faulting module, but mostly it's "unknown":
Faulting module name: nativerd.dll, version: 7.5.7601.17855, time stamp: 0x4fc85321
or
Faulting module name: iiscore.dll, version: 7.5.7601.17514, time stamp: 0x4ce7c6c8
I have captured crash dumps when these errors occur using DebugDiag 2.2 and the analysis shows the following for the thread where the exception was raised:
Thread 0 - System ID 16208
Entry point w3wp!wmainCRTStartup
Create time 25/10/2017 09:19:17
Time spent in user mode 0 Days 00:00:00.078
Time spent in kernel mode 0 Days 00:00:00.015
Call Stack
iiscore!W3_URL_INFO::`vftable'
nativerd!TerminateNativeConfiguration+16
w3wphost!W3WP_HOST::~W3WP_HOST+1fb
w3wphost!AppHostInitialize+325
w3wp!wmain+470
w3wp!PerfStopProvider+19b
kernel32!BaseThreadInitThunk+d
ntdll!RtlUserThreadStart+1d
I've also looked at this in WinDbg and got the following:
OS Thread Id: 0x3f50 (0)
Current frame: iiscore!W3_URL_INFO::`vftable'
Child-SP RetAddr Caller, Callee
00000000001df570 000007fef91ef4d6 nativerd!TerminateNativeConfiguration+0x16
00000000001df5a0 000007fefb1c4797 w3wphost!W3WP_HOST::~W3WP_HOST+0x1fb, calling 0000000000010000
00000000001df5d0 000007fefb1c4269 w3wphost!AppHostInitialize+0x325, calling w3wphost!W3WP_HOST::~W3WP_HOST
00000000001df630 00000000ffbe3c60 w3wp!wmain+0x470
00000000001df670 000007feff3414e4 msvcrt!calloc_impl+0x85, calling ntdll!RtlAllocateHeap
00000000001df720 000007feff3541ba msvcrt!_wgetmainargs+0x7b, calling msvcrt!wsetenvp
00000000001df750 00000000ffbe10a2 w3wp!PerfStopProvider+0x4c, calling msvcrt!_wgetmainargs
00000000001df770 000000007715df6a ntdll! ?? ::FNODOBFM::`string'+0x149ca, calling ntdll!NtQueryPerformanceCounter
00000000001df790 000007feff348e47 msvcrt!initterm+0x1f
00000000001df7c0 00000000ffbe11f1 w3wp!PerfStopProvider+0x19b, calling w3wp!wmain
00000000001df7d0 00000000ffbe1351 w3wp!wmainCRTStartup+0x9, calling w3wp!_security_init_cookie
00000000001df800 0000000076ec59cd kernel32!BaseThreadInitThunk+0xd
00000000001df830 00000000770fa561 ntdll!RtlUserThreadStart+0x1d
But I'm struggling with how to interpret this.
Any insight into what this might be would be, or how I could continue the diagnosis would be much appreciated.
Many thanks
On the off-chance that this helps someone else, we finally tracked this down to an extra IIS module, Dionach StripHeaders that was causing the access violation exception. It was a known bug which has been fixed, so updating the module should fix our issue.
In a recent deploy of our Cloud Service I'm getting lots of errors in my Application Event Log. Presumably it's a combination of us misconfiguring something in Azure Diagnostics and Azure Diagnostics being unable to deal with it gracefully. But what specifically is misconfigured?
Every 5 minutes I get this:
First Error with Source=AzureDiagnostics
System.ArgumentException: An item with the same key has already been added.
at System.Collections.Generic.Dictionary`2.Insert(TKey key, TValue value, Boolean add)
at ApplicationInsightsExtension.WindowsEventLogPublisher..ctor(ILogger logger, ITelemetryClient client, String logTablesPath, IEnumerable`1 dataSources, ILocalTableReader tableReader, Dictionary`2 configProperties) in x:\bt\725234\repo\src\agent\extensions\AppInsightsExtension\Publishers\WindowsEventLogPublisher.cs:line 59
at ApplicationInsightsExtension.WAD2AIExtension.GetPublishersBasedOnConfig() in x:\bt\725234\repo\src\agent\extensions\AppInsightsExtension\WAD2AIExtension.cs:line 114
at ApplicationInsightsExtension.Program.Main(String[] args) in x:\bt\725234\repo\src\agent\extensions\AppInsightsExtension\Program.cs:line 43
Then Error Source=.NET Runtime
Application: ApplicationInsightsExtension.exe
Framework Version: v4.0.30319
Description: The process was terminated due to an unhandled exception.
Exception Info: System.ArgumentException
Stack:
at ApplicationInsightsExtension.Program.Main(System.String[])
Then Error with Source=Application Error
Faulting application name: ApplicationInsightsExtension.exe, version: 33.3.11.0, time stamp: 0x57224d6d
Faulting module name: KERNELBASE.dll, version: 6.3.9600.18202, time stamp: 0x569e7eb1
Exception code: 0xe0434352
Fault offset: 0x0000000000008a5c
Faulting process id: 0x1504
Faulting application start time: 0x01d1bcb0c4a62a77
Faulting application path: C:\Resources\directory\403a5550e74e40448af87aa6c4d6183a.OUR.APP.NAME.DiagnosticStore\WAD0106\Package\Ext\ApplicationInsightsExtension\Commit\ApplicationInsightsExtension.exe
Faulting module path: D:\Windows\system32\KERNELBASE.dll
Report Id: 02a628df-28a4-11e6-80c2-00155dc05ef6
Faulting package full name:
Faulting package-relative application ID:
Looks like the problem was in my Diagnostics Configuration > Windows Event Logs. I had both System!* ticked and also System![System[(Level = 1 or Level=2)]]. D'oh.
I still get the odd other error like this, but doesn't seem important.
System.NullReferenceException: Object reference not set to an instance of an object.
at Microsoft.Azure.Plugins.Diagnostics.dll.RoleInformation.get_IsWorkerRole()
I've developed VC++ application which runs fine on my laptop but crashes on a server PC. This is error message I receive:
Problem signature:
Problem Event Name: APPCRASH
Application Name: Terminator.exe
Application Version: 0.0.0.0
Application Timestamp: 53e0fcee
Fault Module Name: Terminator.exe
Fault Module Version: 0.0.0.0
Fault Module Timestamp: 53e0fcee
Exception Code: c0000005
Exception Offset: 000000000000e79c
OS Version: 6.1.7601.2.1.0.272.7
Locale ID: 1033
Additional Information 1: 62f5
Additional Information 2: 62f5297269af48d65377b01a2aee9b2d
Additional Information 3: 1ec0
Additional Information 4: 1ec0dd9dc74b0802a47c92d98c459c66
Read our privacy statement online:
http://go.microsoft.com/fwlink/?linkid=104288&clcid=0x0409
If the online privacy statement is not available, please read our privacy statement offline:
C:\Windows\system32\en-US\erofflps.txt
In console this message was shown:
Cause: EXCEPTION_ACCESS_VIOLATION Attempted to read from: 0x00000000
Likely somehow I access wrong pointer or doing something outside array range or something like that. To further diagnose this problem I want to know which exact application thread causes exception. Knowing thread I can add line-by-line tracing and then find exact problem line.
Can I know somehow which thread crashed application? (Ideally inside my application can I name each thread and then force application to display name of the crashed thread.)
I have an existing mvc4 web project which I wanted to deploy to a cloud service and to start using the azure data caching.
I have added the windows azure caching nuget packages to two projects in the solution, the web project and a class library project both of which will need these.
I then add a web role for the web project, and I have updated the datacache identifier reference in the web.config to point to the web role which is enabled for co located caching.
I can run this locally on the emulator without any problems while I don't have any datacache code. But the moment I put in code to access the datacache is when I have problems. Just this code caused the web project to hang:
var cache = new DataCache("default");
There are not errors that I am aware of, either in the vs output or errors generated from the web application, it just hangs.
What is the best way to start diagnosing where this problem lies?
UPDATE
I have just noticed the following errors generated in the application event log:
Application: CacheServiceEmulator.exe
Framework Version: v4.0.30319
Description: The process was terminated due to an unhandled exception.
Exception Info: Microsoft.ApplicationServer.Caching.DataCacheException
Stack:
at Microsoft.ApplicationServer.Caching.AzureCommon.AzureUtility.ProcessException(System.Exception)
at Microsoft.ApplicationServer.Caching.CacheServiceEmulator.CacheServiceEmulator.Main(System.String[])
Faulting application name: CacheServiceEmulator.exe, version: 1.0.4797.0, time stamp: 0x506f41ec
Faulting module name: KERNELBASE.dll, version: 6.2.9200.16451, time stamp: 0x50988aa6
Exception code: 0xe0434352
Fault offset: 0x000000000003811c
Faulting process ID: 0x13cc
Faulting application start time: 0x01ce1b74c41f996d
Faulting application path: D:\Users\Tony\My Documents\Visual Studio 2012\Projects\Seqential\Didbook_ws\Didbook.net\Didbook.net v1.0\didbook.net Web.Azure\csx\Debug\roles\didbook.net Web\plugins\Caching\CacheServiceEmulator.exe
Faulting module path: C:\WINDOWS\system32\KERNELBASE.dll
Report ID: 03114030-8768-11e2-beaf-68942335e1fe
Faulting package full name:
Faulting package-relative application ID:
Fault bucket -936878625, type 5
Event Name: CLR20r3
Response: Not available
Cab Id: 0
Problem signature:
P1: cacheserviceemulator.exe
P2: 1.0.4797.0
P3: 506f41ec
P4: Microsoft.ApplicationServer.Caching.AzureServerCommon
P5: 1.0.4797.0
P6: 506f41df
P7: 3d
P8: 18
P9: SWOUM0PNYW4I1S3EYHEY4VNB5OWO0LJ1
P10:
Attached files:
C:\Users\Tony\AppData\Local\Temp\WER90C9.tmp.WERInternalMetadata.xml
These files may be available here:
C:\ProgramData\Microsoft\Windows\WER\ReportArchive\AppCrash_cacheserviceemul_667e21a2e47da59aad2c601844d8dcfd3d291a_28d494fe
Analysis symbol:
Rechecking for solution: 0
Report ID: 03114030-8768-11e2-beaf-68942335e1fe
Report Status: 0
Hashed bucket: 700c7356d6308372410cf1d2baaf5d77
Does that help track down what is happening?
One other piece of info that may help is that if I create a brand new solution, add a web role and enable co located caching I can get it to work fine -- it just appears something specific to this solution.
The Azure Caching emulator starts logman.exe passing the cnf parameter as 30:00, but logman may reject it if the format is not compatible with your regional settings.
All you have to do, is to change the Long time setting to "HH:mm:ss" and it'll works.
Can you dump your cscfgs here starring out storage keys, also check the events and stack traces in application server channel (Admin) , it would have a better stack trace.