SB37 abend in production and can not change the space parameter - mainframe

My colleague faced an issue, where his sort job failed with an SB37 abend, I know that this error can be rectified by allocating more space to the output file but my question here is:
How can I remediate an SB37 abend without changing space allocation?
It takes a week or more to move changes to production. As such, I can't change the space allocation of file at the moment as the error is in production.

An SB37 abend indicates an out of space condition during end-of-volume processing.
B37 Explanation The error was detected by the end-of-volume
routine. This system completion code is accompanied by message
IEC030I. Refer to the explanation of message IEC030I for complete
information about the task that was ended and for an explanation of
the return code (rc in the message text) in register 15.
This is accompanied with message IEC030I which will provide more information about the issue.
Depending on a few items your production control team may be able to fix the environment where it would allow the job to run. Lacking more detail it is impossible to provide an exact answer so consider this a roadmap on how to approach the problem.
IEC030I B37-rc,mod, jjj,sss,ddname[-#],
dev,ser,diagcode,dsname(member)
In the message there should be a volser that identifes the volume that was being written to. If you have the production control team look at the contents of that volume there may be insufficient space that can be remedied by removing datasets. There are too many options to enumerate without specifics about the failure, type of dataset and other information to guide you.
However, as indicated in other comments, if you have a production control team that can run the job, they should be able to make changes to the JCL to direct the output dataset to another set of volumes or storage groups.
Changes to the JCL are likely the only way to correct the problem.

Related

about managing file system space

Space Issues in a filesystem on Linux
Lets call it FILESYSTEM1
Normally, space in FILESYSTEM1 is only about 40-50% used
and clients run some reports or run some queries and these reports produce massive files about 4-5GB in size and this instantly fills up FILESYSTEM1.
We have some cleanup scripts in place but they never catch this because it happens in a matter of minutes and the cleanup scripts usually clean data that is more than 5-7 days old.
Another set of scripts are also in place and these report when free space in a filesystem is less than a certain threshold
we thought of possible solutions to detect and act on this proactively.
Increase the FILESYSTEM1 file system to double its size.
set the threshold in the Alert Scripts for this filesystem to alert when 50% full.
This will hopefully give us enough time to catch this and act before the client reports issues due to FILESYSTEM1 being full.
Even though this solution works, does not seem to be the best way to deal with the situation.
Any suggestions / comments / solutions are welcome.
thanks
It sounds like what you've found is that simple threshold-based monitoring doesn't work well for the usage patterns you're dealing with. I'd suggest something that pairs high-frequency sampling (say, once a minute) with a monitoring tool that can do some kind of regression on your data to predict when space will run out.
In addition to knowing when you've already run out of space, you also need to know whether you're about to run out of space. Several tools can do this, or you can write your own. One existing tool is Zabbix, which has predictive trigger functions that can be used to alert when file system usage seems likely to cross a threshold within a certain period of time. This may be useful in reacting to rapid changes that, left unchecked, would fill the file system.

Netlogo 5.1 (and 5.05) Behavior Space Memory Leak

I have posted on this before, but thought I had tracked it down to the NW extension, however, memory leakage still occurs in the latest version. I found this thread, which discusses a similar issues, but attributes it to Behavior Space:
http://netlogo-users.18673.x6.nabble.com/Behaviorspace-Memory-Leak-td5003468.html
I have found the same symptoms. My model starts out at around 650mb, but over each run the private working set memory rises, to the point where it hits the 1024 limit. I have sufficient memory to raise this, but in reality it will only delay the onset. I am using the table output, as based on previous discussions this helps, and it does, but it only slows the rate of increase. However, eventually the memory usage rises to a point where the PC starts to struggle. I am clearing all data between runs so there should be no hangover. I noticed in the highlighted thread that they were going to run headless. I will try this, but I wondered if anyone else had noticed the issue? My other option is to break the BehSpc simulation into a few batches so the issues never arises, bit i would be nice to let the model run and walk away as it takes around 2 hours to go through.
Some possible next steps:
1) Isolate the exact conditions under which the problem does or not occur. Can you make it happen without involving the nw extension, or not? Does it still happen if you remove some of the code from your model? What if you keep removing code — when does the problem go away? What is the smallest code that still causes the problem? Almost any bug can be demonstrated with only a small amount of code — and finding that smallest demonstration is exactly what is needed in order to track down the cause and fix it.
2) Use standard memory profiling tools for the JVM to see what kind of objects are using the memory. This might provide some clues to possible causes.
In general, we are not receiving other bug reports from users along these lines. It's routine, and has been for many years now, for people to use BehaviorSpace (both headless and not) and do experiments that last for hours or even for days. So whatever it is you're experiencing almost certainly has a more specific cause -- mostly likely, in the nw extension -- that could be isolated.

Relevant debug data for a Linux target

For an embedded ARM system running in-field there is a need to retrieve relevant debug information when a user-space application crash occurs. Such information will be stored in a non-volatile memory so it could be retreived at a later time. All such information must be stored during runtime, and cannot use third-party applications due to memory consumption concerns.
So far I have thought of following:
Signal ID and corresponding PC / memory addresses in case a kernel SIG occurs;
Process ID;
What other information do you think it's relevant in order to indentify the causing problem and be able to do a fast debug afterwards?
Thank you!
Usually, to be able to understand an issue, you'll need every register (from r0 to r15), the CPSR, and the top of the stack (to be able to determine what happened before the crash). Please also note that, when your program is interrupt for any invalid operation (jump to invalid address, ...), the processor goes to an exception mode, while you need to dump the registers and stack in the context of your process.
To be able to investigate, using those data, you also must keep the ELF files (with debug information, if possible) from your build, to be able to interpret the content of your registers and stack.
In the end, the more information you keep, the easier the debug is, but it may be expensive to keep every memory sections used by your program at the time of the failure (as a matter of fact, I've never done this).
In postmortem analysis, you will face some limits :
Dynamically linked libraries : if your crash occurs in a dynamically loaded and linked code, you will also need the lib binary you are using on your target.
Memory corruption : memory corruption usually results in the call of random data as code. On ARM with linux, this will probably lead to a segfault, as you can't go to an other process memory area, and as your data will probably be marked as "never execute", nevertheless, when the crash happens, you may have already corrupted the data that could have allow you to identify the source of the corruption. Postmortem analysis isn't always able to identify the failure cause.

Disk failure detection perl script

I need to write a script to check the disk every minute and report if it is failing by any reason. The error could be the absolute disk failure and a bad sector and so on .
First, I wonder if there is any script out there that does the same as it should be a standard procedure (because I really do not want to reinvent the wheel).
Second, I wonder if I want to look for errors in /var/log/messages, is there any list of standard error strings for disks that I can use?
I look for that on the net a lot, there are lots of info and at the same time no info about that.
Any help will be much appreciated.
Thanks,
You could simply parse the output of dmesg which usually reports fairly detailed information about drive errors, well that's how I've collected stats on failing drives before.
You might get better more well documented information by using Parse::Syslog or lower level kernel reporting directly though.
Logwatch does the /var/log/messages part of the ordeal (as well as any other logfiles that you choose to add). You can either choose to use that, or to use its code to roll your own sollution (it's all written in perl).
If your harddrives support SMART, i suggest you use smartctl output for diagnostics as it includes a lot of nice info that can be monitored over time to detect failure.

SQl Server 2000 Execution: Statistics Missing

I have a sitauation in production where a procedure is taking different time in two different envionments, when I tried to run the execution plan some stastics are missing. When I clicked on those icon(which was in red color for some attention). Stsstics are missing in both server. But I am wondering after seeing a message. There was a field called number of executes which was 23 in slow server and 1 in fast server. Can someone please tell the importance of this.
Edit Fragmentation is not a problem because when I checked I found Reorganizing would only relocate 2% of pages , New server was created with merge replication. Please advice on "number of executes" in execution lan and how we can work to reduce this.
Edit: will re building of indexes make any performance improvement
SQL 2000 has had issues with statistics and some execution plans in the past, and you would have to add query hints in order to make sure the execution would happen the way you want it. For starters, make sure you are on SP4, and then apply the following patch:
http://support.microsoft.com/kb/936232
This patch, while states an issue with an illegal operation (it resolves crashing with 64 bit machines and SQL2000), it also resolves a few other execution plan issues. Though I would ultimately recommend upgrading to SQL 2008, which has seemed to resolve a number of statistic issues that we used to encounter.
Here is a link that explains in more detail the number of executes:
http://www.qdpma.com/CBO/ExecutionPlanCostModel.html

Resources