How can I reduce CPU in SORT operation - mainframe

I am using DFSORT to copy the Tape data-set to a temp file, and processing around 80000000 records. Its taking 3 Hours to just copy the data-sets.
is there any other way around to reduce the CPU time.
Suggestions will be very helpful.
Thank You.
//STEP40 EXEC SORTD
//SORTIN DD DSN=FILEONE(0),
// DISP=SHR
//SORTOUT DD DSN=&&TEMP,
// DISP=(NEW,PASS,DELETE),
// DCB=(RECFM=FB,LRECL=30050,BLKSIZE=0),
// UNIT=TAPE
//SYSOUT DD SYSOUT=*
//SYSPRINT DD SYSOUT=*
//SYSIN DD *
SORT FIELDS=(14,6,PD,A,8,6,PD,A,45,2,ZD,A)
OUTREC IFTHEN=(WHEN=(70,18,CH,EQ,C' encoding="IBM037"'),
OVERLAY=(70:C' encoding="UTF-8"'))
OPTION DYNALLOC=(SYSDA,255)
/*

I love diagnosing these kinds of problems...
80M records at 30K each is about 2.5TB, and since you're reading and writing this data, s you're processing a minimum of 5TB (not including I/O to the work files). If I'm doing my math right, this averages 500MB/second over three hours.
First thing to do is understand whether DFSORT is really actively running for 3 hours, or if there are sources of wait time. For instance, if your tapes are multi-volume datasets, then there may be wait time for tape mounts. Look for this in the joblog messages - might be that 20 minutes of your 3 hours is simply waiting for the right tapes to be mounted.
You may also have a CPU usage problem adding to the wait time. Depending on how your system is setup, your job might be only getting a small slice of CPU time and waiting the rest of the time. You can tell by looking at the CPU time consumed (it's also in the joblog messages) and comparing it to the elapsed time...for instance, if your job gets 1000 CPU seconds (TCB + SRB) over 3 hours, you're averaging 9% CPU usage over that time. It may be that submitting your job in a different job class makes a difference - ask your local systems programmer.
Of course, 9% CPU time might not be a problem - your job is likely heavily I/O bound, so a lot of the wait time is about waiting for I/O to complete, not waiting for more CPU time. What you really want to know is whether your wait time is waiting for CPU access, waiting for I/O or some other reason. Again, your local systems programmer should be able to help you answer this if he knows how to read the RMF reports.
Next thing to do is understand your I/O a little better with a goal of reducing the overall number of physical I/O operations that need to be performed and/or making every I/O run a little faster.
Think of it this way: every physical I/O is going to take a minimum of maybe 2-3 milliseconds. In your worst case, if every one of those 160M records you're reading/writing were to take 3ms, the elapsed time would be 160,000,000 X .003 = 480,000 seconds, or five and a half days!
As another responder mentions, blocksize and buffering are your friends. Since most of the time in an I/O operation comes down to firing off the I/O and waiting for the response, a "big I/O" doesn't take all that much longer than a "small I/O". Generally, you want to do as few and as large physical I/O operations as possible to push elapsed time down.
Depending on the type of tape device you're using, you should be able to get up to 256K blocksizes on your tape - that's 7 records per I/O. Your BLKSIZE=0 might already be getting you this, depending how your system is configured. Note though that this is device dependent, and watch out if your site happens to use one of the virtual tape products that map "real" tape drives to disk...here, blocksizes over a certain limit (32K) tend to run slower.
Buffering is unfortunately more complex than the previous answer suggested...turns out BUFNO is for relatively simple applications using IBM's QSAM access method - and this isn't what DFSORT does. Indeed, DFSORT is quite smart about how it does it's I/O, and it dynamically creates buffers based on available memory. Still, you might try running your job in a larger region (for instance, REGION=0 in your JCL) and you might find DFSORT options like MAINSIZE=MAX help - see this link for more information.
As for your disk I/O (which includes those SORTWK datasets), there are lots of options here too. Your 30K LRECL limits what you can do for blocking to a good degree, but there are all sorts of disk tuning exercises you can go through, from using VIO datasets to PAVs (parallel access volumes). Point is, a lot of this is also configuration-specific, and so the right answer is going to depend on what your site has and how it's all configured.
But maybe the most important thing is that you don't want to go at it purely trial and error until you stumble across the right answer. If you want to learn, get familiar with RMF or whatever performance management tools your site has (or find a systems programmer that's willing to work with you) and dig in. Ask yourself, what's the bottleneck - why isn't this job running faster? Then find the bottleneck, fix it and move on to the next one. These are tremendous skills to have, and once you know the basics, it stops feeling like a black art, and more like a systematic process you can follow with anything.

A few comments on improving I/O performance which should improve your overall elapsed time.
On your SORTIN and SORTOUT DD statement add the following to your DCB.
From IBM's MVS JCL Manual on page 143.
//SORTIN DD DSN=FILEONE(0),
// DISP=SHR<b>,DCB=BUFNO=192</b>
//SORTOUT DD DSN=&&TEMP,
// DISP=(NEW,PASS,DELETE),
// DCB=(RECFM=FB,LRECL=30050,BLKSIZE=0,BUFNO=192),
// UNIT=TAPE
I chose 192 as its relatively cheap in terms of memory these days. Adjust for your environment. This essentially tells the system how many blocks to read with each I/O which reduces time related to I/O operations. You can play with this number to get an optimal result. The default is 5.
BUFNO=buffersSpecifies the number of buffers to be assigned
to the DCB. The maximum normally is 255, but can be less because of
the size of the region. Note: Do not code the BUFNO subparameter with
DCB subparameters BUFIN, BUFOUT, or DD parameter QNAME.
You might consider the blocksize's. The blocksize on the output seems odd. Ensure that it is optimized for the device you are going to. For TAPE devices this should be as large as possible. For 3480 or 3490 devices this can be as large as 65535. You do not specify the LRECL but indicate that its 30050 then you could specify a BLKZIE of 60100 which would be two records per block. Better I/O efficiency.
Here is more information on BLKSIZE selection for tapes.
3490 Emulation (VTS) 262144 (256 KB)
3590 262144 (256 KB) (note: on some older models the limit is
229376 (224 KB) 262144 (256 KB)
Last quick hint if you are actually using TAPE is to specify multiple TAPE devices. This will allow for one tape to be written to while mounting the next one. I've included the BUFNO example here as well:
//SORTOUT DD DSN=&&TEMP,
// DISP=(NEW,PASS,DELETE),
// DCB=(RECFM=FB,LRECL=30050,BLKSIZE=0,BUFNO=192),
// UNIT=(TAPE,2)
Of course these optimizations depend on your physical environment and DFSMS setup.

Since you write
... it takes 3 hours to complete...
I guess what you really want is to reduce elapsed time, not CPU time. Elapsed time depends on many factors such as machine configuration, machine speed, total system load, priority of your job, etc. Without more information about the environment, it is difficult to give advice.
However, I see you're writting the sort output to a temporary data set. I conclude, there is another step to read that data in. Why do you write this data to tape? Disk will surely be faster and reduce elapsed time.
Peter

Related

Estimate Core capacity required based on load?

I have quad core ubuntu system. say If I see the load average as 60 in last 15 mins during peak time. Load average goes to 150 as well.
This loads happens generally only during peak time. Basically I want to know if there is any standard formula to derive the number of cores ideally required to handle the given load ?
Objective :-
If consider the load as 60 then it means 60 task were in queue on an average at any point of time in last 15 mins ? Adding cpu can help me to server the
request faster or save system from hang or crashing .
Linux load average (as printed by uptime or top) includes tasks in I/O wait, so it can have very little to do with CPU time that could potentially be used in parallel.
If all the tasks were purely CPU bound, then yes 150 sustained load average would mean that potentially 150 cores could be useful. (But if it's not sustained, then it might just be a temporary long queue that wouldn't get that long if you had better CPU throughput.)
If you're getting crashes, that's a huge problem that isn't explainable by high loads. (Unless it's from the out-of-memory killer kicking in.)
It might help to use vmstat or dstat to see how much CPU time is spent in user/kernel space when your load avg. is building up, or if it's probably mostly I/O.
Or of course you probably know what tasks are running on your machine, and whether one single task is I/O bound or CPU bound on an otherwise-idle machine. I/O throughput usually scales a bit positively with queue depth, except on magnetic hard drives when that turns sequential read/write into seek-heavy workloads.

linux CPU cache slowdown

We're getting overnight lockups on our embedded (Arm) linux product but are having trouble pinning it down. It usually takes 12-16 hours from power on for the problem to manifest itself. I've installed sysstat so I can run sar logging, and I've got a bunch of data, but I'm having trouble interpreting the results.
The targets only have 512Mb RAM (we have other models which have 1Gb, but they see this issue much less often), and have no disk swap files to avoid wearing the eMMCs.
Some kind of paging / virtual memory event is initiating the problem. In the sar logs, pgpin/s, pgnscand/s and pgsteal/s, and majflt/s all increase steadily before snowballing to crazy levels. This puts the CPU up correspondingly high levels (30-60 on dual core Arm chips). At the same time, the frmpg/s values go very negative, whilst campg/s go highly positive. The upshot is that the system is trying to allocate a large amount of cache pages all at once. I don't understand why this would be.
The target then essentially locks up until it's rebooted or someone kills the main GUI process or it crashes and is restarted (We have a monolithic GUI application that runs all the time and generally does all the serious work on the product). The network shuts down, telnet blocks forever, as do /proc filesystem queries and things that rely on it like top. The memory allocation profile of the main application in this test is dominated by reading data in from file and caching it as textures in video memory (shared with main RAM) in an LRU using OpenGL ES 2.0. Most of the time it'll be accessing a single file (they are about 50Mb in size), but I guess it could be triggered by having to suddenly use a new file and trying to cache all 50Mb of it all in one go. I haven't done the test (putting more logging in) to correlate this event with these system effects yet.
The odd thing is that the actual free and cached RAM levels don't show an obvious lack of memory (I have seen oom-killer swoop in the kill the main application with >100Mb free and 40Mb cache RAM). The main application's memory usage seems reasonably well-behaved with a VmRSS value that seems pretty stable. Valgrind hasn't found any progressive leaks that would happen during operation.
The behaviour seems like that of a system frantically swapping out to disk and making everything run dog slow as a result, but I don't know if this is a known effect in a free<->cache RAM exchange system.
My problem is superficially similar to question: linux high kernel cpu usage on memory initialization but that issue seemed driven by disk swap file management. However, dirty page flushing does seem plausible for my issue.
I haven't tried playing with the various vm files under /proc/sys/vm yet. vfs_cache_pressure and possibly swappiness would seem good candidates for some tuning, but I'd like some insight into good values to try here. vfs_cache_pressure seems ill-defined as to what the difference between setting it to 200 as opposed to 10000 would be quantitatively.
The other interesting fact is that it is a progressive problem. It might take 12 hours for the effect to happen the first time. If the main app is killed and restarted, it seems to happen every 3 hours after that fact. A full cache purge might push this back out, though.
Here's a link to the log data with two files, sar1.log, which is the complete output of sar -A, and overview.log, a extract of free / cache mem, CPU load, MainGuiApp memory stats, and the -B and -R sar outputs for the interesting period between midnight and 3:40am:
https://drive.google.com/folderview?id=0B615EGF3fosPZ2kwUDlURk1XNFE&usp=sharing
So, to sum up, what's my best plan here? Tune vm to tend to recycle pages more often to make it less bursty? Are my assumptions about what's happening even valid given the log data? Is there a cleverer way of dealing with this memory usage model?
Thanks for your help.
Update 5th June 2013:
I've tried the brute force approach and put a script on which echoes 3 to drop_caches every hour. This seems to be maintaining the steady state of the system right now, and the sar -B stats stay on the flat portion, with very few major faults and 0.0 pgscand/s. However, I don't understand why keeping the cache RAM very low mitigates a problem where the kernel is trying to add the universe to cache RAM.

linux: smart fsync()?

I'm recording audio and writing the same to a SD Card, the data rate is around 1.5 MB/s. I'm using a class 4 SD Card with ext4 file system.
After certain interval, kernel auto syncs the files. The downside of this is, my application buffers pile up waiting to be written to disk.
I think, if the kernel syncs frequently that what it is doing now, it may solve the issue.
I used fsync() in application to sync after certain intervals. But this does not solve the problem, because certain times kernel has synced just before the application called fsync(), so the fsync() called from application was a waste of time.
I need a syncing mechanism (say, smart_fsync() ), so that when application calls smart_fsync(), then the kernel will sync only if it has not synced in a while, else it will just return.
Since there is no function as smart_fsync(). what can be a possible workaround?
The first question to ask is, what exactly is the problem you're experiencing? The kernel will flush dirty (unwritten cached) buffers periodically - this is because doing so tends to be faster than flushing synchronously (less latency hit for applications). The downside is that this means a larger latency hit if you reach the kernel's limit on dirty data (and potentially more data loss after an unclean shutdown).
If you want to ensure that the data hits disk ASAP, then you should simply open the file with the O_SYNC option. This will flush the data to disk immediately upon write(). Of course, this implies a significant performance penalty, but on the other hand you have complete control over when the data is flushed.
If you are experiencing drops in throughput while the syncing is going on, most likely you are attempting to write faster than the disk can support, and reaching the dirty page memory limit. Unfortunately, this would mean the hardware is simply not up to the write rate you are attempting to push at it - you'll need to write slower, or buffer the data up on faster media (or add more RAM!).
Note also that your 'smart fsync' is exactly what the kernel implements - it will flush pages when one of the following is true:
* There is too much dirty data in memory. Triggers asynchronously (without blocking writes) when the total amount of dirty data exceeds /proc/sys/vm/dirty_background_bytes, or when the percentage of total memory exceeds /proc/sys/vm/dirty_background_ratio. Triggers synchronously (blocking your application's write() for an extended time) when the total amount of data exceeds /proc/sys/vm/dirty_bytes, or the percentage of total memory exceeds /proc/sys/vm/dirty_ratio.
* Dirty data has been pending in memory for too long. The pdflush daemon checks for old dirty blocks every /proc/sys/vm/dirty_writeback_centisecs centiseconds (1/100 seconds), and will expire blocks if they have been in memory for longer than /proc/sys/vm/dirty_expire_centisecs.
It's possible that tuning these parameters might help a bit, but you're probably better off figuring out why the defaults aren't keeping up as is.

Severe multi-threaded memory bottleneck after reaching a specific number of cores

We are testing our software for the first time on a machine with > 12 cores for scalability and we are encountering a nasty drop in performance after the 12th thread is added. After spending a couple days on this, we are stumped regarding what to try next.
The test system is a dual Opteron 6174 (2x12 cores) with 16 GB of memory, Windows Server 2008 R2.
Basically, performance peaks from 10 - 12 threads, then drops off a cliff and is soon performing work at about the same rate it was with about 4 threads. The drop-off is fairly steep and by 16 - 20 threads it reaches bottom in terms of throughput. We have tested both with a single process running multiple threads and as multiple processes running single threads-- the results are pretty much the same. The processing is fairly memory intensive and somewhat disk intensive.
We are fairly certain this is a memory bottleneck, but we don't believe it a cache issue. The evidence is as follows:
CPU usages continues to climb from 50 to 100% when scaling from 12 to 24 threads. If we were having synchronization/deadlock issues, we would have expected CPU usage to top out before reaching 100%.
Testing while copying a large amount of files in the background had very little impact on the processing rates. We think this rules out disk i/o as the bottleneck.
The commit charge is only about 4 GBs, so we should be well below the threshold in which paging would become an issue.
The best data comes from using AMD's CodeAnalyst tool. CodeAnalyst shows the windows kernel goes from taking about 6% of the cpu time with 12 threads to 80-90% of CPU time when using 24 threads. A vast majority of that time is spent in the ExAcquireResourceSharedLite (50%) and KeAcquireInStackQueuedSpinLockAtDpcLevel (46%) functions. Here are the highlights of the kernel's factor change when going from running with 12 threads to running with 24:
Instructions: 5.56 (times more)
Clock cycles: 10.39
Memory operations: 4.58
Cache miss ratio: 0.25 (actual cache miss ratio is 0.1, 4 times smaller than with 12 threads)
Avg cache miss latency: 8.92
Total cache miss latency: 6.69
Mem bank load conflict: 11.32
Mem bank store conflict: 2.73
Mem forwarded: 7.42
We thought this might be evidence of the problem described in this paper, however we found that pinning each worker thread/process to a particular core didn't improve the results at all (if anything, performance got a little worse).
So that's where we're at. Any ideas on the precise cause of this bottleneck or how we might avoid it?
I'm not sure that I understand the issues completely such that I can offer you a solution but from what you've explained I may have some alternative view points which may be of help.
I program in C so what works for me may not be applicable in your case.
Your processors have 12MB of L3 and 6MB of L2 which is big but in my view they're seldom big enough!
You're probably using rdtsc for timing individual sections. When I use it I have a statistics structure into which I send the measurement results from different parts of the executing code. Average, minimum, maximum and number of observations are obvious but also standard deviation has its place in that it can help you decide whether a large maximum value should be researched or not. Standard deviation only needs to be calculated when it needs to be read out: until then it can be stored in its components (n, sum x, sum x^2). Unless you're timing very short sequences you can omit the preceding synchronizing instruction. Make sure you quantifiy the timing overhead, if only to be able to rule it out as insignificant.
When I program multi-threaded I try to make each core's/thread's task as "memory limited" as possible. By memory limited I mean not doing things which requires unnecessary memory access. Unnecessary memory access usually means as much inline code as possible and as litte OS access as possible. To me the OS is a great unknown in terms of how much memory work a call to it will generate so I try to keep calls to it to a minimum. In the same manner but usually to a lesser performance impacting extent I try to avoid calling application functions: if they must be called I'd rather they didn't call a lot of other stuff.
In the same manner I minimize memory allocations: if I need several I add them together into one and then subdivide that one big allocation into smaller ones. This will help later allocations in that they will need to loop through fewer blocks before finding the block returned. I only block initialize when absolutely necessary.
I also try to reduce code size by inlining. When moving/setting small blocks of memory I prefer using intrinsics based on rep movsb and rep stosb rather than calling memcopy/memset which are usually both optimized for larger blocks and not especially limited in size.
I've only recently begun using spinlocks but I implement them such that they become inline (anything is better than calling the OS!). I guess the OS alternative is critical sections and though they are fast local spinlocks are faster. Since they perform additional processing it means that they prevent application processing from being performed during that time. This is the implementation:
inline void spinlock_init (SPINLOCK *slp)
{
slp->lock_part=0;
}
inline char spinlock_failed (SPINLOCK *slp)
{
return (char) __xchg (&slp->lock_part,1);
}
Or more elaborate (but not overly so):
inline char spinlock_failed (SPINLOCK *slp)
{
if (__xchg (&slp->lock_part,1)==1) return 1;
slp->count_part=1;
return 0;
}
And to release
inline void spinlock_leave (SPINLOCK *slp)
{
slp->lock_part=0;
}
Or
inline void spinlock_leave (SPINLOCK *slp)
{
if (slp->count_part==0) __breakpoint ();
if (--slp->count_part==0) slp->lock_part=0;
}
The count part is something I've brought along from embedded (and other programming) where it is used for handling nested interrupts.
I'm also a big fan of IOCPs for their efficiency in handling IO events and threads but your description does not indicate whether your application could use them. In any case you appear to economize on them, which is good.
To address your bullet points:
1) If you have 12 cores at 100% usage and 12 cores idle, then your total CPU usage would be 50%. If your synchronization is spinlock-esque, then your threads would still be saturating their CPUs even while not accomplishing useful work.
2) skipped
3) I agree with your conclusion. In the future, you should know that Perfmon has a counter: Process\Page Faults/sec that can verify this.
4) If you don't have the private symbols for ntoskrnl, CodeAnalyst may not be able to tell you the correct function names in its profile. Rather, it can only point to the nearest function for which it has symbols. Can you get stack traces with the profiles using CodeAnalyst? This could help you determine what operation your threads perform that drives the kernel usage.
Also, my former team at Microsoft has provided a number of tools and guidelines for performance analysis here, including taking stack traces on CPU profiles.

How long does a context switch take in Linux?

I'm curious how many cycles it takes to change contexts in Linux. I'm specifically using an E5405 Xeon (x64), but I'd love to see how it compares to other platforms as well.
There`s a free app called LMBench written by Larry McVoy and friends. It provides a bunch of OS & HW benchmarks
One of the tests is called lat_ctx and it measures contex switch latencies.
Google for lmbench and check for yourself on your own HW. Its the only way to get a number meaningful to you.
Gilad
Run vmstat on your machine while doing something that requires heavy context switching. It doesnt tell you how long the actual switch takes, but it will tell you how many switches you do per second.
Then, you have to estimate how much each timeslice spends performing actual code, compared to switching context. Maybe a 100:1 or something? I dont know. 1000:1?
A machine of mine is now doing roughly 3000 switches per second, ie 0.3 ms per timeslice. With a ratio of 100:1 that would mean the actual switch takes 0.003 ms.
But, with multiple cores, threads yielding execution, etc etc, I'm wouldnt draw any conclusion from such a guess :)
I've written code that's able to echo (small) UDP packets at 200k packets per second.
That suggests that it's possible to context switch in not more than 2.5 microseconds, with the actual context switch probably taking somewhat less than that.

Resources