ONCODE=451 The STORAGE condition was raised - mainframe

I recently issued an update to a HOST Reporting program. Our shop uses Enterprise PLI.
I added 2 new structures Declared as BASED. So i basically use an ALLOC statement to allocate the required storage for the structures and then pass the pointers to a Fetchable to get some details i need.
It failed with a storage error during a pilot run in production (LEMSG below). It was trying to process more than a million records there and it appears the job basically ran out of storage.
IBM0451S ONCODE=451 The STORAGE condition was raised.
From entry point MXXX at compile unit offset +000001EA at entry offset
More Details: IBM0451S
As a fix , i am issuing an update to explicitly add a FREE for the storage after the Fetchable call and i also updated the REGION PARM of my JCL to 0M.
Thought i'd check if you have seen this kind of errors before and have any additional thoughts. Thanks.
Here is how my latest updated code looks like
DECLARES
=======================================
DCL
01 IN_DATA BASED(IN_PTR),
% INCLUDE SYSLIB(XXXXXPAA);
DCL
01 OUT_DATA BASED(OUT_PTR),
% INCLUDE SYSLIB(YYYYYPAA);
DCL
01 IN_PTR PTR;
DCL
01 OUT_PTR PTR;
======================================
The below block of code runs for every record that get processed. The
FREE statement is what i now added. I was thinking that because i did
not have a FREE before , the ALLOC was basically getting new STOARGE
everytime it executed that block of code and ran out of storage.
ALLOC IN_DATA;
ALLOC OUT_DATA;
IN_DATA = '';
OUT_DATA = '';
IN_DATA.CODE = 'XXX';
CALL FABCD(IN_PTR,
OUT_PTR);
IF OUT_DATA.RTRN_CD <= 04 THEN
DETAIL_REC.XYZ = OUT_DATA.YYY_CODE;
ELSE
;
FREE IN_DATA; -------->> What i added now
FREE OUT_DATA; -------->> What i added now
=============================================

Apart from the storage problem point of view, allocating and freeing storage for each and every record processes is ahuge perforance killer.
From the snipped you show, it is not clear to be a) why you ALLOC in the first place, and b) why you think you need a fresh piece of storage for each record.
Just allocte the structures locally, the pass a pointer to them to the function.
DCL 01 IN_DATA,
% INCLUDE SYSLIB(XXXXXPAA);
DCL 01 OUT_DATA,
% INCLUDE SYSLIB(YYYYYPAA);
DCL IN_PTR PTR INIT ( ADDR( IN_DATA) );
DCL OUT_PTR PTR INIT ( ADDR( OUT_DATA ) );
This will have PL/I allocate the structures only once, but still allow pointers to the storage be passe to the function routine.
I would also remove the second performance killer: The probably unneeded initialization of the structures
IN_DATA = ‘‘;
OUT_DATA = ‘‘;
This does a field by field initialization. Don‘t do this unless you have good reason.

This is expected behavior. From your description, your initial code had a memory leak, allocating storage without freeing it. Now that you have added code to free the allocated memory when it is no longer needed, you likely don't need the REGION=0M, though as indicated in a comment it may not be doing what you wanted anyway.

Related

Buffer overflow exploitation 101

I've heard in a lot of places that buffer overflows, illegal indexing in C like languages may compromise the security of a system. But in my experience all it does is crash the program I'm running. Can anyone explain how buffer overflows could cause security problems? An example would be nice.
I'm looking for a conceptual explanation of how something like this could work. I don't have any experience with ethical hacking.
First, buffer overflow (BOF) are only one of the method of gaining code execution. When they occur, the impact is that the attacker basically gain control of the process. This mean that the attacker will be able to trigger the process in executing any code with the current process privileges (depending if the process is running with a high or low privileged user on the system will respectively increase or reduce the impact of exploiting a BOF on that application). This is why it is always strongly recommended to run applications with the least needed privileges.
Basically, to understand how BOF works, you have to understand how the code you have build gets compiled into machine code (ASM) and how data managed by your software is stored in memory.
I will try to give you a basic example of a subcategory of BOF called Stack based buffer overflows :
Imagine you have an application asking the user to provide a username.
This data will be read from user input and then stored in a variable called USERNAME. This variable length has been allocated as a 20 byte array of chars.
For this scenario to work, we will consider the program's do not check for the user input length.
At some point, during the data processing, the user input is copied to the USERNAME variable (20bytes) but since the user input is longer (let's say 500 bytes) data around this variable will be overwritten in memory :
Imagine such memory layout :
size in bytes 20 4 4 4
data [USERNAME][variable2][variable3][RETURN ADDRESS]
If you define the 3 local variables USERNAME, variable2 and variable3 the may be store in memory the way it is shown above.
Notice the RETURN ADDRESS, this 4 byte memory region will store the address of the function that has called your current function (thanks to this, when you call a function in your program and readh the end of that function, the program flow naturally go back to the next instruction just after the initial call to that function.
If your attacker provide a username with 24 x 'A' char, the memory layout would become something like this :
size in bytes 20 4 4 4
data [USERNAME][variable2][variable3][RETURN ADDRESS]
new data [AAA...AA][ AAAA ][variable3][RETURN ADDRESS]
Now, if an attacker send 50 * the 'A' char as a USERNAME, the memory layout would looks like this :
size in bytes 20 4 4 4
data [USERNAME][variable2][variable3][RETURN ADDRESS]
new data [AAA...AA][ AAAA ][ AAAA ][[ AAAA ][OTHER AAA...]
In this situation, at the end of the execution of the function, the program would crash because it will try to reach the address an invalid address 0x41414141 (char 'A' = 0x41) because the overwritten RETURN ADDRESS doesn't match a correct code address.
If you replace the multiple 'A' with well thought bytes, you may be able to :
overwrite RETURN ADDRESS to an interesting location.
place "executable code" in the first 20 + 4 + 4 bytes
You could for instance set RETURN ADDRESS to the address of the first byte of the USERNAME variable (this method is mostly no usable anymore thanks to many protections that have been added both to OS and to compiled programs).
I know it is quite complex to understand at first, and this explanation is a very basic one. If you want more detail please just ask.
I suggest you to have a look at great tutorials like this one which are quite advanced but more realistic

Memory Leaks while using gstbuffer

I have a pipeline, which takes data from webcam and process it.
For the processing i need to pull that buffer to appsink and push it into pipeline by using appsrc element.
While pushing i had used gst_buffer_new_wrapped function.
Then a new buffer is allocated every time i am pushing the data. But how to free that memory is the problem.
I had tried gst_buffer_unref(buffer);
Then got below error.
Error in `./uuHiesSoaServer': free(): invalid pointer: 0x00007fddf52f6000
I had take the data into an unsigned char pointer and then wrapped into a gstbuffer based on the size.
Now how to free the allocated memory?
g_signal_emit_by_name (Source, "push-buffer", Buffer, &ret);
I had used above function for pushing data into Source(appsrc).
That function will continuously call on a separate thread.
When data available to it, then the thread function will create a buffer using
gst_buffer_new_wrapped((void *)data, Size);
When checking in valgrind, for memory leaks, above line was shown as a leak.
How to solve this?
How do you push the buffer into appsrc?
If you use gst_app_src_push_buffer function I guess you do not have to free resources because gst_app_src_push_buffer will own the buffer (which means it also frees it)
Check this example
If you use need-data callback you may need to free data - check this example
HTH

Copy Constructor for MyString causes HEAP error. Only Gives Error in Debug Mode

So, I've never experienced this before. Normally when I get an error, it always triggers a breakpoint. However, this time when I build the solution and run it without debugging (ctrl+F5), it gives me no error and runs correctly. But when I try debugging it (F5), it gives me this error:
HEAP[MyString.exe]: HEAP: Free Heap block 294bd8 modified at 294c00 after it was freed
Windows has triggered a breakpoint in MyString.exe.
This may be due to a corruption of the heap, which indicates a bug in MyString.exe or any of the DLLs it has loaded.
This may also be due to the user pressing F12 while MyString.exe has focus.
The output window may have more diagnostic information.
This assignment is due tonight, so I'd appreciate any quick help.
My code is here:
https://gist.github.com/anonymous/8d84b21be6d1f4bc18bf
I've narrowed the problem down in the main to line 18 in main.cpp ( c = a + b; ) The concatenation succeeds, but then when it is to be copied into c, the error message occurs at line 56 in MyString.cpp ( pData = new char[length + 1]; ).
The kicker is I haven't had a problem with this line of code until I tried overloading the operator>>. I've since scrapped that code for the sake of trying to debug this.
Again, any help would be appreciated!
Let's go through line 18:
1. In line 17 you create string c with dynamically allocated memory inside.
2. You make assignment: c = a + b:
2.1. Operator+ creates LOCAL object 'cat'.
2.2. cat's memory is allocated dynamically.
2.3. cat becomes concatenation of two given strings.
2.4. Operator+ exits. cat is LOCAL object and it's being destroyed.
2.4.1. cat is being destroyed. cat's destructor runs.
2.4.2. destructor deletes pData;
2.4.3. After delete you make *pData = NULL. //ERROR - should be pData = NULL (1)
2.5. c is initialized with result of operator+.
2.6. operator= calls copy().
2.7. copy() allocates new memory without checking the current one. //ERROR - memory leak (2)
(1)
pData is char*. In destructor we have: delete[] pData (deletes memory) and then *pData = NULL.
Because pData is a pointer, *pData is same as pData[0]. So you write to already freed memory. This is the cause of your error.
(2)
Additional problem. Copy() overwrites current memory without checking. Should be:
copy()
{
if(this->pData)
{
delete this->pData;
}
//now allocate new buffer and copy
}
Also, when dealing with raw bytes (chars), you don't want to use new() and delete(), but malloc() and free() instead. In this case, in functions like copy(), instead of calling delete() and then new(), you would simply use realloc().
EDIT:
One more thing: errors caused by heap damage usually occur during debugging. In release binary, this will simply overwrite some freed (and maybe already used by someone else) memory. That's why debugging is so important when playing with memory in C++.

x86 equivalent for LWARX and STWCX

I'm looking for an equivalent of LWARX and STWCX (as found on the PowerPC processors) or a way to implement similar functionality on the x86 platform. Also, where would be the best place to find out about such things (i.e. good articles/web sites/forums for lock/wait-free programing).
Edit
I think I might need to give more details as it is being assumed that I'm just looking for a CAS (compare and swap) operation. What I'm trying to do is implement a lock-free reference counting system with smart pointers that can be accessed and changed by multiple threads. I basically need a way to implement the following function on an x86 processor.
int* IncrementAndRetrieve(int **ptr)
{
int val;
int *pval;
do
{
// fetch the pointer to the value
pval = *ptr;
// if its NULL, then just return NULL, the smart pointer
// will then become NULL as well
if(pval == NULL)
return NULL;
// Grab the reference count
val = lwarx(pval);
// make sure the pointer we grabbed the value from
// is still the same one referred to by 'ptr'
if(pval != *ptr)
continue;
// Increment the reference count via 'stwcx' if any other threads
// have done anything that could potentially break then it should
// fail and try again
} while(!stwcx(pval, val + 1));
return pval;
}
I really need something that mimics LWARX and STWCX fairly accurately to pull this off (I can't figure out a way to do this with the CompareExchange, swap or add functions I've so far found for the x86).
Thanks
As Michael mentioned, what you're probably looking for is the cmpxchg instruction.
It's important to point out though that the PPC method of accomplishing this is known as Load Link / Store Conditional (LL/SC), while the x86 architecture uses Compare And Swap (CAS). LL/SC has stronger semantics than CAS in that any change to the value at the conditioned address will cause the store to fail, even if the other change replaces the value with the same value that the load was conditioned on. CAS, on the other hand, would succeed in this case. This is known as the ABA problem (see the CAS link for more info).
If you need the stronger semantics on the x86 architecture, you can approximate it by using the x86s double-width compare-and-swap (DWCAS) instruction cmpxchg8b, or cmpxchg16b under x86_64. This allows you to atomically swap two consecutive 'natural sized' words at once, instead of just the usual one. The basic idea is one of the two words contains the value of interest, and the other one contains an always incrementing 'mutation count'. Although this does not technically eliminate the problem, the likelihood of the mutation counter to wrap between attempts is so low that it's a reasonable substitute for most purposes.
x86 does not directly support "optimistic concurrency" like PPC does -- rather, x86's support for concurrency is based on a "lock prefix", see here. (Some so-called "atomic" instructions such as XCHG actually get their atomicity by intrinsically asserting the LOCK prefix, whether the assembly code programmer has actually coded it or not). It's not exactly "bomb-proof", to put it diplomatically (indeed, it's rather accident-prone, I would say;-).
You're probably looking for the cmpxchg family of instructions.
You'll need to precede these with a lock instruction to get equivalent behaviour.
Have a look here for a quick overview of what's available.
You'll likely end up with something similar to this:
mov ecx,dword ptr [esp+4]
mov edx,dword ptr [esp+8]
mov eax,dword ptr [esp+12]
lock cmpxchg dword ptr [ecx],edx
ret 12
You should read this paper...
Edit
In response to the updated question, are you looking to do something like the Boost shared_ptr? If so, have a look at that code and the files in that directory - they'll definitely get you started.
if you are on 64 bits and limit yourself to say 1tb of heap, you can pack the counter into the 24 unused top bits. if you have word aligned pointers the bottom 5 bits are also available.
int* IncrementAndRetrieve(int **ptr)
{
int val;
int *unpacked;
do
{
val = *ptr;
unpacked = unpack(val);
if(unpacked == NULL)
return NULL;
// pointer is on the bottom
} while(!cas(unpacked, val, val + 1));
return unpacked;
}
Don't know if LWARX and STWCX invalidate the whole cache line, CAS and DCAS do. Meaning that unless you are willing to throw away a lot of memory (64 bytes for each independent "lockable" pointer) you won't see much improvement if you are really pushing your software into stress. The best results I've seen so far were when people consciously casrificed 64b, planed their structures around it (packing stuff that won't be subject of contention), kept everything alligned on 64b boundaries, and used explicit read and write data barriers. Cache line invalidation can cost approx 20 to 100 cycles, making it a bigger real perf issue then just lock avoidance.
Also, you'd have to plan different memory allocation strategy to manage either controlled leaking (if you can partition code into logical "request processing" - one request "leaks" and then releases all it's memory bulk at the end) or datailed allocation management so that one structure under contention never receives memory realesed by elements of the same structure/collection (to prevent ABA). Some of that can be very counter-intuitive but it's either that or paying the price for GC.
What you are trying to do will not work the way you expect. What you implemented above can be done with the InterlockedIncrement function (Win32 function; assembly: XADD).
The reason that your code does not do what you think it does is that another thread can still change the value between the second read of *ptr and stwcx without invalidating the stwcx.

Why is thread local storage so slow?

I'm working on a custom mark-release style memory allocator for the D programming language that works by allocating from thread-local regions. It seems that the thread local storage bottleneck is causing a huge (~50%) slowdown in allocating memory from these regions compared to an otherwise identical single threaded version of the code, even after designing my code to have only one TLS lookup per allocation/deallocation. This is based on allocating/freeing memory a large number of times in a loop, and I'm trying to figure out if it's an artifact of my benchmarking method. My understanding is that thread local storage should basically just involve accessing something through an extra layer of indirection, similar to accessing a variable via a pointer. Is this incorrect? How much overhead does thread-local storage typically have?
Note: Although I mention D, I'm also interested in general answers that aren't specific to D, since D's implementation of thread-local storage will likely improve if it is slower than the best implementations.
The speed depends on the TLS implementation.
Yes, you are correct that TLS can be as fast as a pointer lookup. It can even be faster on systems with a memory management unit.
For the pointer lookup you need help from the scheduler though. The scheduler must - on a task switch - update the pointer to the TLS data.
Another fast way to implement TLS is via the Memory Management Unit. Here the TLS is treated like any other data with the exception that TLS variables are allocated in a special segment. The scheduler will - on task switch - map the correct chunk of memory into the address space of the task.
If the scheduler does not support any of these methods, the compiler/library has to do the following:
get current ThreadId
Take a semaphore
Lookup the pointer to the TLS block by the ThreadId (may use a map or so)
Release the semaphore
Return that pointer.
Obviously doing all this for each TLS data access takes a while and may need up to three OS calls: Getting the ThreadId, Take and Release the semaphore.
The semaphore is btw required to make sure no thread reads from the TLS pointer list while another thread is in the middle of spawning a new thread. (and as such allocate a new TLS block and modify the datastructure).
Unfortunately it's not uncommon to see the slow TLS implementation in practice.
Thread locals in D are really fast. Here are my tests.
64 bit Ubuntu, core i5, dmd v2.052
Compiler options: dmd -O -release -inline -m64
// this loop takes 0m0.630s
void main(){
int a; // register allocated
for( int i=1000*1000*1000; i>0; i-- ){
a+=9;
}
}
// this loop takes 0m1.875s
int a; // thread local in D, not static
void main(){
for( int i=1000*1000*1000; i>0; i-- ){
a+=9;
}
}
So we lose only 1.2 seconds of one of CPU's cores per 1000*1000*1000 thread local accesses.
Thread locals are accessed using %fs register - so there is only a couple of processor commands involved:
Disassembling with objdump -d:
- this is local variable in %ecx register (loop counter in %eax):
8: 31 c9 xor %ecx,%ecx
a: b8 00 ca 9a 3b mov $0x3b9aca00,%eax
f: 83 c1 09 add $0x9,%ecx
12: ff c8 dec %eax
14: 85 c0 test %eax,%eax
16: 75 f7 jne f <_Dmain+0xf>
- this is thread local, %fs register is used for indirection, %edx is loop counter:
6: ba 00 ca 9a 3b mov $0x3b9aca00,%edx
b: 64 48 8b 04 25 00 00 mov %fs:0x0,%rax
12: 00 00
14: 48 8b 0d 00 00 00 00 mov 0x0(%rip),%rcx # 1b <_Dmain+0x1b>
1b: 83 04 08 09 addl $0x9,(%rax,%rcx,1)
1f: ff ca dec %edx
21: 85 d2 test %edx,%edx
23: 75 e6 jne b <_Dmain+0xb>
Maybe compiler could be even more clever and cache thread local before loop to a register
and return it to thread local at the end (it's interesting to compare with gdc compiler),
but even now matters are very good IMHO.
One needs to be very careful in interpreting benchmark results. For example, a recent thread in the D newsgroups concluded from a benchmark that dmd's code generation was causing a major slowdown in a loop that did arithmetic, but in actuality the time spent was dominated by the runtime helper function that did long division. The compiler's code generation had nothing to do with the slowdown.
To see what kind of code is generated for tls, compile and obj2asm this code:
__thread int x;
int foo() { return x; }
TLS is implemented very differently on Windows than on Linux, and will be very different again on OSX. But, in all cases, it will be many more instructions than a simple load of a static memory location. TLS is always going to be slow relative to simple access. Accessing TLS globals in a tight loop is going to be slow, too. Try caching the TLS value in a temporary instead.
I wrote some thread pool allocation code years ago, and cached the TLS handle to the pool, which worked well.
I've designed multi-taskers for embedded systems, and conceptually the key requirement for thread-local storage is having the context switch method save/restore a pointer to thread-local storage along with the CPU registers and whatever else it's saving/restoring. For embedded systems which will always be running the same set of code once they've started up, it's easiest to simply save/restore one pointer, which points to a fixed-format block for each thread. Nice, clean, easy, and efficient.
Such an approach works well if one doesn't mind having space for every thread-local variable allocated within every thread--even those that never actually use it--and if everything that's going to be within the thread-local storage block can be defined as a single struct. In that scenario, accesses to thread-local variables can be almost as fast as access to other variables, the only difference being an extra pointer dereference. Unfortunately, many PC applications require something more complicated.
On some frameworks for the PC, a thread will only have space allocated for thread-static variables if a module that uses those variables has been run on that thread. While this can sometimes be advantageous, it means that different threads will often have their local storage laid out differently. Consequently, it may be necessary for the threads to have some sort of searchable index of where their variables are located, and to direct all accesses to those variables through that index.
I would expect that if the framework allocates a small amount of fixed-format storage, it may be helpful to keep a cache of the last 1-3 thread-local variables accessed, since in many scenarios even a single-item cache could offer a pretty high hit rate.
If you can't use compiler TLS support, you can manage TLS yourself.
I built a wrapper template for C++, so it is easy to replace an underlying implementation.
In this example, i've implemented it for Win32.
Note: Since you cannot obtain an unlimited number of TLS indices per process (at least under Win32),
you should point to heap blocks large enough to hold all thread specific data.
This way you have a minimum number of TLS indices and related queries.
In the "best case", you'd have just 1 TLS pointer pointing to one private heap block per thread.
In a nutshell: Don't point to single objects, instead point to thread specific, heap memory/containers holding object pointers to achieve better performance.
Don't forget to free memory if it isn't used again.
I do this by wrapping a thread into a class (like Java does) and handle TLS by constructor and destructor.
Furthermore, i store frequently used data like thread handles and ID's as class members.
usage:
for type*:
tl_ptr<type>
for const type*:
tl_ptr<const type>
for type* const:
const tl_ptr<type>
const type* const:
const tl_ptr<const type>
template<typename T>
class tl_ptr {
protected:
DWORD index;
public:
tl_ptr(void) : index(TlsAlloc()){
assert(index != TLS_OUT_OF_INDEXES);
set(NULL);
}
void set(T* ptr){
TlsSetValue(index,(LPVOID) ptr);
}
T* get(void)const {
return (T*) TlsGetValue(index);
}
tl_ptr& operator=(T* ptr){
set(ptr);
return *this;
}
tl_ptr& operator=(const tl_ptr& other){
set(other.get());
return *this;
}
T& operator*(void)const{
return *get();
}
T* operator->(void)const{
return get();
}
~tl_ptr(){
TlsFree(index);
}
};
We have seen similar performance issues from TLS (on Windows). We rely on it for certain critical operations inside our product's "kernel'. After some effort I decided to try and improve on this.
I'm pleased to say that we now have a small API that offers > 50% reduction in CPU time for an equivalent operation when the callin thread doesn't "know" its thread-id and > 65% reduction if calling thread has already obtained its thread-id (perhaps for some other earlier processing step).
The new function ( get_thread_private_ptr() ) always returns a pointer to a struct we use internally to hold all sorts, so we only need one per thread.
All in all I think the Win32 TLS support is poorly crafted really.

Resources