this is the sentence in NLog's documentation:
"Block - The application-thread will block until the background-writer-thread has taken the next batch. Avoids loosing important logevents, but can block all application-threads."
who will determine which logevents are important? if it's based on level of log, which log levels are important??
Related
I am doing some sort of dalvik bytecode instrumentation using dexlib2.
However, there are a couple of remaining issues.
The register type merging that seems to happen after goto instructions
and catch blocks, more precisely at the corresponding label, somehow
derives an unexpected register type, which in turn breaks the instrumented code.
The instructions that get inserted look as follows:
move(-wide,-object,/16,/from16) vNew, v0
const-string v0, "some string"
invoke-static, {v0}, LPathToSomeClass;->SomeMethod(Ljava/lang/String;)V
move(..) v0, vNew
So, v0 is used to hold some parameter for the static function call, while
vNew is a new (local) register to store and restore the original content of v0. The register type of v0 is derived in advance in order to derive the right
move instruction, i.e. move-wide, move or move-object. However, when this code is included within some try-block, the instrumentation breaks. The output of
baksmali (baksmali d -b "" --register-info ALL,FULLMERGE --offsets )
reveals that the type of v0 after the const-string instruction (which is Reference,L/java/lang/String) is considered as input for the merging procedure happening for instance at the corresponding catch-block label. Assuming that the type before the inserted code was Reference,[I (int array) the resulting
type is now Reference,L/java/lang/Object (which produces a verification error), although the final move instruction restores the original register type.
Now to my questions:
1) When is this merging actually happening?
2) Why is the merging procedure considering the type of v0 after the const-string instruction? Is it considering every instruction modifying the type of any register?
3) Is this problem only related to try-catch blocks?
4) What are the restrictions for try-catch blocks in this matter?
5) Is there any solution to this problem apart from constructing an own method for each code to inject without parameters? So is it possible to use an additional register to solve this problem?
6) Can I detect with dexlib2 try-catch blocks and determine the set of instructions they include?
7) Are there any notes/literature discussing this problem, e.g. the merging procedure, and related technicalities, e.g. further limitations/restrictions
for the instrumentation?
I highly appreciate any help in this matter. Thanks in advance!
When merging registers at the start of a catch block, there is an incoming edge from every instruction in the try block that can throw. Only certain instructions can throw - as determined by the CAN_THROW opcode flag.
In your particular example, the invoke-static instruction after the const-string instruction can throw, and so there's an edge from just before that instruction to the start of the catch block.
If you take a step back, execution can jump from any instruction in the try block that can throw to the start of the catch block. And so the code in the catch block must be prepared for the registers to be in a state that is consistent with the register contents just before any of those instructions that can throw.
So, for example, if there is one possible "jump" from the try block to the catch block where the register contains an primitive int type, and another possible jump where it contains an object, that register is considered "conflicted", because the the register may contain either type at that point in the code, and the two types are not compatible with each other. E.g. a primitive int can never be passed to something expecting a reference type and vice versa. And there's no mechanism in the bytecode for static register type checking.
One possible solution might be to split the try block at the point where you insert your instrumentation, so that the instrumentation itself is not covered by a try block, but both "sides" of the original code is. And keep in mind that in the bytecode, the same catch block can be used by multiple try blocks, so you can split the original try block into two, and have both reference the original catch block.
Otherwise, you'll just have to figure out some way to manage the registers appropriately to avoid this problem.
As for 6), see MethodImplementation.getTryBlocks(), which will give you a list of try blocks in that method. Each try block specifies where it starts, how many instructions it covers, and all of the catch blocks associated with it (different catch blocks for different exceptions).
I've been reading and it seems that std::atomic doesn't support a compare and swap of the less/greater than variant.
I'm using OpenMP and need to safely update a global minimum value.
I was thinking this would be as easy as using a built-in API.
But alas, so instead I'm trying to come up with my own implementation.
I'm primarily concerned with the fact that I don't want to use an omp critical section to do a less than comparison every single time because it may incur significant synchronization overhead for very little gain in most cases.
But in those cases where a new global minima is potentially found (less often), the synchronization overhead is acceptable. I'm thinking I can implement it using the following method. Hoping for someone to advise.
Use an std::atomic_uint as the global minima.
Atomically read the value into thread local stack.
Compare it against the current value and if it's less, attempt to enter a critical section.
Once synchronized, verify that the atomic value is still less than the new one and update accordingly (the body of the critical section should be cheap, just update a few values).
This is for a homework assignment, so I'm trying to keep the implementation my own. Please don't recommend various libraries to accomplish this. But please do comment on the synchronization overhead that this operation can incur or if it's bad, elaborate on why. Thanks.
What you're looking for would be called fetch_min() if it existed: fetch old value and update the value in memory to min(current, new), exactly like fetch_add but with min().
This operation is not directly supported in hardware on x86, but machines with LL/SC could emit slightly more efficient asm for it than from emulating it with a CAS ( old, min(old,new) ) retry loop.
You can emulate any atomic operation with a CAS retry loop. In practice it usually doesn't have to retry, because the CPU that succeeded at doing a load usually also succeeds at CAS a few cycles later after computing whatever with the load result, so it's efficient.
See Atomic double floating point or SSE/AVX vector load/store on x86_64 for an example of creating a fetch_add for atomic<double> with a CAS retry loop, in terms of compare_exchange_weak and plain + for double. Do that with min and you're all set.
Re: clarification in comments: I think you're saying you have a global minimum, but when you find a new one, you want to update some associated data, too. Your question is confusing because "compare and swap on less/greater than" doesn't help you with that.
I'd recommend using atomic<unsigned> globmin to track the global minimum, so you can read it to decide whether or not to enter the critical section and update related state that goes with that minimum.
Only ever modify globmin while holding the lock (i.e. inside the critical section). Then you can update it + the associated data. It has to be atomic<> so readers that look at just globmin outside the critical section don't have data race UB. Readers that look at the associated extra data must take the lock that protects it and makes sure that updates of globmin + the extra data happen "atomically", from the perspective of readers that obey the lock.
static std::atomic<unsigned> globmin;
std::mutex globmin_lock;
static struct Extradata globmin_extra;
void new_min_candidate(unsigned newmin, const struct Extradata &newdata)
{
// light-weight early out check to avoid the critical section
// No ordering requirement as long as globmin is monotonically decreasing with time
if (newmin < globmin.load(std::memory_order_relaxed))
{
// enter a critical section. Use OpenMP stuff if you want, this is plain ISO C++
std::lock_guard<std::mutex> lock(globmin_lock);
// Check globmin again, after we've excluded other threads from modifying it and globmin_extra
if (newmin < globmin.load(std::memory_order_relaxed)) {
globmin.store(newmin, std::memory_order_relaxed);
globmin_extra = newdata;
}
// else leave the critical section with no update:
// another thread raced with use *outside* the critical section
// release the lock / leave critical section (lock goes out of scope here: RAII)
}
// else do nothing
}
std::memory_order_relaxed is sufficient for globmin: there's no ordering required with anything else, just atomicity. We get atomicity / consistency for the associated data from the critical section/lock, not from memory-ordering semantics of loading / storing globmin.
This way the only atomic read-modify-write operation is the locking itself. Everything on globmin is either load or store (much cheaper). The main cost with multiple threads will still be bouncing the cache line around, but once you own a cache line, each atomic RMW is maybe 20x more expensive than a simple store on modern x86 (http://agner.org/optimize/).
With this design, if most candidates aren't lower than globmin, the cache line will stay in the Shared state most of the time, so the globmin.load(std::memory_order_relaxed) outside the critical section can hit in L1D cache. It's just an ordinary load instruction, so it's extremely cheap. (On x86, even seq-cst loads are just ordinary loads (and release loads are just ordinary stores, but seq_cst stores are more expensive). On other architectures where the default ordering is weaker, seq_cst / acquire loads need a barrier.)
As explained in https://martinfowler.com/articles/lmax.html, I would need to process my RingBuffer's events first with Unmarchaler and then with Business Logic Processor. Suppose it is configured like (https://lmax-exchange.github.io/disruptor/docs/com/lmax/disruptor/dsl/Disruptor.html)
Disruptor<MyEvent> disruptor = new Disruptor<MyEvent>(MyEvent.FACTORY, 32, Executors.newCachedThreadPool());
EventHandler<MyEvent> handler1 = new EventHandler<MyEvent>() { ... };
EventHandler<MyEvent> handler2 = new EventHandler<MyEvent>() { ... };
disruptor.handleEventsWith(handler1);
disruptor.after(handler1).handleEventsWith(handler2);
Idea is then that handler1 is unmarchaler and handler2 consumes stuff processed by handler1.
Quesion: How I can exactly code the "unmarchaling and putting back to disruptor" part? I found this https://groups.google.com/forum/#!topic/lmax-disruptor/q6h5HBEBRUk explanation but I didn't quite understand. Suppose the event is arrived at callback for handler1
void onEvent(T event, long sequence, boolean endOfBatch)
(javadoc: https://lmax-exchange.github.io/disruptor/docs/com/lmax/disruptor/EventHandler.html)
that un-marchals some data from event. Now I need to append unmarchaled data to the event for handler2 that will be dealing with unmarchaled object.
What needs to be done to "update" event? Is modifying "event" object enough?
The impact of this really depends on your particular scenario, and as always, if you are after low latency, you should try both and benchmark.
The most straightforward is to update the 'event' object, however depending on your particular approach, this might miss a lot of the single-writer benefits of the disruptor. I will explain and offer some options.
Suppose for example you have handler1 and handler2, handler1 is running in thread1 and handler2 is running in thread2. The initial event publisher is on thread0.
Thread0 writes an entry into the buffer at slot 1
Thread1 reads the entry in slot 1 and writes into slot 1
Thread0 writes an entry into the buffer at slot 2
Thread2 reads from slot 1 and writes to output
Thread1 reads the entry in slot 2 and writes into slot 2
Thread2 reads from slot 2 and writes to output
If you think of the physical memory layout, slot1 and slot2 are hopefully next to each other in memory. For example they could be some subset of a byte array. As you can see, you are reading and writing alternatively from different threads (probably different cpu cores) into very adjacent chunks of memory, which can lead to false sharing / cache lines bouncing around. On top of that, your reads and writes through memory are not likely to be linear so you will miss out on some of the benefits of the CPU caches.
Some other options which might be nicer:
Have separate ringbuffers, where the first ringbuffer is raw data, and the second ringbuffer is unmarshalled events. This way the data is sufficiently separated in memory to avoid these costs. However this will have a bandwidth impact.
Have the unmarshaller and the work done directly in the same handler. Depending on the amount of work in your unmarshaller and your handler this might be viable.
I'm working with Azure Event Hubs and initially when sending data to try and calculate batch size I had code similar to the below that would call EventData.GetBytes
EventHubClient client;//initialized before the relevant code
EventData curr = new EventData(data);
//Setting a partition key, and other operations.
long itemLength = curr.GetBytes().LongLength;
client.SendAsync(curr);
Unfortunately I would receive an exception in the SDK code.
The message body cannot be read multiple times. To reuse it store the value after reading.
While removing the ultimately unnecessary call to GetBytes meant that I could send messages, the rationale for this exception to occur is rather puzzling. Calling GetBytes() twice in a row is an easy way to reproduce the same exception, but a single call will mean that the EventData cannot be sent successfully.
It seems likely that underneath a Message is used and this is set to throw an exception if called more than once as Message.GetBody documents; however, there is no documentation to this effect in EventData's methods GetBodyStream, GetBody w/serializer, GetBody, or GetBytes.
I imagine this should either be documented, or corrected since currently it is an unpleasant surprise in a separate thread.
Have you tried using EventData.SerializedSizeInBytes to get the size? that is a much more accurate way to get the size for batching calculation.
I used to use the following Parasoft rule sets for SCA:
JTest Rule ID and Rule Description:
BD.PB.ARRAY-1 (Avoid accessing arrays out of bounds)
BD.EXCEPT.NP-1 (Avoid Null Pointer Exception)
CDD.DUPC-2 (Avoid code duplication)
MOBILE.J2ME.OOME-3 (Catch 'OutOfMemoryError' for large array allocations)
BD.RES.LEAKS-1 (Ensure resources are deallocated)
BD.PB.ZERO-1 (Avoid division by zero)
UC.UPC-2 (Avoid unused "private" classes or interfaces)
UC.DEAD-1 (Avoid dead stores (variable never used)
OPT.CEL-3 (Do not call methods in loop condition statements)
BD.PB.DEREF-2 (Avoid dereferencing before checking for null)
DotTest ID and Rule Description:
OOM.CYCLO - 2 (Follow the limit for Cyclomatic Complexity)
METRICS.CBO-1 (Follow the limit for Coupling between objects)
METRICS.MI-1 (Follow the limit for Maintainability Index (70))
CS.EXCEPT.RETHROW – 2 (Avoid clearing stack trace while rethrowing exceptions)
IFD.SRII – 4 (Implement IDisposable in types which are using system resources)
IFD.SRIF - 1 (Provide finalizers in types which use resources)
CS.CDD.DUPC – 2 (Avoid code duplication)
CS.CDD.DUPM-2 (Avoid method duplication)
OPU.IGHWE – 1 (Override the GetHashCode method whenever you override the Equals method)
Now I am looking for if SonarQube rule set include the above Parasoft rule set or not. But it is hard for me to tell which one in SonarQube rule sets is equivalent to the above Parasoft rule sets. Does anyone know?
Thanks
June