Dalvik bytecode instrumentation - register type merging

Dalvik bytecode instrumentation - register type merging - dalvik

I am doing some sort of dalvik bytecode instrumentation using dexlib2.
However, there are a couple of remaining issues.
The register type merging that seems to happen after goto instructions
and catch blocks, more precisely at the corresponding label, somehow
derives an unexpected register type, which in turn breaks the instrumented code.
The instructions that get inserted look as follows:
move(-wide,-object,/16,/from16) vNew, v0
const-string v0, "some string"
invoke-static, {v0}, LPathToSomeClass;->SomeMethod(Ljava/lang/String;)V
move(..) v0, vNew
So, v0 is used to hold some parameter for the static function call, while
vNew is a new (local) register to store and restore the original content of v0. The register type of v0 is derived in advance in order to derive the right
move instruction, i.e. move-wide, move or move-object. However, when this code is included within some try-block, the instrumentation breaks. The output of
baksmali (baksmali d -b "" --register-info ALL,FULLMERGE --offsets )
reveals that the type of v0 after the const-string instruction (which is Reference,L/java/lang/String) is considered as input for the merging procedure happening for instance at the corresponding catch-block label. Assuming that the type before the inserted code was Reference,[I (int array) the resulting
type is now Reference,L/java/lang/Object (which produces a verification error), although the final move instruction restores the original register type.
Now to my questions:
1) When is this merging actually happening?
2) Why is the merging procedure considering the type of v0 after the const-string instruction? Is it considering every instruction modifying the type of any register?
3) Is this problem only related to try-catch blocks?
4) What are the restrictions for try-catch blocks in this matter?
5) Is there any solution to this problem apart from constructing an own method for each code to inject without parameters? So is it possible to use an additional register to solve this problem?
6) Can I detect with dexlib2 try-catch blocks and determine the set of instructions they include?
7) Are there any notes/literature discussing this problem, e.g. the merging procedure, and related technicalities, e.g. further limitations/restrictions
for the instrumentation?
I highly appreciate any help in this matter. Thanks in advance!

When merging registers at the start of a catch block, there is an incoming edge from every instruction in the try block that can throw. Only certain instructions can throw - as determined by the CAN_THROW opcode flag.
In your particular example, the invoke-static instruction after the const-string instruction can throw, and so there's an edge from just before that instruction to the start of the catch block.
If you take a step back, execution can jump from any instruction in the try block that can throw to the start of the catch block. And so the code in the catch block must be prepared for the registers to be in a state that is consistent with the register contents just before any of those instructions that can throw.
So, for example, if there is one possible "jump" from the try block to the catch block where the register contains an primitive int type, and another possible jump where it contains an object, that register is considered "conflicted", because the the register may contain either type at that point in the code, and the two types are not compatible with each other. E.g. a primitive int can never be passed to something expecting a reference type and vice versa. And there's no mechanism in the bytecode for static register type checking.
One possible solution might be to split the try block at the point where you insert your instrumentation, so that the instrumentation itself is not covered by a try block, but both "sides" of the original code is. And keep in mind that in the bytecode, the same catch block can be used by multiple try blocks, so you can split the original try block into two, and have both reference the original catch block.
Otherwise, you'll just have to figure out some way to manage the registers appropriately to avoid this problem.
As for 6), see MethodImplementation.getTryBlocks(), which will give you a list of try blocks in that method. Each try block specifies where it starts, how many instructions it covers, and all of the catch blocks associated with it (different catch blocks for different exceptions).

Related

Safely zeroing buffers after working with crypto/*

Is there a way to zero buffers containing e. g. private keys after
using them and make sure that compilers don't delete the zeroing code as
unused? Something tells me that a simple:
copy(privateKey, make([]byte, keySize))
Is not guaranteed to stay there.

Sounds like you want to prevent sensitive data remaining in memory. But have you considered the data might have been replicated, or swapped to disk?
For these reasons I use the https://github.com/awnumar/memguard package.
It provides features to destroy the data when no longer required, while keeping it safe in the mean time.
You can read about its background here; https://spacetime.dev/memory-security-go

How about checking (some of) the content of the buffer after zeroing it and passing it to another function? For example:
copy(privateKey, make([]byte, keySize))
if privateKey[0] != 0 {
// If you pass the buffer to another function,
// this check and above copy() can't be optimized away:
fmt.Println("Zeroing failed", privateKey[0])
}
To be absolutely safe, you could XOR the passed buffer content with random bytes, but if / since the zeroing is not optimized away, the if body is never reached.
You might think a very intelligent compiler might deduce the above copy() zeros privateKey[0] and thus determine the condition is always false and still optimize it away (although this is very unlikely). The solution to this is not to use make([]byte, keySize) but e.g. a slice coming from a global variable or a function argument (whose value can only be determined at runtime) so the compiler can't be smart enough to deduce the condition is going to be always false at compile time.

What happens when a declarator (my/state) is in a for block?

The following blocks run a loop assigning the topic to a variable $var:
The first one the my $var; is outside the loop
The second the my $var; is inside the loop
Lastly the state $var; is inside the loop
my $limit=10_000_000;
{
my $var;
for ^$limit { $var =$_; }
say now - ENTER now;
}
{
for ^$limit { my $var; $var=$_; }
say now - ENTER now;
}
{
for ^$limit { state $var; $var=$_; }
say now - ENTER now;
}
A sample output durations (seconds) of each block are as follows:
0.5938845
1.8251226
2.60700803
The docs at https://docs.perl6.org/syntax/state motion state variables have the same lexical scoping as my. Functionally code block 1 and block 3 would achieve the same persistent storage across multiple calls to the respective loop block.
Why does the state ( and the inner my) version take so much more time? What else is it doing?
Edit:
Similar to #HåkonHægland's comment,if I cut and paste the above code so to run each block three times in total the timing changes significantly for the my $var outside the loop(the first case):
0.600303
1.7917011
2.6640811
1.67793597
1.79197091
2.6816156
1.795679
1.81233942
2.77486777

Short version: in a world without any runtime optimization (type specialization, JIT, and so forth), the timings would match your expectations. The timings here are influenced by how well the optimizer deals with each example.
First of all, it's interesting to run the code without any kind of runtime optimization. In my (rather slow) VM on the box I'm currently on, sticking MVM_SPESH_DISABLE=1 into the environment results in these timings:
13.92366942
16.235372
14.4329288
These make some kind of intuitive sense:
In the first case, we have a simple lexical variable declared in the outer scope of the block
In the second case, we have to allocate, and then garbage collect, an extra Scalar allocation every time around the loop, which accounts for the extra time
In the third case, we're using the state variable. A state variable is stored in the code object of the closure, and then copied into the call frame at entry time. That's cheaper than allocating a new Scalar every time, but still a little bit more work than not having to do that operation at all.
Next, let's run 3 programs with the optimizer enabled, each example in its own isolated program.
The first comes out at 0.86298831, a factor of 16 faster. Go optimizer! It has inlined the loop body.
The second comes out at 1.2288566, a factor of 13 faster. Not too shabby either. It has again inlined the loop body. (This case will also become rather cheaper in the future, once the escape analyzer is smart enough to eliminate the Scalar allocation.)
The third comes out at 2.0695035, a factor of 7 faster. That's comparatively unimpressive (even if still quite an improvement), and the major reason is that it has not inlined the loop body. Why? Because it doesn't know how to inline code that uses state variables yet. (How to see this: run with MVM_SPESH_INLINE_LOG=1 in the environment, and among the output is: Can NOT inline (1) with bytecode size 78 into (3): cannot inline code that declares a state variable.)
In short, the dominating factor here is the inlining of the loop body, and with state variables that is presently not possible.
It's not immediately clear why the optimizer does worse at the case with the outer declaration of $var when that isn't the first loop in the program; that feels more like a bug than a reasonable case of "this feature isn't optimized well yet". In its slight defense, it still consistently manages to deliver a big improvement, even when not so big as might be desired!

Confusion about C++11 lock free stack push() function

I'm reading C++ Concurrency in Action by Anthony Williams, and don't understand its push implementation of the lock_free_stack class. Listing 7.12 to be precise
void push(T const& data)
{
counted_node_ptr new_node;
new_node.ptr=new node(data);
new_node.external_count=1;
new_node.ptr->next=head.load(std::memory_order_relaxed)
while(!head.compare_exchange_weak(new_node.ptr->next,new_node, std::memory_order_release, std::memory_order_relaxed));
}
So imagine 2 threads (A, B) calling push function. Both of them reach while loop but not start it. So they both read the same value from head.load(std::memory_order_relaxed).
Then we have the following things going on:
B thread gets swiped out for any reason
A thread starts the loop and obviously successfully adds a new node to the stack.
B thread gets back on track and also starts the loop.
And this is where it gets interesting as it seems to me.
Because there was a load operation with std::memory_order_relaxed and compare_exchange_weak(..., std::memory_order_release, ...) in case of success it looks like there is no synchronization between threads whatsoever.
I mean it's like std::memory_order_relaxed - std::memory_order_release and not std::memory_order_acquire - std::memory_order_release.
So B thread will simply add a new node to the stack but to its initial state when we had no nodes in the stack and reset head to this new node.
I was doing my research all around this subject and the best i could find was in this post Does exchange or compare_and_exchange reads last value in modification order?
So the question is, is it true? and all RMW functions see the last value in modification order? No matter what std::memory_order we used, if we use RMW operation it will synchronize with all threads (CPU and etc) and find the last value to be written to the atomic operation upon it is called?

So after some research and asking a bunch of people I believe I found the proper answer to this question, I hope it'll be a help to someone.
So the question is, is it true? and all RMW functions see the last
value in modification order?
Yes, it is true.
No matter what std::memory_order we used, if we use RMW operation it
will synchronize with all threads (CPU and etc) and find the last
value to be written to the atomic operation upon it is called?
Yes, it is also true, however there is something that needs to be highlighted.
RMW operation will synchronize only the atomic variable it works with. In our case, it is head.load
Perhaps you would like to ask why we need release - acquire semantics at all if RMW does the synchronization even with the relaxed memory order.
The answer is because RMW works only with the variable it synchronizes, but other operations which occurred before RMW might not be seen in the other thread.
lets look at the push function again:
void push(T const& data)
{
counted_node_ptr new_node;
new_node.ptr=new node(data);
new_node.external_count=1;
new_node.ptr->next=head.load(std::memory_order_relaxed)
while(!head.compare_exchange_weak(new_node.ptr->next,new_node, std::memory_order_release, std::memory_order_relaxed));
}
In this example, in case of using two push threads they won't be synchronized with each other to some extent, but it could be allowed here.
Both threads will always see the newest head because compare_exchange_weak
provides this. And a new node will be always added to the top of the stack.
However if we tried to get the value like this *(new_node.ptr->next) after this line new_node.ptr->next=head.load(std::memory_order_relaxed) things could easily turn ugly: empty pointer might be dereferenced.
This might happen because a processor can change the order of instructions and because there is no synchronization between threads the second thread could see the pointer to a top node even before that was initialized!
And this is exactly where release-acquire semantic comes to help. It ensures that all operations which happened before release operation will be seen in acquire part!
Check out and compare listings 5.5 and 5.8 in the book.
I also recommend you to read this article about how processors work, it might provide some essential information for better understanding.
memory barriers

When should the Win32 InterlockedExchange function be used?

I came across the function InterlockedExchange and was wondering when I should use this function. In my opinion, setting a 32 Bit value on an x86 processor should always be atomic?
In the case where I want to use the function, the new value does not depend on the old value (it is not an increment operation).
Could you provide an example where this method is mandatory (I'm not looking for InterlockedCompareExchange)

InterlockedExchange is both a write and a read -- it returns the previous value.
This is necessary to ensure another thread didn't write a different value just after you did. For example, say you're trying to increment a variable. You can read the value, add 1, then set the new value with InterlockedExchange. The value returned by InterlockedExchange must match the value you originally read, otherwise another thread probably incremented it at the same time, and you need to loop around and try again.

As well as writing the new value, InterlockedExchange also reads and returns the previous value; this whole operation is atomic. This is useful for lock-free algorithms.
(Incidentally, 32-bit writes are not guaranteed to be atomic. Consider the case where the write is unaligned and straddles a cache boundary, for instance.)

In a multi-processor or multi-core machine each core has it's own cache - so each core has each own potentially different "view" of what the content of the system memory is.
Thread synchronization mechanisms take care of synchronizing between cores, for more information look at http://blogs.msdn.com/oldnewthing/archive/2008/10/03/8969397.aspx or google for acquire and release semantics

Setting a 32-bit value is atomic, but only if you're setting a literal.
b = a is 2 operations:
mov eax,dword ptr [a]
mov dword ptr [b],eax
Theoretically there could be some interruption between the first and second operation.

Writing a value is never atomic by default. When you write a value to a variable, several machine instructions are generated. With modern, preemptive OSes, the OS might switch to another thread between the individual operations of the write.
This is even more a problem on multi-processor machines, where several threads could be executing at the same time, and trying to write to a single memory location simultaneously.
Interlocked operations avoid this by using specialized instructions to make the write (x86 has dedicated instructions for this kind of situation), which do the read-modify-write in one instruction. These instructions also lock the memory bus of all processors, to ensure that no other executing thread could be writing to the value at the same time.

InterlockedExchange makes sure that the change of a variable and the return of its original value are not interrupted by other threads.
So, if 'i' is an int, these calls (taken individually) do not need InterlockedExchange around 'i':
a = i;
i = 9;
i = a;
i = a + 9;
a = i + 9;
if(0 == i)
None of these statements rely upon BOTH the initial AND final values of 'i'. But these following calls DO need InterlockedExchange around 'i':
a = i++; //a = InterlockedExchange(&i, i + 1);
Without it, two threads running through this same code might get the same value of 'i' assigned to 'a' or 'a' may unexpectedly skip two or more numbers.
if(0 == i++) //if(0 == InterlockedExchange(&i, i + 1))
Two threads may both execute the code that is only supposed to happen once.
etc.

wow, so many conflicting answers. Hard to sift through who's right, who's wrong, and what information is misleading.
I'm unsure of the answer too, given the above half-answers, but I think it works like this, I may be wrong, and it will be interesting to find out if I am:
32-bit read & writes ARE atomic, but depending on your code, that may not mean much.
don't worry about non-aligned read/writes. ALL 32-bit writes to a 32-bit variable have to be aligned or the machine page-faults.
don't worry about a write wrapping around the end of a cached page, that can't happen.
If you need to write-then-read on one thread, and you're writing on another thread, then you need to use InterlockedExchange. If you're simply reading the value on one thread, and writing it on another, then you don't need to use it, but those values may be wiggly because of multithreading.

Is it ok to have multiple threads writing the same values to the same variables?

I understand about race conditions and how with multiple threads accessing the same variable, updates made by one can be ignored and overwritten by others, but what if each thread is writing the same value (not different values) to the same variable; can even this cause problems? Could this code:
GlobalVar.property = 11;
(assuming that property will never be assigned anything other than 11), cause problems if multiple threads execute it at the same time?

The problem comes when you read that state back, and do something about it. Writing is a red herring - it is true that as long as this is a single word most environments guarantee the write will be atomic, but that doesn't mean that a larger piece of code that includes this fragment is thread-safe. Firstly, presumably your global variable contained a different value to begin with - otherwise if you know it's always the same, why is it a variable? Second, presumably you eventually read this value back again?
The issue is that presumably, you are writing to this bit of shared state for a reason - to signal that something has occurred? This is where it falls down: when you have no locking constructs, there is no implied order of memory accesses at all. It's hard to point to what's wrong here because your example doesn't actually contain the use of the variable, so here's a trivialish example in neutral C-like syntax:
int x = 0, y = 0;
//thread A does:
x = 1;
y = 2;
if (y == 2)
print(x);
//thread B does, at the same time:
if (y == 2)
print(x);
Thread A will always print 1, but it's completely valid for thread B to print 0. The order of operations in thread A is only required to be observable from code executing in thread A - thread B is allowed to see any combination of the state. The writes to x and y may not actually happen in order.
This can happen even on single-processor systems, where most people do not expect this kind of reordering - your compiler may reorder it for you. On SMP even if the compiler doesn't reorder things, the memory writes may be reordered between the caches of the separate processors.
If that doesn't seem to answer it for you, include more detail of your example in the question. Without the use of the variable it's impossible to definitively say whether such a usage is safe or not.

It depends on the work actually done by that statement. There can still be some cases where Something Bad happens - for example, if a C++ class has overloaded the = operator, and does anything nontrivial within that statement.
I have accidentally written code that did something like this with POD types (builtin primitive types), and it worked fine -- however, it's definitely not good practice, and I'm not confident that it's dependable.
Why not just lock the memory around this variable when you use it? In fact, if you somehow "know" this is the only write statement that can occur at some point in your code, why not just use the value 11 directly, instead of writing it to a shared variable?
(edit: I guess it's better to use a constant name instead of the magic number 11 directly in the code, btw.)
If you're using this to figure out when at least one thread has reached this statement, you could use a semaphore that starts at 1, and is decremented by the first thread that hits it.

I would expect the result to be undetermined. As in it would vary from compiler to complier, langauge to language and OS to OS etc. So no, it is not safe
WHy would you want to do this though - adding in a line to obtain a mutex lock is only one or two lines of code (in most languages), and would remove any possibility of problem. If this is going to be two expensive then you need to find an alternate way of solving the problem

In General, this is not considered a safe thing to do unless your system provides for atomic operation (operations that are guaranteed to be executed in a single cycle).
The reason is that while the "C" statement looks simple, often there are a number of underlying assembly operations taking place.
Depending on your OS, there are a few things you could do:
Take a mutual exclusion semaphore (mutex) to protect access
in some OS, you can temporarily disable preemption, which guarantees your thread will not swap out.
Some OS provide a writer or reader semaphore which is more performant than a plain old mutex.

Here's my take on the question.
You have two or more threads running that write to a variable...like a status flag or something, where you only want to know if one or more of them was true. Then in another part of the code (after the threads complete) you want to check and see if at least on thread set that status... for example
bool flag = false
threadContainer tc
threadInputs inputs
check(input)
{
...do stuff to input
if(success)
flag = true
}
start multiple threads
foreach(i in inputs)
t = startthread(check, i)
tc.add(t) // Keep track of all the threads started
foreach(t in tc)
t.join( ) // Wait until each thread is done
if(flag)
print "One of the threads were successful"
else
print "None of the threads were successful"
I believe the above code would be OK, assuming you're fine with not knowing which thread set the status to true, and you can wait for all the multi-threaded stuff to finish before reading that flag. I could be wrong though.

If the operation is atomic, you should be able to get by just fine. But I wouldn't do that in practice. It is better just to acquire a lock on the object and write the value.

Assuming that property will never be assigned anything other than 11, then I don't see a reason for assigment in the first place. Just make it a constant then.
Assigment only makes sense when you intend to change the value unless the act of assigment itself has other side effects - like volatile writes have memory visibility side-effects in Java. And if you change state shared between multiple threads, then you need to synchronize or otherwise "handle" the problem of concurrency.
When you assign a value, without proper synchronization, to some state shared between multiple threads, then there's no guarantees for when the other threads will see that change. And no visibility guarantees means that it it possible that the other threads will never see the assignt.
Compilers, JITs, CPU caches. They're all trying to make your code run as fast as possible, and if you don't make any explicit requirements for memory visibility, then they will take advantage of that. If not on your machine, then somebody elses.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string