Android app : java / JNI call hooking strategies - hook

My goal is to instrument the AOSP in order to dynamically log all java or JNI calls from a targeted app, with or without the arguments and return value. I do not want to modify the application, it is why I am looking to modify the Android source code. I am not very experience with AOSP and its multitude of libs and frameworks so I am looking for advices because I don't know where to start. Moreover, because of the potential amount of lines logged, the process have to be efficient (i.e I do not believe that a debug-like method, where one must implements a hook class for each hooked method, can work)
What I understood so far :
With the relatively new ART system, it compiles the DEX app source code into a sort of machine executable code (OAT ?) and it is more complex to instrument compared to what it has been with Dalvik.
The execution flow : compiled java bytecode of the app (which depends of the compiled Android API) + libs.so -> DVM -> forked Zygote VM -> Execution of the app.
If I try to hook at the root (Android API + libs.so) it will demands a fastidious amount of work to hook each call. The ideal would be a spot where all java calls pass through. Does a such spot even exists with ART ?.
The AOSP source code is hard to understand because it seems that there are no document that states the role of each source file in the global architecture. So where it is better to hook the calls ?
EDIT(s)
This topic is not well covered, so I'll show info for anyone interested.
My researches came across this blog : http://blog.csdn.net/l173864930/article/details/45035521. (+Google translate)
Who links to this interesting Java and ELF (arm) call hooking project : https://github.com/boyliang/AllHookInOne
It is not exactly what I'm seeking, but I will try to implement an AOSP patch for dynamic analysis that suits my needs.

I have succeed to answer my question.
For what I could understand from the source code there is 3 possible entry points for the java calls :
ArtMethod::Invoke (art/runtime/mirror/art_method.cc)
Execute (art/runtime/interpreter/interpreter.cc)
DoCall (art/runtime/interpreter/interpreter_common.cc)
ArtMethod::Invoke seems to be used for reflection and for calling the method directly with a pointer to the OAT code section. (Again, no documentation, it can be inexact).
Execute end up calling DoCall generally.
There is some optimizations of ART that make the study of Java calls difficult, like method inlining and direct offset address calling.
The first step is the disabling of these optimizations :
In device/brand-name/model/device.mk (in my case device/lge/hammerhead/device.mk for a nexus 5) :
Add the option "interpret-only" to dex2oat. With this option, ART compile only the boot classpath, so the applications will not be compiled in OAT.
PRODUCT_PROPERTY_OVERRIDES := \
dalvik.vm.dex2oat-filter=interpret-only
The second step is to disable inlining in art/compiler/dex/frontend.cc :
Uncomment "kSuppressMethodInlining".
/* Default optimizer/debug setting for the compiler. */
static uint32_t kCompilerOptimizerDisableFlags = 0 | // Disable specific optimizations
(1 << kLoadStoreElimination) |
// (1 << kLoadHoisting) |
// (1 << kSuppressLoads) |
// (1 << kNullCheckElimination) |
// (1 << kClassInitCheckElimination) |
(1 << kGlobalValueNumbering) |
// (1 << kPromoteRegs) |
// (1 << kTrackLiveTemps) |
// (1 << kSafeOptimizations) |
// (1 << kBBOpt) |
// (1 << kMatch) |
// (1 << kPromoteCompilerTemps) |
// (1 << kSuppressExceptionEdges) |
(1 << kSuppressMethodInlining) |
0;
The last step is to disable direct code offset invocation in art/compiler/driver/compiler_driver.cc :
-bool use_dex_cache = GetCompilerOptions().GetCompilePic();
+bool use_dex_cache = true;
With these changes all different calls will fall in the DoCall function where we can finally add our targeted logging routine.
In art/runtime/interpreter/interpreter_common.h, add at the beginning of the includes :
#ifdef HAVE_ANDROID_OS
#include "cutils/properties.h"
#endif
In art/runtime/interpreter/interpreter_common.cc, add at the beginning of the DoCall function :
#ifdef HAVE_ANDROID_OS
char targetAppVar[92];
property_get("target.app.pid", targetAppVar, "0");
int targetAppPID = atoi(targetAppVar);
if(targetAppPID != 0 && targetAppPID == getpid())
LOG(INFO) << "DoCall - " << PrettyMethod(method, true);
#endif
For targeting the application I use a property which set the targeted pid.
For this we need the lib system/core/libcutils and this lib is only available when the AOSP is compiled for a real phone (without messing with the current makefiles).
So the solution will not work for an emulator. (Only guessing, never tried EDIT: confirmed, "cutils/properties.h" cannot be added to the build of an emulator).
After compiling and flashing the patched AOSP, start an app, ps | grep for finding the PID and set the property in root :
shell#android:/ # ps | grep contacts
u0_a2 4278 129 1234668 47356 ffffffff 401e8318 S com.android.contacts
shell#android:/ # setprop target.app.pid 4278
shell#android:/ # logcat
[...]
I/art ( 4278): DoCall - int android.view.View.getId()
I/art ( 4278): DoCall - void com.android.contacts.activities.PeopleActivity$ContactsUnavailableFragmentListener.onCreateNewContactAction()
I/art ( 4278): DoCall - void android.content.Intent.<init>(java.lang.String, android.net.Uri)
I/art ( 4278): DoCall - void android.app.Activity.startActivity(android.content.Intent)
I/ActivityManager( 498): START u0 {act=android.intent.action.INSERT dat=content://com.android.contacts/contacts cmp=com.android.contacts/.activities.ContactEditorActivity} from uid 10002 on display 0
V/WindowManager( 498): addAppToken: AppWindowToken{3a82282b token=Token{dc3f87a ActivityRecord{c0aaca5 u0 com.android.contacts/.activities.ContactEditorActivity t4}}} to stack=1 task=4 at 1
I/art ( 4278): DoCall - void android.app.Fragment.onPause()
I/art ( 4278): DoCall - void com.android.contacts.common.list.ContactEntryListFragment.removePendingDirectorySearchRequests()
I/art ( 4278): DoCall - void android.os.Handler.removeMessages(int)
I/art ( 4278): DoCall - void com.android.contacts.list.ProviderStatusWatcher.stop()
I/art ( 4278): DoCall - boolean com.android.contacts.list.ProviderStatusWatcher.isStarted()
I/art ( 4278): DoCall - void android.os.Handler.removeCallbacks(java.lang.Runnable)
I/art ( 4278): DoCall - android.content.ContentResolver com.android.contacts.ContactsActivity.getContentResolver()
I/art ( 4278): DoCall - void android.content.ContentResolver.unregisterContentObserver(android.database.ContentObserver)
I/art ( 4278): DoCall - void android.app.Activity.onPause()
I/art ( 4278): DoCall - void android.view.ViewGroup.drawableStateChanged()
I/art ( 4278): DoCall - void com.android.contacts.ContactsActivity.<init>()
I/art ( 4278): DoCall - void com.android.contacts.common.activity.TransactionSafeActivity.<init>()
I/art ( 4278): DoCall - void android.app.Activity.<init>()
I/art ( 4278): DoCall - void com.android.contacts.util.DialogManager.<init>(android.app.Activity)
I/art ( 4278): DoCall - void java.lang.Object.<init>()
[...]
When it is over :
shell#android:/ # setprop target.app.pid 0
Voilà !
The overload is not noticeable from a user point of view, but the logcat will be quickly filled.
PS : File paths and names match the Android 5 version (Lollipop), they will be probably different with superior versions.
PS' : If one would wants to print the arguments of the methods, I would advice it to look at art/runtime/utils.cc for the PrettyArguments method and to find some practical implementation somewhere in the code.

Maybe you can get more ideas from Xposed project, it follow the same approach by disabling the method inlining and direct branching optimization:
https://github.com/rovo89/android_art/commit/0f807a6561201230962f77a46120a53d3caa12c2
https://github.com/rovo89/android_art/commit/92e8c8e0309c4a584f4279c478d54d8ce036ee59

Related

Disabling the default class constructor in Rcpp modules

I would like to disable the default (zero argument) constructor for a C++ class exposed to R using RCPP_MODULE so that calls to new(class) without any further arguments in R give an error. Here are the options I've tried:
1) Specifying a default constructor and throwing an error in the function body: this does what I need it to, but means specifying a default constructor that sets dummy values for all const member variables (which is tedious for my real use case)
2) Specifying that the default constructor is private (without a definition): this means the code won't compile if .constructor() is used in the Rcpp module, but has no effect if .constructor() is not used in the Rcpp module
3) Explicitly using delete on the default constructor: requires C++11 and seems to have the same (lack of) effect as (2)
I'm sure that I am missing something obvious, but can't for the life of me work out what it is. Does anyone have any ideas?
Thanks in advance,
Matt
Minimal example code (run in R):
inc <- '
using namespace Rcpp;
class Foo
{
public:
Foo()
{
stop("Disallowed default constructor");
}
Foo(int arg)
{
Rprintf("Foo OK\\n");
}
};
class Bar
{
private:
Bar();
// Also has no effect:
// Bar() = delete;
public:
Bar(int arg)
{
Rprintf("Bar OK\\n");
}
};
RCPP_MODULE(mod) {
class_<Foo>("Foo")
.constructor("Disallowed default constructor")
.constructor<int>("Intended 1-argument constructor")
;
class_<Bar>("Bar")
// Wont compile unless this line is commented out:
// .constructor("Private default constructor")
.constructor<int>("Intended 1-argument constructor")
;
}
'
library('Rcpp')
library('inline')
fx <- cxxfunction(signature(), plugin="Rcpp", include=inc)
mod <- Module("mod", getDynLib(fx))
# OK as expected:
new(mod$Foo, 1)
# Fails as expected:
new(mod$Foo)
# OK as expected:
new(mod$Bar, 1)
# Unexpectedly succeeds:
new(mod$Bar)
How can I get new(mod$Bar) to fail without resorting to the solution used for Foo?
EDIT
I have discovered that my question is actually a symptom of something else:
#include <Rcpp.h>
class Foo {
public:
int m_Var;
Foo() {Rcpp::stop("Disallowed default constructor"); m_Var=0;}
Foo(int arg) {Rprintf("1-argument constructor\n"); m_Var=1;}
int GetVar() {return m_Var;}
};
RCPP_MODULE(mod) {
Rcpp::class_<Foo>("Foo")
.constructor<int>("Intended 1-argument constructor")
.property("m_Var", &Foo::GetVar, "Get value assigned to m_Var")
;
}
/*** R
# OK as expected:
f1 <- new(Foo, 1)
# Value set in the 1-parameter constructor as expected:
f1$m_Var
# Unexpectedly succeeds without the error message:
tryCatch(f0 <- new(Foo), error = print)
# This is the type of error I was expecting to see:
tryCatch(f2 <- new(Foo, 1, 2), error = print)
# Note that f0 is not viable (and sometimes brings down my R session):
tryCatch(f0$m_Var, error = print)
*/
[With acknowledgements to #RalfStubner for the improved code]
So in fact it seems that new(Foo) is not actually calling any C++ constructor at all for Foo, so my question was somewhat off-base ... sorry.
I guess there is no way to prevent this happening at the C++ level, so maybe it makes most sense to use a wrapper function around the call to new(Foo) at the R level, or continue to specify a default constructor that throws an error. Both of these solutions will work fine - I was just curious as to exactly why my expectation regarding the absent default constructor was wrong :)
As a follow-up question: does anybody know exactly what is happening in f0 <- new(Foo) above? My limited understanding suggests that although f0 is created in R, the associated pointer leads to something that has not been (correctly/fully) allocated in C++?
After a bit of experimentation I have found a simple solution to my problem that is obvious in retrospect...! All I needed to do was use .factory for the default constructor along with a function that takes no arguments and just throws an error. The default constructor for the Class is never actually referenced so doesn't need to be defined, but it obtains the desired behaviour in R (i.e. an error if the user mistakenly calls new with no additional arguments).
Here is an example showing the solution (Foo_A) and a clearer illustration of the problem (Foo_B):
#include <Rcpp.h>
class Foo {
private:
Foo(); // Or for C++11: Foo() = delete;
const int m_Var;
public:
Foo(int arg) : m_Var(arg) {Rcpp::Rcout << "Constructor with value " << m_Var << "\n";}
void ptrAdd() const {Rcpp::Rcout << "Pointer: " << (void*) this << "\n";}
};
Foo* dummy_factory() {Rcpp::stop("Default constructor is disabled for this class"); return 0;}
RCPP_MODULE(mod) {
Rcpp::class_<Foo>("Foo_A")
.factory(dummy_factory) // Disable the default constructor
.constructor<int>("Intended 1-argument constructor")
.method("ptrAdd", &Foo::ptrAdd, "Show the pointer address")
;
Rcpp::class_<Foo>("Foo_B")
.constructor<int>("Intended 1-argument constructor")
.method("ptrAdd", &Foo::ptrAdd, "Show the pointer address")
;
}
/*** R
# OK as expected:
fa1 <- new(Foo_A, 1)
# Error as expected:
tryCatch(fa0 <- new(Foo_A), error = print)
# OK as expected:
fb1 <- new(Foo_B, 1)
# No error:
tryCatch(fb0 <- new(Foo_B), error = print)
# But this terminates R with the following (quite helpful!) message:
# terminating with uncaught exception of type Rcpp::not_initialized: C++ object not initialized. (Missing default constructor?)
tryCatch(fb0$ptrAdd(), error = print)
*/
As was suggested to me in a comment I have started a discussion at https://github.com/RcppCore/Rcpp/issues/970 relating to this.

Why my implementation of sbrk system call does not work?

I try to write a very simple os to better understand the basic principles. And I need to implement user-space malloc. So at first I want to implement and test it on my linux-machine.
At first I have implemented the sbrk() function by the following way
void* sbrk( int increment ) {
return ( void* )syscall(__NR_brk, increment );
}
But this code does not work. Instead, when I use sbrk given by os, this works fine.
I have tryed to use another implementation of the sbrk()
static void *sbrk(signed increment)
{
size_t newbrk;
static size_t oldbrk = 0;
static size_t curbrk = 0;
if (oldbrk == 0)
curbrk = oldbrk = brk(0);
if (increment == 0)
return (void *) curbrk;
newbrk = curbrk + increment;
if (brk(newbrk) == curbrk)
return (void *) -1;
oldbrk = curbrk;
curbrk = newbrk;
return (void *) oldbrk;
}
sbrk invoked from this function
static Header *morecore(unsigned nu)
{
char *cp;
Header *up;
if (nu < NALLOC)
nu = NALLOC;
cp = sbrk(nu * sizeof(Header));
if (cp == (char *) -1)
return NULL;
up = (Header *) cp;
up->s.size = nu; // ***Segmentation fault
free((void *)(up + 1));
return freep;
}
This code also does not work, on the line (***) I get segmentation fault.
Where is a problem ?
Thanks All. I have solved my problem using new implementation of the sbrk. The given code works fine.
void* __sbrk__(intptr_t increment)
{
void *new, *old = (void *)syscall(__NR_brk, 0);
new = (void *)syscall(__NR_brk, ((uintptr_t)old) + increment);
return (((uintptr_t)new) == (((uintptr_t)old) + increment)) ? old :
(void *)-1;
}
The first sbrk should probably have a long increment. And you forgot to handle errors (and set errno)
The second sbrk function does not change the address space (as sbrk does). You could use mmap to change it (but using mmap instead of sbrk won't update the kernel's view of data segment end as sbrk does). You could use cat /proc/1234/maps to query the address space of process of pid 1234). or even read (e.g. with fopen&fgets) the /proc/self/maps from inside your program.
BTW, sbrk is obsolete (most malloc implementations use mmap), and by definition every system call (listed in syscalls(2)) is executed by the kernel (for sbrk the kernel maintains the "data segment" limit!). So you cannot avoid the kernel, and I don't even understand why you want to emulate any system call. Almost by definition, you cannot emulate syscalls since they are the only way to interact with the kernel from a user application. From the user application, every syscall is an atomic elementary operation (done by a single SYSENTER machine instruction with appropriate contents in machine registers).
You could use strace(1) to understand the actual syscalls done by your running program.
BTW, the GNU libc is a free software. You could look into its source code. musl-libc is a simpler libc and its code is more readable.
At last compile with gcc -Wall -Wextra -g and use the gdb debugger (you can even query the registers, if you wanted to). Perhaps read the x86/64-ABI specification and the Linux Assembly HowTo.

Conditional Compilation of CUDA Function

I created a CUDA function for calculating the sum of an image using its histogram.
I'm trying to compile the kernel and the wrapper function for multiple compute capabilities.
Kernel:
__global__ void calc_hist(unsigned char* pSrc, int* hist, int width, int height, int pitch)
{
int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
int yIndex = blockIdx.y * blockDim.y + threadIdx.y;
#if __CUDA_ARCH__ > 110 //Shared Memory For Devices Above Compute 1.1
__shared__ int shared_hist[256];
#endif
int global_tid = yIndex * pitch + xIndex;
int block_tid = threadIdx.y * blockDim.x + threadIdx.x;
if(xIndex>=width || yIndex>=height) return;
#if __CUDA_ARCH__ == 110 //Calculate Histogram In Global Memory For Compute 1.1
atomicAdd(&hist[pSrc[global_tid]],1); /*< Atomic Add In Global Memory */
#elif __CUDA_ARCH__ > 110 //Calculate Histogram In Shared Memory For Compute Above 1.1
shared_hist[block_tid] = 0; /*< Clear Shared Memory */
__syncthreads();
atomicAdd(&shared_hist[pSrc[global_tid]],1); /*< Atomic Add In Shared Memory */
__syncthreads();
if(shared_hist[block_tid] > 0) /* Only Write Non Zero Bins Into Global Memory */
atomicAdd(&(hist[block_tid]),shared_hist[block_tid]);
#else
return; //Do Nothing For Devices Of Compute Capabilty 1.0
#endif
}
Wrapper Function:
int sum_8u_c1(unsigned char* pSrc, double* sum, int width, int height, int pitch, cudaStream_t stream = NULL)
{
#if __CUDA_ARCH__ == 100
printf("Compute Capability Not Supported\n");
return 0;
#else
int *hHist,*dHist;
cudaMalloc(&dHist,256*sizeof(int));
cudaHostAlloc(&hHist,256 * sizeof(int),cudaHostAllocDefault);
cudaMemsetAsync(dHist,0,256 * sizeof(int),stream);
dim3 Block(16,16);
dim3 Grid;
Grid.x = (width + Block.x - 1)/Block.x;
Grid.y = (height + Block.y - 1)/Block.y;
calc_hist<<<Grid,Block,0,stream>>>(pSrc,dHist,width,height,pitch);
cudaMemcpyAsync(hHist,dHist,256 * sizeof(int),cudaMemcpyDeviceToHost,stream);
cudaStreamSynchronize(stream);
(*sum) = 0.0;
for(int i=1; i<256; i++)
(*sum) += (hHist[i] * i);
printf("sum = %f\n",(*sum));
cudaFree(dHist);
cudaFreeHost(hHist);
return 1;
#endif
}
Question 1:
When compiling for sm_10, the wrapper and the kernel shouldn't execute. But that is not what happens. The whole wrapper function executes. The output shows sum = 0.0.
I expected the output to be Compute Capability Not Supported as I have added the printf statement in the start of the wrapper function.
How can I prevent the wrapper function from executing on sm_10? I don't want to add any run-time checks like if statements etc. Can it be achieved through template meta programming?
Question 2:
When compiling for greater than sm_10, the program executes correctly only if I add cudaStreamSynchronize after the kernel call. But if I do not synchronize, the output is sum = 0.0. Why is it happening? I want the function to be asynchronous w.r.t the host as much as possible. Is it possible to shift the only loop inside the kernel?
I am using GTX460M, CUDA 5.0, Visual Studio 2008 on Windows 8.
Ad. Question 1
As already Robert explained in the comments - __CUDA_ARCH__ is defined only when compiling device code. To clarify: when you invoke nvcc, the code is parsed and compiled twice - once for CPU and once for GPU. The existence of __CUDA_ARCH__ can be used to check which of those two passes occurs, and then for the device code - as you do in the kernel - it can be checked which GPU are you targetting.
However, for the host side it is not all lost. While you don't have __CUDA_ARCH__, you can call API function cudaGetDeviceProperties which returns lots of information about your GPU. In particular, you can be interested in fields major and minor which indicate the Compute Capability. Note - this is done at run-time, not a preprocessing stage, so the same CPU code will work on all GPUs.
Ad. Question 2
Kernel calls and cudaMemoryAsync are asynchronous. It means that if you don't call cudaStreamSynchronize (or alike) the followup CPU code will continue running even if your GPU hasn't finished your work. This means, that the data you copy from dHist to hHist might not be there yet when you begin operating on hHist in the loop. If you want to work on the output from a kernel you have to wait till the kernel finishes.
Note that cudaMemcpy (without Async) has an implicit synchronization inside.

Reading x86 MSR from kernel module

My main aim is to get the address values of the last 16 branches maintained by the LBR registers when a program crashes. I tried two ways till now -
1) msr-tools
This allows me to read the msr values from the command line. I make system calls to it from the C program itself and try to read the values. But the register values seem no where related to the addresses in the program itself. Most probably the registers are getting polluted from the other branches in system code. I tried turning off recording of branches in ring 0 and far jumps. But that doesn't help. Still getting unrelated values.
2) accessing through kernel module
Ok I wrote a very simple module (I've never done this before) to access the msr registers directly and possibly avoid register pollution.
Here's what I have -
#define LBR 0x1d9 //IA32_DEBUGCTL MSR
//I first set this to some non 0 value using wrmsr (msr-tools)
static void __init do_rdmsr(unsigned msr, unsigned unused2)
{
uint64_t msr_value;
__asm__ __volatile__ (" rdmsr"
: "=A" (msr_value)
: "c" (msr)
);
printk(KERN_EMERG "%lu \n",msr_value);
}
static int hello_init(void)
{
printk(KERN_EMERG "Value is ");
do_rdmsr (LBR,0);
return 0;
}
static void hello_exit(void)
{
printk(KERN_EMERG "End\n");
}
module_init(hello_init);
module_exit(hello_exit);
But the problem is that every time I use dmesg to read the output I get just
Value is 0
(I have tried for other registers - it always comes as 0)
Is there something that I am forgetting here?
Any help? Thanks
Use the following:
unsigned long long x86_get_msr(int msr)
{
unsigned long msrl = 0, msrh = 0;
/* NOTE: rdmsr is always return EDX:EAX pair value */
asm volatile ("rdmsr" : "=a"(msrl), "=d"(msrh) : "c"(msr));
return ((unsigned long long)msrh << 32) | msrl;
}
You can use Ilya Matveychikov's answer... or... OR :
#include <asm/msr.h>
int err;
unsigned int msr, cpu;
unsigned long long val;
/* rdmsr without exception handling */
val = rdmsrl(msr);
/* rdmsr with exception handling */
err = rdmsrl_safe(msr, &val);
/* rdmsr on a given CPU (instead of current one) */
err = rdmsrl_safe_on_cpu(cpu, msr, &val);
And there are many more functions, such as :
int msr_set_bit(u32 msr, u8 bit)
int msr_clear_bit(u32 msr, u8 bit)
void rdmsr_on_cpus(const struct cpumask *mask, u32 msr_no, struct msr *msrs)
int rdmsr_safe_regs_on_cpu(unsigned int cpu, u32 regs[8])
Have a look at /lib/modules/<uname -r>/build/arch/x86/include/asm/msr.h

MPI program with a VC++ GUI?

I need to write an application using MPICH2 (64 bit, in case you're wondering). A GUI is entirely optional but would of course be a huge plus. Will mpiexec have any difficulties running managed VC++ code? Are there any other problems I might run into with compiling/linking (calling conventions, etc)?
Just to give you an idea, the general structure of the program would be like this:
int main(array<System::String ^> ^args)
{
/* Get MPI rank */
if ( rank == 0 )
{
// Enabling Windows XP visual effects before any controls are created
Application::EnableVisualStyles();
Application::SetCompatibleTextRenderingDefault(false);
// Create the main window and run it
// Send/receive messages in Form1's code
Application::Run(gcnew Form1());
}
else
{
/* Send/receive messages to/from process #0 only */
}
return 0;
}
MPI is just another library so no magic. Your code should look something like this:
init MPI
if(rank == 0) init your GUI;
while (1){
if(rank == 0)get input;
perform MPI computation on input
make sure rank 0 ends up with the final result
if(rank == 0) display result on GUI;
}
if(rank == 0) clean up GUI;
clean up MPI

Resources