Monitoring the instructions of a running program in ubuntu?

Monitoring the instructions of a running program in ubuntu? - linux

I'm a little stuck here.
The idea is that I'd like to get a file of every instruction run by a program during it's execution. I'd like to do it with just the executable in hand (no source) and be able to determine what operation is occuring on what address when.
For example, I'd like to be able to run it on Google Chrome, Firefox, etc.
I want to use this for a performance prediction system I'm working on. I figure if I'm able to obtain each instruction that is executed in order it is executed on system 1, I can attempt to simulate/model the run time of an identical program being run on system 2, because I'll be able to predict(although I know not with 100% accuracy) L1/L2 cache-misses, L1/L2 cache-hits, TLB hits/misses, page faults, time taken on floating point multiplication operations, etc.
I'd like to try to do this on two different systems:
System 1: Ubuntu 10.10 on Intel Core 2 Duo CPU
System 2: Ubuntu 12.04 on system with 2x AMD Sixteen Core Opteron model 6274
(I can definitely change the OS's as neccessary, but would prefer to stay with Ubuntu, if possible)
Is this possible / how could I go about doing it? I know with debuggers, you can use them to step through everything, but I don't have the source available.

I think, you can use qemu (or even bochs) or valgrind to monitor every executed instruction. They are x86 binary translation tools (excluding bochs - which is an interpreter of x86 code). There is a valgrind tool called cachegrind (+ kcachegrind gui), which is ready to emulate cache by instrumenting every memory access and simulating some L1/L2 cache model (sizes may be configured via command line options).
To get deeper (into pipeline) you may want to look on free ptlsim (http://www.ptlsim.org/)

Related

Record dynamic instruction trace or histogram in QEMU?

I've written and compiled a RISC-V Linux application.
I want to dump all the instructions that get executed at run-time (which cannot be achieved by static analysis).
Is it possible to get a dynamic assembly instruction execution historgram from QEMU (or other tools)?

For instruction tracing, I go with -singlestep -d nochain,cpu, combined with some awk. This can become painfully slow and large depending on the code you run.
Regarding the statistics you'd like to obtain, delegate it to R/numpy/pandas/whatever after extracting the program counter.
The presentation or video of user "yvr18" on that topic, might cover some aspects of QEMU tracing at various levels (as well as some interesting heatmap visualization).

QEMU doesn't currently support that sort of trace of all instructions executed.
The closest we have today is that there are various bits of debug logging under the -d switch, and you can combine the tracing of "instructions translated from guest to native" with the "blocks of translated code executed" translation to work out what was executed, but this is pretty awkward.
Alternatively you could try scripting the gdbstub interface to do something like "disassemble instruction at PC; singlestep" which will (slowly!) give you all the instructions executed.
Note: There ongoing work to improve QEMU's ability to introspect guest execution so that you can write a simple 'plugin' with functions that are called back on events like guest instruction execution; with that it would be fairly easy to write a dump of guest instructions executed (or do more interesting processing), but this is still work-in-progress, so not available yet.

It seems you can do something similar with rv8 (https://github.com/rv8-io/rv8), using the command:
rv-jit -l

The "spike" RISC-V emulator allows tracing instructions executed, new values stored into registers, or just simply a histogram of PC values (from which you can extract what instruction was at each PC location).
It's not as fast as qemu, but runs at 100 to 200 MIPS on current x86 hardware (at least without tracing enabled)

Embedded Linux Boot Optimization

I am doing project on Pandaboard using Embedded Linux (UBUNTU 12.10 Server Prebuild image) to optimize boot time. I need techniques or tools through which I can find boot time and techniques to optimize the boot time. If anyone can help.

Just remove application which is not required from /etc/init.d/rc file also put echo after every process initialization and check which process is taking much time for starting,
if you find application which is taking more time then debug that application and so on.

There is program that can be helpful to know the approximate boot-up time. Check this link
Time Stamp.

First of all the best you have to do is to compile yourself your own made kernel, get the source on the internet and do a make xconfig and then unselected everythin you don't need.
In a second time create your own root filesystem using Buildroot and make xconfig to select/unselect everything you need or not.
Hope this help.
I had the same problem and do that way, now it's clearly not the same ;)
EDIT: Everything you need will be here

to analyze the boot process, you can use Bootchart2, its available on github: https://github.com/mmeeks/bootchart
or Bootchart, from the Ubuntu packages:
sudo apt-get update
sudo apt-get install bootchart pybootchartgui

There are broadly 3 areas where you can reduce boot time
Bootloader:
Modify the linker script to initialize only the required h/w. Also, if you are using an SD card to boot, merge kernel and bootloader image to save time.
Kernel:
Remove unwanted modules from kernel config. Also try using compressed and uncompressed image. If your CPU is good enough to handle it go compressed image and check uncompression time required for different compression types.
Filesystem:
FS size can be significantly reduced by removing the unwanted bins and libs. Check for dependencies and use only the one's that are required.
For more techniques and information on tools that help in measuring the boot time please refer to the following link.
Refer to Training Material

The basic rule is: the fastest code is code that never gets loaded and
run, so remove everything you don't need:
in U-Boot: don't load and run the full U-Boot at all; use FALCON
mode and have the SPL load the Linux kernel and DTB directly
in Linux: remove all drivers and other stuff you don't really need;
load all drivers that are not essential for your core application as
modules - and load them after your application was started. If you
take this serious, you may even want to start only one CPU core
initially (and start the remaining ones after your application is
running).
in user space: minimize the size of the root file system. throuw
out anything you don't need; configure tools (like busybox) to
contain only the really needed functionality; use efficient code
(for example, link against musl libc instead of glibc) etc.
What can be acchieved by combining all these measures can be seen in
this video - and yes, the complete code for this optimization is
available here.

Optimizing embedded Linux Boot process , needs modifications in three level of embedded Linux design.
Note: you will need the source codes of bootloader and kernel
Boot : the first step in optimizing and reducing boot time of board is optimizing boot loader. first you should know what is your bootloader is. If your bootloader is an opensource bootloader like u-boot than you have the opportunity to modify and optimize it. In u-boot we have a procedure that we can skip unnecessary system check and just upload kernel image to ram and start. the documentation and instruction for this is available in u-boot website. by doing this you will save about 4 ~ 5 second in boot.
Kernel : for having a quicker kernel , you should optimize kernel in many sections. for editing you can use on of Linux config menu. I always use a low graphic menu. it need some dependency you can use it by this command:
$ make menuconfig
our goal for Linux kernel is to have smaller kernel image and less module to load in boot. first change the algorithm of compression from gzip to LZO. the point of this action is gzip algorithm will take much time to extract kernel. by using LZO we have a quicker kernel decompression process. the second , disable any unnecessary driver or module that you don’t have it on your board or you don’t use it any more. by doing this , you will lose some device access and cannot use them in Linux but you will have two positive points: less Ram usage , quicker boot time.
but please remind that some driver are necessary for Linux and by disabling them you will lose some of main features (for example if you disable I2C driver in Linux you will no longer have a HDMI interface) that you need or in worst case you will have a boot problem (such as boot-loop). The third is to disable some of unusable filesystem to reduce kernel size and boot time. The Fourth is to remove some of compression algorithm to have smaller kernel image.
the last thing , If you are using a u-boot bootloader create a uImage instead of zImage. the following steps , are general and main actions , for having quicker boot as 1 second after power attach you should change more option.
after two base layer modifications, now we should optimize boot process in user-space (root file system). depend on witch system are you using , we have different changes to do. in abstract root file system of Linux that have necessary package and system to boot Linux we should use systemd instead of Unix systemv , because systemd have a multi-task init. system and it is faster , after that is udev that you should modify some of loading modules. if you have a graphical user-interface , we can use an easy trick to have a big boot time reduction by initing GUI first and load other module after loading GUI.
if you do all of following tasks , you can have quick boot time and fast system to work with.

Xen binary rewriting method

In full virtualization, what is the CPL of guest OS?
in paravertualiation, CPL of guest OS is 1(ring 1)
is it same in full virtualization?
and I heard that some of the x86 privileged instructions are
not easily handled, thus "binary rewriting" method is required...
how does this "binary rewriting" happens??
I understand that in virtualization, CPU is not emulated.
so how can hypervisor change the binary instruction codes before
the CPU executes them?? do they predict the next instruction on memory and
update the memory contents before CPU gets there??
if this is true, I think hypervisor code(performing binary rewriting)
needs to intercept the CPU every time before some instruction of guest OS is
executed. I think this is absurd.
specific explanation will be appreciated.
thank you in advance..!!

If by full virtualization, you mean hardware supported virtualization, then the CPL of the guest is identical to if it was running on bare-metal.
Xen never rewrites the binary.
This is something that VMWare (as far as I understand). To the best of my understanding (but I have never seen the VMWare source code), the method consists of basically doing runtime patching of code that needs to run differently - typically, this involves replacing an existing op-code with something else - either causing a trap to the hypervisor, or a replacement set of code that "does the right thing". If I understand how this works in VMWare is that the hypervisor "learns" the code by single-stepping through a block, and either applies binary patches or marks the section as "clear" (doesn't need changing). The next time this code gets executed, it has already been patched or is clear, so it can run at "full speed".
In Xen, using paravirtualization (ring compression), then the code in the OS has been modified to be aware of the virtualized environment, and as such is "trusted" to understand certain things. But the hypervisor will still trap for example writes to the page-table (otherwise someone could write a malicious kernel module that modifies the page-table to map in another guest's memory, or some such).
The HVM method does intercept CERTAIN instructions - but the rest of the code runs at normal full speed, thanks to the hardware support in modern processors, such as SVM in AMD and VMX in Intel processors. ARM has a similar technology in the latest models of their processors, but I'm not sure what the name of it is.
I'm not sure if I've answered quite all of your questions, if I've missed something, or it's not clear enough, feel free to ask...

Address space identifiers using qemu for i386 linux kernel

Friends, I am working on an in-house architectural simulator which is used to simulate the timing-effect of a code running on different architectural parameters like core, memory hierarchy and interconnects.
I am working on a module takes the actual trace of a running program from an emulator like "PinTool" and "qemu-linux-user" and feed this trace to the simulator.
Till now my approach was like this :
1) take objdump of a binary executable and parse this information.
2) Now the emulator has to just feed me an instruction-pointer and other info like load-address/store-address.
Such approaches work only if the program content is known.
But now I have been trying to take traces of an executable running on top of a standard linux-kernel. The problem now is that the base kernel image does not contain the code for LKM(Loadable Kernel Modules). Also the daemons are not known when starting a kernel.
So, my approach to this solution is :
1) use qemu to emulate a machine.
2) When an instruction is encountered for the first time, I will parse it and save this info. for later.
3) create a helper function which sends the ip, load/store address when an instruction is executed.
i am stuck in step2. how do i differentiate between different processes from qemu which is just an emulator and does not know anything about the guest OS ??
I can modify the scheduler of the guest OS but I am really not able to figure out the way forward.
Sorry if the question is very lengthy. I know I could have abstracted some part but felt that some part of it gives an explanation of the context of the problem.

In the first case, using qemu-linux-user to perform user mode emulation of a single program, the task is quite easy because the memory is linear and there is no virtual memory involved in the emulator. The second case of whole system emulation is a lot more complex, because you basically have to parse the addresses out of the kernel structures.
If you can get the virtual addresses directly out of QEmu, your job is a bit easier; then you just need to identify the process and everything else functions just like in the single-process case. You might be able to get the PID by faking a system call to get_pid().
Otherwise, this all seems quite a bit similar to debugging a system from a physical memory dump. There are some tools for this task. They are probably too slow to run for every instruction, though, but you can look for hints there.

How to test the kernel for kernel panics?

I am testing the Linux Kernel on an embedded device and would like to find situations / scenarios in which Linux Kernel would issue panics.
Can you suggest some test steps (manual or code automated) to create Kernel panics?

There's a variety of tools that you can use to try to crash your machine:
crashme tries to execute random code; this is good for testing process lifecycle code.
fsx is a tool to try to exercise the filesystem code extensively; it's good for testing drivers, block io and filesystem code.
The Linux Test Project aims to create a large repository of kernel test cases; it might not be designed with crashing systems in particular, but it may go a long way towards helping you and your team keep everything working as planned. (Note that the LTP isn't proscriptive -- the kernel community doesn't treat their tests as anything important -- but the LTP team tries very hard to be descriptive about what the kernel does and doesn't do.)
If your device is network-connected, you can run nmap against it, using a variety of scanning options: -sV --version-all will try to find versions of all services running (this can be stressful), -O --osscan-guess will try to determine the operating system by throwing strange network packets at the machine and guessing by responses what the output is.
The nessus scanning tool also does version identification of running services; it may or may not offer any improvements over nmap, though.
You can also hand your device to users; they figure out the craziest things to do with software, they'll spot bugs you'd never even think to look for. :)

You can try following key combination
SysRq + c
or
echo c >/proc/sysrq-trigger

Crashme has been known to find unknown kernel panic situations, but it must be run in a potent way that creates a variety of signal exceptions handled within the process and a variety of process exit conditions.
The main purpose of the messages generated by Crashme is to determine if sufficiently interesting things are happening to indicate possible potency. For example, if the mprotect call is needed to allow memory allocated with malloc to be executed as instructions, and if you don't have the mprotect enabled in the source code crashme.c for your platform, then Crashme is impotent.
It seems that operating systems on x64 architectures tend to have execution turned off for data segments. Recently I have updated the crashme.c on http://crashme.codeplex.com/ to use mprotect in case of __APPLE__ and tested it on a MacBook Pro running MAC OS X Lion. This is the first serious update to Crashme since 1994. Expect to see updated Centos and Freebsd support soon.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string