How to get the address that one machine instruction write to?

How to get the address that one machine instruction write to? - linux

Now i have got a machine instruction'address from EIP register. This machine instruction could change the value of a certain area of memory, i do want but cannot get the address of this memory.
Of course,i could read the data from machine instruction'address, but the content is machine instruction like:0x8b0c4d8b......, it's unreadable(i can not use debugging tools like gdb).
How to get the address that one machine instruction will write to?

If you know the machine code EIP points to and you just want to disassemble it, do something like this (I took your example of 0x8b0c4d8b):
#create binary file
$ echo -en "\x8b\x0c\x4d\x8b" > foo.bin
#disassemble it
$ objdump -D -b binary -m i386 foo.bin
foo.bin: file format binary
Disassembly of section .data:
00000000 :
0: 8b .byte 0x8b
1: 0c 4d or $0x4d,%al
3: 8b .byte 0x8b
So, in this case, it doesn't change any memory location but if it did, you can easily see it from the assembly code.
Edit: It seems from the comments that you want to do this programmatically. Take a look at udis86. It allows examining operands of instructions. For ARM, see disarm.

Related

Calculate the entry point of an ELF file as a physical address (offset from 0)

I am building a RISC-V emulator which basically loads a whole ELF file into memory.
Up to now, I used the pre-compiled test binaries that the risc-v foundation provided which conveniently had an entry point exactly at the start of the .text section.
For example:
> riscv32-unknown-elf-objdump ../riscv32i-emulator/tests/simple -d
../riscv32i-emulator/tests/simple: file format elf32-littleriscv
Disassembly of section .text.init:
80000000 <_start>:
80000000: 0480006f j 80000048 <reset_vector>
...
Going into this project I didn't know much about ELF files so I just assumed that every ELF's entry point is exactly the same as the start of the .text section.
The problem arose when I compiled my own binaries, I found out that the actual entry point is not always the same as the start of the .text section, but it might be anywhere inside it, like here:
> riscv32-unknown-elf-objdump a.out -d
a.out: file format elf32-littleriscv
Disassembly of section .text:
00010074 <register_fini>:
10074: 00000793 li a5,0
10078: 00078863 beqz a5,10088 <register_fini+0x14>
1007c: 00010537 lui a0,0x10
10080: 43850513 addi a0,a0,1080 # 10438 <__libc_fini_array>
10084: 3a00006f j 10424 <atexit>
10088: 00008067 ret
0001008c <_start>:
1008c: 00002197 auipc gp,0x2
10090: cec18193 addi gp,gp,-788 # 11d78 <__global_pointer$>
...
So, after reading more about ELF files, I found out that the actual entry point address is provided by the Entry entry on the ELF's header:
> riscv32-unknown-elf-readelf a.out -h | grep Entry
Entry point address: 0x1008c
The problem now becomes that this address is not the actual address on the file (offset from 0) but is a virtual address, so obviously if I set the program counter of my emulator to this address, the emulator would crash.
Reading a bit more, I heard people talk about calculations regarding offsets from program headers and whatnot, but no one had a concrete answer.
My question is: what is the actual "formula" of how exactly you get the entry point address of the _start procedure as an offset from byte 0?
Just to be clear my emulator doesn't support virtual memory and the binary is the only thing that is loaded into my emulator's memory, so I have no use for the abstraction of virtual memory. I just want every memory address as physical address on disk.

My question is: what is the actual "formula" of how exactly you get the entry point address of the _start procedure as an offset from byte 0?
First, forget about sections. Only segments matter at runtime.
Second, use readelf -Wl to look at segments. They tell you exactly which chunk of file ([.p_offset, .p_offset + .p_filesz)) goes into which in-memory region ([.p_vaddr, .p_vaddr + .p_memsz)).
The exact calculation of "at which offset in the file does _start reside" is:
Find Elf32_Phdr which "covers" the address contained in Elf32_Ehdr.e_entry.
Using that phdr, file offset of _start is: ehdr->e_entry - phdr->p_vaddr + phdr->p_offset.
Update:
So, am I always looking for the 1st program header?
No.
Also by "covers" you mean that the 1st phdr->p_vaddr is always equal to e_entry?
No.
You are looking for a the program header (describing relationship between in-memory and on-file data) which overlaps the ehdr->e_entry in memory. That is, you are looking for the segment for which phdr->p_vaddr <= ehdr->e_entry && ehdr->e_entry < phdr->p_vaddr + phdr->p_memsz. This segment is often the first, but that is in no way guaranteed. See also this answer.

Determining the source of Qemu guest instructions when using in_asm

I'm trying to gather statistics about the percentage of library code that is used vs executed. To do this I'm invoking Qemu-user with the -d in_asm flag. I log this to a file and get a sizeable file listing the translated instructions that looks like this
----------------
IN:
0x4001a0f1e9: 48 83 c4 30 addq $0x30, %rsp
0x4001a0f1ed: 85 c0 testl %eax, %eax
0x4001a0f1ef: 74 b7 je 0x4001a0f1a8
----------------
IN:
0x4001a0f1f1: 49 8b 0c 24 movq (%r12), %rcx
0x4001a0f1f5: 48 83 7c 24 50 00 cmpq $0, 0x50(%rsp)
0x4001a0f1fb: 0f 84 37 01 00 00 je 0x4001a0f338
----------------
To map blocks to associated files, I extract the /proc/pid/maps for the qemu process and compare the address of instructions executed to the address ranges of files within the guest program. This appears to work reasonably well, however the majority of the instructions executed appear to be outside of any of the files contained within the map file. The bottom of the guest address space is listed as follows
.
.
.
40020a0000-4002111000 r--p 00000000 103:02 2622381 /lib/x86_64-
linux-gnu/libpcre.so.3.13.3
4002111000-4002112000 r--p 00070000 103:02 2622381 /lib/x86_64-linux-gnu/libpcre.so.3.13.3
4002112000-4002113000 rw-p 00071000 103:02 2622381 /lib/x86_64-linux-gnu/libpcre.so.3.13.3
4002113000-4002115000 rw-p 00000000 00:00 0
555555554000-5555555a1000 r--p 00000000 103:02 12462104 /home/name/Downloads/qemu-5.2.0/exe/bin/qemu-x86_64
the guest program appears to end at 0x4002115000, with a sizeable gap between the guest, and Qemu which begins at 0x555555554000. I can match instructions in the libraries to the actual binaries, so the approach isn't entirely faulty. However there are almost 60,000 blocks executed whose origin is between 0x400aa20000 and 0x407c8ae138. This region of memory is nominally unmapped, however Qemu seems to be translating, and succesfully executing code here. The program appears to run correctly, so I am unsure where these instructions originate. I had initially thought it might be the vDSO, but the range appears to be much too large, and there are too many separate addresses. I looked at the preceding code for a couple of these blocks and it was in ld.so but I can't say if all the calls are generated there. I think it's possible that this is kernel code, but I'm not sure how to validate whether or not this is true. I'm at a loss as to how to approach this problem.
Is there a way to trace the providence of these instructions? perhaps using the gdb stub or some other logging functionality?"

When you are searching in /proc/pid/maps the corresponding modules may be already unloaded. Running LD_DEBUG=files <your qemu command line> will print module loading info, including their load address and size. Search there for missing code addresses.

How to get linux ebpf assembly?

I want to learn linux ebpf vm, if I write a ebpf program test.c, used llvm:
clang -O2 -target bpf -o test.o test.c. How to get the ebpf assembly like tcpdump -d in classic bpf, thanks.

This depends on what you mean exactly by “learn[ing] linux ebpf vm”.
The language itself
If you mean learning about the instructions of eBPF, the assembly-like language itself, you can have a look at the documentation from the kernel (quite dense) or at this summarized version of the syntax from bcc project.
The virtual machine
If you prefer to see how the internals of the eBPF virtual machine work, you can either have a look at various presentations (I recommend those from D. Borkmann), I have a list here in this blog post; or you can directly read at the kernel sources, under linux/kernel/bpf (in particular file core.c). Alternatively, there is a simpler userspace implementation available.
Dump eBPF instructions
Now if you want to see the code that has been compiled from C to eBPF, here are a couple of solutions.
Read the object file
For my part I compile with the command presented in the tc-bpf man page:
__bcc() {
clang -O2 -emit-llvm -c $1 -o - | \
llc -march=bpf -filetype=obj -o "`basename $1 .c`.o"
}
alias bcc=__bcc
The code is translated into eBPF and stored in one of the sections of the ELF file produced. Then I can examine my program with tools such as objdump or readelf. For example, if my program is in the classifier section:
$ bcc return_zero.c
$ readelf -x classifier return_zero.o
Hex dump of section 'classifier':
0x00000000 b7000000 02000000 95000000 00000000 ................
In the above output, two instructions are displayed (little endian — the first field starting with 0x is the offset inside the section). We could parse this to put in shape the instructions and to obtain:
b7 0 0 0000 00000002 // Load 0x02 in register r0
95 0 0 0000 00000000 // Exit and return value in r0
[April 2019 edit] Dump an eBPF program loaded in the kernel
It is possible to dump the instructions of programs loaded (and then possibly attached to one of the available BPF hooks) in the kernel, either as eBPF assembly instructions, or as machine instructions if the program has been JIT-compiled. bpftool, relying on libbpf, is the go-to utility for doing such things. For example, one can see what programs are currently loaded, and note their ids, with:
# bpftool prog show
Then dumping the instructions for a program of a given id is as simple as:
# bpftool prog dump xlated id <id>
# bpftool prog dump jited id <id>
for eBPF or JITed (if available) instructions respectively. Output can also be formatted as JSON if necessary.
Advanced tools
Depending on the tools you use to inject BPF into the kernel, you can generally dump the output of the in-kernel verifier, that contains most of the instructions formatted in a human-friendly way.
With the bcc set of tools (not directly related to the previous command, and not related at all with the old 16-bit compiler), you can get this by using the relevant flags for the BPF object instance, while with tc filter add dev eth0 bpf obj … verbose this is done with the verbose keyword.
Disassemblers
The aforementioned userspace implementation (uBPF) has its own assembler and disassembler that might be of interest to you: it takes the “human-friendly” (add32 r0, r1 and the likes) instructions as input and converts into object files, or the other way round, respectively.
But probably more interesting, there is the support for debug info, coming along with a BPF disassembler, in LLVM itself: as of today it has recently been merged, and its author (A. Starovoitov) has sent an email about it on the netdev mailing list. This means that with clang/LLVM 4.0+, you should be able to use llvm-objdump -S -no-show-raw-insn my_file.o to obtain a nicely formatted output.

Acquring Processor ID in Linux

On Microsoft Windows you can get Processor ID (not Process ID) via WMI which is based in this case (only when acquiring Processor ID) on CPUID instruction
Is there a similar method to acquire this ID on Linux ?

I do not know what WMI is and MS-Windows "CPUID instruction", since I do not know or use MS-Windows (few users here do). So I cannot say for sure if this is offering the same information, but have a try with cat /proc/cpuinfo. If you require a specific value you can grep that out easily.
If you need do to this from within a program then you can use the file utils to read such information. Always keep in mind one of the most basic principles of 'unix' style operating systems: everything is a file.

For context of the OP's question, ProcessorID value returned by WMI is documented thus:
Processor information that describes the processor features. For an
x86 class CPU, the field format depends on the processor support of
the CPUID instruction. If the instruction is supported, the property
contains 2 (two) DWORD formatted values. The first is an offset of
08h-0Bh, which is the EAX value that a CPUID instruction returns with
input EAX set to 1. The second is an offset of 0Ch-0Fh, which is the
EDX value that the instruction returns. Only the first two bytes of
the property are significant and contain the contents of the DX
register at CPU reset—all others are set to 0 (zero), and the contents
are in DWORD format.
As an example, on my system:
C:\>wmic path Win32_Processor get ProcessorId
ProcessorId
BFEBFBFF000206A7
Note that the ProcessorID is simply a binary-encoded format of information usually available in other formats, specifically the signature (Family/Model/Stepping/Processor type) and feature flags. If you only need the information, you may not actually need this ID -- just get the already-decoded information from /proc/cpuinfo.
If you really want these 8 bytes, there are a few ways to get the ProcessorID in Linux.
With root/sudo, the ID is contained in the output of dmidecode:
<snip>
Handle 0x0004, DMI type 4, 35 bytes
Processor Information
Socket Designation: CPU Socket #0
Type: Central Processor
Family: Other
Manufacturer: GenuineIntel
ID: A7 06 02 00 FF FB EB BF
<snip>
Note the order of bytes is reversed: Windows returns the results in Big-Endian order, while Linux returns them in Little-Endian order.
If you don't have root permissions, it is almost possible to reconstruct the ProcessorID from /proc/cpuinfo by binary-encoding the values it returns. For the "signature" (the first four bytes in Windows/last four bytes in Linux) you can binary encode the Identification extracted from /proc/cpuinfo to conform to the Intel Documentation Figure 5-2 (other manufacturers use it for compatibility).
Stepping is in bits 3-0
Model is in bits 19-16 and 7-4
Family is in bits 27-20 and 11-8
Processor type is in bits 13-12 (/proc/cpuinfo doesn't tell you this, assume 0)
Similarly, you can populate the remaining four bytes by iterating over the feature flags (flags key in /proc/cpuinfo) and setting bits as appropriate per Table 5-5 of the Intel doc linked above.
Finally, you can install the cpuid package (e.g., on Ubuntu, sudo apt-get install cpuid). Then by running the cpuid -r (raw) command you can parse its output. You would combine the values from the EAX and EDX registers for an initial EAX value of 1:
$ cpuid -r
CPU 0:
0x00000000 0x00: eax=0x0000000d ebx=0x756e6547 ecx=0x6c65746e edx=0x49656e69
0x00000001 0x00: eax=0x000206a7 ebx=0x00020800 ecx=0x9fba2203 edx=0x1f8bfbff
<snip>

Minimal assembler program for CP/M 3.1 (z80)

I seem to be losing the battle against my stupidity.
This site explains the system calls under various versions of CP/M.
However, when I try to use call 2 (C_WRITE, console output), nothing much happens.
I have the following code.
ORG 100h
LD E,'a'
LD C,2
CALL 5
CALL 0
I recite this here from memory. If there are typos, rest assured they were not in the original since the file did compile and I had a COM file to start.
I am thinking the lines mean the following:
Make sure this gets loaded at address 100h (0h to FFh being the zero page).
Load ASCII 'a' into E register for system call 2.
Load integer 2 into C register for system call 2.
Make system call (JMP to system call is at address 5 in zero page).
End program (Exit command is at address 0 in zero page).
The program starts and exits with no problems. If I remove the last command, it hangs the computer (which I guess is also expected and shows that CALL 0 works).
However, it does not print the ASCII character. (But it does print an extra new line, but the system might have done that.)
How can I get my CP/M program to do what the system call is supposed to do? What am I doing wrong?
UPDATE: The problem was that all assemblers I tried expected a certain format of the source file. This file worked with Microsoft's macro assembler:
.Z80
START: LD E,'a'
LD C,2
CALL 5
JP 0
I think (I am guessing) that asm.com (DR's assembler) and m80.com (Microsoft's macro assembler) are expecting Intel 8080 mnemonics and have to be told when they have to expect z80 mnemonics, which are apparently different.
I'll accept the answer below anyway because it is also correct since it suggests simply writing the image itself without worrying about asm.com.

Obvious possibility: is your assembler taking 'a' to be a hexadecimal rather than an ASCII character? 0xa is ASCII for new line. Maybe try 'g' or inspect a hex dump of your assembler output?
Other than that your code looks fine, though an RST 0 would save a few bytes.
EDIT:
I hand assembled your code to:
1e 61
0e 02
cd 05 00
cd 00 00
I saved that to disk as mytest.com. I then launched this CP/M emulator (warning: that's a direct file download link; the emulator appears to be titled Joan Riff's "Z80MU PROFESSIONAL" Z80 and CP/M 2.2 Emulator and is itself more than twenty years old so doesn't seem to have a web page) for DOS inside DOSBox and ran mytest.com. It output the letter 'a'. So either your toolchain or your CP/M is at fault.
A picture because it really did happen:

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string