Disassembler for Linux capable of disassembling old DOS .COM/.EXE files

Disassembler for Linux capable of disassembling old DOS .COM/.EXE files - linux

My first question here, hopefully I'm not doing it wrong.
My problem is that I have a certain old DOS program which has quite much hacked the file format to the extreme to save space. (Yes, it's a demoscene prod for those who know.)
Objdump doesn't want to help me with it; quick Googling yielded no real results for the problem and the manpage doesn't seem too generous in this regard either.
There are others yes, like lida. However, for some reason I couldn't get lida to work; I believe there are alternatives.
Anyone have any experience of disassembling DOS executables on Linux? Or should I just try some DOS based disassembler and run it on Dosemu?

IDA is the best disassembler, and there is also linux version. It's better than a simple dissasembler because it's interactive.
Also, if you want to see nice "hand made" assembly, the best place to look are old viruses. And not the binaries, but sources, because they are commented. You can try Netlux for that.

ndisasm comes with NASM, the netwide assembler. It is pretty versatile, including the ability to disassemble raw streams of bytes (since you mentioned COM files) and also a few object file formats. Strictly speaking I think it's also possible to disassemble raw streams of bytes with some objdump option, but I don't remember how that goes.
However self-modifying code can make this rather tricky. Looking at a stream of bytes, it's hard to predict what the final instructions executed might be if the program will modify itself, a common space-saving trick in the DOS era. You mentioned booting into DOS, which gives me some interesting ideas: Perhaps you could step through it using a DOS debugger, or run DOS under qemu and use its debugging options (some of which include dumping assembly output and register state during execution).

Related

What is wrong with self-modifying codes with static-recompilation emulations?

I was searching for writing an emulator, and its techniques. But following paragraph made me wondered, I think I couldn't figure out which scenario can be present, if you write a self-modifying code to be static-recompilation emulated.
In this technique, you take a program written in the emulated code and attempt to translate it into the assembly code of your computer. The result will be a usual executable file which you can run on your computer without any special tools. While static recompilation sounds very nice, it is not always possible. For example, you cannot statically recompile self-modifying code as there is no way to tell what it will become without running it. To avoid such situations, you may try combining static recompiler with an interpreter or a dynamic recompiler.
Here is what I was reading, and this line made me wondered:
For example, you cannot statically recompile self-modifying code as there is no way to tell what it will become without running it
A good explanation with examples will be so instructive, thanks.
Edit: By the way, I know the meaning of self-modifying, I just wonder what problems and where will we get problems after statically-recompilation, which thing will make our self-modifying code broken.

Self-modifying code heavily relies on the instruction set encoding of the original CPU. For example, it could flip some bits in a specific memory location to turn one instruction into another. With static recompilation, flipping those same bits will have an entirely different effect since the instructions are encoded completely differently for the host CPU.

How to inspect Haskell bytecode

I am trying to figure out a bug (a serious performance downgrade). Unfortunately, I wasn't able to figure out why by going back many different versions of my code.
I am suspecting it could be some modifications to libraries that I've updated, not to mention in the meanwhile I've updated to GHC 7.6 from 7.4 (and if anybody knows if some laziness behavior has changed I would greatly appreciate it!).
I have an older executable of this code that does not have this bug and thus I wonder if there are any tools to tell me the library versions I was linking to from before? Like if it can figure out the symbols, etc.

GHC creates executables, which are notoriously hard to understand... On my Linux box I can view the assembly code by typing in
objdump -d <executable filename>
but I get back over 100K lines of code from just a simple "Hello, World!" program written in Haskell.
If you happen to have the GHC .hi files, you can get some information about the executable by typing in
ghc --show-iface <hi filename>
This won't give you the assembly code, but you can get some extra information that may prove useful.
As I mentioned in the comment above, on Linux you can use "ldd" to see what C-system libraries you used in the compile, but that is also probably less than useful.
You can try to use a disassembler, but those are generally written to disassemble to C, not anything higher level and certainly not Haskell. That being said, GHC compiles to C as an intermediary (at least it used to; has that changed?), so you might be able to learn something.
Personally I often find view system calls in action much more interesting than viewing pure assembly. On my Linux box, I can view all system calls by running using strace (use Wireshark for the network traffic equivalent):
strace <program executable>
This also will generate a lot of data, so it might only be useful if you know of some specific place where direct real world communication (i.e., changes to a file on the hard disk drive) goes wrong.
In all honesty, you are probably better off just debugging the problem from source, although, depending on the actual problem, some of these techniques may help you pinpoint something.
Most of these tools have Mac and Windows equivalents.

Since much has changed in the last 9 years, and apparently this is still the first result a search engine gives on this question (like for me, again), an updated answer is in order:
First of all, yes, while Haskell does not specify a bytecode format, bytecode is also just a kind of machine code, for a virtual machine. So for the rest of the answer I will treat them as the same thing. The “Core“ as well as the LLVM intermediate language, or even WASM could be considered equivalent too.
Secondly, if your old binary is statically linked, then of course, no matter the format your program is in, no symbols will be available to check out. Because that is what linking does. Even with bytecode, and even with just classic static #include in simple languages. So your old binary will be no good, no matter what. And given the optimisations compilers do, a classic decompiler will very likely never be able to figure out what optimised bits used to be partially what libraries. Especially with stream fusion and such “magic”.
Third, you can do the things you asked with a modern Haskell program. But you need to have your binaries compiled with -dynamic and -rdynamic, So not only the C-calling-convention libraries (e.g. .so), and the Haskell libraries, but also the runtime itself is dynamically loaded. That way you end up with a very small binary, consisting of only your actual code, dynamic linking instructions, and the exact data about what libraries and runtime were used to build it. And since the runtime is compiler-dependent, you will know the compiler too. So it would give you everything you need, but only if you compiled it right. (I recommend using such dynamic linking by default in any case as it saves memory.)
The last factor that one might forget, is that even the exact same compiler version might behave vastly differently, depending on what IT was compiled with. (E.g. if somebody put a backdoor in the very first version of GHC, and all GHCs after that were compiled with that first GHC, and nobody ever checked, then that backdoor could still be in the code today, with no traces in any source or libraries whatsoever. … Or for a less extreme case, that version of GHC your old binary was built with might have been compiled with different architecture options, leading to it putting more optimised instructions into the binaries it compiles for unless told to cross-compile.)
Finally, of course, you can profile even compiled binaries, by profiling their system calls. This will give you clues about which part of the code acted differently and how. (E.g. if you notice that your new binary floods the system with some slow system calls where the old one just used a single fast one. A classic OpenGL example would be using fast display lists versus slow direct calls to draw triangles. Or using a different sorting algorithm, or having switched to a different kind of data structure that fits your work load badly and thrashes a lot of memory.)

Reading integers from keyboard in Assembly (Linux IA-32 x86 gcc gas)

I'd like to know how to read integers from keyboard in assembly. I'm using Linux/x86 IA-32 architecture and GCC/GAS (GNU Assembler). The examples I found so far are for NASM or some other Windows/DOS related compiler.
I heard that it has something to do with the "int 16h" interrupt, but I don't know how it works (does it needs parameters? The result goes to %eax or any of its virtual registers [AX, AH, AL]?).
Thanks in advance,
Flayshon.
:D

Simple answer is that you don't read integers from the keyboard, you read characters from the keyboard. You don't print integers to the screen, either - you print characters. You will need routines to convert "ascii-to-integer" and "integer-to-ascii". You can "just call scanf" for the one, and "just call printf" for the other. "scanf" works okay if the user is well-behaved and confines input to characters representing decimal digits, but it's difficult to get rid of any "junk" entered! "printf" isn't too bad.
Although I'm a Nasm user (it works fine for Linux - not really "Windows/dos related"), I might have routines in (G)as syntax lying around. I'll see if I can find 'em if you can't figure it out.
As Brian points out, int 16h is a BIOS interrupt - 16-bit code - and is not useful in Linux.
Best,
Frank

In 2012, I don't recommend coding an entire program in assembly. Code only the most critical parts (if you absolutely want some assembly code). Compilers are optimizing better than humans. So use C or C++ for low level software, and higher-level languages e.g. Ocaml instead.
On Linux, you need to understand the role of the linux kernel and of system calls, which are documented in the section 2 of man pages. You probably want at least read(2) and write(2) (if only handling stdin and stdout which should have already be opened by the parent process, e.g. a shell), and you probably need many other syscalls (e.g. open(2) and close(2)). Don't forget to do your buffering (for efficiency purpose).
I strongly recommend learning the Linux system interfaces by reading a good book such as Advanced Unix Programming.
How system calls are done at the machine level in assembly is documented in the Linux Assembly Howto (at least for x86 Linux in 32 bits).

If your goal is to "obtain" a program, I would agree entirely with Basile. If your goal is to "learn assembly language", these other languages aren't really going to help. If your goal is to learn the nitty-gritty details of the hardware, you probably want assembly language, but Linux (or any other "protected mode" OS) isolates us from the hardware, so you might want to use clunky old DOS or even "write your own OS". Flayshon doesn't actually say what his goal is, but since he's asking here, he's probably interested in assembly language...
Some of us have a mental illness that makes us think it's "fun" to write in assembly language. Humor us!
Best,
Frank

Linux kernel/os source code documentation?

Is there a Linux distro (other than Minix) with good documentation for the source code? Or, is there some good documentation to describe the general Linux source code?
I have downloaded the Kernel source code, but, it is (unsurprisingly) a little overwhelming to find my way around and I wondered if there were some higher-level documentation to go with how the Linux kernel works?

Have you tried having a look on The linux documentation project I've find it quietly exhaustive regarding linux
They have a section The Linux Kernel wich is an online book that explains
how the linux kernel works and why it does behaves in certain ways, you should deffinitely
look into it because it's very well made.

Some of the Linux kernel code has decent commenting as documentation, but if you're going to be getting into kernel development, I'd recommend picking up a good book. A good, relatively easy-to-read one is Linux Kernel Development, by Robert Love. I got started on the Second Edition when I was in college, and keep a copy of the third on my bookshelf now.
I also find the Linux Cross Reference site helpful in jumping around the kernel source code. It's nice for tracking down functions that are in different files, and getting at what you need.

If you want to learn about operating systems and their basics, I strongly suggest you to start with a small kernel and then ramp up to learn about Linux. Starting with an operating system like Linux would be overwhelming in terms of code and documentation.
There is XV6 operating system which follows the basic Unix notion of files and processes. You can get the code listing and the documentation explaining the code properly. Here is a link to it. link.
Since academia is using this course as a baseline, I think you should get good support for understanding the same.

Linux Core Kernel Commentary is a little dated, but is still an excellent source of info.

For something which is not obsolete (like kernel.org/doc is), you may see:
Free Electrons Linux/Documentation/ (3.8)
Linux Cross Reference kernel/Documentation/
kernel-doc (3.6.10)
The first is the one I prefer personally (clean, readable, pleasant, up‑to‑date).
The second is the most well known.
The third, is for download, if you wish to browse and search it off‑line (may be handy in some case).
My two cents as a side note before I leave: I feel it's weird how for such a famous stuff as the Linux kernel is, when you search the web for documentation, you get masses of obsolete documentations, and how the rather up‑to‑date ones seems to be rather hidden and far from the top position of search engines.

Linux user-space ELF loader

I need to do a rather unusual thing: manually execute an elf executable. I.e. load all sections into right places, query main() and call it (and cleanup then). Executable will be statically linked, so there will be no need to link libraries. I also control base address, so no worries about possible conflicts.
So, is there are any libraries for that?
I found OSKit and its liboskit_exec, but project seems to be dead since 2002.
I'm OK with taking parts of projects (respecting licenses, of course) and tailoring them to my need, but as I'm quite a noob in the linux world, I dont even know where to find those parts! :)
PS. I need that for ARM platform.
UPD Well, the matter of loading elfs seems to require some good knowledge about it (sigh), so I'm out to read some specs and manuals. And I think I will stick to bionic/linker and libelfsh. Thanks guys!
Summarized findings:
libelf: http://directory.fsf.org/project/libelf/
elfsh and libelfsh (are now part of eresi): http://www.eresi-project.org/
elfio (another elf library): http://sourceforge.net/projects/elfio/
OSKit and liboskit_exec (outdated): http://www.cs.utah.edu/flux/oskit/
bionic/linker: https://android.googlesource.com/platform/bionic

A quick apt-cache search suggests libelf1, libelfg0 and/or libelfsh0. I think the elfsh program (in the namesake package) might be an interesting practical example of how to use libelfsh0.
I haven't tried any myself, but I hope they might be helpful. Good luck :-)

Google's Android, in it's "bionic" libc implementation, has a completely reimplemented ELF loader. It's reasonably clean, and probably a better source than gilbc if you're looking for something simple.

Take a look at libelf for reading the executable format. You are going to have trouble with this I think.
Sounds like, as you don't need libraries for anything, why not just mmap your executable, set data about various memory areas and jmp/b in?
I don't know if ARM has an NX-bit equivalent, but worth checking.

This tool contains an ELF loader: http://bitwagon.com/rtldi/rtldi.html
I reused the code from rtldi for an ELF chainloader in another project. The code is here: http://svn.gna.org/viewcvs/plash/trunk/chroot-jail/elf-chainloader/?rev=877 and there is some background here: http://plash.beasts.org/wiki/Story16. (Apparently I have to break these links because stackoverflow won't let me post >1 link!)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string