How do I add a new opcode to PHP? - php-internals

Searching for it I found this blog post: https://www.npopov.com/2017/04/14/PHP-7-Virtual-machine.html
Does it cover "everything" needed to add a new opcode, or all the places I'd need to touch in the engine? Or is it better to just start grepping in the code-base? Maybe there's a commit that can be used as a prototype or example?
Edit, last opcode added was this: https://github.com/php/php-src/pull/7019/files#diff-773bdb31563a0f907c75068675f6056b25f003e61f46928a31d9837ae107460d

PHP source code is written in the c programming language. That is what you are looking at in your link to the "last opcode added". That's not opcode it is source code. The output of the PHP interpreter (Zend Virtual Machine) reads the contents of php files and turns it into opcode. Opcode is not written directly by programmers.
If you want to write a custom extension for PHP you will need to write it in c and structure it so it can be loaded as an external mod. To say, "It's not trivial" would be a massive understatement.
There is a hybrid system that allows you to write PHP-like code that can be compiled into a standalone PHP extension. Check out the Zephir language. Stack overflow has lots of questions tagged with 'zephir'. I have read about it but not used it.

Related

Create a C and C++ preprocessor using ANTLR

I want to create a tool that can analyze C and C++ code and detect unwanted behaviors, based on a config file. I thought about using ANTLR for this task, as I already created a simple compiler with it from scratch a few years ago (variables, condition, loops, and functions).
I grabbed C.g4 and CPP14.g4 from ANTLR grammars repository. However, I came to notice that they don't support the pre-processing parsing, as that's a different step in the compilation.
I tried to find a grammar that does the pre-processing part (updated to ANTLR4) with no luck. Moreover, I also understood that if I'll go with two-steps parsing I won't be able to retain the original locations of each character, as I'd already modified the input stream.
I wonder if there's a good ANTLR grammar or program (preferably Python, but can deal with other languages as well) that can help me to pre-process the C code. I also thought about using gcc -E, but then I won't be able to inspect the macro definitions (for example, I want to warn if a user used a #pragma GCC (some students at my university, for which I write this program to, used this to bypass some of the course coding style restrictions). Moreover, gcc -E will include library header contents, which I don't want to process.
My question is, therefore, if you can recommend me a grammar/program that I can use to pre-process C and C++ code. Alternatively, if you can guide me on how to create a grammar myself that'd be perfect. I was able to write the basic #define, #pragma etc. processings, but I'm unable to deal with conditions and with macro functions, as I'm unsure how to deal with them.
Thanks in advance!
This question is almost off-topic as it asks for an external resource. However, it also bears a part that deserves some attention.
The term "preprocessor" already indicates what the handling of macros etc. is about. The parser never sees the disabled parts of the input, which also means it can be anything, which might not be part of the actual language to parse. Hence a good approach for parsing C-like languages is to send the input through a preprocessor (which can be a specialized input stream) to strip out all preprocessing constructs, to resolve macros and remove disabled text. The parse position is not a problem, because you can push the current token position before you open a new input stream and restore that when you are done with it. Store reported errors together with your input stream stack. This way you keep the correct token positions. I have used exactly this approach in my Windows resource file parser.

Why can't we extract source code from executable file?

I need some information on executable files, thanks in advance, this is a new topic in our grade.
I've seen a lot of questions asking how to extract, but my question is why can't we get the original source code? Yeah using decompilers we can extract but those codes are not the exact code used to develop the program.
I mean, if a computer is running a software it obviously has to have some code to refer from, then why can't we get that code? Also, do exe files have the same code which is developed by the programmer? Is it that OSs are developed in such a way that they don't leak source code from an executable file?
The .exe file is made up of binary numbers which consist of 1's and 0's. And these files contain some additional code that support code from many source.
Operating systems use binary languages to operate, this is what we call machine code. (Getting the code back from the exe is like getting the apple back from the apple juice) ;)
Also check Compiled vs. Interpreted Languages
The process to transform the source code into the exe file is extremely complex. For example, when it is being compile, the language of the source code (eg. c++ etc) is transformed into machine code. It is like when you eat something, those things turns into feces after being process by your stomach. Therefore it's quite impossible to revert exe file to source code.

Determine source language from a binary?

I responded to another question about developing for the iPhone in non-Objective-C languages, and I made the assertion that using, say, C# to write for the iPhone would strike an Apple reviewer wrong. I was speaking largely about UI elements differing between the ObjC and C# libraries in question, but a commenter made an interesting point, leading me to this question:
Is it possible to determine the language a program is written in, solely from its binary? If there are such methods, what are they?
Let's assume for the purposes of the question:
That from an interaction standpoint (console behavior, any GUI appearance, etc.) the two are identical.
That performance isn't a reliable indicator of language (no comparing, say, Java to C).
That you don't have an interpreter or something between you and the language - just raw executable binary.
Bonus points if you're language-agnostic as possible.
Short answer: YES
Long answer:
If you look at a binary, you can find the names of the libraries that have been linked in. Opening cmd.exe in TextPad easily finds the following at hex offset 0x270: msvcrt.dll, KERNEL32.dll, NTDLL.DLL, USER32.dll, etc. msvcrt is the Microsoft 'C' runtime support functions. KERNEL32, NTDLL, and USER32.dll are OS specific libraries which tell you either the target platform, or the platform on which it was built, depending on how well the cross-platform development environment segregates the two.
Setting aside those clues, most any c/c++ compiler will have to insert the names of the functions into the binary, there is a list of all functions (or entrypoints) stored in a table. C++ 'mangles' the function names to encode the arguments and their types to support overloaded methods. It is possible to obfuscate the function names but they would still exist. The functions signatures would include the number and types of the arguments which can be used to trace into the system or internal calls used in the program. At offset 0x4190 is "SetThreadUILanguage" which can be searched for to find out a lot about the development environment. I found the entry-point table at offset 0x1ED8A. I could easily see names like printf, exit, and scanf; along with __p__fmode, __p__commode, and __initenv
Any executable for the x86 processor will have a data segment which will contain any static text that was included in the program. Back to cmd.exe (offset 0x42C8) is the text "S.o.f.t.w.a.r.e..P.o.l.i.c.i.e.s..M.i.c.r.o.s.o.f.t..W.i.n.d.o.w.s..S.y.s.t.e.m.". The string takes twice as many characters as is normally necessary because it was stored using double-wide characters, probably for internationalization. Error codes or messages are a prime source here.
At offset B1B0 is "p.u.s.h.d" followed by mkdir, rmdir, chdir, md, rd, and cd; I left out the unprintable characters for readability. Those are all command arguments to cmd.exe.
For other programs, I've sometimes been able to find the path from which a program was compiled.
So, yes, it is possible to determine the source language from the binary.
I'm not a compiler hacker (someday, I hope), but I figure that you may be able to find telltale signs in a binary file that would indicate what compiler generated it and some of the compiler options used, such as the level of optimization specified.
Strictly speaking, however, what you're asking is impossible. It could be that somebody sat down with a pen and paper and worked out the binary codes corresponding to the program that they wanted to write, and then typed that stuff out in a hex editor. Basically, they'd be programming in assembly without the assembler tool. Similarly, you may never be able to tell with certainty whether a native binary was written in straight assembler or in C with inline assembly.
As for virtual machine environments such as JVM and .NET, you should be able to identify the VM by the byte codes in the binary executable, I would expect. However you may not be able to tell what the source language was, such as C# versus Visual Basic, unless there are particular compiler quirks that tip you off.
what about these tools:
PE Detective
PEiD
both are PE Identifiers. ok, they're both for windows but that's what it was when i landed here
I expect you could, if you disassemble the source, or at least you may know the compiler, as not all compilers will use the same code for printf for example, so Objective-C and gnu C should differ here.
You have excluded all byte-code languages so this issue is going to be less common than expected.
First, run what on some binaries and look at the output. CVS (and SVN) identifiers are scattered throughout the binary image. And most of those are from libraries.
Also, there's often a "map" to the various library functions. That's a big hint, also.
When the libraries are linked into the executable, there is often a map that's included in the binary file with names and offsets. It's part of creating "position independent code". You can't simply "hard-link" the various object files together. You need a map and you have to do some lookups when loading the binary into memory.
Finally, the start-up module for C, C++ (and I imagine C#) is unique to that compiler's defaiult set of libraries.
Well, C is initially converted the ASM, so you could write all C code in ASM.
No, the bytecode is language agnostic. Different compilers could even take the same code source and generate different binaries. That's why you don't see general purpose decompilers that will work on binaries.
The command 'strings' could be used to get some hints as to what language was used (for instance, I just ran it on the stripped binary for a C application I wrote and the first entries it finds are the libraries linked by the executable).

linux disassembler

how can I write just a simple disassembler for linux from scratches?
Are there any libs to use? I need something that "just works".
Instead of writing one, try Objdump.
Based on your comment, and your desire to implement from scratch, I take it this is a school project. You could get the source for objdump and see what libraries and techniques it uses.
The BFD library might be of use.
you have to understand the ELF file format first. Then, you can start processing the various sections of code according to the opcodes of your architecture.
You can use libbfd and libopcodes, which are libraries distributed as part of binutils.
http://www.gnu.org/software/binutils/
As an example of the power of these libraries, check out the Online Disassembler (ODA).
http://www.onlinedisassembler.com
ODA supports a myriad of architectures and provides a basic feature set. You can enter binary data in the Live View and watch the disassembly appear as you type, or you can upload a file to disassemble. A nice feature of this site is that you can share the link to the disassembly with others.
You can take a look at the code of ERESI
The ERESI Reverse Engineering Software Interface is a multi-architecture binary analysis framework with a tailored domain specific language for reverse engineering and program manipulation.

How to code in the unix spirit? (small single task tools)

I have written a little script which retrieves pictures and movies from my camera and renames them based on their date and then copies them on my harddrive, managing conflicts automatically (same name? same size? same md5?)
Works pretty well.
But this is ONE script.
From time to time I need to check if a picture is already hidden somewhere in a volume, so I'd like to apply the "conflict manager" only. I guess if I had properly followed the unix spirit of tiny single-task tools, I could do that.
What are the best resources, best practices and your experience on the subject?
Thank you.
Edit : Although I'd love to read unix books and have a deep understanding of the subject, I am looking for the Great Principles first. Plus I tend to limit myself to online resources.
I would look at the book called The Art of Unix Programming.
I've found that most code doesn't start out being reusable, it evolves to be. Take your existing code and factor out the "conflict manager" portion into its own function or program, then call that program instead of having it be a part of your original application. After that you'll be able to reuse that part of your code that you have a need to reuse. Sometimes it's impossible to design software up front for reusability because you simply don't know which parts you'll want to reuse.
As for resources, it seems like the store shelves are packed with books for Linux desktop users and system administrators, but it's hard to find good Linux programming books. A few good ones:
Beginning Linux Programming
Professional Linux Programming
Linux Programming by Example: The Fundamentals
The Linux Programmer's Toolbox
Lastly, Eric Raymond has made The Art of Unix Programming available online for free.
Check out this book:
The Art of Unix Programming by Eric S. Raymond
http://www.amazon.com/UNIX-Programming-Addison-Wesley-Professional-Computing/dp/0131429019
Here is his website:
http://www.catb.org/~esr/writings/taoup/
Personally, whenever I see that the script I'm planning to write will be longer than a dozen lines, I use python instead of shell script. One neat trick with python scripting is that it's very easy to program in a style where you create both "unix spirit" command-line tools and libraries. E.g. for your "conflict manager", create a file (python module) and put the functionality in functions and/or classes, and then at the end you can put a python "main" function (the usual if __name__=='__main__': dance) where you parse command line options (use the builtin OptionParser module for this, it's very nice!) and use the functionality in the functions/classes.
This way you can use the utility both as a stand-alone command line program, or you can import the module in another python script and use the functionality defined there via functions/classes rather than parsing input.
Start with wikipedia (Dataflow programming)
The book Software Tools (amazon) by Kernighan and Plauger is a classic on this subject. I think it should be required reading for any serious student of software development.
- the art of UNIX programming - is quite a nice book ok "the unix way", in so far as one exists. OTOH if the way is "do as little work as gets your job done", you may already be there. :)
I think some of the keys for good gnu code, are:
Handling the system signals properly, like deattaching hard drive files if SIGTERM is received.
Proper use of pipes and standard input/output
Follwing common command line flag rules
I would also recommend this book. Pretty old, but I think is quite clear explaining the principles of unix.

Resources