How can i reverse/decompile a cython file(.so / .c) back to a .py file [duplicate]

How can i reverse/decompile a cython file(.so / .c) back to a .py file [duplicate] - python-3.x

I have used cythonize to compile my python modules. This way speed of code is increased and also the code can not be read by developers. However I have doubts if some python developer can crack that cython module to hack the code.
Question is, can someone decompile them back to python or other readable format to crack the code?

There are reasonably good C decompilers which will get the Cython extension back to (somewhat readable) C. It won't be the same C code that Cython generated, but it will likely be possible to work out the details of your algorithm. You wouldn't be able to get it back to the original Python/Cython code very easily (but given that Cython generates code in a fairly predictable way it might be possible...)
In particular, things like string constants will be fairly easy to extract from the C file (or even directly from the so file). Since a lot of Python code is based around attribute lookups from string constants (e.g. np.ones(...) looks up a global with the string constant "np", then looks up an attribute with the string constant "ones", then some variation of PyObject_Call), then that code will be fairly easy to decompile. Because of this, a typical Cython extension module is probably a little easier to decompile than a typical C program.
In short you should assume:
if you've messed up and deleted your .py/.pyx file then you should assume it's lost for good, and you can't get it back.
If someone else has a sufficient interest in working out what your code does, then you should assume they will be able to do it.

Related

Is it possible to "customize" python?

Can I change the core functionality of Python, for example, rewrite it to use say("Hello world") instead of print("Hello world")?
If this is possible, how can this be done?

I see a few possibilities as to how to accomplish this. I've arranged them in order of how much programming is needed/how obnoxious they are:
Renaming builtins
If, as in your example, you are simply more comfortable using say() or printf() than print(), then you can, as others have answered, just alias the builtin function to your own function with something like say=print.
Rewriting builtins
Let's pretend we don't trust the official implementation of print() and we want to implement our own. A lot of the internals in Python such as stdin are contained in the sys library. You could, if you wanted, implement your own. I asked a question a couple years ago here that discussed how to rename the _ variable to ans which might be illuminating to take a look at.
Sending your code through a preprocessor
Ok, so gcc doesn't require C code as input. If you use the right precompiler flags, then you could get away with evaluating #define macros in your source code before you send it to python. Technically a valid answer, but obnoxious as heck.
Writing modules in another language
Cython (python written in C) can have modules written for it in C. You could build a wrapper for printf in C (or assembly, if you'd rather) and use that library in your python code.
Recompiling Python
Unfortunately, doing the above is not possible with all tokens. What if, in a fit of fancy, we'd like to use whilst loops instead of while loops? The only way to accomplish this is actually altering the functioning of python itself. Now, this isn't for the faint of heart or the new programmer. Compilers are really complicated.
Since, however, Python is open source and you can download the source code here, in theory, you could go into the compiler and manually make all the edits you want, then compile your version of python and use that. By no means would your code be portable (as essentially you'd be making a fork of python) but you could technically do it.
Or just conform to the Python standards. That works too.
Writing a PEP
Python is a living language. It's constantly being updated. The ruling body of "What gets included" is the BDFL-delegate and the Council, but anyone can write a Python Enhancement Proposal that proposes to change the language in some way. Most features in Python started out as a PEP. See PEP 0001 for more details.

yes you can just write
say = print
say("hello")

How to Decompile Bytenode "jsc" files?

I've just seen this library ByteNode it's the same as ByteCode of java but this is for NodeJS.
This library compiles your JavaScript code into V8 bytecode, which protect your source code, I'm wondering is there anyway to Decompile byteNode therefore it's not secure enough. I'm wondering because I would like to protect my source code using this library?

TL;DR It'll raise the bar to someone copying the code and trying to pass it off as their own. It won't prevent a dedicated person from doing so. But the primary way to protect your work isn't technical, it's legal.
This library compiles your JavaScript code into V8 bytecode, which protect your source code...
Well, we don't know it's V8 bytecode, but it's "compiled" in some sense. All we know is that it creates a "code cache" via the built-in vm.Script.prototype.createCachedData API, which is officially just a cache used to speed up recompiling the code a second time, third time, etc. In theory, you're supposed to also provide the original source code as a string to the vm.Script constructor. But if you go digging into Node.js's vm.Script and V8 far enough it seems to be the actual code in some compiled form (whether actual V8 bytecode or not), and the code string you give it when running is ignored. (The ByteNode library provides a dummy string when running the code from the code cache, so clearly the actual code isn't [always?] needed.)
I'm wondering is there anyway to Decompile byteNode therefore it's not secure enough.
Naturally, otherwise it would be useless because Node.js wouldn't be able to run it. I didn't find a tool to do it that already exists, but since V8 is open source, it would presumably be possible to find the necessary information to write a decompiler for it that outputs valid JavaScript source code which someone could then try to understand.
Experimenting with it, local variable names appear to be lost, although function names don't. Comments appear to get lost (this may not be as obvious as it seems, given that Function.prototype.toString is required to either return the original source text or a synthetic version [details]).
So if you run the code through a minifier (particularly one that renames functions), then run it through ByteNode (or just do it with vm.Script yourself, ByteNode is a fairly thin wrapper), it will be feasible for someone to decompile it into something resembling source code, but that source code will be very hard to understand. This is very similar to shipping Java class files, which can be decompiled (there's even a standard tool to do it in the JDK, javap), except that the format Java class files are well-documented and don't change from one dot release to the next (though they can change from one major release to another; new releases always support the older format, though), whereas the format of this data is not documented (though it's an open source project) and is subject to change from one dot release to the next.
Certain changes, such as changing the copyright message, are probably fairly easy to make to said source code. More meaningful changes will be harder.
Note that the code cache appears to have a checksum or other similar integrity mechanism, since directly editing the .jsc file to swap one letter for another in a literal string makes the code cache fail to load. So someone tampering with it (for instance, to change a copyright notice) would either need to go the decompilation/recompilation route, or dive into the V8 source to find out how to correct the integrity check.
Fundamentally, the way to protect your work is to ensure that you've put all the relevant notices in the relevant places such that the fact copying it is a violation of copyright is clear, then pursue your legal recourse should you find out about someone passing it off as their own.

is there any way
You could get a hundred answers here saying "I don't know a way", but that still won't guarantee that there isn't one.
not secure enough
Secure enough for what? What's your deployment scenario? What kind of scenario/attack are you trying to defend against?
FWIW, I don't know of an existing tool that "decompiles" V8 bytecode (i.e. produces JavaScript source code with the same behavior). That said, considering that the bytecode is a fairly straightforward translation of the source code, I'm sure it wouldn't be very hard to write such a tool, if someone had a reason to spend some time on it. After all, V8's JS-to-bytecode compiler is open source, so one would only have to look at those sources and implement the reverse direction. So I would assume that shipping as bytecode provides about as much "protection" as shipping as uglified JavaScript, i.e. none that I would trust.
Before you make any decisions, please also keep in mind that bytecode is considered an internal implementation detail of V8; in particular it is not versioned and can change at any time, so it has to be created by exactly the same V8 version that consumes it. If you want to update your Node.js you'll have to recreate all the bytecode, and there is no checking or warning in place that will point out when you forgot to do that.

Node.js source already contains code for decompiling binary bytecode.
You can get a text string from your V8 bytecode and then you would need to analyze it.
But text string would be very long and miss some important information such as a constant pool. So you need to modify the Node.js source.
Please check https://github.com/3DGISKing/pkg10.17.0
I have attached exported xml file.
If you study V8, it would be possible to analyze it and get source code from it.

It keeping it short and sweet, You can try Ghidra node.js package which is based on Ghidra reverse engineering framework which was open-sourced by NSA in the year 2019. Ghidra is capable of disassembling and decompiling the v8 bytecode. The inner working of disassembling is quite complex, this answer is short but sufficient.

How to programmatically wrap a C++ dll with Python

I know how to use ctypes to call a function from a C++ .dll in Python by creating a "wrapper" function that casts the Python input types to C. I think of this as essentially recreating the function signatures in Python, where the function body contains the type cast to C and a corresponding .dll function call.
I currently have a set of C++ .dll files. Each library contains many functions, some of which are overloaded. I am tasked with writing a Python interface for each of these .dll files. My current way forward is to "use the hammer I have" and go through each function, lovingly crafting a corresponding Python wrapper for each... this will involve my looking at the API documentation for each of the functions within the .dlls and coding them up one by one. My instinct tells me, though, that there may be a much more efficient way to go about this.
My question is: Is there a programmatic way of interfacing with a Windows C++ .dll that does not require crafting corresponding wrappers for each of the functions? Thanks.

I would recommend using Cython to do your wrapping. Cython allows you to use C/C++ code directly with very little changes (in addition to some boilerplate). For wrapping large libraries, it's often straightforward to get something up and running very quickly with minimal extra wrapping work (such as in Ctypes). It's also been my experience that Cython scales better... although it takes more front end work to stand Cython up rather than Ctypes, it is in my opinion more maintainable and lends itself well to the programmatic generation of wrapping code to which you allude.

Customising Cabal libraries (I think?)

Perhaps it's just better to describe my problem.
I'm developing a Haskell library. But part of the library is written in C, and another part actually in raw LLVM. To actually get GHC to spit out the code I want I have to follow this process:
Run ghc -emit-llvm on both the code that uses the Haskell module and the "Main" module.
Run clang -emit-llvm on the C file
Now I've got three .ll files from above. I add the part of the library I've handwritten in raw LLVM and llvm-link these into one .ll file.
I then run LLVM's opt on the linked file.
Lastly, I feed the LLVM bitcode fileback into GHC (which pleasantly accepts it) and produces an executable.
This process (with appropriate optimisation settings of course) seems to be the only way I can inline code from C, removing the function call overhead. Since many of these C functions are very small this is significant.
Anyway, I want to be able to distribute the library and for users to be able to use it as painlessly as possible, whilst still gaining the optimisations from the process above. I understand it's going to be a bit more of a pain than an ordinary library (for example, you're forced to compile via LLVM) but as painlessly as possible is what I'm looking for advice for.
Any guidance will be appreciated, I don't expect a step by step answer because I think it will be complex, but just some ideas would be helpful.

What would be involved in calling ARPACK++ (a C++ library) from Haskell?

I've spent a couple of days developing a program in Haskell, while learning the language. Now I realize that I'll need to call Arpack (a Fortran library) or Arpack++ (a C++ wrapper to Arpack) -- I can't find a good implementation of Lanczos method with Haskell bindings. Do any more experienced Haskell programers have an opinion of how difficult this would be?
I've been able to get ".so" ("shared object") versions of libarpack and libarpack++ installed through Ubuntu's repository, but I'm not sure that will suffice. I suspect I'm going to ultimately need to build Arpack++ from source code, which is possible, but I'm getting a lot of build errors, so it will take time. Is there any way to use just the ".so" files, without knowing exactly which version of the header files were used to generate them?
I'm considering using GreenCard, because it looks like the most well maintained Haskell/C bridge. I can't find much documentation though, so I'm wondering whether it will support C++ too.
I'm also starting to wonder whether I should rewrite my program in Python, and use scipy to call Arpack, but I've already sunk a couple of days into writing Haskell. I really like Haskell too, so I'm hoping I can make this work. I guess my overall question is this: What would be involved in making this work with Haskell?
Thanks much.

ELF format is standard format of executables and shared libraries, so accessing the code in these compiled modules is only a matter of knowing function names. If I understand correctly, Fortran is interoperable with C. As a consequence, Fortran should be interoperable with any language which can use C bindings, including Haskell. FYI, you can find all names exported by a module (executable or shared object or simple object archive) using nm tool (it is usually available in all linux distros by default). This of course would work if the binary file was not "stripped", but AFAIK it is not common practice.
However, Haskell cannot use C++ bindings in sane way, since C++ polymorphic features require name mangling, and the method of this name transformation is highly compiler-dependent. It is well-known problem which is not specific to Haskell. Of course, you could try to get a list of exported symbols from C++ shared object and then bind them using FFI, but... It isn't worth it.
As dsign said, you can use Foreign Function Interface GHC feature to create bindings to foreign code. All you would require is library headers (and the library itself of course). In case of C language that would be header files (*.h), but since your library is written in Fortran, you have to find header files analogue in library sources, refere to this page to match Fortran and C types, and then use this information to write FFI bindings. It would be helpful first to write C bindings, i.e. write C header. Then you can even use automatic FFI binding programs like c2hs.
It maybe also helpful to look through C++ bindings. It is possible that it has the header file I've described above. If it has one, then writing FFI bindings will be no more difficult than writing them for any other library.
So, it is not entirely impossible, but it may require some thorough work. Writing bindings to scientific/pure computational libraries is way easier than writing them for some system library which does a lot of IO and keeps its own internal state, but since this library is written not in C... Well, it may be advisable to invest your time in easier alternatives. I cannot say anythin about scipy, I've never used it, but since Python as a language is much more simpler than Haskell, it may be good alternative.

I can tell you that using a C/Fortran library from Haskell, with the help of the Foreign Function Interface would be certainly possible and not terribly complicated. Here is an introduction. In my understanding, you should be able to call anything with a C calling convention, and perhaps even Fortran, without need of recompiling the code. The only exception is with things that look like function calls but are indeed macros, in which case you will have to figure out what the macros do and reproduce them in Haskell.
As of greencard, I have never used it, so I can not vouch for it.
Your second idea of using Python could potentially save you more than a couple of days. Sad as it is, I have never managed Haskell code to easily adapt to my changing requirements, while I find that trivial in Python. Of course, that could be a limitation on my skills with Haskell or my thinking process rather that something to blame to the language.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string