Why do MacOS apps store a lot of strings?

Why do MacOS apps store a lot of strings? - string

Why is it that in MacOS, my application has lots of strings in the executable. Like it's like a bunch of binary non-human-readable nonsense, then I see a bunch of function and variable names, type names, like NSString and other NS+something strings, a lot of OBJC+something IIRC, but why? Why does it store all that? Other than bloating the executable size?

I can't specifically answer this for MacOS binaries, but there are generally 2 reasons for a binary to have things like function names:
It allows someone to debug the program. Without these symbols, every variable, function, stacktrace would contain information that's not that useful.
It allows in some cases for other processes to call functions inside another binary. If the function name is not encoded, it's not possible for some other program to call that function by its name (because it doesn't have one).
It's often possible to create binaries stripped with debug symbols, which would reduce the binary size.

Related

How to obfuscate string of variable, function and package names in Golang binary?

When use command "nm go_binary", I find the names of variables, functions and packages and even the directory where my code is located are all displayed, is there any way to obfuscate the binary generated by the command "go build" and prevent go binary from being exploited by hackers?

Obfuscating can't stop reverse engineering but in a way prevent info leakage
That is what burrowers/garble (Go 1.16+, Feb. 2021):
Literal obfuscation
Using the -literals flag causes literal expressions such as strings to be replaced with more complex variants, resolving to the same value at run-time.
This feature is opt-in, as it can cause slow-downs depending on the input code.
Literal expressions used as constants cannot be obfuscated, since they are resolved at compile time. This includes any expressions part of a const declaration.
Tiny mode
When the -tiny flag is passed, extra information is stripped from the resulting Go binary.
This includes line numbers, filenames, and code in the runtime that prints panics, fatal errors, and trace/debug info.
All in all this can make binaries 2-5% smaller in our testing, as well as prevent extracting some more information.
With this flag, no panics or fatal runtime errors will ever be printed, but they can still be handled internally with recover as normal.
In addition, the GODEBUG environmental variable will be ignored.
But:
Exported methods are never obfuscated at the moment, since they could be required by interfaces and reflection. This area is a work in progress.

I think the best answer to this question is here How do I protect Python code?, specifically this answer.
While that question is about Python, it applies to all code in general.
I was gonna mark this question as a duplicate, but maybe someone will provide more insight into it.

Is it possible to extract constants and other predefined values from binary executables?

Let's say we have this program here
class Message{
public static SUPER_SECRET_STRING = "bar";
public static void Main(){
string SECRET = "foo";
Console.Write(sha(SUPER_SECRET_STRING) + "" + sha(SECRET));
}
}
Now, after building this program, is there any way using a hex editor or some other utility to extract the values "foo" and "bar" from the compiled binary file?
Also let's assume that memory editors are not allowed.
Is this applicable to all compiled languages like C++? What about ones that are run in another environment like Java or C#?

The answer from Mene is correct, but I wanted to put in my two cents to let you know how ridiculously easy it is to extract strings from compiled binaries (regardless of the language). If you have Linux, all you have to do is run the command strings <compiled binary> and you have the extracted strings. You don't have to be any sort of reverse engineer to pull this off. I just ran it against the eclipse binary on my Ubuntu machine and check out the (truncated) output:
> strings eclipse
ATSH
0[A\
8.uCH
The %s executable launcher was unable to locate its
companion shared library.
There was a problem loading the shared library and
finding the entry point.
setInitialArgs
-vmargs
-name
--launcher.library
--launcher.suppressErrors
--launcher.ini
eclipse
Notice how the string "The %s executable launcher was unable to locate its companion shared library. There was a problem loading the shared library and finding the entry point." appears in the output. This string is no doubt hard coded into the program.
When strings (and other data) are hard coded into a program, most compilers place them into a special section in the binary where they can be mapped directly into memory for access by the program as it needs them. If you were to open the binary with a hex editor, you could find this string easily.

Yes you could easily use a decompiler to extract those kinds of constants, especially strings (since they require a larger chunk of memory). This will even work in machine-code binaries and is even easier for VM-languages like Java and C#.
If you need to keep something secret in there you will need to go great lengths. Simply encrypting the string for example would add a layer of security, but for someone who knows what she does this won't be a big barrier. For example scanning the the file for places with uncommon entropy is likely to reveal the key which was used for encryption. There are even systems which encode secrets by altering the used low-level commands in the binary. Those tools replace certain combinations of commands with other equivalent commands. But even thous systems are not too hard to circumvent, as the uncommon combination of commands will reveal the use of such tools.
And even if you manage to protect the string by some kind of encryption in your binary, you will at some point require a decrypted version for your execution. Creating a memory-dump at a point in time where the string is used will thus also contain a copy of the secret value. This is especially problematic in Java as you cannot deallocate a chunk of memory and a string is immutable (meaning that a "change" to the string will lead to a new chunk of memory).
As you see the problem is far from trivial. And of course there is no way to give you 100% security (think of all the cracked games and so on).
Something that can be implemented in a secure way is using Public-key cryptography. In that case you will need to keep the private key hidden. That might be possible if you could for example send things to your server to encrypt them or you have hardware which provides a Trusted Platform Module. But those things might not be feasible for your case.

Verifying two different build architectures (one a re-write of the other) are functionally equivalent

I'm re-writing a build that produces a number of things (shared/static libraries, jars, executables, etc). The question came up whether there's a way to verify that the results are functionally equivalent without doing a full top-to-bottom test of the resulting software.
However, that is proving to be more difficult to do than I anticipated.
As an example, I expected that the md5 of two objects produced from the same source (sun studio C++ compiler) and command-line parameters would have the same md5 hash, but that isn't the case. I can build the file, rename it, build again, and they have different hashes.
With that said ... is there a way do a quick check to verify that two files produced from separate build architectures of the same source tree (eg, two shared objects) are functionally equivalent?
edit I am sorry, I neglected to mention this is for a debug build ... when debugging flags aren't used the binaries are identical, but they've been using debugging flags by default for so many years their stuff breaks when you remove the debugging flags (part of the reason I'm re-writing the build is to take that particular 'feature' out of the build so we can get some proper testing going)

Windows DLLs have a link timestamp (TimeDateStamp) as part of PE image.
Looking at linker options, I don't see an option to suppress that. So re-linking a DLL (or an EXE) will always produce a different binary.
You could write a tool to zero out these timestamps (always at a fixed offset from file start), and compare MD5s afterwards. But you'll likely discover lots of other differences as well. In particular, any program that uses __DATE__ or __TIME__ builtins will give you trouble.
We've had to work quite hard to achieve bit-identical rebuilds (using GNU toolchain). It's possible (at least for open-source tools, on Linux), but not easy (as you've discovered).

I forgot about this question; I'm revisiting so I can give the answer I came up with.
objcopy can be used to produce a new binary file in different formats. It's been a few years since I worked on this, so the specifics escape me, but here's what I recall:
objcopy can strip various things out (debug info, symbol information, etc), but even after stripping stuff out I was still seeing different hashes between objects.
In the end I found I could convert it from ELF to other formats. I ended up dumping it to another format (I think I chose SREC) that consistently provided the same MD5 for objects built at different times with identical source/flags.
I'm betting I could have done this a better way with objcopy (or perhaps another binutils tool), but it was good enough to satisfy our concerns.

Determine source language from a binary?

I responded to another question about developing for the iPhone in non-Objective-C languages, and I made the assertion that using, say, C# to write for the iPhone would strike an Apple reviewer wrong. I was speaking largely about UI elements differing between the ObjC and C# libraries in question, but a commenter made an interesting point, leading me to this question:
Is it possible to determine the language a program is written in, solely from its binary? If there are such methods, what are they?
Let's assume for the purposes of the question:
That from an interaction standpoint (console behavior, any GUI appearance, etc.) the two are identical.
That performance isn't a reliable indicator of language (no comparing, say, Java to C).
That you don't have an interpreter or something between you and the language - just raw executable binary.
Bonus points if you're language-agnostic as possible.

Short answer: YES
Long answer:
If you look at a binary, you can find the names of the libraries that have been linked in. Opening cmd.exe in TextPad easily finds the following at hex offset 0x270: msvcrt.dll, KERNEL32.dll, NTDLL.DLL, USER32.dll, etc. msvcrt is the Microsoft 'C' runtime support functions. KERNEL32, NTDLL, and USER32.dll are OS specific libraries which tell you either the target platform, or the platform on which it was built, depending on how well the cross-platform development environment segregates the two.
Setting aside those clues, most any c/c++ compiler will have to insert the names of the functions into the binary, there is a list of all functions (or entrypoints) stored in a table. C++ 'mangles' the function names to encode the arguments and their types to support overloaded methods. It is possible to obfuscate the function names but they would still exist. The functions signatures would include the number and types of the arguments which can be used to trace into the system or internal calls used in the program. At offset 0x4190 is "SetThreadUILanguage" which can be searched for to find out a lot about the development environment. I found the entry-point table at offset 0x1ED8A. I could easily see names like printf, exit, and scanf; along with __p__fmode, __p__commode, and __initenv
Any executable for the x86 processor will have a data segment which will contain any static text that was included in the program. Back to cmd.exe (offset 0x42C8) is the text "S.o.f.t.w.a.r.e..P.o.l.i.c.i.e.s..M.i.c.r.o.s.o.f.t..W.i.n.d.o.w.s..S.y.s.t.e.m.". The string takes twice as many characters as is normally necessary because it was stored using double-wide characters, probably for internationalization. Error codes or messages are a prime source here.
At offset B1B0 is "p.u.s.h.d" followed by mkdir, rmdir, chdir, md, rd, and cd; I left out the unprintable characters for readability. Those are all command arguments to cmd.exe.
For other programs, I've sometimes been able to find the path from which a program was compiled.
So, yes, it is possible to determine the source language from the binary.

I'm not a compiler hacker (someday, I hope), but I figure that you may be able to find telltale signs in a binary file that would indicate what compiler generated it and some of the compiler options used, such as the level of optimization specified.
Strictly speaking, however, what you're asking is impossible. It could be that somebody sat down with a pen and paper and worked out the binary codes corresponding to the program that they wanted to write, and then typed that stuff out in a hex editor. Basically, they'd be programming in assembly without the assembler tool. Similarly, you may never be able to tell with certainty whether a native binary was written in straight assembler or in C with inline assembly.
As for virtual machine environments such as JVM and .NET, you should be able to identify the VM by the byte codes in the binary executable, I would expect. However you may not be able to tell what the source language was, such as C# versus Visual Basic, unless there are particular compiler quirks that tip you off.

what about these tools:
PE Detective
PEiD
both are PE Identifiers. ok, they're both for windows but that's what it was when i landed here

I expect you could, if you disassemble the source, or at least you may know the compiler, as not all compilers will use the same code for printf for example, so Objective-C and gnu C should differ here.
You have excluded all byte-code languages so this issue is going to be less common than expected.

First, run what on some binaries and look at the output. CVS (and SVN) identifiers are scattered throughout the binary image. And most of those are from libraries.
Also, there's often a "map" to the various library functions. That's a big hint, also.
When the libraries are linked into the executable, there is often a map that's included in the binary file with names and offsets. It's part of creating "position independent code". You can't simply "hard-link" the various object files together. You need a map and you have to do some lookups when loading the binary into memory.
Finally, the start-up module for C, C++ (and I imagine C#) is unique to that compiler's defaiult set of libraries.

Well, C is initially converted the ASM, so you could write all C code in ASM.

No, the bytecode is language agnostic. Different compilers could even take the same code source and generate different binaries. That's why you don't see general purpose decompilers that will work on binaries.

The command 'strings' could be used to get some hints as to what language was used (for instance, I just ran it on the stripped binary for a C application I wrote and the first entries it finds are the libraries linked by the executable).

Is there a way to convert from a string to pure code in C++?

I know that its possible to read from a .txt file and then convert various parts of that into string, char, and int values, but is it possible to take a string and use it as real code in the program?
Code:
string codeblock1="cout<<This is a test;";
string codeblock2="int array[5]={0,6,6,3,5};}";
int i;
cin>>i;
if(i)
{
execute(codeblock1);
}
else
{
execute(codeblock2);
}
Where execute is a function that converts from text to actual code (I don't know if there actually is a function called execute, I'm using it for the purpose of my example).

In C++ there's no simple way to do this. This feature is available in higher-level languages like Python, Lisp, Ruby and Perl (usually with some variation of an eval function). However, even in these languages this practice is frowned upon, because it can result in very unreadable code.
It's important you ask yourself (and perhaps tell us) why you want to do it?
Or do you only want to know if it's possible? If so, it is, though in a hairy way. You can write a C++ source file (generate whatever you want into it, as long as it's valid C++), then compile it and link to your code. All of this can be done automatically, of course, as long as a compiler is available to you in runtime (and you just execute it with system). I know someone who did this for some heavy optimization once. It's not pretty, but can be made to work.

You can create a function and parse whatever strings you like and create a data structure from it. This is known as a parse tree. Subsequently you can examine your parse tree and generate the necessary dynamic structures to perform the logic therin. The parse tree is subsequently converted into a runtime representation that is executed.
All compilers do exactly this. They take your code and they produce machine code based on this. In your particular case you want a language to write code for itself. Normally this is done in the context of a code generator and it is part of a larger build process. If you write a program to parse your language (consider flex and bison for this operation) that generates code you can achieve the results you desire.

Many scripting languages offer this sort of feature, going all the way back to eval in LISP - but C and C++ don't expose the compiler at runtime.
There's nothing in the spec that stops you from creating and executing some arbitrary machine language, like so:
char code[] = { 0x2f, 0x3c, 0x17, 0x43 }; // some machine code of some sort
typedef void (FuncType*)(); // define a function pointer type
FuncType func = (FuncType)code; // take the address of the code
func(); // and jump to it!
but most environments will crash if you try this, for security reasons. (Many viruses work by convincing ordinary programs to do something like this.)
In a normal environment, one thing you could do is create a complete program as text, then invoke the compiler to compile it and invoke the resulting executable.
If you want to run code in your own memory space, you could invoke the compiler to build you a DLL (or .so, depending on your platform) and then link in the DLL and jump into it.

First, I wanted to say, that I never implemented something like that myself and I may be way off, however, did you try CodeDomProvider class in System.CodeDom.Compiler namespace? I have a feeling the classes in System.CodeDom can provide you with the functionality you are looking for.
Of course, it will all be .NET code, not any other platform
Go here for sample

Yes, you just have to build a compiler (and possibly a linker) and you're there.
Several languages such as Python can be embedded into C/C++ so that may be an option.

It's kind of sort of possible, but not with just straight C/C++. You'll need some layer underneath such as LLVM.
Check out c-repl and ccons

One way that you could do this is with Boost Python. You wouldn't be using C++ at that point, but it's a good way of allowing the user to use a scripting language to interact with the existing program. I know it's not exactly what you want, but perhaps it might help.

Sounds like you're trying to create "C++Script", which doesn't exist as far as I know. C++ is a compiled language. This means it always must be compiled to native bytecode before being executed. You could wrap the code as a function, run it through a compiler, then execute the resulting DLL dynamically, but you're not going to get access to anything a compiled DLL wouldn't normally get.
You'd be better off trying to do this in Java, JavaScript, VBScript, or .NET, which are at one stage or another interpreted languages. Most of these languages either have an eval or execute function for just that, or can just be included as text.
Of course executing blocks of code isn't the safest idea - it will leave you vulnerable to all kinds of data execution attacks.
My recommendation would be to create a scripting language that serves the purposes of your application. This would give the user a limited set of instructions for security reasons, and allow you to interact with the existing program much more dynamically than a compiled external block.

Not easily, because C++ is a compiled language. Several people have pointed round-about ways to make it work - either execute the compiler, or incorporate a compiler or interpreter into your program. If you want to go the interpreter route, you can save yourself a lot of work by using an existing open source project, such as Lua

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string