Haskell: unnecessary binary growth with module imports

Haskell: unnecessary binary growth with module imports - haskell

When i import a (big) module into a Main module in one of the following ways:
import Mymodule
import qualified Mymodule as M
import Mymodule (MyDatatype)
the compiled binary grows the same huge amount compared to when i don't import that module. This happens regardless of whether i use anything inside that module or not in the Main module. Shouldn't the compiler (i am using GHC on Debian Testing) only add into the binary what is needed to run it?
In my specific case i have a huge Map in Mymodule which i don't use in the Main module. Selectively importing what i really need, did not change the growth of the compiled binary.

As far as GHC is concerned, import lists are only there for readability and avoiding name clashes; they don't affect what's linked in at all.
Also, even if you did only import a few functions from a library, they might still depend on the bulk of the library internally, so you shouldn't necessarily expect to see a size decrease from only using some of an available interface in general.
By default, GHC links in entire libraries, rather than only the pieces you use; you could avoid this by building libraries with the -split-objs option to GHC (or put split-objs: True in your cabal-install configuration file (~/.cabal/config on Unix)), but it slows down compilation, and is seemingly not recommended by the GHC developers:
-split-objs
Tell the linker to split the single object file that would normally be generated into multiple object files, one per top-level Haskell function or type in the module. This only makes sense for libraries, where it means that executables linked against the library are smaller as they only link against the object files that they need. However, assembling all the sections separately is expensive, so this is slower than compiling normally. Additionally, the size of the library itself (the .a file) can be a factor of 2 to 2.5 larger. We use this feature for building GHC’s libraries.
— The GHC manual
This will omit unused parts of libraries you use, regardless of what you import.
You might also be interested in using shared Haskell libraries.

Related

RcppEigen and package size

I am maintaining a package that uses RcppEigen. The package itself has a modest amount of code (+- 1000 lines at the moment).
What I don't understand is that the file size of my library is very large, leading to a file size of 14MB for my <packagename>.so and 11MB for <packagename>.o.
I would imagine that the package would link dynamically to RcppEigen libraries (thus keeping the size of the binaries of my package relatively small). But my guess instead it links the libraries statically into my .o and .so files.
Am I correct that this is what happens?
Can I/should I avoid this?
If so, how?
I see here (RcppEigen.package.skeleton documentation) that NAMESPACE should include "a useDynLib directive"; it is also present in my NAMESPACE file)
(On a side note, when I submit to CRAN the large package size is NOTEd, but has not been cause for rejection.)

This is expected behavior. I have not checked, but I expect that the majority of packages using RcppEigen (or RcppArmadillo) get this NOTE. That's because Eigen (and Armadillo) is a header-only library, i.e. it is not dynamically linked. Instead the respective function is compiled into each *.o file. This is potentially even worse than static linking: If a function is used in multiple compilation units, it will end up in multiple *.o files, leading to multiple versions of the same function in the *.so. That is the price we all have to pay for the convenience of header-only libraries. Getting dynamic (or static) linking correct can be really difficult, in particular on Windows.
Concerning the useDynLib: If you look into the NAMESPACE file in your package, you should see a line like useDynLib(<packagename> [...]). That tells R to load the dynamic library associated with your package and is required for any R package using compiled code.

Interpose statically linked binaries

There's a well-known technique for interposing dynamically linked binaries: creating a shared library and and using LD_PRELOAD variable. But it doesn't work for statically-linked binaries.
One way is to write a static library that interpose the functions and link it with the application at compile time. But this isn't practical because re-compiling isn't always possible (think of third-party binaries, libraries, etc).
So I am wondering if there's a way to interpose statically linked binaries in the same LD_PRELOAD works for dynamically linked binaries i.e., with no code changes or re-compilation of existing binaries.
I am only interested in ELF on Linux. So it's not an issue if a potential solution is not "portable".

One way is to write a static library that interpose the functions and link it with the application at compile time.
One difficulty with such an interposer is that it can't easily call the original function (since it has the same name).
The linker --wrap=<symbol> option can help here.
But this isn't practical because re-compiling
Re-compiling is not necessary here, only re-linking.
isn't always possible (think of third-party binaries, libraries, etc).
Third-party libraries work fine (relinking), but binaries are trickier.
It is still possible to do using displaced execution technique, but the implementation is quite tricky to get right.

I'll assume you want to interpose symbols in main executable which came from a static library which is equivalent to interposing a symbol defined in executable. The question thus reduces to whether it's possible to intercept a function defined in executable.
This is not possible (EDIT: at least not without a lot of work - see comments to this answer) for two reasons:
by default symbols defined in executable are not exported so not accessible to dynamic linker (you can alter this via -export-dynamic or export lists but this has unpleasant performance or maintenance side effects)
even if you export necessary symbols, ELF requires executable's dynamic symtab to be always searched first during symbol resolution (see section 1.5.4 "Lookup Scope" in dsohowto); symtab of LD_PRELOAD-ed library will always follow that of executable and thus won't have a chance to intercept the symbols

What you are looking for is called binary instrumentation (e.g., using Dyninst or ptrace). The idea is you write a mutator program that attaches to (or statically rewrites) your original program (called mutatee) and inserts code of your choice at specific points in the mutatee. The main challenge usually revolves around finding those insertion points using the API provided by the instrumentation engine. In your case, since you are mainly looking for static symbols, this can be quite challenging and would likely require heuristics if the mutatee is stripped of non-dynamic symbols.

Does GHC optimize away unused code and packages?

Let's say a big package is included to a project and only one function from the package is used, is the rest of the code optimized away when compiling the final binary?
And If a package is included, but at the end it's never used (for example it's used to import a type by another library that's never used in the end), is the whole package stripped?

within a project can I compile a module and interactively load the compiled module within ghci?

Typically in a Haskell project, I either work interactively with ghci or compile the entire project with cabal build.
However, in some use cases, I may have a computationally intensive routine along with some higher level scripting functionality, say for picking inputs to an analysis algorithm.
Is it possible to use GHCi + GHC such that I compile the computationally intensive module, load the compiled code to re-run with different inputs from within GHCi?

Yes, you can load compiled modules in ghci; if there is an appropriately named .hi and .o file, ghci will use those instead of interpreting the code in the corresponding .hs file. You will then only have access to the operations that are exported from that module.
In case you find yourself using a compiled loaded module when you wanted the interpreted one, you can :load *foo.hs to instruct ghci to ignore the compiled version and interpret foo.hs.

How do I statically compile a C library into a Haskell module that I can later load with the GHC API?

Here is my desired use case:
I have a package with a single module that reads HDF5 files and writes some of their data to Haskell records. To do the work, the library uses the bindings-hdf5 package. Here is my cabal's build-depends. reader-types is a module I wrote that defines the types of the Haskell records that contain the read-in data.
build-depends: base >=4.7 && <4.8
, text
, vector
, containers
, bindings-hdf5
, reader-types
Note that my cabal file does not currently use extra-libraries or ghc-options. I can load my module, src/Mabel.hs in ghci as long as I specify the required hdf5_hl library:
ghci src/Mabel.hs -lhdf5_hl -L/long/nixos/path/lib
and within ghci, I can run my function perfectly fine.
Now, what I want to do is compile this library/module into a single, compiled file that I can later load with the GHC API in a different Haskell program. By single file, I mean that it needs to run even if the hdf5_hl library does not exist on the system. Preferably, it would also run even if text, vector, and/or containers are missing, but this is not essential because reader-types requires those types anyway. When loading the module with the GHC API, I want it to load in already compiled form, and not run interpreted.
My purpose for doing this is that I want the self-contained file to act as a single, pre-compiled plugin file that is later loaded and executed by a different Haskell executable. Other plugins might not use hdf5 at all, and the only package they are guaranteed to use is reader-types, which essentially defines the plugin interface types.
The hdf5 library on my system contains the following files: libhdf5_la.la, libhdf5_hl.so, libhdf5.la, libhdf5.so, and similar files that have the version number in the file name.
I have done a lot of googling, but am getting confused by all the edge cases I am finding. Here are some examples that I'm either sure don't fit my case, or I can't tell.
I do not want to compile a Haskell library to use from C or Python, only a Haskell program using GHC API.
I do not want to compile C wrappers for a C++ library into a Haskell module because the bindings already exist and the library is already a C library.
I do not to want compile a library that is entirely self-contained because, since I am loading it with the GHC API, I don't need the GHC runtime included in the library. (My understanding is that the plugins must be compiled with the same ghc version they will be loaded with in the GHC API).
I do not want to compile C bindings and the C library at the same time because the C library is already compiled and the bindings are specified in separate package (bindings-hdf5).
The closest resource for what I want to do is this exchange on the mailing list from 2009. However, I added extra-libraries: hdf5_hl or extra-libraries: hdf5 to my cabal file, and in both cases the resulting .a, .so, .dyn_hi, .dyn_o, .hi, and .o files in dist/build are all the exact same size as without using extra-libraries, so I'm confident it is not working correctly.
What changes to my cabal file do I need to make to create a self-contained, standalone file that I can later load with the GHC API? If this is not possible, what are the alternatives?
Instead of using the GHC API, I am also open to using the plugins library to load the plugin, but the self-contained requirements are still the same.
EDIT: I do not care what form the compiled "plugin" must take (I assume object file is the right way), but I want to load it dynamically from an separate executable at run time and execute functions it defines with known names and known types. The reason I want a single file is that there will eventually be other different plugins, and I want them all to behave the same way without having to worry about lib paths and dependencies for each one. A compiled, single file is a simpler interface for doing this than zipping/unzipping archives that include Haskell object code and their dependencies.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Haskell: unnecessary binary growth with module imports - haskell

Related

RcppEigen and package size

Interpose statically linked binaries

Does GHC optimize away unused code and packages?

within a project can I compile a module and interactively load the compiled module within ghci?

How do I statically compile a C library into a Haskell module that I can later load with the GHC API?

Categories

Resources