What is the fastest way to get just the preprocessed source code with MSVC? - visual-c++

I'm trying to find the fastest way to get the complete preprocessed source code (I don't need #line information other comments, just the raw source code) for a C source file.
I have the following little test program which just includes the Windows header file (mini.c):
#include <windows.h>
Using Microsoft Visual Studio 2005, I then run this command:
cl /nologo /P mini.c
This takes 6 seconds to generate a 2.5MB mini.i file; changing this to
cl /nologo /EP mini.c > mini.i
(which skips comments and #line information) needs just 0.5 seconds to write 2.1MB of output.
Is anybody aware of good techniques for improving this even further, without using precompiled headers?
The reason I'm asking is that I wrote a variant of the popular ccache tool for MSVC. As part of the work, the program needs to compute a hash sum of the preprocessed source code (and a few other things). I'd like to make this as fast as possible.
Maybe there is a dedicated preprocessor binary available, or other command line switches which might help?
UPDATE: One idea which just came to my mind: define the WIN32_LEAN_AND_MEAN macro to strip out lots of rarely needed code. This speeds the above preprocessor run up by a factor of approx 3x.

You're repeatedly processing the same source file (<windows.h>) which in turn pulls in a lot of other files. This <windows.h> is located in the default SDK directory.
Now, processing this unchanging file takes serious time. Yet you can rely on it not changing - it's part of the public interface, after all. Hence you could preprocess it - strip out comments, for instance - and pass that version to cl /EP.
Of course, this is typically an I/O bound task, but with a significant CPU part intermixed. An approach which processes multiple sources in parallel will help the total throughput. Measuring single source preprocesing times isn't too relevant.
Finally, measure the time to write the output to NUL. You shouldn't be including the time needed to write mini.i to disk, since you'll intend to pipe the output to md5sum.

There's a preprocess to file project setting that you can use. In addition, some of the newer MSVC versions offer multithreaded compilation.

The /P option has been there for years, it creates the .i file, and no object file.

Related

How to get an ETA?

I am build several large set of source files (targets) using scons. Now, I would like to know if there is a metric I can use to show me:
How many targets remain to be build.
How long it will take -- this to be honest, this is probably a no-go as it is really hard to tell!
How can I do that in scons?
There is currently no progress indicator built into SCons, and it's also not trivial to provide one. The problem is, that SCons doesn't build the complete DAG first, and then starts the build...such that you'd have a total number of targets to visit that you could use as a reference (=100%).
Instead, it makes up the DAG on the go... It looks at each target, and then expands the list of its children (sources and implicit dependencies like headers) to check whether they are up-to-date. If a child has changed, it gets rebuilt by applying the same "build step" recursively.
In this way, SCons crawls itself from the list of targets, as given on the command-line (with the "." dir being the default), down the DAG...where only the parts are ever visited, that are required for (or, in other words: have a dependency to) the requested targets.
This makes it possible for SCons to handle things like "header files, generated by a program that must be compiled first" in the first go...but it also means that the total number of targets/children to get visited changes constantly.
So, a standard progress indicator would continuously climb towards the 80%-90%, just to then fall back to 50%...and I don't think this would give you the information you're really after.
Tip: If your builds are large and you don't want to wait, do incremental builds and only build the library/program you're currently doing work on ("scons lib1"). This will still take into account all dependencies, but only a fraction of the DAG has to get expanded. So, you use less memory and get faster update times...especially if you use the "interactive" mode. In a project with a 100000 C files total, the update of a single library with 500 C files takes about 1s on my machine. For more infos on this topic check out http://scons.org/wiki/WhySconsIsNotSlow .

Minimizing IOPS with Fortran

I have a Fortran program that writes out a large amount of ASCII data, one line at-a-time, and there is some concern from system admins (and evidence from my runs) that this is adversely affecting system performance. I/O generally works better for fewer big writes than many small writes. So, I'd like to get the program to minimize the number of IOPS by writing out bigger chunks of data without changing the format of the output file (this is a large set of software with lots of related software depending on assumed file formats). I had thought turning a loop like this:
nwrite=100000000 !total number of lines to write
do cnt=1,nwrite
write(11,'(i22,3x,f16.14)')cnt,numar(cnt)
enddo
into a loop like this:
nwrite=100000000
nblock=10000 !number of lines to write in each block
do cnt=1,nwrite/nblock
write(11,'(i22,3x,f16.14)')(nblock*(cnt-1)+j,numar(nblock*(cnt-1)+j),j=1,nblock)
enddo
would do the trick. But I made two small scripts doing the above and they didn't show any real difference in run time. It's a fairly major time commitment to make the change in the actual code, so I'd like to be fairly sure before committing to an approach. I haven't completely unrolled the loop into a single write command because that might not work well for my current problem, though approaches that do this are also welcome.
Can anyone confirm whether the above code would reduce the actual number of write commands or what else might achieve what I'm looking for? Thanks in advance.
Based on input from other users and reading more, Fortran leaves control over this to the compiler, therefore it is compiler-dependent. Buffered writes are the default behavior for the Portland Group Fortran compiler, and it looks to be the same for GFortran. Intel does not buffer files by default. For the Intel compiler, adding the option -assume buffered_io will make file I/O buffered by default.

Is a core dump executable by itself?

The Wikipedia page on Core dump says
In Unix-like systems, core dumps generally use the standard executable
image-format:
a.out in older versions of Unix,
ELF in modern Linux, System V, Solaris, and BSD systems,
Mach-O in OS X, etc.
Does this mean a core dump is executable by itself? If not, why not?
Edit: Since #WumpusQ.Wumbley mentions a coredump_filter in a comment, perhaps the above question should be: can a core dump be produced such that it is executable by itself?
In older unix variants it was the default to include the text as well as data in the core dump but it was also given in the a.out format and not ELF. Today's default behavior (in Linux for sure, not 100% sure about BSD variants, Solaris etc.) is to have the core dump in ELF format without the text sections but that behavior can be changed.
However, a core dump cannot be executed directly in any case without some help. The reason for that is that there are two things missing from a simple core file. One is the entry point, the other is code to restore the CPU state to the state at or just before the dump occurred (by default also the text sections are missing).
In AIX there used to be a utility called undump but I have no idea what happened to it. It doesn't exist in any standard Linux distribution I know of. As mentioned above (#WumpusQ) there's also an attempt at a similar project for Linux mentioned in above comments, however this project is not complete and doesn't restore the CPU state to the original state. It is, however, still good enough in some specific debugging cases.
It is also worth mentioning that there exist other ELF formatted files that cannot be executes as well which are not core files. Such as object files (compiler output) and .so (shared object) files. Those require a linking stage before being run to resolve external addresses.
I emailed this question the creator of the undump utility for his expertise, and got the following reply:
As mentioned in some of the answers there, it is possible to include
the code sections by setting the coredump_filter, but it's not the
default for Linux (and I'm not entirely sure about BSD variants and
Solaris). If the various code sections are saved in the original
core-dump, there is really nothing missing in order to create the new
executable. It does, however, require some changes in the original
core file (such as including an entry point and pointing that entry
point to code that will restore CPU registers). If the core file is
modified in this way it will become an executable and you'll be able
to run it. Unfortunately, though, some of the states are not going to
be saved so the new executable will not be able to run directly. Open
files, sockets, pips, etc are not going to be open and may even point
to other FDs (which could cause all sorts of weird things). However,
it will most probably be enough for most debugging tasks such running
small functions from gdb (so that you don't get a "not running an
executable" stuff).
As other guys said, I don't think you can execute a core dump file without the original binary.
In case you're interested to debug the binary (and it has debugging symbols included, in other words it is not stripped) then you can run gdb binary core.
Inside gdb you can use bt command (backtrace) to get the stack trace when the application crashed.

How to speed up compilation time in linux

While compiling under linux I use flag -j16 as i have 16 cores. I am just wondering if it makes any sense to use sth like -j32. Actually this is a quesiton about scheduling of processor time and if it is possible to put more pressure on particular process than any other this way (let say i have like to pararell compilations each with -j16 and what if one would be -j32?).
I think it does not make much sense but I am not sure as do not know how kernel solves such things.
Kind regards,
I use a non-recursive build system based on GNU make and I was wondering how well it scales.
I ran benchmarks on a 6-core Intel CPU with hyper-threading. I measured compile times using -j1 to -j20. For each -j option make ran three times and the shortest time was recorded. Using -j9 results in shortest compile time, 11% better than -j6.
In other words, hyper-threading does help a little, and an optimal formula for Intel processors with hyper-threading is number_of_cores * 1.5:
Chart data is here.
The rule of thumb is to use the number of processors+1. Hyper-Thready counts, so a quad core CPU with HT should have -j9
Setting the value too high is counter-productive, if you do want to speed up compile times consider ccache to cache compiled objects that do not change in each compilation, and distcc to distribute the compilation across several machines.
We have a machine in our shop with the following characteristics:
256 core sparc solaris
~64gb RAM
Some of that memory used for a ram drive for /tmp
Back when it was originally setup, before other users discovered its existence, I ran some timing tests to see how far I could push it. The build in question is non-recursive, so all jobs are kicked off from a single make process. I also cloned my repo into /tmp to take advantage of the ram drive.
I saw improvements up to -j56. Beyond that my results flat lined much like Maxim's graph, until somewhere above (roughly) -j75 where performance began to degrade. Running multiple parallel builds I could push it beyond the apparent cap of -j56.
The primary make process is single-threaded; after running some tests I realized the ceiling I was hitting had to do with how many child processes the primary thread could service -- which was further hampered by anything in the makefiles that either required extra time to parse (eg., using = instead of := to avoid unnecessary delayed evaluation, complex user defined macros, etc) or used things like $(shell).
These are the things I've been able to do to speed up builds that have a noticeable impact:
Use := wherever possible
If you assign to a variable once with :=, then later with +=, it'll continue to use immediate evaluation. However, ?= and +=, when a variable hasn't been assigned previously, will always delay evaluation.
Delayed evaluation doesn't seem like a big deal until you have a large enough build. If a variable (like CFLAGS) doesn't change after all the makefiles have been parsed, then you probably don't want to use delayed evaluation on it (and if you do, you probably already know enough about what I'm talking about anyway to ignore my advice).
If you create macros you execute with the $(call) facility, try to do as much of the evaluation ahead of time as possible
I once got it in my head to create macros of the form:
IFLINUX = $(strip $(if $(filter Linux,$(shell uname)),$(1),$(2)))
IFCLANG = $(strip $(if $(filter-out undefined,$(origin CLANG_BUILD)),$(1),$(2)))
...
# an example of how I might have made the worst use of it
CXXFLAGS = ${whatever flags} $(call IFCLANG,-fsanitize=undefined)
This build produces over 10,000 object files, about 8,000 of which are from C++ code. Had I used CXXFLAGS := (...), it would only need to immediately replace ${CXXFLAGS} in all of the compile steps with the already evaluated text. Instead it must re-evaluate the text of that variable once for each compile step.
An alternative implementation that can at least help mitigate some of the re-evaluation if you have no choice:
ifneq 'undefined' '$(origin CLANG_BUILD)'
IFCLANG = $(strip $(1))
else
IFCLANG = $(strip $(2))
endif
... though that only helps avoid the repeated $(origin) and $(if) calls; you'd still have to follow the advice about using := wherever possible.
Where possible, avoid using custom macros inside recipes
The reasoning should be pretty obvious here after the above; anything that requires a variable or macro to be repeatedly evaluated for every compile/link step will degrade your build speed. Every macro/variable evaluation occurs in the same thread as what kicks off new jobs, so any time spent parsing is time make delays kicking off another parallel job.
I put some recipes in custom macros whenever it promotes code re-use and/or improves readability, but I try to keep it to a minimum.

Profiling partial programs in Linux

I have a program in which significant amount of time is spent loading and saving data. Now I want to know how much time each function is taking in terms of percentage of the total running time. However, I want to exclude the time taken by loading and saving functions from the total time considered by the profiler. Is there any way to do so using gprof or any other popular profiler?
Similarly you can use
valgrind --tool=callgrind --collect-atstart=no --toggle-collect=<function>
Other options to look at:
--instr-atstart # to avoid runtime overhead while not profiling
To get instructionlevel stats:
--collect-jumps=yes
--dump-instr=yes
Alternatively you can 'remote control' it on the fly: callgrind_control or annotate your source code (IIRC also with branch predictions stats): callgrind_annotate.
The excellent tool kcachegrind is a marvellous visualization/navigation tool. I can hardly recommend it enough:
I would consider using something more modern than gprof, such as OProfile. When generating a report using opreport you can use the --exclude-symbols option to exclude functions you are not interested in.
See the OProfile webpage for more details; however for a quick start guide see the OProfile docs page.
Zoom from RotateRight offers a system-wide time profile for Linux. If your code spends a lot of time in i/o, then that time won't show up in a time profile of the CPUs. Alternatively, if you want to account for time spent in i/o, try the "thread time profile".
for a simple, basic solution, you might want log data to a csv file.
e.g. Format [functionKey,timeStamp\n]
... then load that up in Excel. Get the deltas, and then include or exclude based on if functions. Nothing fancy. On the upside, you could get some visualisations fairly cheaply.

Resources