garbage collection for specific circumstances - garbage-collection

I'm working with just the basics of garbage collection and the different algorithms of each (plus pro's con's etc..). I'm trying to determine that best garbage collection algorithm to use for different scenarios.
such as: everything on heap same size, everything small w/ short lifespan, everything large w/ longer lifespan.
-if the everything is the same size heap fragmentation isn't an issue. Also I wouldn't have to worry about compaction. So maybe reference counting?
-small obj w/ short lifespan?
-large obj w/ longer lifespan? (possibly generational because of lifespan)
I'm looking at: Reference counting, Mark & Sweep, Stop & Copy and Generational

Paul Wilson's paper, "Uniprocessor Garbage Collection Techniques" is a very handy survey of garbage collection algorithms. It's a few years old, but most of what he covers is still relevant today. And, he includes information on performance, and so on. Just remember that CPU instructions aren't as expensive as they were 20 years ago. ;)
http://www.cse.nd.edu/~dthain/courses/cse40243/spring2006/gc-survey.pdf

Related

What's the size of Garbage Collector implementations in common VMs and what can be learned from it?

Looking at Java/OpenJDK, it seems that every “new” garbage collection implementation is roughly one magnitude larger than the preceding one.
What are the sizes of Garbage Collector implementations in other runtimes like the CLR?
Can a non-trivial improvement in garbage collection be equated with a sharp increase in the size/complexity of the implementation?
Going further, can this observation be made about Garbage Collector implementations in general or are there certain design decisions (in Java or elsewhere) which especially foster or mandate these size increases?
Really interesting question... really broad but I'll try my best to give decent input.
Going further, can this observation be made about Garbage Collector implementations in general or are there certain design decisions (in Java or elsewhere) which especially foster or mandate these size increases?
Well java's garbage collector initially didn't support generations, so adding this feature made it grow in size. One other thing that adds to the size/complexity of the jvm garbage collector is its configuration. The user can tweak the gc in a number of ways which increases the complexity. See this doc if you really want to know all the tun-able features of the jvm garbage collector http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html
This stackoverflow answer goes into this in a bit more depth https://stackoverflow.com/a/492821/25981
As to comparing size vs features...
Here is a very simple garbage collector for C:
https://code.google.com/p/dpgc/
It has very few features and even requires the user to mark blocks as references are shared. It's size is very small weighing in at one C file and one header file.
Compare this to a fully featured gc as used in the .net framework. Below I've included a bunch of talks with the two architects of the .net garbage collector.
http://channel9.msdn.com/Tags/garbage+collector
Specifcally this link: http://channel9.msdn.com/Shows/Going+Deep/Patrick-Dussud-Garbage-Collection-Past-Present-and-Future they discuss the evolution of the .net gc both interns of featuers and complexity( which is related to lines of code.)

Do any array based, bounded, wait free stacks exist?

I have a problem that requires me to use a highly concurrent, wait free implementation of a stack. I have to preallocate all memory (no garbage collection or mallocs) in advance and it is acceptable that the stack is bounded in size (push may return false if the stack if full).
I am familiar with the stack implementation from Nir Shavit: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.156.8728 ... But that relies on linked lists and garbage collection. I need something that is array based.
It looks like there is one on ACM here: http://dl.acm.org/citation.cfm?id=1532611 though I am sceptical about the low download rate and citations.
The ideal answer would be reference code (in C/C++) I could simply steal :-)
Have you looked at the windows DDI InterlockedPushEntrySList() and InterlockedPopEntrySList()? They are not array based but they are lock free and use processor atomic instructions to add and remove items from the stack. It is not wait free, but maybe it can be useful to you...

For real time programming, does reference counting have an advantage over garbage collection in terms of determinism?

If you were designing a programming language that features automatic memory management, would using reference counting allow for determinism guarantees that are not possible with a garbage collector?
Would there be a different answer to this question for functional vs. imperative languages?
Would using reference counting allow for determinism guarantees that are not possible with a garbage collector?
The word guarantee is a strong one. Here are the guarantees you can provide with reference counting:
Constant time overhead at an assignment to adjust reference counts.
Constant time to free an object whose reference count goes to zero. (The key is that you must not decrement that object's children right away; instead you must do it lazily when the object is used to satisfy a future allocation request.)
Constant time to allocate a new object when the relevant free list is not empty. This guarantee is conditional and isn't worth much.
Here are some things you can't guarantee with reference counting:
Constant time to allocate a new object. (In the worst case, the heap may be growing, and depending on the system the delay to organize new memory may be considerable. Or even worse, you may fill the heap and be unable to allocate.)
All unreachable objects are reclaimed and reused while maintaining constant time for other operations. (A standard reference counter can't collect cyclic garbage. There are a variety of ingenious workarounds, but generally they invalidate constant-time guarantees for simple operations.)
There are now some real-time garbage collectors that provide pretty interesting guarantees about pause times, and in the last 5 years there have been pretty interesting developments in both reference counting and garbage collection. From where I sit as an informed outsider, there's no obvious winner.
Some of the best recent work on reference counting is by David Bacon of IBM and by Erez Petrank of Technion. If you want to learn what a sophisticated, modern reference-counting system can do, look up their papers. Among other things, they are using multiple processors in amazing ways.
For information about memory management and real-time guarantees more generally, check out the International Symposium on Memory Management.
Would there be a different answer to this question for functional vs. imperative languages?
Because you asked about guarantees, no. But for memory management in general, the performance tradeoffs are quite different for an imperative language (lots of mutation but low allocation rates), an impure functional language (hardly any mutation but high allocation rates), and a pure, lazy functional language (lots of mutation—all those thinks being updated—and high allocation rates).
would using reference counting allow for determinism guarantees that are not possible with a garbage collector?
I don't see how. The process of lowering the reference count of an object is not time-bounded, as that object may be the single root for an arbitrary large object graph.
The only way to approach the problem of GC for real-time systems is by using either a concurrent collector or an incremental one - and no matter if the system uses reference counting or not; in my opinion your distinction between reference counting and "collection" is not precise anyway, e.g. systems which utilize reference counting might still occasionally perform some memory sweep (for example, to handle cycles).
You might be interested in IBM's Metronome, and I also know Microsoft has done some research in direction of good, real-time memory management.
If you look at the RTSJ spec (JSR-1), you'll see they did an end-run around the problem by providing for no-heap realtime threads. By having a separate category of thread that isn't allowed to touch any object that might require the thread to be stopped for garbage collection, JSR-1 side stepped the issue. There aren't many RTSJ implementations right now, but the area of realtime garbage collection is a hot topic in that community.
For real time programming, does reference counting have an advantage over garbage collection in terms of determinism?
Yes. The main advantage of reference counting is simplicity.
If you were designing a programming language that features automatic memory management, would using reference counting allow for determinism guarantees that are not possible with a garbage collector?
A GC like Baker's Treadmill should attain the same level of guarantees regarding determinism that reference counting offers.
Would there be a different answer to this question for functional vs. imperative languages?
Yes. Reference counting alone does not handle cycles. Some functional languages make it impossible to create cycles by design (e.g. Erlang and Mathematica) so they trivially permit reference counting alone as an exact approach to GC.
In real time programming garbage collection could be harmful, because you don't know when the garbage collector will collect... so yes, reference counting is definitely better in this context.
As a side note, usually in real time system only some parts needs real time processing, so you could avoid garbage collection just in sensitive components. A real world example is a C# program running on a Windows CE target.
From some involvement in various projects migrating significant chunks of code from C++ (with various smart pointer classes, including reference counted) to garbage collected Java/C#, I observe that the biggest pain-points all seem to be related to classes with non-empty destructors (particularly when used for RAII). This is a pretty big flag that deterministic cleanup is expected.
The issue is surely much the same for any language with objects; I don't think hybrid OO-functional languages like Scala or Ocaml enjoy any particular advantages in this area. Situation might be different for more "pure" functional languages.

Cobol: science and fiction [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 12 years ago.
There are a few threads about the relevance of the Cobol programming language on this forum, e.g. this thread links to a collection of them. What I am interested in here is a frequently repeated claim based on a study by Gartner from 1997: that there were around 200 billion lines of code in active use at that time!
I would like to ask some questions to verify or falsify a couple of related points. My goal is to understand if this statement has any truth to it or if it is totally unrealistic.
I apologize in advance for being a little verbose in presenting my line of thought and my own opinion on the things I am not sure about, but I think it might help to put things in context and thus highlight any wrong assumptions and conclusions I have made.
Sometimes, the "200 billion lines" number is accompanied by the added claim that this corresponded to 80% of all programming code in any language in active use. Other times, the 80% merely refer to so-called "business code" (or some other vague phrase hinting that the reader is not to count mainstream software, embedded systems or anything else where Cobol is practically non-existent). In the following I assume that the code does not include double-counting of multiple installations of the same software (since that is cheating!).
In particular in the time prior to the y2k problem, it has been noted that a lot of Cobol code is already 20 to 30 years old. That would mean it was written in the late 60ies and 70ies. At that time, the market leader was IBM with the IBM/370 mainframe. IBM has put up a historical announcement on his website quoting prices, configuration and availability. According to the sheet, prices are about one million dollars for machines with up to half a megabyte of memory.
Question 1: How many mainframes have actually been sold?
I have not found any numbers for those times; the latest numbers are for the year 2000, again by Gartner. :^(
I would guess that the actual number is in the hundreds or the low thousands; if the market size was 50 billion in 2000 and the market has grown exponentially like any other technology, it might have been merely a few billions back in 1970. Since the IBM/370 was sold for twenty years, twenty times a few thousand will result in a couple of ten-thousands of machines (and that is pretty optimistic)!
Question 2: How large were the programs in lines of code?
I don't know how many bytes of machine code result from one line of source code on that architecture. But since the IBM/370 was a 32-bit machine, any address access must have used 4 bytes plus instruction (2, maybe 3 bytes for that?). If you count in operating system and data for the program, how many lines of code would have fit into the main memory of half a megabyte?
Question 3: Was there no standard software?
Did every single machine sold run a unique hand-coded system without any standard software? Seriously, even if every machine was programmed from scratch without any reuse of legacy code (wait ... didn't that violate one of the claims we started from to begin with???) we might have O(50,000 l.o.c./machine) * O(20,000 machines) = O(1,000,000,000 l.o.c.).
That is still far, far, far away from 200 billion! Am I missing something obvious here?
Question 4: How many programmers did we need to write 200 billion lines of code?
I am really not sure about this one, but if we take an average of 10 l.o.c. per day, we would need 55 million man-years to achieve this! In the time-frame of 20 to 30 years this would mean that there must have existed two to three million programmers constantly writing, testing, debugging and documenting code. That would be about as many programmers as we have in China today, wouldn't it?
EDIT: Several people have brought up automatic templating systems/code generators or so. Could somebody elaborate on this? I have two issues with that: a) I need to tell the system what it is supposed to do for me; for that I need to communicate with the computer and the computer will output the code. This is exactly what a compiler of a programming language does. So essentially I am using a different high-level programming language to generate my Cobol code. Shouldn't I work with that other high-level language instead of Cobol? Why the middle-man? b) In the 70s and 80s the most precious commodity was memory. So if I have a programming language output something, it should better be concise. Using my hypothetical meta-language would I really generate verbose and repetitive Cobol code with it rather than bytecode/p-code like other compilers of that time did? END OF EDIT
Question 5: What about the competition?
So far, I have come up with two things here:
1) IBM had their own programming language, PL/I. Above I have assumed that the majority of code has been written exclusively using Cobol. However, all other things being equal I wonder if IBM marketing had really pushed their own development off the market in favor of Cobol on their machines. Was there really no relevant code base of PL/I?
2) Sometimes (also on this board in the thread quoted above) I come across the claim that the "200 billion lines of code" are simply invisible to anybody outside of "governments, banks ..." (and whatnot). Actually, the DoD had funded their own language in order to increase cost effectiveness and reduce the proliferation of programming language. This lead to their use of Ada. Would they really worry about having so many different programming languages if they had predominantly used Cobol? If there was any language running on "government and military" systems outside the perception of mainstream computing, wouldn't that language be Ada?
I hope someone can point out any flaws in my assumptions and/or conclusions and shed some light on whether the above claim has any truth to it or not.
On the surface, the numbers Gartner produces are akin to answering the
question: How many angels can dance on the head of a pin?.
Unless you obtain a full copy of their report (costing big bucks) you will never know how they came up
with or justified the 200 billion lines of COBOL claim. Having said that, Gartner is a well
respected information technology research and advisory
firm so I would think they would not have made such a claim without justification or
explanation.
It is amazing how this study has been quoted over the years. A Google search for "200 billion lines of COBOL"
got me about 19,500 hits. I sampled a bunch of them and most attribute the number directly to the 1997 the Gartner report.
Clearly, this figure has captured the attention of many.
The method that you have taken to "debunk" the claim has a few problems:
1) How many mainframes have been sold This is a big question in and of itself, probably just as difficult
as answering the 200 billion lines of code question. But more importantly, I don't see how determining the number of
mainframes could be used in constraing the number of lines of code running on them.
2) How large were the programs in lines of code? COBOL programs tend to be large. A modest program can
run to a few thousand lines, a big one into tens of thousands. One of the jokes COBOL programmers
often make is that only one COBOL program has ever been written, the rest are just modified
copies of it. As with many jokes there is a grain of truth in it. Most shops have a large program inventory
and a lot of those programs were built by cutting and pasting from each other. This really "fluffs" up the
size of your code base.
Your assumption that a program must fit into physical memory in order to run is wrong. The size problem
was solved in several different ways (e.g. program overlays, virtual memory etc.). It was not unusual in the
60's and 70's to run large programs on machines with tiny physical memories.
3) Was there no standard software? A lot of COBOL is written
from scratch or from templates. A number of financial packages were developed by software houses the 70's and 80's.
Most of these
were distributed as source code libraries. The customer then copied and modified the source to
fit their particular business
requirement. More "fluffing" of the code base - particularly given that large segments of that code
was logically unexecutable once the package had been "configured".
4) How many programmers did we need to write 200 billion lines of code Not as many as you might think!
Given that COBOL tends to be verbose and highly replicated, a programmer can have huge "productivity".
Program generating systems were in vogue during the 70's and early 80's.
I once worked with a product (now defunct fortunately) that let me write "business logic" and it
generated all of the "boiler plate" code around it - producing a fully functional COBOL program. The code
it generated was, to be polite, verbose in the extreme. I could produce a 15K line COBOL program from
about 200 lines of input! We are taking serious fluff here!
5) What about the competition? COBOL has never really had much serious competition in the
financial and insurance sectors. PL/1 was a major IBM initiative to produce the one programming language
that met every possible computing need. Like all such initiatives it was too ambitious and
has pretty much collapsed inward. I believe IBM still uses and supports it today. During the 70's
several other languages were predicted to replace COBOL - ALGOL, ADA and C come to mind, but
none have. Today I hear the same said of Java and .Net. I think the major reason COBOL is still with us is that it
does what it is supposed to do very well and
the huge multi billion lines of code legacy make moving to a "better" language both expensive and
risky from a business point of view.
Do I believe the 200 billion lines of code story? I find the number high but not impossibly high given
the number of ways COBOL code tends to "fluff" itself up over time.
I also think that getting too involved in analyzing these numbers quickly degrades into a
"counting angels" exercise - something people can get really wound up over but has no
significance in the real world.
EDIT
Let me address a few of the comments left below...
Never seen a COBOL program used by an investment bank. Quite possible. Depends
which application areas you were working in. Investment banks tend to have
large computing infrastructures and employ a wide range of technologies. The shop
I have been working in
for the past 20 years (although not an investment bank) is one of the largest in
the country and it has a significant
COBOL investment. It also has significant Java, C and C++ investments as
well as pockets of just about every other technology
known to man. I have also met some fairly senior applications developers here that
were completely unaware that COBOL was still in use. I did a
rough line count through our source control system and found around 70 million lines of
production COBOL. Quite a few people that have worked here for years are completely oblivious to it!
I am also aware that COBOL is rapidly declining as a language of choice, but the fact
is, there is still a lot of it around today. In 1997, the period to which this question
relates, I believe COBOL would have been the dominant language in terms of LOC. The
point of the question is: Could there have been 200 billion lines of it in 1997?
Counting mainframes. Even if one were able to obtain the number of mainframes it would
be difficult to assess the "compute" power they represented. Mainframes, like most
other computers, come in a wide range of configurations and processing capacity.
If we could say there were exactly "X" mainframes in use in 1997, you still need to
estimate the processing capacity they represented, then you need to figure out what
percentage of the work load was due to running COBOL programs and a bunch of other
confounding factors. I just don't see how this line of reasoning would ever
yield an answer with an acceptable margin of error.
Multi-counting lines of code. That was exactly my point when
referring to the COBOL "fluff" factor. Counting lines of COBOL can be a very misleading statistic
simply because a significant amount of it was never written by programmers in the
first place. Or if it was, quite a bit of it was done using the cut-paste-tinker
"design pattern".
Your observation that memory was a valuable commodity in 1997 and prior is true. One would think that
this would have lead to using the most efficient coding techniques and languages
available to maximize its use. However, there are other factors: The opportunity cost of having an application
backlog was often perceived to outweigh the cost of bringing in more memory/cpu to deal with less than
optimal code (which could be cranked out quite a bit faster). This thinking was further reinforced by the
observation that Moore's Law leads to ever
declining hardware costs whereas software development costs remain constant. The "logical" conclusion
was to crank out less than optimal code, suffer for a while, then reap the benefits
in the years to come (IMO, a lesson in bad planning and greed, still common today).
The push to deliver applications during the 70's through 90's led to the rise of a host of
code generators (today I see frameworks of various flavours fulfilling this role).
Many of these code generators emitted tons of COBOL code. Why emit COBOL code? Why not emit
assembler or p-code or something much more efficient? I believe the answer is
one of risk mitigation. Most code generators are proprietary pieces of software owned by some
third party who may or may not be in business or supporting their product 10 years from now.
It is a very hard sell if you can't provide an iron-clad guarantee that the generated application can be
supported into the distant future. The solution is to have the "generator" produce something
familiar - COBOL for example! Even if the "generator" dies, the resulting application can
be maintained by your existing staff of COBOL programmers. Problem solved ;) (today we see
open source used as a risk mitigation argument).
Returning to the LOC question. Counting lines of COBOL code is, in my opinion, open to
a wide margin of error or at least misinterpretation. Here are a few statistics from an application
I work on (quoted approximately). This
application was built with and is maintained using Basset Frame Technology (frame-work) so
we don't actually write COBOL but we generate COBOL from it.
Lines of COBOL: 7,000,000
Non-Procedure Division: 3,000,000
Procedure Division: 3,500,000
Comment/blank : 500,000
Non-expanded COPY directives: 40,000
COBOL verbs: 2,000,000
Programmer written procedure Division: 900,000
Application frame generated: 270,000
Corporate infrastructure frame generated: 2,330,000
As you can see, almost half of our COBOL programs are non-procedure Division code (data declaration
and the like). The ratio of LOC to Verbs (statement count) is about 7:2. Using our framework
leverages code production by about a factor of 7:1.
So what should you make of all this? I really don't know - except that there is a lot of room to
fluff up the COBOL line counts.
I have worked with other COBOL program generators in the past. Some of these had absolutely
stupid inflation factors (e.g. the 200 lines to 15K line fluffing mentioned earlier). Given all these
inflationary factors and the counting methodology used in by Gartner, it may very well have
been possible to "fluff" up to 200 billion lines of COBOL in 1997 - but the question
remains: Of what real use is this number? What could it really mean? I have no idea. Now
lets get back to the counting angels problem!
I would never defend those clowns at Gartner, but still:
Your ideas about IBM/370s are wrong. The 370 was an architecture, not a specific machine - it was implemented on everything from water cooled monsters to small, mini-computer sized machine (same size as a VAX). The number sold was thus probably far larger, by orders of magnitude, than you suspect.
Also, COBOL was heavily used on DEC's VAX lineup, and before that on the DEC-10 and DEC-20 lines. In the UK it was used on all ICL mainframes. Lots of other platforms also supported it.
[Usual disclaimer - I work for a COBOL vendor]
It's an interesting question and it's always difficult to get a provable number. On the number of COBOL programmers estimates - the 2 - 3 million number may not be orders of magnitude in error. I think there have been estimates of 1 or 2 million in the past. And each one of those can generate a lot of code in a 20 year career. In India tens of thousands of new COBOL programmers are added to the pool every year (perhaps every month!).
I think the automatically generated code is probably bigger than might be thought. For example PACBASE is a very popular application in the banking industry. A very large global bank I know of uses it extensively and they generate all their code into COBOL and estimate this generated code is 95% of their total code base with the other 5% being hand coded/maintained. I don't think this is uncommon. The maintenance and development of those systems is typically done at the model-level not the generated code as you say.
There is a group of applications that were missing from the original question - COBOL isn't only a mainframe language. The early years of Micro Focus were almost entirely spent in the OEM marketplace - we used to have something like 200 different OEMs (lots of long-gone names like DEC, Stratus, Bull, ...). Every OEM had to have a COBOL compiler on their box alongside C and Assembler. A lot of big applications were built at that time and are still going strong - think about the largest HR ERP systems, the largest mobile phone billing systems etc. My point is that there is a lot of COBOL code that was never on an IBM mainframe and is often overlooked.
And finally, the size of the code base may be larger in COBOL shops than the "average". That's not just because COBOL is verbose (or was - that's not been the case for a long time) but the systems are just bigger - they're in large organizations, doing a large number of disparate tasks. It's very common for sites to have 10's of millions of LoC.
I don't have figures, but my first "real" job was on IBM 370s.
First: Number sold. In 1974, a large railway ran on three 370s. These were big 370s, though, and you could get a small one for a whole lot less. (For perspective, at that time whether to get another megabyte was a decision on the VP level.)
Second: COBOL code is what you might call "fluffy", since a typical sentence (COBOLese for line) might be "ADD 1 TO MAIN-ACCOUNT OF CUSTOMER." There would be relatively few machine instructions per line. (This is especially true on the IBM 360 and onwards, which had machine instructions designed around executing COBOL.) BTW, addresses were 16 bits, four to designate a register (using the bottom 24 bits as a base address) and 12 as an offset. This meant that something under 64K could be addressed at a time, since not all of the 16 registers could be used as base registers for various reasons. Don't underestimate the ability of people to fit programs into small memory.
Also, don't underestimate the number of programs. The program library would be on disk and tape, and was essentially only limited by cost and volume. (Earlier on, they'd be on punch cards, which had serious problems as data and program storage.)
Third: Yes, most software was hand-written for the business at that time. Machines were far more expensive then, and people were cheaper. Programs were written to automate the existing business processes, and the idea that you could get outside software and change your business practices was almost heresy. This changed over time, of course.
Fourth: Programmers could go much faster than today, in lines of code per person-year, since these were largely big dumb programs for big dumb problems. In my experience, the DATA DIVISION was a large part of each COBOL program, and that would frequently take large descriptions of file layouts and repeat them in each program in the series.
I have limited experience with program generators, but it was very common at the time to use it to generate an application and then modify the output. This was partly just bad practice, and partly because a program generator couldn't do everything needed.
Fifth: PL/I was not heavily used, despite IBM's efforts. It ran into early bad press, although as far as I know the only real major problem that couldn't be fixed was figuring out the precision system. The Defense Department used Ada and COBOL for entirely different things. You are omitting assembly language as a competitor, and lots of small shops used BAL (also called ASM) instead of COBOL. After all, programmers were cheap, compilers were expensive, and there were a whole lot of things COBOL couldn't do. It was actually a very nice assembly language, and I liked it a lot.
Well, you're asking in the wrong place here. This forum is dominated by .net programmers, with a significant java minority and such a age build-up that only a very small minority has any cobol experience.
The CASE tool market consisted for a large part of cobol code generators. Most tools were write-only, not two-way. That ensures there are a lot of lines of code. This market was somewhat newer than the 70s, so the volume of cobol code grew fast in the 80s and 90s.
A lot of cobol development is done by people having no significant internet access and therefore visibility. There is no need for it. Cobol prorammers are used to having in-house programming courses and paper documentation (lots of it).
[edit] Generating cobol source made a lot of sense. Cobol is very verbose and low level. The various cobol implementations are all slightly different, so configuring the code generator eliminates a lot of potential errors.
With regards to #4: how much of that could have been machine-generated code? I don't know if template-based code was used a lot with Cobol, but I see a lot of it used now for all sorts of things. If my application has thousands of LOC that were machine generated, that doesn't mean much. The last code-generating script I wrote took 20 minutes to write, 10 minutes to format the input, 2 minutes to run, then an hour to execute a suite of automatic tests to verify it actually worked, but the code it generated would have taken several days to do by hand (or the time between the morning meeting and lunch, doin' it my way ;) ). Ok I admit it's not always that easy and there is often manual tweaking involved, but I still don't think the LOC metric has much meaning if code-generators are in heavy use.
Maybe that's how they generated so much code in so little time.

Advice on starting a large multi-threaded programming project

My company currently runs a third-party simulation program (natural catastrophe risk modeling) that sucks up gigabytes of data off a disk and then crunches for several days to produce results. I will soon be asked to rewrite this as a multi-threaded app so that it runs in hours instead of days. I expect to have about 6 months to complete the conversion and will be working solo.
We have a 24-proc box to run this. I will have access to the source of the original program (written in C++ I think), but at this point I know very little about how it's designed.
I need advice on how to tackle this. I'm an experienced programmer (~ 30 years, currently working in C# 3.5) but have no multi-processor/multi-threaded experience. I'm willing and eager to learn a new language if appropriate. I'm looking for recommendations on languages, learning resources, books, architectural guidelines. etc.
Requirements: Windows OS. A commercial grade compiler with lots of support and good learning resources available. There is no need for a fancy GUI - it will probably run from a config file and put results into a SQL Server database.
Edit: The current app is C++ but I will almost certainly not be using that language for the re-write. I removed the C++ tag that someone added.
Numerical process simulations are typically run over a single discretised problem grid (for example, the surface of the Earth or clouds of gas and dust), which usually rules out simple task farming or concurrency approaches. This is because a grid divided over a set of processors representing an area of physical space is not a set of independent tasks. The grid cells at the edge of each subgrid need to be updated based on the values of grid cells stored on other processors, which are adjacent in logical space.
In high-performance computing, simulations are typically parallelised using either MPI or OpenMP. MPI is a message passing library with bindings for many languages, including C, C++, Fortran, Python, and C#. OpenMP is an API for shared-memory multiprocessing. In general, MPI is more difficult to code than OpenMP, and is much more invasive, but is also much more flexible. OpenMP requires a memory area shared between processors, so is not suited to many architectures. Hybrid schemes are also possible.
This type of programming has its own special challenges. As well as race conditions, deadlocks, livelocks, and all the other joys of concurrent programming, you need to consider the topology of your processor grid - how you choose to split your logical grid across your physical processors. This is important because your parallel speedup is a function of the amount of communication between your processors, which itself is a function of the total edge length of your decomposed grid. As you add more processors, this surface area increases, increasing the amount of communication overhead. Increasing the granularity will eventually become prohibitive.
The other important consideration is the proportion of the code which can be parallelised. Amdahl's law then dictates the maximum theoretically attainable speedup. You should be able to estimate this before you start writing any code.
Both of these facts will conspire to limit the maximum number of processors you can run on. The sweet spot may be considerably lower than you think.
I recommend the book High Performance Computing, if you can get hold of it. In particular, the chapter on performance benchmarking and tuning is priceless.
An excellent online overview of parallel computing, which covers the major issues, is this introduction from Lawerence Livermore National Laboratory.
Your biggest problem in a multithreaded project is that too much state is visible across threads - it is too easy to write code that reads / mutates data in an unsafe manner, especially in a multiprocessor environment where issues such as cache coherency, weakly consistent memory etc might come into play.
Debugging race conditions is distinctly unpleasant.
Approach your design as you would if, say, you were considering distributing your work across multiple machines on a network: that is, identify what tasks can happen in parallel, what the inputs to each task are, what the outputs of each task are, and what tasks must complete before a given task can begin. The point of the exercise is to ensure that each place where data becomes visible to another thread, and each place where a new thread is spawned, are carefully considered.
Once such an initial design is complete, there will be a clear division of ownership of data, and clear points at which ownership is taken / transferred; and so you will be in a very good position to take advantage of the possibilities that multithreading offers you - cheaply shared data, cheap synchronisation, lockless shared data structures - safely.
If you can split the workload up into non-dependent chunks of work (i.e., the data set can be processed in bits, there aren't lots of data dependencies), then I'd use a thread pool / task mechanism. Presumably whatever C# has as an equivalent to Java's java.util.concurrent. I'd create work units from the data, and wrap them in a task, and then throw the tasks at the thread pool.
Of course performance might be a necessity here. If you can keep the original processing code kernel as-is, then you can call it from within your C# application.
If the code has lots of data dependencies, it may be a lot harder to break up into threaded tasks, but you might be able to break it up into a pipeline of actions. This means thread 1 passes data to thread 2, which passes data to threads 3 through 8, which pass data onto thread 9, etc.
If the code has a lot of floating point mathematics, it might be worth looking at rewriting in OpenCL or CUDA, and running it on GPUs instead of CPUs.
For a 6 month project I'd say it definitely pays out to start reading a good book about the subject first. I would suggest Joe Duffy's Concurrent Programming on Windows. It's the most thorough book I know about the subject and it covers both .NET and native Win32 threading. I've written multithreaded programs for 10 years when I discovered this gem and still found things I didn't know in almost every chapter.
Also, "natural catastrophe risk modeling" sounds like a lot of math. Maybe you should have a look at Intel's IPP library: it provides primitives for many common low-level math and signal processing algorithms. It supports multi threading out of the box, which may make your task significantly easier.
There are a lot of techniques that can be used to deal with multithreading if you design the project for it.
The most general and universal is simply "avoid shared state". Whenever possible, copy resources between threads, rather than making them access the same shared copy.
If you're writing the low-level synchronization code yourself, you have to remember to make absolutely no assumptions. Both the compiler and CPU may reorder your code, creating race conditions or deadlocks where none would seem possible when reading the code. The only way to prevent this is with memory barriers. And remember that even the simplest operation may be subject to threading issues. Something as simple as ++i is typically not atomic, and if multiple threads access i, you'll get unpredictable results.
And of course, just because you've assigned a value to a variable, that's no guarantee that the new value will be visible to other threads. The compiler may defer actually writing it out to memory. Again, a memory barrier forces it to "flush" all pending memory I/O.
If I were you, I'd go with a higher level synchronization model than simple locks/mutexes/monitors/critical sections if possible. There are a few CSP libraries available for most languages and platforms, including .NET languages and native C++.
This usually makes race conditions and deadlocks trivial to detect and fix, and allows a ridiculous level of scalability. But there's a certain amount of overhead associated with this paradigm as well, so each thread might get less work done than it would with other techniques. It also requires the entire application to be structured specifically for this paradigm (so it's tricky to retrofit onto existing code, but since you're starting from scratch, it's less of an issue -- but it'll still be unfamiliar to you)
Another approach might be Transactional Memory. This is easier to fit into a traditional program structure, but also has some limitations, and I don't know of many production-quality libraries for it (STM.NET was recently released, and may be worth checking out. Intel has a C++ compiler with STM extensions built into the language as well)
But whichever approach you use, you'll have to think carefully about how to split the work up into independent tasks, and how to avoid cross-talk between threads. Any time two threads access the same variable, you have a potential bug. And any time two threads access the same variable or just another variable near the same address (for example, the next or previous element in an array), data will have to be exchanged between cores, forcing it to be flushed from CPU cache to memory, and then read into the other core's cache. Which can be a major performance hit.
Oh, and if you do write the application in C++, don't underestimate the language. You'll have to learn the language in detail before you'll be able to write robust code, much less robust threaded code.
One thing we've done in this situation that has worked really well for us is to break the work to be done into individual chunks and the actions on each chunk into different processors. Then we have chains of processors and data chunks can work through the chains independently. Each set of processors within the chain can run on multiple threads each and can process more or less data depending on their own performance relative to the other processors in the chain.
Also breaking up both the data and actions into smaller pieces makes the app much more maintainable and testable.
There's plenty of specific bits of individual advice that could be given here, and several people have done so already.
However nobody can tell you exactly how to make this all work for your specific requirements (which you don't even fully know yourself yet), so I'd strongly recommend you read up on HPC (High Performance Computing) for now to get the over-arching concepts clear and have a better idea which direction suits your needs the most.
The model you choose to use will be dictated by the structure of your data. Is your data tightly coupled or loosely coupled? If your simulation data is tightly coupled then you'll want to look at OpenMP or MPI (parallel computing). If your data is loosely coupled then a job pool is probably a better fit... possibly even a distributed computing approach could work.
My advice is get and read an introductory text to get familiar with the various models of concurrency/parallelism. Then look at your application's needs and decide which architecture you're going to need to use. After you know which architecture you need, then you can look at tools to assist you.
A fairly highly rated book which works as an introduction to the topic is "The Art of Concurrency: A Thread Monkey's Guide to Writing Parallel Application".
Read about Erlang and the "Actor Model" in particular. If you make all your data immutable, you will have a much easier time parallelizing it.
Most of the other answers offer good advice regarding partitioning the project - look for tasks that can be cleanly executed in parallel with very little data sharing required. Be aware of non-thread safe constructs such as static or global variables, or libraries that are not thread safe. The worst one we've encountered is the TNT library, which doesn't even allow thread-safe reads under some circumstances.
As with all optimisation, concentrate on the bottlenecks first, because threading adds a lot of complexity you want to avoid it where it isn't necessary.
You'll need a good grasp of the various threading primitives (mutexes, semaphores, critical sections, conditions, etc.) and the situations in which they are useful.
One thing I would add, if you're intending to stay with C++, is that we have had a lot of success using the boost.thread library. It supplies most of the required multi-threading primitives, although does lack a thread pool (and I would be wary of the unofficial "boost" thread pool one can locate via google, because it suffers from a number of deadlock issues).
I would consider doing this in .NET 4.0 since it has a lot of new support specifically targeted at making writing concurrent code easier. Its official release date is March 22, 2010, but it will probably RTM before then and you can start with the reasonably stable Beta 2 now.
You can either use C# that you're more familiar with or you can use managed C++.
At a high level, try to break up the program into System.Threading.Tasks.Task's which are individual units of work. In addition, I'd minimize use of shared state and consider using Parallel.For (or ForEach) and/or PLINQ where possible.
If you do this, a lot of the heavy lifting will be done for you in a very efficient way. It's the direction that Microsoft is going to increasingly support.
2: I would consider doing this in .NET 4.0 since it has a lot of new support specifically targeted at making writing concurrent code easier. Its official release date is March 22, 2010, but it will probably RTM before then and you can start with the reasonably stable Beta 2 now. At a high level, try to break up the program into System.Threading.Tasks.Task's which are individual units of work. In addition, I'd minimize use of shared state and consider using Parallel.For and/or PLINQ where possible. If you do this, a lot of the heavy lifting will be done for you in a very efficient way. 1: http://msdn.microsoft.com/en-us/library/dd321424%28VS.100%29.aspx
Sorry i just want to add a pessimistic or better realistic answer here.
You are under time pressure. 6 month deadline and you don't even know for sure what language is this system and what it does and how it is organized. If it is not a trivial calculation then it is a very bad start.
Most importantly: You say you have never done mulitithreading programming before. This is where i get 4 alarm clocks ringing at once. Multithreading is difficult and takes a long time to learn it when you want to do it right - and you need to do it right when you want to win a huge speed increase. Debugging is extremely nasty even with good tools like Total Views debugger or Intels VTune.
Then you say you want to rewrite the app in another lanugage - well this isn't as bad as you have to rewrite it anyway. THe chance to turn a single threaded Program into a well working multithreaded one without total redesign is almost zero.
But learning multithreading and a new language (what is your C++ skills?) with a timeline of 3 month (you have to write a throw away prototype - so i cut the timespan into two halfs) is extremely challenging.
My advise here is simple and will not like it: Learn multithreadings now - because it is a required skill set in the future - but leave this job to someone who already has experience. Well unless you don't care about the program being successfull and are just looking for 6 month payment.
If it's possible to have all the threads working on disjoint sets of process data, and have other information stored in the SQL database, you can quite easily do it in C++, and just spawn off new threads to work on their own parts using the Windows API. The SQL server will handle all the hard synchronization magic with its DB transactions! And of course C++ will perform a lot faster than C#.
You should definitely revise C++ for this task, and understand the C++ code, and look for efficiency bugs in the existing code as well as adding the multi-threaded functionality.
You've tagged this question as C++ but mentioned that you're a C# developer currently, so I'm not sure if you'll be tackling this assignment from C++ or C#. Anyway, in case you're going to be using C# or .NET (including C++/CLI): I have the following MSDN article bookmarked and would highly recommend reading through it as part of your prep work.
Calling Synchronous Methods Asynchronously
Whatever technology your going to write this, take a look a this must read book on concurrency "Concurrent programming in Java" and for .Net I highly recommend the retlang library for concurrent app.
I don't know if it was mentioned yet, but if I were in your shoes, what I would be doing right now (aside from reading every answer posted here) is writing a multiple threaded example application in your favorite (most used) language.
I don't have extensive multithreaded experience. I've played around with it in the past for fun but I think gaining some experience with a throw-away application will suit your future efforts.
I wish you luck in this endeavor and I must admit I wish I had the opportunity to work on something like this...

Resources