Why don't large programs (such as games) use loads of different threads? [closed]

Why don't large programs (such as games) use loads of different threads? [closed] - multithreading

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I don't know how commercial games work inside very much, but the open source games I have come across don't seem to be massively into threading. Same goes for most other desktop applications, normally two or three threads seem to be used (eg program logic and GUI updates).
Why don't games have many threads? Eg separate threads for physics, sound, graphics, AI etc?

I don't know about the games that you have played, but most games run the sound on a separate thread. Networking code, at least the socket listeners run on a separate thread.
However, the rest of the game engine generally runs in a single thread. There are reasons for this. For example, most processing in a game runs a single chain of dependencies. Graphics depend on state of physics engine as does the artificial intelligence. Designing for multiple threads means that you have to have frame latency between the various subsystems for concurrency. You get quicker response time and snappier game play if these subsystems are computed linearly each frame. The part of the game that benefits the most from parallelization is of course the rendering subsystem which is offloaded to highly parallelized graphics accelerator cards.

You need to think, what are the actual benefits of threads? Remember that on a single core machine, threads don't actually allow concurrent execution, just the impression of it. Behind the scenes, the CPU is context-switching between the different threads, doing a little work on each every time. Therefore, if I have several tasks that involve no waiting, running them concurrently (on a single core) will be no quicker than running them linearly. In fact, it will be slower, due to the added overhead of the frequent context-switching.
If that is the case then, why ever use threads on a single core machine? Well firstly, because sometimes tasks can involve long periods of waiting on some external resource, such as a disk or other hardware device, to become available. Whilst the task in a waiting stage, threading allows other tasks to continue, thus using the CPU's time more efficiency.
Secondly, tasks may have a deadline of some sort in which to complete, particularly if they are responding to an event. The classic example is the user interface of an application. The computer should respond to user action events as quickly as possible, even if it is busy performing some other long running task, otherwise the user will be become agitated and may believe the application has crashed. Threading allows this to happen.
As for games, I am not a games programmer, but my understanding of the situation is this: 3D games create a programmatic model of the game world; players, enemies, items, terrain, etc. This game world is updated in discrete steps, based on the amount of time that has elapsed since the previous update. So, if 1ms has passed since the last time round the game loop, the position of an object is updated by using its velocity and the elapsed time to determine the delta (obviously the physics is a bit more complicated than that, but you get the idea). Other factors such as AI and input keys may also contribute to the update. When everything is finished, the updated game world is rendered as a new frame and the process begins again. This process usually occurs many times per second.
When we think about the game loop in this way, we can see that the engine is in fact achieving a very similar goal to that of threading. It has a number of long running tasks (updating the world's physics, handling user input, etc), and it gives the impression that they are happening concurrently by breaking them down into small pieces of work and interleaving these pieces, but instead of relying on the CPU or operating system to manage the time spent on each, it is doing it itself. This means it can keep all the different tasks properly synchronized, and avoid the complexities that come with real threading: locks, pre-emption, re-entrant code, etc. There is no performance implication to this approach either, because as we said a single core machine can only really execute code linearly anyway.
Things change when have a multi-core system. Now, tasks can be running genuinely concurrently and there may indeed be a benefit to using threading to handle different parts of the game world updates, so long as we can manage to synchronise the results to render consistent frames. We would expect therefore, that with the advent of multi-core systems, games engine developers would be working on this. And so it turns out, they are. Valve, the makers of Half Life, have recently introduced multi-processor support into their Source Engine, and I imagine many other engine developers are following suit.
Well, that turned out a little longer than I expected. I'm not a threading or games expert, but I hope I haven't made any especially glaring errors. If I have I'm sure people will correct me :)

The main reason is that, as elegant as it sounds, using multiple threads in a program as complicated as a 3D game is really, really, really difficult. Also, before the fairly recent introduction of low cost multi-core systems, using multiple threads did not offer much of a performance incentive.

Many games these days are using "task" or "job" systems for parallel processing. That is, the game spawns a fixed number of worker threads which are used for multiple tasks. Work is divided up into small pieces and queued, then sent to be processed by the worker threads as they become available.
This is becoming especially common on consoles. The PS3 is based on Cell architecture so you need to use parallel processing to get the best performance out of the system. The Xbox 360 can emulate a task/job setup that was designed for PS3 as it has multiple cores. You would probably find for most games that a lot of the system design is shared among the 360, PS3, and PC codebases, so PC most likely uses the same sort of tactic.
While it is hard to write threadsafe code, as many of the other answers indicate, I think there are a few other reasons for the things you're seeing:
First, many open source games are a few years old. Especially with this generation of consoles parallel programming is becoming popular and even necessary as mentioned above.
Second, very few open source projects seem concerned about getting the highest possible performance. As John Carmack pointed out to the Utah GLX project, highly optimized code is often harder to maintain than unoptimized code, so the latter would generally be preferred in open source contexts.
Third, I wouldn't take a small number of threads created by a game to mean that it's not using parallel jobs well.

I was about to post the same thing as William, but I'd like to expand on it a little bit. It's very hard to write optimal code for the future. Given the choice between writing something that will scale to hardware you don't have vs. writing something that will work on hardware you do have, most people will chose to do the latter. Since the single-core paradigm has been with us for so long, most code that has been written (especially for games where there is extreme pressure to get it out the door) isn't that future proof.
x86 has been very kind to game programmers, since we haven't had to think about the ramifications of less forgiving hardware platforms.

The fact that everybody here is correctly claiming that multithreading is hard is very sad. We desperately need to make concurrency systems easy.
Personally I think we are going to need a paradigm shift and new tools.

Other than the technical challenges of programming for multiple cores, commercial games have to run well on low end systems w/o multiple cores to make money.
Now that multi-core processors have been out for a while and the major game consoles have multiple cores it's only a matter of time before dual core shows up on the minimum system requirements list for PC games.
Here's a link to an interview with Orion Granatir from Intel where he's talking about getting game developers to take advantage of multi-threading.

There are many issues with race conditions and data locking when using lots of threads. Since the different parts of games are fairly reliant on each other it doesn't make much sense to do all the extra engineering required to use loads of threads.

It's very difficult to use threads without problems, and most GUI APIs are based on event driven coding anyway. Threads mandate the use of locking mechanisms which add delay to the code, and often that delay is unpredictable depending on who is currently holding the lock.
It seems sensible to me to have a single (or perhaps very few) threads handling things in an event driven way rather than hundreds of threads all causing strange and unrepeatable bugs.

Threads are dead, baby.
Realistically, in game development, threads don't scale beyond offloading very dedicated tasks like networking and loading. Job-systems seem to be the only way forward, given 8 CPU systems are becoming more commonplace even on PCs. And you can pretty much guarantee that upcoming super-multicore systems like Intel's Larrabee will be job-system based.
This has been a somewhat painful realization on Playstation3 and XBOX360 projects, and it seems now even Apple has jumped on board with their "revolutionary" Grand Central Dispatch system in Snow Leopard.
Threads have their place, but the naive promise of "put everything in a thread and it will all run faster" simply doesn't work in practice.

Related

Difference between SIMD and Multi-threading [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
What is the difference between the SIMD and Muti-threading concepts that one comes across in parallel programming paradigm?

SIMD means "Single Instruction, Multiple Data" and is an umbrella term describing a method whereby many elements are loaded into extra-wide CPU registers at the same time and a single low-level instruction (such as ADD, MULTIPLY, AND, XOR) is applied to all the elements in parallel. Specific examples are MMX, SSE2/3 and AVX on Intel processors, or NEON on ARM processors, or AltiVec on PowerPC. It is very low-level and typically only for a few clock cycles. An example might be that, rather than go into a for loop increasing the brightness of the pixels in an image one-by-one, you load 64 off 8-bit pixels into a single 512-bit wide register and multiply them all up at the same time in one or two clock cycles. SIMD is often implemented for you in high-performance libraries (like OpenCV) or is generated for you by your compiler when you compile with vectorisation enabled, typically at optimisation level 3 or higher (-O3 switch). Very experienced programmers may choose to write their own, using "intrinsics".
Multi-threading refers to when you have multiple threads of execution, normally running on different CPU cores, at the same time. It is higher-level than SIMD and typically threads exist a lot longer. One thread might be acquiring images, another thread might be detecting objects, another might be tracking the objects and a final one might be displaying the results. A feature of multi-threading is that the threads all share the same address space, so data in one thread can be seen and manipulated by others. This makes threads light-weight compared to multiple processes, but can make for harder debugging. Threads are called "light-weight" because they typically take much less time to create and start than full-blown processes.
Multi-processing is similar to multi-threading except each process has its own address space, so if you want to share data between the processes, you need to work harder to do it. It has the benefit over multi-threading that one process is unlikely to crash another or interfere with its data, making it somewhat easier to debug.
If I make an analogy with cooking a meal, then SIMD is like lining up all your green beans and slicing them in one go. The single instruction is "slice", the multiple, repeated data are the beans. In fact, lining things up ("memory alignment") is an important aspect of SIMD.
Then multi-threading is like having multiple chefs all taking ingredients from a shared vegetable larder, preparing them and putting them in a big shared cook-pot. You get the job done faster because there are multiple chefs - analogous to CPU cores - working at once.
In this little analogy, multi-processing is more like each chef having his own vegetable larder and cook-pot, so if one chef runs out of vegetables, or cooking gas, the others are not affected - things are more independent. You get the job done faster, because there are more chefs, just you have to do a bit more organisation (or "synchronisation") to get all the chefs to serve their meals at the same time at the end.
There is nothing to prevent an application using SIMD as well as multi-threading and multi-processing at the same time. Going back to the cooking analogy, you can have multiple chefs (multi-threading or multi-processing) who are all slicing their green beans efficiently (SIMD). It is my impression that most applications either use SIMD and multi-threading, or SIMD and multi-processing, but relatively few use both multi-threading and multi-processing. YMMV on this bit!

Are there concurrent designs where the actor model isn't good for?

I've noticed that all designs I have come across can be multi-threaded using the actor mode - separating each work module into a different actor and using a message queue (for me a .NET ConcurrentQueue) to pass messages. What other good multi threaded models exist?

Communicating Sequential Processes is, I think, a far better model for concurrency than the actor model. It addresses a number of problems with the actor model (and other models) such as deadlock, livelock, starvation. Take a look at this and, more practically useful, this.
The main difference is as follows. In the actor model a message is sent asynchronously. However in CSP messages are sent synchronously; the sender cannot send until the receiver is ready to receive.
This one simple restriction makes the world of difference. If you've got an incorrect design with deadlock potential then in the actor model it may or may not occur (and it usually occurs only when demo-ing to the boss...). However in CSP the deadlock will always occur, leaving you in no doubt that your design is incorrect. Ok, so you've still got to fix it but that's OK; fixing problems you know are there is much easier than attempting to exhaustively test for the absence of problems (your only choice in the actor model).
The strictly synchronous approach of CSP seems like it will cause problems with response times; for example one fears that a GUI thread can't move on because it's not been able to send a message to a busy worker thread that's not got as far as its 'read'. What you have to do is to ensure that the workload is spread across enough threads so that they can all get back to waiting for new messages within an acceptable period of time. CSP doesn't let you get away with it. The actor model does, however don't be deceived; you're just building up future problems.
In .NET a ConcurrentQueue is not the right primitive for CSP, not unless you layer a synchronising mechanism on top. I've added strict synchronisation on top of TCP sockets too. In fact I generally end up writing some sort of library that abstracts both sockets and pipes so that it becomes immaterial as to whether a 'Process' (as they're known in CSP parlance) is a thread on this machine or a whole other process on another machine at the end of a network connection. Nice - scalabilty built in from the very beginning.
I've been doing it the CSP way for 23 years now, I won't do it any other way. Built some big systems with thousands of threads that way.
==EDIT==
It seems this answer is still attracting some attention, so I thought I'd add to it. For Windows developers there is the DataFlow namespace for the Task Parallel Library. It has to be separately downloaded. Microsoft desribe it thusly: "This dataflow model promotes actor-based programming by providing in-process message passing for coarse-grained dataflow and pipelining tasks." Excellent! It uses classes like BufferBlocks as communications channels. The important thing is that a BufferBlock has a BoundedCapacity property that defaults to Unbounded, which fits the Actor model. Set this to a value of 1, and you have now transformed it into a CSP-style communcation channel.

To add to my last, there are various other multi threading models beyond CSP. This Wikipedia page lists several others like CCS, ACP, and LOTOS. Reading those articles hints at a deep and dark cavern where academics roam, waiting to pounce on a stray software developer.
The problem is that academic obscurity often means a complete lack of tools and libraries at the practical, usable level. It takes a lot of effort to convert a sound, proven academic study into a set of libraries and tools. There's little real incentive for the wider software community to take up a theoretical paper and turn it into a practical reality.
I like CSP because it's actually dead simple to implement your own CSP library based on select() or pselect(). I've done that several times now (I must learn about code re-use), plus the nice people at Kent University put together JCSP for those who like Java. I don't recommend developing in Occam (though it's still just about possible); support and maintainability are going to be issues going forward. CSP is probably the easiest one to get into, and given its good characteristics it's well worthwhile.

#JeremyFriesner
Future Problems
To expand on what I meant by "future problems", I was referring to the fact that in an asynchronous system the sender of messages has no knowledge as to whether the receiver is actually keeping up with the demand. The sender doesn't know because all it knows is that some message buffer has accepted the message. The transport underneath (e.g. tcp) then gets on with the job of pushing the message over as and when the receiver is willing to accept it.
Thus it might be that when under stress the system fails to perform as required, because the message transport will inevitably have a limited capacity to absorb messages that the receiver can't accept yet. The sender only finds this out after the problem has already begun to develop, by which time it might be too late to do anything about it.
Testing of course can reveal this problem, but you have to be careful that the testing really has exhausted the transport's ability to absorb messages. Just a quick blast at full speed might be deceiving.
Of course, a synchronous system imposes an overhead ("are you ready yet?", "no, not yet", "now?", "yes!", "here you are then") which just doesn't happen in an asynchronous system. So on average the asynchronous system will be more efficient, might actually have a higher throughput, etc. Which is why most the of the worlds systems are actually asynchronous, but also the reason why systems don't always reach the full capacity that the raw network bandwidths / processing times might suggest. When approaching full capacity asynchronous systems tend not to limit gracefully, in my opinion. Token Bus (nb not Token Ring) was a good example of a synchronous network with totally dependable and deterministic throughput but was just a little bit slower than Ethernet and Token Ring...
Having always been blessed with a surfeit of bandwidth in my problems I've chosen the synchronous route for certainty-of-success reasons; I'm not really losing out much on bandwidth, but I am losing tons of risk, which is good.
Convert from Synchronous to Asynchronous
Maybe, but it's possibly of little value. In a synchronous system it only works as per the requirement if you have successfully balanced the division of labour between threads. That is, there are enough threads doing the slow bits so that the fast bits aren't held back. Get that wrong and the system definitely isn't quick enough.
But having done that you have a system where every component is able to send messages onwards with no delay, because everything it is sending to is ready and waiting (because of your skill and judgement at balancing out the workloads). So if you did then convert to an asynchronous message transport all you're doing is saving fractionally small amounts of time in the transport of those messages. You're not making changes that will result in the workloads getting processed quicker. However, if saving bandwidth is the goal then perhaps its worthwhile.
Of course, doing this balancing can be a difficult thing, and dealing with variabilities like HDD access times, networks, etc can be difficult to overcome. I've often had to implement a 'next available' workload sharing scheme. But certainly in real time signal processing systems like the ones I play with you're basically dealing with a very dependable transport like OpenVPX's RapidIO, you're only doing sums on the data (not dealing with databases, disks, etc), and the data rates are very high (1GByte/sec is perfectly doable these days, and in fact I was handling data rates that high 13 years ago; that was haaard work). Being strictly synchronous means that you're either definitely keeping up with the data rate or definitely not. With asynchronous, it's more of a maybe...
Real Time OS for Everyone!
Having a real time OS is an essential component too, and these days it seems to be the PREEMPT_RT patch set for Linux that does the job for a lot of people in the trade. Redhat do a prepack spin of that (RedHat MRG), but for a freebie Scientific Linux from the nice people at CERN is good and free! I strongly suspect that a lot of systems would work much more smoothly near their capacity limits if PREEMPT_RT was used - it does a good job of smoothing things out.

Concurrency is a fascinating topic with a lot of approaches to implementation with the fundamental question being - "How do I coordinate parallel computations?".
Some models of concurrency are:
Futures
Futures also known as Promises or Tasks are objects that act as proxies for an asynchronously calculated result. When the value is actually needed for a calculation the thread freezes until the calculation is complete and thus, synchronization is achieved.
Futures are the preferred concurrency model for .NET and ES6.
Software Transactional Memory
Software Transactional Memory (STM) synchronizes access to shared memory (much like locks) by grouping actions into transactions. Any single transaction only sees a single view of the shared memory and is atomic. This is conceptually similar to how many databases deal with concurrency.
STM is the preferred concurrency model for Clojure and Haskell.
The Actor Model
The Actor Model focuses of message passing. An actor receives a message and can decide to send a message in response, spawn other actors, make local changes etc. This is, probably, the least tightly coupled model of these discussed as Actors exchange messages only and nothing else.
The Actor Model is the preferred concurrency model for Erlang and Rust.
Note that unlike the languages mentioned above most languages don't have cannon or preferred concurrency models and even those languages who show a strong preference for one model usually have the other ones implemented as libraries.
My personal opinion is that Futures outclass STM and Actors in simplicity of use and reasoning but none of these models are inherently "wrong" and I can think of no disadvantages for either. You could use whichever you preferred with no consequences.

The most general model for parallel processing is Petri Nets. It represents computation as pure data dependency graph, which expreses maximum parallelism. All other models stem from it.
Dataflow Computing model http://www.cs.colostate.edu/cameron/dataflow.html, http://en.wikipedia.org/wiki/Dataflow_programming is almost as powerful. It restricts Petri Net places to have only one output arc. In practice, this is useful, as places with multiple output arcs are hard to implement, cause indeterminism, and are rarely needed.
Actor model is a dataflow model where nodes may have only 2 input edges - one for input messages and one for actor's state. This is a serious restriction if you want to program functions with side-effect and more than one argument.

Threads and processes? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
In computer science, a thread of execution is the smallest unit of processing that can be scheduled by an operating system.
This is very abstract!
What would be real world/tangible/physical interpretation of threads?
How could I tell if an application (by looking at its code) is a single threaded or multi threaded?
Are there any benefits in making an application multi threaded? In which cases could one do that?
Is there anything like multi process application too?
Is technology a limiting factor to decide if it could be made multi threaded os is it just the design choice? e.g. is it possible to make multi threaded applications in flex?
If anyone can could give me an example or an analogy to explain these things, it would be very helpful.

What would be real world/tangible/physical interpretation of threads?
Think about a thread as an independent unit of execution that can be executed concurrently (at the same time) on a given CPU(s). A good analogy would be multiple cars driving around independently on the same road. Here a "car" is a thread, and a road is that CPU. So the function of all these cars is somewhat the same: "drive people around", but the kicker is that people should not stand in line to wait for a single car => they can drive at the same time in different cars (threads).
Technically, however, depending on number of CPU cores, and overall hardware / OS architecture there will be some context switching, where CPU would make it seem it happens simultaneously, but in reality it switches from one thread to another.
How could I tell if an application (by looking at its code) is a single threaded or multi threaded?
This depends on several things, a language the code is written in, your understanding of the language, what code is trying to accomplish, etc.. Usually you can tell, but I do not believe this will solve anything. If you already have access to the code, it's a lot simpler to just ask the developer, or, in case it is an open source product, read documentation, post on user forums to figure it out.
Are there any benefits in making an application multi threaded? In which cases could one do that?
Yes, think about the car example above. The benefit = at the same time and decoupled execution of things. For example, you need to calculate how many starts are in a known universe. You can either have a single process go over all the stars and count them, or you can "spawn" multiple threads, and give each thread a galaxy to solve: "thread 1 counts all the stars in Milky Way, thread 2 counts all the starts in Andromeda, etc.."
Is there anything like multi process application too?
That is a matter of terminology, but the cleanest answer would be yes. For example, in Erlang, VM is capable of starting many very lightweight processes very fast, where each process does its own thing. On Unix servers if you do "ps -aux / ps -ef", for example, you'd see multiple "processes" executin, where each process may in fact have many threads doing its job.
Is technology a limiting factor to decide if it could be made multi threaded os is it just the design choice? e.g. is it possible to make multi threaded applications in flex?
2 threaded application is already multithreaded. You most likely already have 2 or more cores on your laptop / PC, so technology would pretty always encourage you to utilize those cores, rather than limit you. Having said that, the problem and requirements should drive the decision. Not the technology or tools. But if you do decide write a multithreaded application, make sure you understand all the gotchas and solutions to them. The best language I used so far to solve concurrency is Erlang, since concurrency is just built in to it. However, other languages like Scala, Java, C#, and mostly functional languages, where shared state is not a problem would also be a good choice.

Are there any practical alternatives to threads?

While reading up on SQLite, I stumbled upon this quote in the FAQ: "Threads are evil. Avoid them."
I have a lot of respect for SQLite, so I couldn't just disregard this. I got thinking what else I could, according to the "avoid them" policy, use instead in order to parallelize my tasks. As an example, the application I'm currently working on requires a user interface that is always responsive, and needs to poll several websites from time to time (a process which takes at least 30 seconds for each website).
So I opened up the PDF linked from that FAQ, and essentially it seems that the paper suggests several techniques to be applied together with threads, such as barriers or transactional memory - rather than any techniques to replace threads altogether.
Given that these techniques do not fully dispense with threads (unless I misunderstood what the paper is saying), I can see two options: either the SQLite FAQ does not literally mean what it says, or there exist practical approaches that actually avoid the use of threads altogether. Are there any?
Just a quick note on tasklets/cooperative scheduling as an alternative - this looks great in small examples, but I wonder whether a large-ish UI-heavy application can be practically parallelized in a solely cooperative way. If you have done this successfully or know of such examples this certainly qualifies as a valid answer!

Note: This answer no longer accurately reflects what I think about this subject. I don't like its overly dramatic, somewhat nasty tone. Also, I am not so certain that the quest for provably correct software has been so useless as I seemed to think back then. I am leaving this answer up because it is accepted, and up-voted, and to edit it into something I currently believe would pretty much vandalize it.
I finally got around to reading the paper. Where do I start?
The author is singing an old song, which goes something like this: "If you can't prove the program is correct, we're all doomed!" It sounds best when screamed loudly accompanied by over modulated electric guitars and a rapid drum beat. Academics started singing that song when computer science was in the domain of mathematics, a world where if you don't have a proof, you don't have anything. Even after the first computer science department was cleaved from the mathematics department, they kept singing that song. They are singing that song today, and nobody is listening. Why? Because the rest of us are busy creating useful things, good things out of software that can't be proved correct.
The presence of threads makes it even more difficult to prove a program correct, but who cares? Even without threads, only the most trivial of programs can be proved correct. Why do I care if my non-trivial program, which could not be proved correct, is even more unprovable after I use threading? I don't.
If you weren't sure the author was living in an academic dreamworld, you can be sure of it after he maintains that the coordination language he suggests as an alternative to threads could best be expressed with a "visual syntax" (drawing graphs on the screen). I've never heard that suggestion before, except every year of my career. A language that can only be manipulated by GUI and does not play with any of the programmer's usual tools is not an improvement. The author goes on to cite UML as a shining example of a visual syntax which is "routinely combined with C++ and Java." Routinely in what world?
In the mean time, I and many other programmers go on using threads without all that much trouble. How to use threads well and safely is pretty much a solved problem, as long as you don't get all hung up on provability.
Look. Threading is a big kid's toy, and you do need to know some theory and usage patterns to use them well. Just as with databases, distributed processing, or any of the other beyond-grade-school devices that programmers successfully use every day. But just because you can't prove it correct doesn't mean it's wrong.

The statement in the SQLite FAQ, as I read it, is just a comment on how difficult threading can be to the uninitiated. It is the author's opinion, and it might be a valid one. But saying you should never use threads is throwing the baby out with the bath water, in my opinion. Threads are a tool. Like all tools, they can be used and they can be abused. I can read his paper and be convinced that threads are the devil, but I have used them successfully, without killing kittens.
Keep in mind that SQLite is written to be as lightweight and easy to understand (from a coding standpoint) as possible, so I would imagine that threading is kind of the antithesis to this lightweight approach.
Also, SQLite is not meant to be used in a highly-concurrent environment. If you have one of these, you might be better off working with a more enterprisey database like Postgres.

Evil, but a necessary evil. High level abstractions of threads (Tasks in .NET for example) are becoming more common but for the most part the industry is not trying to find a way to avoid threads, just making it easier to deal with the complexities that come with any kind of concurrent programming.

One trend I've noticed, at least in the Cocoa domain, is help from the framework. Apple has gone to great lengths to help developers with the relatively difficult concept of concurrent programming. Some things I've seen:
Different granularity of threading. Cocoa supports everything from posix threads (low level) to object oriented threading with NSLock and NSThread, to high level parellelism such as NSOperation. Depending on your task, using a high level tool like NSOperation is easier and gets the job done.
Threading behind the scenes via an API. Lots of the UI and animation stuff in cocoa is hidden behind an API. You are responsible for calling an API method and providing an asynchronous callback this executed when the secondary thread completes (for example the end of some animation).
openMP. There are tools like openMP that allow you to provide pragmas that describe to the compiler that some task may be safely parelellized. For example iterating a set of items in an independent way.
It seems like a big push in this industry is to make things simple for the Application developers and leave the gory thread details to the system developers and framework developers. There is a push in academia for formalizing parellel patterns. As mentioned you cant always avoid threading, but there are an increasing number of tools in your arsenal to make it as painless as possible.

If you really want to live without threads, you can, so long as you don't call any functions that can potentially block. This may not be possible.
One alternative is to implement the tasks you would have made into threads as finite state machines. Basically, the task does what it can do immediately, then goes to its next state, waiting for an event, such as input arriving on a file or a timer going off. X Windows, as well as most GUI toolkits, support this style. When something happens, they call a callback, which does what it needs to do and returns. For a FSM, the callback checks to see what state the task is in and what the event is to determine what to do immediately and what the next state will be.
Say you have an app that needs to accept socket connections, and for each connection, parse command lines, execute some code, and return the results. A task would then be what listens to a socket. When select() (or Gtk+, or whatever) tells you the socket has something to read, you read it into a buffer, then check to see if you have enough input buffered to do something. If so, you advance to a "start doing something" state, otherwise you stay in the "reading a line" state. (What you "do" could be multiple states.) When done, your task drops the line from the buffer and goes back to the "reading a line" state. No threads or preemption needed.
This lets you act multithreaded by way of being event-driven. If your state machines are complicated, however, your code can get hard to maintain pretty fast, and you'll need to work up some kind of FSM-management library to separate the grunt work of running the FSM from the code that actually does things.
P.S. Another way to get threads without really using threads is the GNU Pth library. It doesn't do preemption, but it is another option if you really don't want to deal with threads.

Another approach to this may be to use a different concurrency model rather than avoid multithreading altogether (you have to utilize all these CPU cores in parallel somehow).
Take a look at mechanisms used in Clojure (e.g. agents, software transactional memory).

Software Transactional Memory (STM) is a good alternative concurrency control. It scales well with multiple processors and do not have most of the problems of conventional concurrency control mechanisms. It is implemented as part of the Haskell language. It worths giving a try. Although, I do not know how this is applicable in the context of SQLite.

Alternatives to threads:
coroutines
goroutines
mapreduce
workerpool
apple's grand central dispatch+lambdas
openCL
erlang
(interesting to note that half of those technologies were invented or popularised by google.)
Another thing is many web frameworks transparently use multiple threads/processes for handling requests, and usually in such a way that mostly eliminates the problems associated with multithreading (for the user of the framework), or at least makes the threading rather invisible. The web being stateless, the only shared state is session state (which isn't really a problem since by definition, a single session isn't going to be doing concurrent things), and data in a database that already has its multithreading nonsense sorted out for you.
It's somewhat important to note though that these are all abstractions. The underlying implementations of these things still use threads. But this is still incredibly useful. In the same way you wouldn't use assembler to write a web application, you wouldn't use threads directly to write any important application. Designing an application to use threads is too complicated to leave for a human to deal with.

Threading is not the only model of concurrency. The actors model (Erlang, Scala) is an example of a somewhat different approach.
http://www.scala-lang.org/node/242

If your task is really, really easily isolatable, you can use processes instead of threads, like Chrome does for its tabs.
Otherwise, inside a single process, there is no way to achieve real parallelism without threads, because you need at least two coroutines if you want two things to happen at the same time (assuming you're having multiple processors/cores at hand, of course; otherwise real parallelism is simply not possible).
The complexity of threading a program is always relative to the degree of isolation of the tasks the threads will perform. There's no trouble in running several threads if you know for sure these will never use the same variables. Then again, multiple high-level constructs exist in modern languages to help synchronize access to shared resources.
It's really a matter of application. If your task is simple enough to fit in some kind of high-level Task object (depends on your development platform; your mileage may vary), then using a task queue is your best bet. My rule of the thumb is that if you can't find a cool name to your thread, then its task is not important enough to justify a thread (instead of task going on an operation queue).

Threads give you the opportunity to do some evil things, specifically sharing state among different execution paths. But they offer a lot of convenience; you don't have to do expensive communication across process boundaries. Plus, they come with less overhead. So I think they're perfectly fine, used correctly.
I think the key is to share as little data as possible among the threads; just stick to synchronization data. If you try to share more than that, you have to engage in complex code that is hard to get right the first time around.

One method of avoiding threads is multiplexing - in essence you make a lightweight mechanism similar to threads which you manage yourself.
Thing is this is not always viable. In your case the 30s polling time per website - can it be split into 60 0.5s pieces, in between which you can stuff calls to the UI? If not, sorry.
Threads aren't evil, they are just easy to shoot your foot with. If doing Query A takes 30s and then doing Query B takes another 30s, doing them simultaneously in threads will take 120s instead of 60 due to thread overhead, fighting for disk access and various bottlenecks.
But if Operation A consists of 5s of activity and 55 seconds of waiting, mixed randomly, and Operation B takes 60s of actual work, doing them in threads will take maybe 70s, compared to plain 120 when you execute them in sequence.
The rule of thumb is: threads should idle and wait most of the time. They are good for I/O, slow reads, low-priority work and so on. If you want performance, use multiplexing, which requires more work but is faster, more efficient and has way less caveats. (synchronizing threads and avoiding race conditions is a whole different chapter of thread headaches...)

Advice on starting a large multi-threaded programming project

My company currently runs a third-party simulation program (natural catastrophe risk modeling) that sucks up gigabytes of data off a disk and then crunches for several days to produce results. I will soon be asked to rewrite this as a multi-threaded app so that it runs in hours instead of days. I expect to have about 6 months to complete the conversion and will be working solo.
We have a 24-proc box to run this. I will have access to the source of the original program (written in C++ I think), but at this point I know very little about how it's designed.
I need advice on how to tackle this. I'm an experienced programmer (~ 30 years, currently working in C# 3.5) but have no multi-processor/multi-threaded experience. I'm willing and eager to learn a new language if appropriate. I'm looking for recommendations on languages, learning resources, books, architectural guidelines. etc.
Requirements: Windows OS. A commercial grade compiler with lots of support and good learning resources available. There is no need for a fancy GUI - it will probably run from a config file and put results into a SQL Server database.
Edit: The current app is C++ but I will almost certainly not be using that language for the re-write. I removed the C++ tag that someone added.

Numerical process simulations are typically run over a single discretised problem grid (for example, the surface of the Earth or clouds of gas and dust), which usually rules out simple task farming or concurrency approaches. This is because a grid divided over a set of processors representing an area of physical space is not a set of independent tasks. The grid cells at the edge of each subgrid need to be updated based on the values of grid cells stored on other processors, which are adjacent in logical space.
In high-performance computing, simulations are typically parallelised using either MPI or OpenMP. MPI is a message passing library with bindings for many languages, including C, C++, Fortran, Python, and C#. OpenMP is an API for shared-memory multiprocessing. In general, MPI is more difficult to code than OpenMP, and is much more invasive, but is also much more flexible. OpenMP requires a memory area shared between processors, so is not suited to many architectures. Hybrid schemes are also possible.
This type of programming has its own special challenges. As well as race conditions, deadlocks, livelocks, and all the other joys of concurrent programming, you need to consider the topology of your processor grid - how you choose to split your logical grid across your physical processors. This is important because your parallel speedup is a function of the amount of communication between your processors, which itself is a function of the total edge length of your decomposed grid. As you add more processors, this surface area increases, increasing the amount of communication overhead. Increasing the granularity will eventually become prohibitive.
The other important consideration is the proportion of the code which can be parallelised. Amdahl's law then dictates the maximum theoretically attainable speedup. You should be able to estimate this before you start writing any code.
Both of these facts will conspire to limit the maximum number of processors you can run on. The sweet spot may be considerably lower than you think.
I recommend the book High Performance Computing, if you can get hold of it. In particular, the chapter on performance benchmarking and tuning is priceless.
An excellent online overview of parallel computing, which covers the major issues, is this introduction from Lawerence Livermore National Laboratory.

Your biggest problem in a multithreaded project is that too much state is visible across threads - it is too easy to write code that reads / mutates data in an unsafe manner, especially in a multiprocessor environment where issues such as cache coherency, weakly consistent memory etc might come into play.
Debugging race conditions is distinctly unpleasant.
Approach your design as you would if, say, you were considering distributing your work across multiple machines on a network: that is, identify what tasks can happen in parallel, what the inputs to each task are, what the outputs of each task are, and what tasks must complete before a given task can begin. The point of the exercise is to ensure that each place where data becomes visible to another thread, and each place where a new thread is spawned, are carefully considered.
Once such an initial design is complete, there will be a clear division of ownership of data, and clear points at which ownership is taken / transferred; and so you will be in a very good position to take advantage of the possibilities that multithreading offers you - cheaply shared data, cheap synchronisation, lockless shared data structures - safely.

If you can split the workload up into non-dependent chunks of work (i.e., the data set can be processed in bits, there aren't lots of data dependencies), then I'd use a thread pool / task mechanism. Presumably whatever C# has as an equivalent to Java's java.util.concurrent. I'd create work units from the data, and wrap them in a task, and then throw the tasks at the thread pool.
Of course performance might be a necessity here. If you can keep the original processing code kernel as-is, then you can call it from within your C# application.
If the code has lots of data dependencies, it may be a lot harder to break up into threaded tasks, but you might be able to break it up into a pipeline of actions. This means thread 1 passes data to thread 2, which passes data to threads 3 through 8, which pass data onto thread 9, etc.
If the code has a lot of floating point mathematics, it might be worth looking at rewriting in OpenCL or CUDA, and running it on GPUs instead of CPUs.

For a 6 month project I'd say it definitely pays out to start reading a good book about the subject first. I would suggest Joe Duffy's Concurrent Programming on Windows. It's the most thorough book I know about the subject and it covers both .NET and native Win32 threading. I've written multithreaded programs for 10 years when I discovered this gem and still found things I didn't know in almost every chapter.
Also, "natural catastrophe risk modeling" sounds like a lot of math. Maybe you should have a look at Intel's IPP library: it provides primitives for many common low-level math and signal processing algorithms. It supports multi threading out of the box, which may make your task significantly easier.

There are a lot of techniques that can be used to deal with multithreading if you design the project for it.
The most general and universal is simply "avoid shared state". Whenever possible, copy resources between threads, rather than making them access the same shared copy.
If you're writing the low-level synchronization code yourself, you have to remember to make absolutely no assumptions. Both the compiler and CPU may reorder your code, creating race conditions or deadlocks where none would seem possible when reading the code. The only way to prevent this is with memory barriers. And remember that even the simplest operation may be subject to threading issues. Something as simple as ++i is typically not atomic, and if multiple threads access i, you'll get unpredictable results.
And of course, just because you've assigned a value to a variable, that's no guarantee that the new value will be visible to other threads. The compiler may defer actually writing it out to memory. Again, a memory barrier forces it to "flush" all pending memory I/O.
If I were you, I'd go with a higher level synchronization model than simple locks/mutexes/monitors/critical sections if possible. There are a few CSP libraries available for most languages and platforms, including .NET languages and native C++.
This usually makes race conditions and deadlocks trivial to detect and fix, and allows a ridiculous level of scalability. But there's a certain amount of overhead associated with this paradigm as well, so each thread might get less work done than it would with other techniques. It also requires the entire application to be structured specifically for this paradigm (so it's tricky to retrofit onto existing code, but since you're starting from scratch, it's less of an issue -- but it'll still be unfamiliar to you)
Another approach might be Transactional Memory. This is easier to fit into a traditional program structure, but also has some limitations, and I don't know of many production-quality libraries for it (STM.NET was recently released, and may be worth checking out. Intel has a C++ compiler with STM extensions built into the language as well)
But whichever approach you use, you'll have to think carefully about how to split the work up into independent tasks, and how to avoid cross-talk between threads. Any time two threads access the same variable, you have a potential bug. And any time two threads access the same variable or just another variable near the same address (for example, the next or previous element in an array), data will have to be exchanged between cores, forcing it to be flushed from CPU cache to memory, and then read into the other core's cache. Which can be a major performance hit.
Oh, and if you do write the application in C++, don't underestimate the language. You'll have to learn the language in detail before you'll be able to write robust code, much less robust threaded code.

One thing we've done in this situation that has worked really well for us is to break the work to be done into individual chunks and the actions on each chunk into different processors. Then we have chains of processors and data chunks can work through the chains independently. Each set of processors within the chain can run on multiple threads each and can process more or less data depending on their own performance relative to the other processors in the chain.
Also breaking up both the data and actions into smaller pieces makes the app much more maintainable and testable.

There's plenty of specific bits of individual advice that could be given here, and several people have done so already.
However nobody can tell you exactly how to make this all work for your specific requirements (which you don't even fully know yourself yet), so I'd strongly recommend you read up on HPC (High Performance Computing) for now to get the over-arching concepts clear and have a better idea which direction suits your needs the most.

The model you choose to use will be dictated by the structure of your data. Is your data tightly coupled or loosely coupled? If your simulation data is tightly coupled then you'll want to look at OpenMP or MPI (parallel computing). If your data is loosely coupled then a job pool is probably a better fit... possibly even a distributed computing approach could work.
My advice is get and read an introductory text to get familiar with the various models of concurrency/parallelism. Then look at your application's needs and decide which architecture you're going to need to use. After you know which architecture you need, then you can look at tools to assist you.
A fairly highly rated book which works as an introduction to the topic is "The Art of Concurrency: A Thread Monkey's Guide to Writing Parallel Application".

Read about Erlang and the "Actor Model" in particular. If you make all your data immutable, you will have a much easier time parallelizing it.

Most of the other answers offer good advice regarding partitioning the project - look for tasks that can be cleanly executed in parallel with very little data sharing required. Be aware of non-thread safe constructs such as static or global variables, or libraries that are not thread safe. The worst one we've encountered is the TNT library, which doesn't even allow thread-safe reads under some circumstances.
As with all optimisation, concentrate on the bottlenecks first, because threading adds a lot of complexity you want to avoid it where it isn't necessary.
You'll need a good grasp of the various threading primitives (mutexes, semaphores, critical sections, conditions, etc.) and the situations in which they are useful.
One thing I would add, if you're intending to stay with C++, is that we have had a lot of success using the boost.thread library. It supplies most of the required multi-threading primitives, although does lack a thread pool (and I would be wary of the unofficial "boost" thread pool one can locate via google, because it suffers from a number of deadlock issues).

I would consider doing this in .NET 4.0 since it has a lot of new support specifically targeted at making writing concurrent code easier. Its official release date is March 22, 2010, but it will probably RTM before then and you can start with the reasonably stable Beta 2 now.
You can either use C# that you're more familiar with or you can use managed C++.
At a high level, try to break up the program into System.Threading.Tasks.Task's which are individual units of work. In addition, I'd minimize use of shared state and consider using Parallel.For (or ForEach) and/or PLINQ where possible.
If you do this, a lot of the heavy lifting will be done for you in a very efficient way. It's the direction that Microsoft is going to increasingly support.
2: I would consider doing this in .NET 4.0 since it has a lot of new support specifically targeted at making writing concurrent code easier. Its official release date is March 22, 2010, but it will probably RTM before then and you can start with the reasonably stable Beta 2 now. At a high level, try to break up the program into System.Threading.Tasks.Task's which are individual units of work. In addition, I'd minimize use of shared state and consider using Parallel.For and/or PLINQ where possible. If you do this, a lot of the heavy lifting will be done for you in a very efficient way. 1: http://msdn.microsoft.com/en-us/library/dd321424%28VS.100%29.aspx

Sorry i just want to add a pessimistic or better realistic answer here.
You are under time pressure. 6 month deadline and you don't even know for sure what language is this system and what it does and how it is organized. If it is not a trivial calculation then it is a very bad start.
Most importantly: You say you have never done mulitithreading programming before. This is where i get 4 alarm clocks ringing at once. Multithreading is difficult and takes a long time to learn it when you want to do it right - and you need to do it right when you want to win a huge speed increase. Debugging is extremely nasty even with good tools like Total Views debugger or Intels VTune.
Then you say you want to rewrite the app in another lanugage - well this isn't as bad as you have to rewrite it anyway. THe chance to turn a single threaded Program into a well working multithreaded one without total redesign is almost zero.
But learning multithreading and a new language (what is your C++ skills?) with a timeline of 3 month (you have to write a throw away prototype - so i cut the timespan into two halfs) is extremely challenging.
My advise here is simple and will not like it: Learn multithreadings now - because it is a required skill set in the future - but leave this job to someone who already has experience. Well unless you don't care about the program being successfull and are just looking for 6 month payment.

If it's possible to have all the threads working on disjoint sets of process data, and have other information stored in the SQL database, you can quite easily do it in C++, and just spawn off new threads to work on their own parts using the Windows API. The SQL server will handle all the hard synchronization magic with its DB transactions! And of course C++ will perform a lot faster than C#.
You should definitely revise C++ for this task, and understand the C++ code, and look for efficiency bugs in the existing code as well as adding the multi-threaded functionality.

You've tagged this question as C++ but mentioned that you're a C# developer currently, so I'm not sure if you'll be tackling this assignment from C++ or C#. Anyway, in case you're going to be using C# or .NET (including C++/CLI): I have the following MSDN article bookmarked and would highly recommend reading through it as part of your prep work.
Calling Synchronous Methods Asynchronously

Whatever technology your going to write this, take a look a this must read book on concurrency "Concurrent programming in Java" and for .Net I highly recommend the retlang library for concurrent app.

I don't know if it was mentioned yet, but if I were in your shoes, what I would be doing right now (aside from reading every answer posted here) is writing a multiple threaded example application in your favorite (most used) language.
I don't have extensive multithreaded experience. I've played around with it in the past for fun but I think gaining some experience with a throw-away application will suit your future efforts.
I wish you luck in this endeavor and I must admit I wish I had the opportunity to work on something like this...

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string