How to benchmark a crypto library? - security

What are good tests to benchmark a crypto library?
Which unit (time,CPU cycles...) should we use to compare the differents crypto libraries?
Are there any tools, procedures....?
Any Idea, comment is welcome!
Thank you for your inputs!

I assume you mean performance benchmarks. I would say that both time and cycles are valid benchmarks, as some code may execute differently on different architectures (perhaps wildly differently if they're different enough).
If it is extremely important to you, I would do the testing myself. You can use some timer (almost all languages have one) or you can use some profiler (almost all languages have one of these too) to figure out the exact performance for the algorithms you are looking for on your target platform.
If you are looking at one algorithm vs. another one, you can look for data that others have already gathered and that will give you a rough idea. For instance, here are some benchmarks from Crypto++:
http://www.cryptopp.com/benchmarks.html
Note that they use MB/Second and Cycles/Byte as metrics. I think those are very good choices.

Some very good answers before me, but keep in mind optimizations are a very good way to leak key material by timing attack (for example see how devastating it can be for AES). If there is any chance an attacker can time your operations you want not the fastest but the most constant time library available (and possibly the most constant power usage available, if there is any chance someone can monitor yours). OpenSSL does a great job of keeping on top of current attacks, can't necessarily say the same things of other libraries.

What are good tests to benchmark a crypto library?
The answers below are in the context of Crypto++. I don't now about other libraries, like OpenSSL, Botan, BouncyCastle, etc.
The Crypto++ library has a built-in benchmarking suite.
Which unit (time,CPU cycles...) should we use to compare the differents crypto libraries?
You typically measure performance in cycles-per-byte. Cycles-per-byte depends upon the CPU frequency. Another related metric is throughput measured in MB/s. It also depends upon CPU frequency.
Are there any tools, procedures....?
git clone https://github.com/weidai11/cryptopp.git
cd cryptopp
make static cryptest.exe
# 2.0 GHz (use KB=1024; not 1000)
make bench CRYPTOPP_CPU_SPEED=1.8626
make bench will create a file called benchmark.html.
If you want to manually run the tests, then:
./cryptest.exe b <time in seconds> <cpu speed in GHz>
It will output an HTML-like table without <HEAD> and <BODY> tags. You will still be able to view it in a web browser.
You can also check the Crypto++ benchmark page at Crypto++ Benchmarks. The information is dated, and its on our TODO list.
You also need accumen for what looks right. For example, SSE4.2 and ARMv8 have a CRC32 instruction. Cycles-per-byte should go from about 3 or 5 cpb (software only) to about 1 or 1.5 cpb (hardware acceleration). It should equate to a change of roughly 300 or 500 MB/s (software only) to roughly 1.5 GB/s (hardware acceleration) on modern hardware running around 2 GHz.
Other technologies, like SSE2 and NEON, are trickier to work with. There's a theoretical cycles-per-byte and throughput you should see, but you may not know what it is. You may need to contact the authors of the algorithm to find out. For example, we contacted the authors of BLAKE2 to learn if our ARMv7/ARMv8 NEON implementation was performing as expected because it was missing benchmark results on the author's homepage.
I've also found GCC 4.6 (and above) and -O3 can make a big difference in software-only implementations. That's because GCC heavily vectorizes at -O3, and you might witness a 2x to 2.5x speedup.For example, the compiler may generate code that runs at 40 cpb at -O2. At -O3 it may run at 15 or 19 cpb. A good SSE2 or NEON implementation should outperform the software-only implementation by at least a few cycles per byte. In the same example, the SSE2 or NEON implementation may run at 8 to 13 cpb.
There's also sites like OpenBenchmarking.org that may be able to provide some metrics for you.

My comments above aside, the US government has the FIPS program that you might want to look at. It's not perfect (by a long shot) but it's a start -- you can get an idea of things they were looking at when evaluation cryptography.
I also suggest looking at the Computer Security Division of the NIST.
Also, on a side note ... reviewing what the master has to say (Bruce Schneier) on the subject of Security Pitfalls in Cryptography is always good. Also: Security is harder than it looks.

Related

Measuring the scaling behaviour of multithreaded applications

I am working on an application which supports many-core MIMD architectures (on consumer/desk-computers). I am currently worrying about the scaling behaviour of the application. Its designed to be massively parallel and addressing next-gen hardware. That's actually my problem. Does anyone know any software to simulate/emulate many-core MIMD Processors with >16 cores on a machine-code level? I've already implemented a software based thread sheduler with the ability to simulate multiple processors, by simple timing techniques.
I was curious if there's any software which could do this kind of simulation on a lower level preferably on an assembly language level to get better results. I want to emphasize once again that I'm only interested in MIMD Architectures. I know about OpenCL/CUDA/GPGPU but thats not what I'm looking for.
Any help is appreciated and thanks in advance for any answers.
You will rarely find all-purpose testing tools that are ALSO able to target very narrow (high-performance) corners - for a simple reason: the overhead of the "general-purpose" object will defeat that goal in the first place.
This is especially true with paralelism where locality and scheduling have a huge impact.
All this to say that I am affraid that you will have to write your own testing tool to target your exact usage pattern.
That's the price to pay for relevance.
If you are writing your application in C/C++, I might be able to help you out.
My company has created a tool that will take any C or C++ code and simulate its run-time behavior at bytecode level to find parallelization opportunities. The tool highlights parallelization opportunities and shows how parallel tasks behave.
The only mismatch is that our tool will also provide refactoring recipes to actually get to parallelization, whereas it sounds like you already have that.

Advice on starting a large multi-threaded programming project

My company currently runs a third-party simulation program (natural catastrophe risk modeling) that sucks up gigabytes of data off a disk and then crunches for several days to produce results. I will soon be asked to rewrite this as a multi-threaded app so that it runs in hours instead of days. I expect to have about 6 months to complete the conversion and will be working solo.
We have a 24-proc box to run this. I will have access to the source of the original program (written in C++ I think), but at this point I know very little about how it's designed.
I need advice on how to tackle this. I'm an experienced programmer (~ 30 years, currently working in C# 3.5) but have no multi-processor/multi-threaded experience. I'm willing and eager to learn a new language if appropriate. I'm looking for recommendations on languages, learning resources, books, architectural guidelines. etc.
Requirements: Windows OS. A commercial grade compiler with lots of support and good learning resources available. There is no need for a fancy GUI - it will probably run from a config file and put results into a SQL Server database.
Edit: The current app is C++ but I will almost certainly not be using that language for the re-write. I removed the C++ tag that someone added.
Numerical process simulations are typically run over a single discretised problem grid (for example, the surface of the Earth or clouds of gas and dust), which usually rules out simple task farming or concurrency approaches. This is because a grid divided over a set of processors representing an area of physical space is not a set of independent tasks. The grid cells at the edge of each subgrid need to be updated based on the values of grid cells stored on other processors, which are adjacent in logical space.
In high-performance computing, simulations are typically parallelised using either MPI or OpenMP. MPI is a message passing library with bindings for many languages, including C, C++, Fortran, Python, and C#. OpenMP is an API for shared-memory multiprocessing. In general, MPI is more difficult to code than OpenMP, and is much more invasive, but is also much more flexible. OpenMP requires a memory area shared between processors, so is not suited to many architectures. Hybrid schemes are also possible.
This type of programming has its own special challenges. As well as race conditions, deadlocks, livelocks, and all the other joys of concurrent programming, you need to consider the topology of your processor grid - how you choose to split your logical grid across your physical processors. This is important because your parallel speedup is a function of the amount of communication between your processors, which itself is a function of the total edge length of your decomposed grid. As you add more processors, this surface area increases, increasing the amount of communication overhead. Increasing the granularity will eventually become prohibitive.
The other important consideration is the proportion of the code which can be parallelised. Amdahl's law then dictates the maximum theoretically attainable speedup. You should be able to estimate this before you start writing any code.
Both of these facts will conspire to limit the maximum number of processors you can run on. The sweet spot may be considerably lower than you think.
I recommend the book High Performance Computing, if you can get hold of it. In particular, the chapter on performance benchmarking and tuning is priceless.
An excellent online overview of parallel computing, which covers the major issues, is this introduction from Lawerence Livermore National Laboratory.
Your biggest problem in a multithreaded project is that too much state is visible across threads - it is too easy to write code that reads / mutates data in an unsafe manner, especially in a multiprocessor environment where issues such as cache coherency, weakly consistent memory etc might come into play.
Debugging race conditions is distinctly unpleasant.
Approach your design as you would if, say, you were considering distributing your work across multiple machines on a network: that is, identify what tasks can happen in parallel, what the inputs to each task are, what the outputs of each task are, and what tasks must complete before a given task can begin. The point of the exercise is to ensure that each place where data becomes visible to another thread, and each place where a new thread is spawned, are carefully considered.
Once such an initial design is complete, there will be a clear division of ownership of data, and clear points at which ownership is taken / transferred; and so you will be in a very good position to take advantage of the possibilities that multithreading offers you - cheaply shared data, cheap synchronisation, lockless shared data structures - safely.
If you can split the workload up into non-dependent chunks of work (i.e., the data set can be processed in bits, there aren't lots of data dependencies), then I'd use a thread pool / task mechanism. Presumably whatever C# has as an equivalent to Java's java.util.concurrent. I'd create work units from the data, and wrap them in a task, and then throw the tasks at the thread pool.
Of course performance might be a necessity here. If you can keep the original processing code kernel as-is, then you can call it from within your C# application.
If the code has lots of data dependencies, it may be a lot harder to break up into threaded tasks, but you might be able to break it up into a pipeline of actions. This means thread 1 passes data to thread 2, which passes data to threads 3 through 8, which pass data onto thread 9, etc.
If the code has a lot of floating point mathematics, it might be worth looking at rewriting in OpenCL or CUDA, and running it on GPUs instead of CPUs.
For a 6 month project I'd say it definitely pays out to start reading a good book about the subject first. I would suggest Joe Duffy's Concurrent Programming on Windows. It's the most thorough book I know about the subject and it covers both .NET and native Win32 threading. I've written multithreaded programs for 10 years when I discovered this gem and still found things I didn't know in almost every chapter.
Also, "natural catastrophe risk modeling" sounds like a lot of math. Maybe you should have a look at Intel's IPP library: it provides primitives for many common low-level math and signal processing algorithms. It supports multi threading out of the box, which may make your task significantly easier.
There are a lot of techniques that can be used to deal with multithreading if you design the project for it.
The most general and universal is simply "avoid shared state". Whenever possible, copy resources between threads, rather than making them access the same shared copy.
If you're writing the low-level synchronization code yourself, you have to remember to make absolutely no assumptions. Both the compiler and CPU may reorder your code, creating race conditions or deadlocks where none would seem possible when reading the code. The only way to prevent this is with memory barriers. And remember that even the simplest operation may be subject to threading issues. Something as simple as ++i is typically not atomic, and if multiple threads access i, you'll get unpredictable results.
And of course, just because you've assigned a value to a variable, that's no guarantee that the new value will be visible to other threads. The compiler may defer actually writing it out to memory. Again, a memory barrier forces it to "flush" all pending memory I/O.
If I were you, I'd go with a higher level synchronization model than simple locks/mutexes/monitors/critical sections if possible. There are a few CSP libraries available for most languages and platforms, including .NET languages and native C++.
This usually makes race conditions and deadlocks trivial to detect and fix, and allows a ridiculous level of scalability. But there's a certain amount of overhead associated with this paradigm as well, so each thread might get less work done than it would with other techniques. It also requires the entire application to be structured specifically for this paradigm (so it's tricky to retrofit onto existing code, but since you're starting from scratch, it's less of an issue -- but it'll still be unfamiliar to you)
Another approach might be Transactional Memory. This is easier to fit into a traditional program structure, but also has some limitations, and I don't know of many production-quality libraries for it (STM.NET was recently released, and may be worth checking out. Intel has a C++ compiler with STM extensions built into the language as well)
But whichever approach you use, you'll have to think carefully about how to split the work up into independent tasks, and how to avoid cross-talk between threads. Any time two threads access the same variable, you have a potential bug. And any time two threads access the same variable or just another variable near the same address (for example, the next or previous element in an array), data will have to be exchanged between cores, forcing it to be flushed from CPU cache to memory, and then read into the other core's cache. Which can be a major performance hit.
Oh, and if you do write the application in C++, don't underestimate the language. You'll have to learn the language in detail before you'll be able to write robust code, much less robust threaded code.
One thing we've done in this situation that has worked really well for us is to break the work to be done into individual chunks and the actions on each chunk into different processors. Then we have chains of processors and data chunks can work through the chains independently. Each set of processors within the chain can run on multiple threads each and can process more or less data depending on their own performance relative to the other processors in the chain.
Also breaking up both the data and actions into smaller pieces makes the app much more maintainable and testable.
There's plenty of specific bits of individual advice that could be given here, and several people have done so already.
However nobody can tell you exactly how to make this all work for your specific requirements (which you don't even fully know yourself yet), so I'd strongly recommend you read up on HPC (High Performance Computing) for now to get the over-arching concepts clear and have a better idea which direction suits your needs the most.
The model you choose to use will be dictated by the structure of your data. Is your data tightly coupled or loosely coupled? If your simulation data is tightly coupled then you'll want to look at OpenMP or MPI (parallel computing). If your data is loosely coupled then a job pool is probably a better fit... possibly even a distributed computing approach could work.
My advice is get and read an introductory text to get familiar with the various models of concurrency/parallelism. Then look at your application's needs and decide which architecture you're going to need to use. After you know which architecture you need, then you can look at tools to assist you.
A fairly highly rated book which works as an introduction to the topic is "The Art of Concurrency: A Thread Monkey's Guide to Writing Parallel Application".
Read about Erlang and the "Actor Model" in particular. If you make all your data immutable, you will have a much easier time parallelizing it.
Most of the other answers offer good advice regarding partitioning the project - look for tasks that can be cleanly executed in parallel with very little data sharing required. Be aware of non-thread safe constructs such as static or global variables, or libraries that are not thread safe. The worst one we've encountered is the TNT library, which doesn't even allow thread-safe reads under some circumstances.
As with all optimisation, concentrate on the bottlenecks first, because threading adds a lot of complexity you want to avoid it where it isn't necessary.
You'll need a good grasp of the various threading primitives (mutexes, semaphores, critical sections, conditions, etc.) and the situations in which they are useful.
One thing I would add, if you're intending to stay with C++, is that we have had a lot of success using the boost.thread library. It supplies most of the required multi-threading primitives, although does lack a thread pool (and I would be wary of the unofficial "boost" thread pool one can locate via google, because it suffers from a number of deadlock issues).
I would consider doing this in .NET 4.0 since it has a lot of new support specifically targeted at making writing concurrent code easier. Its official release date is March 22, 2010, but it will probably RTM before then and you can start with the reasonably stable Beta 2 now.
You can either use C# that you're more familiar with or you can use managed C++.
At a high level, try to break up the program into System.Threading.Tasks.Task's which are individual units of work. In addition, I'd minimize use of shared state and consider using Parallel.For (or ForEach) and/or PLINQ where possible.
If you do this, a lot of the heavy lifting will be done for you in a very efficient way. It's the direction that Microsoft is going to increasingly support.
2: I would consider doing this in .NET 4.0 since it has a lot of new support specifically targeted at making writing concurrent code easier. Its official release date is March 22, 2010, but it will probably RTM before then and you can start with the reasonably stable Beta 2 now. At a high level, try to break up the program into System.Threading.Tasks.Task's which are individual units of work. In addition, I'd minimize use of shared state and consider using Parallel.For and/or PLINQ where possible. If you do this, a lot of the heavy lifting will be done for you in a very efficient way. 1: http://msdn.microsoft.com/en-us/library/dd321424%28VS.100%29.aspx
Sorry i just want to add a pessimistic or better realistic answer here.
You are under time pressure. 6 month deadline and you don't even know for sure what language is this system and what it does and how it is organized. If it is not a trivial calculation then it is a very bad start.
Most importantly: You say you have never done mulitithreading programming before. This is where i get 4 alarm clocks ringing at once. Multithreading is difficult and takes a long time to learn it when you want to do it right - and you need to do it right when you want to win a huge speed increase. Debugging is extremely nasty even with good tools like Total Views debugger or Intels VTune.
Then you say you want to rewrite the app in another lanugage - well this isn't as bad as you have to rewrite it anyway. THe chance to turn a single threaded Program into a well working multithreaded one without total redesign is almost zero.
But learning multithreading and a new language (what is your C++ skills?) with a timeline of 3 month (you have to write a throw away prototype - so i cut the timespan into two halfs) is extremely challenging.
My advise here is simple and will not like it: Learn multithreadings now - because it is a required skill set in the future - but leave this job to someone who already has experience. Well unless you don't care about the program being successfull and are just looking for 6 month payment.
If it's possible to have all the threads working on disjoint sets of process data, and have other information stored in the SQL database, you can quite easily do it in C++, and just spawn off new threads to work on their own parts using the Windows API. The SQL server will handle all the hard synchronization magic with its DB transactions! And of course C++ will perform a lot faster than C#.
You should definitely revise C++ for this task, and understand the C++ code, and look for efficiency bugs in the existing code as well as adding the multi-threaded functionality.
You've tagged this question as C++ but mentioned that you're a C# developer currently, so I'm not sure if you'll be tackling this assignment from C++ or C#. Anyway, in case you're going to be using C# or .NET (including C++/CLI): I have the following MSDN article bookmarked and would highly recommend reading through it as part of your prep work.
Calling Synchronous Methods Asynchronously
Whatever technology your going to write this, take a look a this must read book on concurrency "Concurrent programming in Java" and for .Net I highly recommend the retlang library for concurrent app.
I don't know if it was mentioned yet, but if I were in your shoes, what I would be doing right now (aside from reading every answer posted here) is writing a multiple threaded example application in your favorite (most used) language.
I don't have extensive multithreaded experience. I've played around with it in the past for fun but I think gaining some experience with a throw-away application will suit your future efforts.
I wish you luck in this endeavor and I must admit I wish I had the opportunity to work on something like this...

Programming on future hardware?

I want to practice programming code for future hardware. What are these? The two main things that come to mind is 64bits and multicore. I also note that cache is important along and GPU have their own tech but right now i am not interested in any graphics programming.
What else should i know about?
-edit- i know a lot of these are in the present but pretty soon all cpus will be multicore and threading will be more important. I consider endians (big vs little) but found that not to be important and already have a big endian CPU to test on.
My recommendation for future :)
nVidia CUDA
nVidia Tegra
Or you can focusing on ray tracing.
If you'd like to dive into a "mainstream" OS that has full 64 bit support, I suggest you start coding against the beta of Mac OS X "Snow Leopard" (codename for 10.6). One of the big enhancements is Grand Central, which is a "facility" for developers to code for multicore systems. Grand Central should distribute workload not only between core, but also to the GPU.
Also very important is the explosion of smart devices such as the iPhone, Android, etc. I strongly believe that some upcoming so-called "netbooks" will rely on OS such as Android and iPhone OS, and as such knowing how to code against their SDK, and knowing how to optimize code for mobile devices is very important (e.g. optimizing performance graphic or otherwise, battery usage).
I can't foretell the future, but one aspect to look into is something like the CELL processor used in the PS3, where instead of many identical general purpose cores, there is only one (although capable of symmetric multithreading) plus many cores that are more specific purpose.
In a simple analysis, the Cell processor can be split into four components: external input and output structures, the main processor called the Power Processing Element (PPE) (a two-way simultaneous multithreaded Power ISA v.2.03 compliant core), eight fully-functional co-processors called the Synergistic Processing Elements, or SPEs, and a specialized high-bandwidth circular data bus connecting the PPE, input/output elements and the SPEs, called the Element Interconnect Bus or EIB.
CUDA and OpenCL are similar in that you separate your general purpose code and high performance computations into separate parts that may run on different hardware and language/api.
64 bits and multicore are the present not the future.
About the future:
Quantum computing or something like that?
How about learning OpenCL? It's a massively parallel processing language based on C. It's similar to nVidia's CUDA but is vendor agnostic. There are no major implementations yet, but expect to see some pretty soon.
As for 64 bit, don't really worry about. Programming will not really be any different unless you're doing really low level stuff (kernels). Higher level frameworks such as Java and .NET allow you to run code on 32 bit and 64 bit machines. Even C/C++ allows you to do this (but not quite so transparently).
I agree with Oli's answere (+1) and would add that in addition to 64-bit environments, you look at multi-core environments. The industry is getting pretty close to the end of the cycle of improvements in raw speed. But we're seeing more and more multi-core CPUs. So parallel or concurrent programming -- which is rilly rilly hard -- is quickly becoming very much in demand.
How can you prepare for this and practice it? I've been asking myself the same same question. So for, it seems to me like functional languages such as ML, Haskell, LISP, Arc, Scheme, etc. are a good place to begin, since truly functional languages are generally free of side effects and therefore very "parallelizable". Erlang is another interesting language.
Other interesting developments that I've found include
The Singularity Research OS
Transactional Memory and Software Isolated Processes
The many Software Engineering Podcast episodes on concurrency. (Here's the first one.)
This article from ACM Queue on "Real World Concurrency"
Of course this question is hard to answer because nobody knows what future hardware will look like (at least in long terms), but multi-threading/parallel programming are important and will be definitely even more important for some years.
I'd also suggest working with GPU computing like CUDA/Stream, but this could be a problem because it's very likely that this will change a lot the next years.

Trivial mathematical problems as language benchmarks

Why do people insist on using trivial mathematical problems like finding numbers in the Fibonacci sequence for language benchmarks? Don't these usually get optimized to relativistic speeds? Isn't the brunt of the bottlenecks usually in I/O, system API calls, operations on strings and structures, processing large quantities of data, abstract object-oriented stuff, etc?
It is a throwback to the old days, when compiler technology for what we would now call basic math was still evolving rapidly.
Now, compiler evolution is more focused on exploiting new instructions for niche operations, 64-bit math, and so on.
Micro-benchmarks such as the ones you mention were useful, though, when evaluating the efficiency of the hotspot compiler when Java was first launched, and in evaluating the efficiency of .NET versus C/C++.
Your suggestion that I/O and system calls are the likely bottlenecks is correct, at least for some space of problems. But I notice you suggested string operations. One person's irrelevant micro-benchmark is another person's critical performance metric.
EDIT: ps, I also remember using linpack and other micro-benchmarks to compare versions of the JVM, and to compare vendors of the JVM. From v4 to v5 there was a big jump in perf, I guess the JIT compiler got more effective. Also, IBM's JVM was ahead of Sun's at that time, on Windows-x86.
Because if you want to benchmark the language/compiler, these "math problems" are good indicators of the "bare speed" of the generated code. Either they use the iterative solution, which is a tight loop and indicates how well can the compiler push the instructions to the processor, or they use the recursive solution, which indicates how does it handle recursive calls of short functions (inlining, tail-recursion etc.) (although the Ackermann function is usually used for that too).
Usually, the benchmark suite for the language contain tests benchmarking other parts as well - eg. gzip compression, text searching, object creation, virtual function call, exception throw/catch benchmarks.
The other things you've noticed, syscalls and IO are usually not included because
syscalls are in fact not that slow - applications don't spend significant porion of the time in the kernel, except for test specifically targeted at them or when something is seriously wrong with the program
syscall and IO performance does not depend on the language, but rather on the OS & hardware
I'd think a simple, well-established algorithm would remove the possibility that the benchmark is biased (whether through ignorance or malice) to favor one language. It is very difficult to write a complex program in two different languages exactly the same. Testing something like the efficiency of a multithreaded application in c# vs java, for example, would require developers skilled in multithreaded development both languages, and there would still be questions as to whether the benchmark app properly represents the general case, or if it is misrepresenting a special case that only one language handles well.
Back when the sieve of eratosthanes was a popular benchmark for C compilers, I thought it would be funny if one of the compiler authors would recognize the sieve code and replace it with a pre-computed lookup.

Machine dependent languages

Why might a machine-dependent language be more appropriate for writing certain types of programs? What types of programs would be appropriate?
Why might a machine-dependent language
be more appropriate for writing
certain types of programs?
Speed
Some machines have special instructions sets (Like MMX or SSE on x86, for example) that allows to 'exploit' the architecture in ways that compilers may or may not utilize best (or not utilize at all). If speed is critical (such as video games or data-crunching programs), then you'd want to utilize the best out of the architecture you're on.
Where Portability is Useless
When coding a program for a specific device (take the iPhone or the Nintendo DS as examples), portability is the least of your concerns. This code will most likely never go to another platform as it's specifically designed for that architecture/hardware combination.
Developer Ignorance and/or Market Demand
Computer video games are prime example - Windows is the dominating computer game OS, so why target others? It will let the developers focus on known variables for speed/size/ease-of-use. Some developers are ignorant - they learn to code only on one platform (Such as .NET) and 'forget' that others platforms exist because they don't know about them. They seem to take an approach similar to "It works on my machine, why should I bother porting it to a bizarre combination that I will never use?"
No other choice.
I will take the iPhone again as it is a very good example. While you can program to it in C or C++, you cannot access any of the UI widgets that are linked against the Objective-C runtime. You have no other choice but to code in Objective-C if you want to access any of those widgets.
What types of programs would be
appropriate?
Embedded systems
All of the above apply - When you're coding for an embedded system, you want to take advantage of the full potential of the hardware you're working on. Be it memory management (Such as the CP15 on ARM9) or even obscure hardware that is only attached to the target device (servo motors, special sensors etc).
The best example I can think of is for small embedded devices. When you have to have full control over every detail of optimization due to extremely limited computing power (only a few kilobytes of RAM, for example), you might want to drop down to the assembler level yourself to make everything work perfectly in those small confines.
On the other hand, compilers have gotten sophisticated enough these days where you really don't need to drop below C for most situations, including embedded devices and microcontrollers. The situations are pretty rare when this is necessary.
Consider virtually any graphics engine. Since your run-of-the-mill general purpose CPU cannot perform operations in parallel, you would have a bare minimum of one cycle per pixel to be modified.
However, since modern GPUs can operate on many pixels (or other piece of data) all at the same time, the same operation can be finished much more quickly. GPUs are very well-suited for embarrassingly parallel problems.
Granted, we have high-level-language APIs to control our video cards nowadays, but as you get "closer to the metal", the raw language used to control a GPU is a different animal from the language to control a general purpose CPU, due to the vast difference in architectures.

Resources