Language choice for small memory footprint

Language choice for small memory footprint - programming-languages

I have a tiny VPS where the memory is very scarce. I was thinking that for fun I might want to write some servers to run on it, that use as little memory as possible. Maybe something like a git-daemon, or anything else that comes up later, there are a lot of interesting techs out there which I'd love to try out for myself.
What programming language would you recommend if memory usage has the highest priority? I'm glad (even prefer) to learn something new.

Forth can be extremely compact.

The good old C, unless you're brave enough to all way down to assembly.
Why?
You probably don't want any VMT.
You probably don't want any dynamic typing.
You probably don't want any memory hungry VM.
It's the standard non assembly language for microcontrollers (very little memory), and C low memory footprint is one of the reasons.

I would suggest a language that has a dense virtual machine instruction set. Another answer here suggested Forth, which is surely a VM, but which I think fails that test by virtue of using pointers (non-dense fullwords) to select execution routines.
Google's compiled version of Java, Dalvik, is supposed to be designed with the intent to minimize memory footprint while being pretty fast to interpret. Being open source, apparantly you can get it and use it for your own purposes. You can likely bend it to avoid use of garbage collection to help manage the data storage footprint.
There is also a Cint, an interpreter for C with a small VM. Probably not as fast as Dalvik, which uses simulated registers rather than a stack.

Related

Does Racket offer anything for implementing a language with its own GC that could manage GPU memory?

I am working on this and I have some regrets with that I am going to have to do some kind of region based memory allocation scheme for GPU memory because .NET does not allow the adequate level of control over its GC.
I was too naive. I admit that it did cross my mind that just because I was on a platform with GC that I would (and should) not have to do manual memory management, nor would I need to know how the C malloc works nor how it is implemented. I want to do better than this.
What are Racket's facilities in this area?

No. GPU processors are not like CPUs, and practically speaking don't run any GC-ed language implementation, but some very low-level code (e.g. using OpenCL or CUDA or OpenACC or SPIR). They don't really have some general purpose dynamic memory allocation, and they might not even have any virtual memory or MMU. Their memory is generally separate.
What you could do is use some existing library having some GPU compute kernels (like TensorFlow, OpenCV, etc...) and call that library from your Racket based thing using some foreign function interface.
What you might do with a lot of work (probably several years) is to generate some kernel code in OpenCL or CUDA (or SPIR) -mixed with some other generated code managing that kernel code-, that is to implement a compiler from a small subset (to be painfully defined) of your Spiral language into OpenCL or CUDA kernels. In that case, the evil is in the details (and the kernel code you'll generate would depend upon the particular GPU model). You could look into SPOC for inspiration.
nor would I need to know how the C malloc works nor how it is implemented.
It is much worse than that. You'll need to care of a lot of low level details, you'll need to code stuff specific to your OS and hardware, and understanding C malloc is easier than taking care of all the GPU details (that is, generating the "right" GPU and glue code: dive into the specifications of OpenCL for more).
(I believe that it is not worth the effort -several years- to compile your Spiral into GPU kernel code and the necessary glue code running in the CPU)
You should also read more about garbage collection, e.g. the GC handbook.
I was too naive.
You probably still are. Your subject is harder than what you think, if you want an efficient and competitive implementation. Coding a naïve GC (or VM) is easy, but coding an efficient one is hard (requiring several years of work).
I want to do better than this.
You'll need several years of full time work.

Advice on starting a large multi-threaded programming project

My company currently runs a third-party simulation program (natural catastrophe risk modeling) that sucks up gigabytes of data off a disk and then crunches for several days to produce results. I will soon be asked to rewrite this as a multi-threaded app so that it runs in hours instead of days. I expect to have about 6 months to complete the conversion and will be working solo.
We have a 24-proc box to run this. I will have access to the source of the original program (written in C++ I think), but at this point I know very little about how it's designed.
I need advice on how to tackle this. I'm an experienced programmer (~ 30 years, currently working in C# 3.5) but have no multi-processor/multi-threaded experience. I'm willing and eager to learn a new language if appropriate. I'm looking for recommendations on languages, learning resources, books, architectural guidelines. etc.
Requirements: Windows OS. A commercial grade compiler with lots of support and good learning resources available. There is no need for a fancy GUI - it will probably run from a config file and put results into a SQL Server database.
Edit: The current app is C++ but I will almost certainly not be using that language for the re-write. I removed the C++ tag that someone added.

Numerical process simulations are typically run over a single discretised problem grid (for example, the surface of the Earth or clouds of gas and dust), which usually rules out simple task farming or concurrency approaches. This is because a grid divided over a set of processors representing an area of physical space is not a set of independent tasks. The grid cells at the edge of each subgrid need to be updated based on the values of grid cells stored on other processors, which are adjacent in logical space.
In high-performance computing, simulations are typically parallelised using either MPI or OpenMP. MPI is a message passing library with bindings for many languages, including C, C++, Fortran, Python, and C#. OpenMP is an API for shared-memory multiprocessing. In general, MPI is more difficult to code than OpenMP, and is much more invasive, but is also much more flexible. OpenMP requires a memory area shared between processors, so is not suited to many architectures. Hybrid schemes are also possible.
This type of programming has its own special challenges. As well as race conditions, deadlocks, livelocks, and all the other joys of concurrent programming, you need to consider the topology of your processor grid - how you choose to split your logical grid across your physical processors. This is important because your parallel speedup is a function of the amount of communication between your processors, which itself is a function of the total edge length of your decomposed grid. As you add more processors, this surface area increases, increasing the amount of communication overhead. Increasing the granularity will eventually become prohibitive.
The other important consideration is the proportion of the code which can be parallelised. Amdahl's law then dictates the maximum theoretically attainable speedup. You should be able to estimate this before you start writing any code.
Both of these facts will conspire to limit the maximum number of processors you can run on. The sweet spot may be considerably lower than you think.
I recommend the book High Performance Computing, if you can get hold of it. In particular, the chapter on performance benchmarking and tuning is priceless.
An excellent online overview of parallel computing, which covers the major issues, is this introduction from Lawerence Livermore National Laboratory.

Your biggest problem in a multithreaded project is that too much state is visible across threads - it is too easy to write code that reads / mutates data in an unsafe manner, especially in a multiprocessor environment where issues such as cache coherency, weakly consistent memory etc might come into play.
Debugging race conditions is distinctly unpleasant.
Approach your design as you would if, say, you were considering distributing your work across multiple machines on a network: that is, identify what tasks can happen in parallel, what the inputs to each task are, what the outputs of each task are, and what tasks must complete before a given task can begin. The point of the exercise is to ensure that each place where data becomes visible to another thread, and each place where a new thread is spawned, are carefully considered.
Once such an initial design is complete, there will be a clear division of ownership of data, and clear points at which ownership is taken / transferred; and so you will be in a very good position to take advantage of the possibilities that multithreading offers you - cheaply shared data, cheap synchronisation, lockless shared data structures - safely.

If you can split the workload up into non-dependent chunks of work (i.e., the data set can be processed in bits, there aren't lots of data dependencies), then I'd use a thread pool / task mechanism. Presumably whatever C# has as an equivalent to Java's java.util.concurrent. I'd create work units from the data, and wrap them in a task, and then throw the tasks at the thread pool.
Of course performance might be a necessity here. If you can keep the original processing code kernel as-is, then you can call it from within your C# application.
If the code has lots of data dependencies, it may be a lot harder to break up into threaded tasks, but you might be able to break it up into a pipeline of actions. This means thread 1 passes data to thread 2, which passes data to threads 3 through 8, which pass data onto thread 9, etc.
If the code has a lot of floating point mathematics, it might be worth looking at rewriting in OpenCL or CUDA, and running it on GPUs instead of CPUs.

For a 6 month project I'd say it definitely pays out to start reading a good book about the subject first. I would suggest Joe Duffy's Concurrent Programming on Windows. It's the most thorough book I know about the subject and it covers both .NET and native Win32 threading. I've written multithreaded programs for 10 years when I discovered this gem and still found things I didn't know in almost every chapter.
Also, "natural catastrophe risk modeling" sounds like a lot of math. Maybe you should have a look at Intel's IPP library: it provides primitives for many common low-level math and signal processing algorithms. It supports multi threading out of the box, which may make your task significantly easier.

There are a lot of techniques that can be used to deal with multithreading if you design the project for it.
The most general and universal is simply "avoid shared state". Whenever possible, copy resources between threads, rather than making them access the same shared copy.
If you're writing the low-level synchronization code yourself, you have to remember to make absolutely no assumptions. Both the compiler and CPU may reorder your code, creating race conditions or deadlocks where none would seem possible when reading the code. The only way to prevent this is with memory barriers. And remember that even the simplest operation may be subject to threading issues. Something as simple as ++i is typically not atomic, and if multiple threads access i, you'll get unpredictable results.
And of course, just because you've assigned a value to a variable, that's no guarantee that the new value will be visible to other threads. The compiler may defer actually writing it out to memory. Again, a memory barrier forces it to "flush" all pending memory I/O.
If I were you, I'd go with a higher level synchronization model than simple locks/mutexes/monitors/critical sections if possible. There are a few CSP libraries available for most languages and platforms, including .NET languages and native C++.
This usually makes race conditions and deadlocks trivial to detect and fix, and allows a ridiculous level of scalability. But there's a certain amount of overhead associated with this paradigm as well, so each thread might get less work done than it would with other techniques. It also requires the entire application to be structured specifically for this paradigm (so it's tricky to retrofit onto existing code, but since you're starting from scratch, it's less of an issue -- but it'll still be unfamiliar to you)
Another approach might be Transactional Memory. This is easier to fit into a traditional program structure, but also has some limitations, and I don't know of many production-quality libraries for it (STM.NET was recently released, and may be worth checking out. Intel has a C++ compiler with STM extensions built into the language as well)
But whichever approach you use, you'll have to think carefully about how to split the work up into independent tasks, and how to avoid cross-talk between threads. Any time two threads access the same variable, you have a potential bug. And any time two threads access the same variable or just another variable near the same address (for example, the next or previous element in an array), data will have to be exchanged between cores, forcing it to be flushed from CPU cache to memory, and then read into the other core's cache. Which can be a major performance hit.
Oh, and if you do write the application in C++, don't underestimate the language. You'll have to learn the language in detail before you'll be able to write robust code, much less robust threaded code.

One thing we've done in this situation that has worked really well for us is to break the work to be done into individual chunks and the actions on each chunk into different processors. Then we have chains of processors and data chunks can work through the chains independently. Each set of processors within the chain can run on multiple threads each and can process more or less data depending on their own performance relative to the other processors in the chain.
Also breaking up both the data and actions into smaller pieces makes the app much more maintainable and testable.

There's plenty of specific bits of individual advice that could be given here, and several people have done so already.
However nobody can tell you exactly how to make this all work for your specific requirements (which you don't even fully know yourself yet), so I'd strongly recommend you read up on HPC (High Performance Computing) for now to get the over-arching concepts clear and have a better idea which direction suits your needs the most.

The model you choose to use will be dictated by the structure of your data. Is your data tightly coupled or loosely coupled? If your simulation data is tightly coupled then you'll want to look at OpenMP or MPI (parallel computing). If your data is loosely coupled then a job pool is probably a better fit... possibly even a distributed computing approach could work.
My advice is get and read an introductory text to get familiar with the various models of concurrency/parallelism. Then look at your application's needs and decide which architecture you're going to need to use. After you know which architecture you need, then you can look at tools to assist you.
A fairly highly rated book which works as an introduction to the topic is "The Art of Concurrency: A Thread Monkey's Guide to Writing Parallel Application".

Read about Erlang and the "Actor Model" in particular. If you make all your data immutable, you will have a much easier time parallelizing it.

Most of the other answers offer good advice regarding partitioning the project - look for tasks that can be cleanly executed in parallel with very little data sharing required. Be aware of non-thread safe constructs such as static or global variables, or libraries that are not thread safe. The worst one we've encountered is the TNT library, which doesn't even allow thread-safe reads under some circumstances.
As with all optimisation, concentrate on the bottlenecks first, because threading adds a lot of complexity you want to avoid it where it isn't necessary.
You'll need a good grasp of the various threading primitives (mutexes, semaphores, critical sections, conditions, etc.) and the situations in which they are useful.
One thing I would add, if you're intending to stay with C++, is that we have had a lot of success using the boost.thread library. It supplies most of the required multi-threading primitives, although does lack a thread pool (and I would be wary of the unofficial "boost" thread pool one can locate via google, because it suffers from a number of deadlock issues).

I would consider doing this in .NET 4.0 since it has a lot of new support specifically targeted at making writing concurrent code easier. Its official release date is March 22, 2010, but it will probably RTM before then and you can start with the reasonably stable Beta 2 now.
You can either use C# that you're more familiar with or you can use managed C++.
At a high level, try to break up the program into System.Threading.Tasks.Task's which are individual units of work. In addition, I'd minimize use of shared state and consider using Parallel.For (or ForEach) and/or PLINQ where possible.
If you do this, a lot of the heavy lifting will be done for you in a very efficient way. It's the direction that Microsoft is going to increasingly support.
2: I would consider doing this in .NET 4.0 since it has a lot of new support specifically targeted at making writing concurrent code easier. Its official release date is March 22, 2010, but it will probably RTM before then and you can start with the reasonably stable Beta 2 now. At a high level, try to break up the program into System.Threading.Tasks.Task's which are individual units of work. In addition, I'd minimize use of shared state and consider using Parallel.For and/or PLINQ where possible. If you do this, a lot of the heavy lifting will be done for you in a very efficient way. 1: http://msdn.microsoft.com/en-us/library/dd321424%28VS.100%29.aspx

Sorry i just want to add a pessimistic or better realistic answer here.
You are under time pressure. 6 month deadline and you don't even know for sure what language is this system and what it does and how it is organized. If it is not a trivial calculation then it is a very bad start.
Most importantly: You say you have never done mulitithreading programming before. This is where i get 4 alarm clocks ringing at once. Multithreading is difficult and takes a long time to learn it when you want to do it right - and you need to do it right when you want to win a huge speed increase. Debugging is extremely nasty even with good tools like Total Views debugger or Intels VTune.
Then you say you want to rewrite the app in another lanugage - well this isn't as bad as you have to rewrite it anyway. THe chance to turn a single threaded Program into a well working multithreaded one without total redesign is almost zero.
But learning multithreading and a new language (what is your C++ skills?) with a timeline of 3 month (you have to write a throw away prototype - so i cut the timespan into two halfs) is extremely challenging.
My advise here is simple and will not like it: Learn multithreadings now - because it is a required skill set in the future - but leave this job to someone who already has experience. Well unless you don't care about the program being successfull and are just looking for 6 month payment.

If it's possible to have all the threads working on disjoint sets of process data, and have other information stored in the SQL database, you can quite easily do it in C++, and just spawn off new threads to work on their own parts using the Windows API. The SQL server will handle all the hard synchronization magic with its DB transactions! And of course C++ will perform a lot faster than C#.
You should definitely revise C++ for this task, and understand the C++ code, and look for efficiency bugs in the existing code as well as adding the multi-threaded functionality.

You've tagged this question as C++ but mentioned that you're a C# developer currently, so I'm not sure if you'll be tackling this assignment from C++ or C#. Anyway, in case you're going to be using C# or .NET (including C++/CLI): I have the following MSDN article bookmarked and would highly recommend reading through it as part of your prep work.
Calling Synchronous Methods Asynchronously

Whatever technology your going to write this, take a look a this must read book on concurrency "Concurrent programming in Java" and for .Net I highly recommend the retlang library for concurrent app.

I don't know if it was mentioned yet, but if I were in your shoes, what I would be doing right now (aside from reading every answer posted here) is writing a multiple threaded example application in your favorite (most used) language.
I don't have extensive multithreaded experience. I've played around with it in the past for fun but I think gaining some experience with a throw-away application will suit your future efforts.
I wish you luck in this endeavor and I must admit I wish I had the opportunity to work on something like this...

Mixing Erlang and Haskell

If you've bought into the functional programming paradigm, the chances are that you like both Erlang and Haskell. Both have purely functional cores and other goodness such as lightweight threads that make them a good fit for a multicore world. But there are some differences too.
Erlang is a commercially proven fault-tolerant language with a mature distribution model. It has a seemingly unique feature in its ability to upgrade its version at runtime via hot code loading. (Way cool!)
Haskell, on the otherhand, has the most sophisticated type system of any mainstream language. (Where I define 'mainstream' to be any language that has a published O'Reilly book so Haskell counts.) Its straightline single threaded performance looks superior to Erlang's and its lightweight threads look even lighter too.
I am trying to put together a development platform for the rest of my coding life and was wondering whether it was possible to mix Erlang and Haskell to achieve a best of breed platform. This question has two parts:
I'd like to use Erlang as a kind of fault tolerant MPI to glue GHC runtime instances together. There would be one Erlang process per GHC runtime. If "the impossible happened" and the GHC runtime died, then the Erlang process would detect that somehow and die too. Erlang's hot code loading and distribution features would just continue to work. The GHC runtime could be configured to use just one core, or all cores on the local machine, or any combination in between. Once the Erlang library was written, the rest of the Erlang level code should be purely boilerplate and automatically generated on a per application basis. (Perhaps by a Haskell DSL for example.) How does one achieve at least some of these things?
I'd like Erlang and Haskell to be able to share the same garabage collector. (This is a much further out idea than 1.) Languages that run on the JVM and the CLR achieve greater mass by sharing a runtime. I understand there are technical limitations to running Erlang (hot code loading) and Haskell (higher kinded polymorphism) on either the JVM or the CLR. But what about unbundling just the garbage collector? (Sort of the start of a runtime for functional languages.) Allocation would obviously still have to be really fast, so maybe that bit needs to be statically linked in. And there should be some mechansim to distinguish the mutable heap from the immutable heap (incuding lazy write once memory) as GHC needs this. Would it be feasible to modify both HIPE and GHC so that the garbage collectors could share a heap?
Please answer with any experiences (positive or negative), ideas or suggestions. In fact, any feedback (short of straight abuse!) is welcome.
Update
Thanks for all 4 replies to date - each taught me at least one useful thing that I did not know.
Regarding the rest of coding life thing - I included it slightly tongue in cheek to spark debate, but it is actually true. There is a project that I have in mind that I intend to work on until I die, and it needs a stable platform.
In the platform I have proposed above, I would only write Haskell, as the boilerplate Erlang would be automatically generated. So how long will Haskell last? Well Lisp is still with us and doesn't look like it is going away anytime soon. Haskell is BSD3 open source and has achieved critical mass. If programming itself is still around in 50 years time, I would expect Haskell, or some continuous evolution of Haskell, will still be here.
Update 2 in response to rvirding's post
Agreed - implementing a complete "Erskell/Haslang" universal virtual machine might not be absolutely impossible, but it would certainly be very difficult indeed. Sharing just the garbage collector level as something like a VM, while still difficult, sounds an order of magnitude less difficult to me though. At the garbage collection model, functional languages must have a lot in common - the unbiquity of immutable data (including thunks) and the requirement for very fast allocation. So the fact that commonality is bundled tightly with monolithic VMs seems kind of odd.
VMs do help achieve critical mass. Just look at how 'lite' functional languages like F# and Scala have taken off. Scala may not have the absolute fault tolerance of Erlang, but it offers an escape route for the very many folks who are tied to the JVM.
While having a single heap makes
message passing very fast it
introduces a number of other problems,
mainly that doing GC becomes more
difficult as it has to be interactive
and globally non-interruptive so you
can't use the same simpler algorithms
as the per-process heap model.
Absolutely, that makes perfect sense to me. The very smart people on the GHC development team appear to be trying to solve part of the problem with a parallel "stop the world" GC.
http://research.microsoft.com/en-us/um/people/simonpj/papers/parallel-gc/par-gc-ismm08.pdf
(Obviously "stop the world" would not fly for general Erlang given its main use case.) But even in the use cases where "stop the world" is OK, their speedups do not appear to be universal. So I agree with you, it is unlikely that there is a universally best GC, which is the reason I specified in part 1. of my question that
The GHC runtime could be configured to
use just one core, or all cores on the
local machine, or any combination in
between.
In that way, for a given use case, I could, after benchmarking, choose to go the Erlang way, and run one GHC runtime (with a singlethreaded GC) plus one Erlang process per core and let Erlang copy memory between cores for good locality.
Alternatively, on a dual processor machine with 4 cores per processor with good memory bandwidth on the processor, benchmarking might suggest that I run one GHC runtime (with a parallel GC) plus one Erlang process per processor.
In both cases, if Erlang and GHC could share a heap, the sharing would probably be bound to a single OS thread running on a single core somehow. (I am getting out of my depth here, which is why I asked the question.)
I also have another agenda - benchmarking functional languages independently of GC. Often I read of results of benchmarks of OCaml v GHC v Erlang v ... and wonder how much the results are confounded by the different GCs. What if choice of GC could be orthogonal to choice of functional language? How expensive is GC anyway? See this devil advocates blog post
http://john.freml.in/garbage-collection-harmful
by my Lisp friend John Fremlin, which he has, charmingly, given his post title "Automated garbage collection is rubbish". When John claims that GC is slow and hasn't really sped up that much, I would like to be able to counter with some numbers.

A lot of Haskell and Erlang people are interested in the model where Erlang supervises distribution, while Haskell runs the shared memory nodes in parallel doing all the number crunching/logic.
A start towards this is the haskell-erlang library: http://hackage.haskell.org/package/erlang
And we have similar efforts in Ruby land, via Hubris: http://github.com/mwotton/Hubris/tree/master
The question now is to find someone to actually push through the Erlang / Haskell interop to find out the tricky issues.

You're going to have an interesting time mixing GC between Haskell and Erlang. Erlang uses a per-process heap and copies data between processes -- as Haskell doesn't even have a concept of processes, I'm not sure how you would map this "universal" GC between the two. Furthermore, for best performance, Erlang uses a variety of allocators, each with slightly tweaked behaviours that I'm sure would affect the GC sub-system.
As with all things in software, abstraction comes at a cost. In this case, I rather suspect you'd have to introduce so many layers to get both languages over their impedance mismatch that you'd wind up with a not very performant (or useful) common VM.
Bottom line -- embrace the difference! There are huge advantages to NOT running everything in the same process, particularly from a reliability standpoint. Also, I think it's a little naive to expect one language/VM to last you for the rest of your life (unless you plan on a.) living a short time or b.) becoming some sort of code monk that ONLY works on a single project). Software development is all about mental agility and being willing to use the best available tools to build fast, reliable code.

Although this is a pretty old thread, if readers are still interested then it's worth taking a look at Cloud Haskell, which brings Erlang style concurrency and distribution to the GHC stable.
The forthcoming distributed-process-platform library adds support for OTP-esque constructs like gen_servers, supervision trees and various other "haskell flavoured" abstractions borrowed from and inspired by Erlang/OTP.

You could use an OTP gen_supervisor process to monitor Haskell instances that you spawn with open_port(). Depending on how the "port" exited, you would then be able to restart it or decide that it stopped on purpose and let the corresponding Erlang process die, too.
Fugheddaboudit. Even these language-independent VMs you speak of have trouble with data passed between languages sometimes. You should just serialize data between the two somehow: database, XML-RPC, something like that.
By the way, the idea of a single platform for the rest of your life is probably impractical, too. Computing technology and fashion change too often to expect that you can keep using just one language forever. Your very question points this out: no one language does everything we might wish, even today.

As dizzyd mentioned in his comment not all data in messages is copied, large binaries exist outside of the process heaps and are not copied.
Using a different memory structure to avoid having separate per-process heaps is certainly possible and has been done in a number of earlier implementations. While having a single heap makes message passing very fast it introduces a number of other problems, mainly that doing GC becomes more difficult as it has to be interactive and globally non-interruptive so you can't use the same simpler algorithms as the per-process heap model.
As long as we use have immutable data-structures there is no problem with robustness and safety. Deciding on which memory and GC models to use is a big trade-off, and unfortunately there universally best model.
While Haskell and Erlang are both functional languages they are in many respects very different languages and have very different implementations. It would difficult to come up with an "Erskell" (or Haslang) machine which could handle both languages efficiently. I personally think it is much better to keep them separate and to make sure you have a really good interface between them.

The CLR supports tail call optimization with an explicit tail opcode (as used by F#), which the JVM doesn't (yet) have an equivalent, which limits the implementation of such a style of language. The use of separate AppDomains does allow the CLR to hot-swap code (see e.g. this blog post showing how it can be done).
With Simon Peyton Jones working just down the corridor from Don Syme and the F# team at Microsoft Research, it would be a great disappointment if we didn't eventually see an IronHaskell with some sort of official status. An IronErlang would be an interesting project -- the biggest piece of work would probably be porting the green-threading scheduler without getting as heavyweight as the Windows Workflow engine, or having to run a BEAM VM on top the CLR.

Your predictions on Languages evolution

Well, I know it's not all about speed and memory usage.
But I would like to know what you think will happen to most of the high-level programming languages. As far as I know, Java is much faster than it was in the past, what about python, php etc.

Speed has more to with Moore's law than the language itself. So if you are looking in absolute terms, you'll get more bangs for more buck by just upgrading your machine on a regular basis.
In terms of memory footprint, I expect most languages to continue gathering functionality thus increasing their footprint.

High level programming languages will continue to get more abstractions that make it easier for developers to specificy what they want a computer to do, without having to get their hands dirty with difficult underlying details that a compiler and/or runtime system is better at optimizing anyway than any developer might be able to do a priori.
Think about:
support for multi-threaded execution (like Parallel Extentions in latest .NET)
specifying structure and functional outcome instead of manually telling computer exactly how and in what order to shuffle which sets of bits around
Those kinds of things.

Parallelism, given that increasing the number of processing units (cores) is the principal way of gaining speed nowadays. To make it manageable to humans, software transactional memory seems to be one of the most promising real-world solutions.

Which language would you use in your OS?

This is probably more of a subjective question, but which language (not API like .NET or JDK) would you use should you write your own operating system? Which language provides flexibility, simplicity, and possibly a low-level interface to the hardware? I was thinking Java or C...

C, of course.

Haskell.
Once you have flipped the right hardware bits, C is a terrible language to use for the rest of the OS. Things like the scheduler, filesystems, drivers, etc. are complex high-level algorithms, and you don't want to be writing those in assembly language (or C; same thing). It's too hard to get right. (The VM subsystem and memory manager may need to be written in something low-level, as you will need to bootstrap your high-level langauge's runtime somehow.)
Anyway, this isn't just a crazy idea that I am coming up with for SO. Here is an OS written in Haskell: http://programatica.cs.pdx.edu/House/
Lisp is another good choice; the original Lisp machines were infinitely more tweakable (at runtime) than "modern" OSes like UNIX and Windows.
Sometimes history forgets good ideas (often in the name of "maximum performance"), and that makes me sad.

D would be an interesting choice. From its own description:
D is a systems programming language. Its focus is on combining the power and high performance of C and C++ with the programmer productivity of modern languages like Ruby and Python. Special attention is given to the needs of quality assurance, documentation, management, portability and reliability.
The D runtime assumes the existence of a garbage collector, which would not be appropriate for the very lowest levels of the kernel. However, it would be appropriate for many of the higher layers.

Build the basic components like task schedulers and drivers etc with Assembly, then build the higher level components like applications and tools with C
I believe this is how Windows XP was built too (unsure about Windows Vista and Windows 7).

Definitely... C

C, ASM, C#
Singularity

Low-level in something like Haskell or D. Productivity over performance, in my opinion. You can rewrite slow parts in C++ or even assembly later if the need arises.
High-level in Python or Ruby. Ideally I'd also have a really fast JIT-capable VM for that language, but that's not going to happen for either language for a while. Lua would be a good alternative if speed gets in the way.

The kernel has to be written in a low-level language, C is by far the best choice for this, because it is so memory efficient. The higher levels could be built with a combination of Java or more ideally Objective-C, and scripting languages like python and ruby, or lua.

Honestly, I would either use C or some hierarchy of languages that I had either designed or fit together completely seamlessly. What I would be looking for is a seamless experience that starts at the bare metal level and then I could move to higher and higher level languages as I moved up the problem space. I would probably chose something like:
C - for bare metal stuff like drivers, kernel, etc
Java/C# - for application-level things like administration consoles, OS apps
Python/PowerShell - for scripting activities like common administrative tasks (creating a new user, etc)
Personally, I think C/C#/PowerShell is more tightly integrated and the type of experience I'd be looking for. Of course, if I ever got so ambitious as to write an OS, I would have a lot of spare time on my hands and would probably really enjoy tackling the language stack first. So maybe it would be L/L#/LScript ...

BitC seems to have this in mind. Despite it's name it seems to be the midpoint of assembly language and lisp. The goal was to make a language with a strong correspondence with machine language but have an intermediate representation that supports stronger correctness inferences than is possible with most other common languages. The languages was created as part of the Coyotos project, an operating system with lofty goals of security and reliability. Formal verification is made significantly possible with the ir used in BitC.

Ada:
Ada is a structured, statically typed, imperative, and object-oriented high-level computer programming language, extended from Pascal and other languages. It was originally designed by a team led by Jean Ichbiah of CII Honeywell Bull under contract to the United States Department of Defense (DoD) from 1977 to 1983 to supersede the hundreds of programming languages then used by the DoD. Ada is strongly typed and compilers are validated for reliability in mission-critical applications, such as avionics software.
Ada, because it was not only specifically developed for such projects, but it also provides support for several very useful high level features (such as support for strong typing, concurrency and abstraction) that are simply not available in standard C.
So that, even as a project grows, you don't have to work around language limitations (think encapsulation, abstraction, namespaces in C).
Don't get me wrong, C works obviously for a great many of projects, but once a project has gained a certain size (think Linux kernel, gcc, GNOME), you will inevitably appreciate certain features of more high level languages to make the development process less tedious and also less obfuscated.
In C however, these features usually end up being -pretty poorly- emulated by excessive and almost pervert use of the pre-processor (this can for example be seen in the gcc code base), so that you get to see lots of nested macros, that from an implementation point of view, actually emulate features found in other programming languages.
In addition, Ada is the only programming language, that I am aware of, that actually provides standardized support for source code analysis using the ASIS, having such a facility in place is however the prerequisite to actually be able to maintain and transform/re-engineer a code base in the long run (think refactoring).
Having an interface like ASIS available, means that you can actually implement "semantic patching", where you can automatically rename a file, function or variable/data structure and it will actually work.

Java ?? no jave runs on a virtual machine which needs an os to run on top of ,
maybe C and some ASM ;)

I would go with D to see whether it can do it.

I would only pick the following 3 out of practicality.
C (good old fashioned)
C++ (C with stuff tacked on. Windows is partially written in this)
Java (the medium level language that just might have a capable garbage collector with controllable pauses with G1).

If I were going to start a new OS I'd do it with the subset of C++ recommended by the embedded industry. You can use things like classes and use it "as a better C" and be just as fast. Just avoid things that have massive overhead. You can even use some template features, if you stick to a certain subet that basically don't have any overhead. Look on embedded.com for features in C++ that have little to no overhead, but will allow you organize your code better than you ever could in C.

Oberon? I guess I miss Pascal too much some times. C paid the bills for quite a while, but I don't really love it.

Lisp of course!
Title text: Some say the world will end in fire; some say in segfaults.

For an OS, you want speed at the lowest levels. So assembly, C, C++, Objective-C, or Java seem to be the current choices. Although it's just recently that Java got fast, and it's hard for me to imagine an OS with garbage collection.
If I were writing my own, it would be a mix of assembly and C.

A C or C++ microkernel with a JIT for a highly dynamic language like Ruby or maybe a language with native support for the Prototype pattern. Even device drivers in that language.
Not because it's practical but because it's really cool. Cool in the way that NeXTStep was cool for using Obj-C for pretty much everything.

http://www.dwheeler.com/sloc/redhat71-v1/redhat71sloc.html - share of languages in Linux's source code.

C, by a number of reasons. Other candidates, like D, are great. However, C has this advantage: there's a lot of available open-sourced C code that you could reuse in your project (much more than is for other languages appropriate for system programming).

I would be torn between using some existing low level language and write my own based on C# but with much better generics support.
In second case I would make each method generic, but all the constraints will be resolved by compiler - to allow "duck typing" like in Scala but still language should be static. Also static virtual methods would lower the codebase.
I've had that idea for a long time, but it never seems to be doable in real timeframe, so who knows maybe in the future. :-)

Some would say Java.
Note that openfirmware is written partly in Forth, and it's very low level.
Have an open mind.

"The kernel has to be written in a low-level language, C is by far the best choice for this, because it is so memory efficient. "
Um... What about FORTH?
FORTH can be low level and high level, so you could have a whole operating system written in FORTH from the ground up, and still have a nice easy REPL scripting environment on top, also in FORTH.
However, any decent operating system should support lots of langauges on top, from C all the way to Python Ruby and Javascript. Making FORTH the basis for it all has a lot of benefits though.
edit: I'd only ever attempt this for an embedded environment with a single known hardware set. Trying to write an OS that could compete with Linux or Windows is a fools job.

If this isn't a hypothetical question, and you're looking to create your own OS, I'd probably go with C because most of the examples out there are written in C.
Also, (And I haven't build an OS yet so take this with a grain of salt), I'm thinking that the c runtime libraries would be a lot easier to port to your new OS than say .NET.

Pascal + Oberon: they have the power of C and C++ but they're not as daunting to use. Both these languages are grossly under appreciated.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string