How do laziness and parallelism coexist in Haskell?

People argue that Haskell has an advantage in parallelism since it has immutable data structures. But Haskell is also lazy. It means data actually can be mutated from thunk to evaluated result.
So it seems laziness can harm the advantage of immutability. Am I wrong or does Haskell have countermeasures for this problem? Or is this Haskell's own feature?

Yes, GHC’s RTS uses thunks to implement non-strict evaluation, and they use mutation under the hood, so they require some synchronisation. However, this is simplified due to the fact that most heap objects are immutable and functions are referentially transparent.
In a multithreaded program, evaluation of a thunk proceeds as follows:
The thunk is atomically† replaced with a BLACKHOLE object
If the same thread attempts to force the thunk after it’s been updated to a BLACKHOLE, this represents an infinite loop, and the RTS throws an exception (<<loop>>)
If a different thread attempts to force the thunk while it’s a BLACKHOLE, it blocks until the original thread has finished evaluating the thunk and updated it with a value
When evaluation is complete, the original thread atomically† replaces the thunk with its result
† e.g., using a compare-and-swap (CAS) instruction
So there is a potential race here: if two threads attempt to force the same thunk at the same time, they may both begin evaluating it. In that case, they will do some redundant work—however, one thread will succeed in overwriting the BLACKHOLE with the result, and the other thread will simply discard the result that it calculated, because its CAS will fail.
Safe code cannot detect this, because it can’t obtain the address of an object or determine the state of a thunk. And in practice, this type of collision is rare for a couple of reasons:
Concurrent code generally partitions workloads across threads in a manner suited to the particular problem, so there is low risk of overlap
Evaluation of thunks is generally fairly “shallow” before you reach weak head normal form, so the probability of a “collision” is low
So thunks ultimately provide a good performance tradeoff when implementing non-strict evaluation, even in a concurrent context.


What costs are incurred when using Cell<T> as opposed to just T?

I ran across a comment on reddit that indicates that using Cell<T> prevents certain optimizations from occurring:
Cell works with no memory overhead (Cell is the same size as T) and little runtime overhead (it "just" inhibits optimisations, it doesn't introduce extra explicit operations)
This seems counter to other things I've read about Cell<T>, in particular that it's "zero-cost." The first place I encountered this categorization is here.
With all that said, I'd like to understand the actual cost of using Cell<T>, including whatever optimizations it may prevent.
TL;DR Cell is Zero-Overhead Abstraction; that is, the same functionality implemented manually has the same cost.
The term Zero-Cost Abstractions is not English, it's jargon. The idea of Zero-Cost Abstractions is that the layer of abstraction itself does not add any cost compared to manually doing the same thing.
There are various misunderstandings that have sprung up: most notably, I have regularly seen zero-cost understood as "the operation is free", which is not the case.
To add to the confusion, the exception mechanism used by most C++ implementations, and which Rust uses for panic = unwind is called Zero-Cost Exceptions, and purports1 to add no overhead on the non-throwing path. It's a different kind of Zero-Cost...
Lately, my recommendation is to switch to using the term Zero-Overhead Abstractions: first because it's a distinct term from Zero-Cost Exceptions, so less likely to be mistaken, and second because it emphasizes that the Abstraction does not add Overhead, which is what we are trying to convey in the first place.
1 The objective is only partially achieved. While the same assembly executed with and without the possibility of throwing indeed has the same performance, the presence of potential exceptions may hinder the optimizer and cause it to generate sub-optimal assembly in the first place.
On the memory side, there is no overhead:
sizeof::<Cell<T>>() == sizeof::<T>(),
given a cell of type Cell<T>, &cell == cell.as_ptr().
(You can peek at the source code)
On the access side, Cell<T> does incur a run-time cost compared to T; the cost of the extra functionality.
The most immediate cost is that manipulating the value through a &Cell<T> requires copying it back and forth1. This is a bitwise copy, so the optimizer may elide it, if it can prove that it is safe to do so.
Another notable cost is that UnsafeCell<T>, on which Cell<T> is based, breaks the rules that &T means that T cannot be modified.
When a compiler can prove that a portion of memory cannot be modified, it can optimize out further reads: read in a register, then use the register value rather than reading again.
In traditional Rust code, a &T gives such a guarantee: no matter if there are opaque function calls, calls to C code, etc... between two reads to, the second read will return the same value as the first, guaranteed. With a &Cell<T>, there is no such guarantee any longer, and thus unless the optimizer can prove beyond doubt that the value is unmodified2, then it cannot apply such optimizations.
1 You can manipulate the value at no cost through &mut Cell<T> or using unsafe code.
2 For example, if the optimizer knows that the value resides on the stack, and it never passed the address of the value to anyone else, then it can reasonably conclude that no one else can modify the value. Although a stack-smashing attack may, of course.

C++ Threads writing to different parts of array of vector

I have an std::array<std::vector, NUM_THREADS> and I basically want each thread to go get some data, and store it in its own std::vector, and also to read from its vector.
Is this safe? Or am I going to have to use a mutex or something?
The rule regarding data-races is that if every memory location is either accessed by no more than one thread at a time, or is only read (by any number of threads, but no writes), you don't need atomicity. Otherwise, you need either atomicity or synchronization (such as mutual-exclusion).
If every thread is only writing to and reading from its own vector, this would be safe. If two threads are writing to the same vector elements without synchronization, or if they're both writing to the same vector itself (e.g., appending or truncating the vector), you're pretty much clobbered --- that's two simultaneous writes. If two threads are each writing to elements of their own vectors and reading from both vectors, it's more complicated, but in general I would expect it to be unsafe. There are very specific arrangements where it may be safe/legal, but they will be very brittle, and likely hard to maintain, so it's probably better to re-architect to avoid it.
As an example of a usage like this where it would be legal (but again, brittle and hard to retain safety during code maintenance) would be where none of the vectors are changing size (a reallocation is going to be a write to the vector itself which would preclude any reads on the vector or its elements by other threads) and each thread is able to avoid reading from any specific element of a vector that is written to by any other thread (for example, you have two threads, one reading from and writing to even elements of the vectors and the other reading from and writing to odd elements of the vectors).
The above example is very artificial and probably not all that useful for real access patterns that might be desired. Other examples I could think of would probably also be artificial and unhelpful. And it's very easy to do some simple operation that would destroy the whole guarantee. In particular, if any thread performs push_back() on their own vector, any threads that may be concurrently reading the vector are almost guaranteed to result in undefined behavior. (You might be able to align the stars using reserve() very carefully and make code that is legal, but I certainly wouldn't attempt it myself.)

"Wait-free" data in Haskell

I've been led to believe that the GHC implementation of TVars is lock-free, but not wait-free. Are there any implementations that are wait-free (e.g. a package on Hackage)?
Wait-freedom is a term from distributed computing. An algorithm is wait-free if a thread (or distributed node) is able to terminate correctly even if all input from other threads is delayed/lost at any time.
If you care about consistency, then you cannot guarantee wait-freedom (assuming that you always want to terminate correctly, i.e. guarantee availability). This follows from the CAP theorem [1], since wait-freedom essentially implies partition-tolerance.
Your question "Are there any implementations that are wait-free?" is a bit incomplete. STM (and thus TVar) is rather complex and has support built into the compiler - you can't build it properly with Haskell primitives.
If you're looking for any data container that allows mutation and can be non-blocking then you want IORefs or MVars (but those can block if no value is available).

Why is concurrent haskell non deterministic while parallel haskell primitives (par and pseq) deterministic?

Don't quite understand determinism in the context of concurrency and parallelism in Haskell. Some examples would be helpful.
When dealing with pure values, the order of evaluation does not matter. That is essentially what parallelism does: Evaluating pure values in parallel. As opposed to pure values, order usually matters for actions with side-effects. Running actions simultaneously is called concurrency.
As an example, consider the two actions putStr "foo" and putStr "bar". Depending on the order in which those two actions get evaluated, the output is either "foobar", "barfoo" or any state in between. The output is indeterministic as it depends on the specific order of evaluation.
As another example, consider the two values sum [1..10] and 5 * 3. Regardless of the order in which those two get evaluated, they always reduce to the same results. This determinism is something you can usually only guarantee with pure values.
Concurrency and parallelism are two different things.
Concurrency means that you have multiple threads interacting non-deterministically. For example, you might have a chat server where each client is handled by one thread. The non-determinism is essential to the system you're trying to model.
Parallelism is about using multiple threads for simply making your program run faster. However, the end result should be exactly the same as if you run the algorithm sequentially.
Many languages don't have primitives for parallelism, so you have to implement it using concurrency primitives like threads and locks. However, this means that you the programmer have to be careful to ensure that you don't accidentally introduce unwanted non-determinism or other concurrency issues. With explicit parallelism primitives like par and pseq, many of these concerns simply go away.

How do Haskell compilers decide whether to allocate on the heap or the stack?

Haskell doesn't feature explicit memory management, and all objects are passed by value, so there's no obvious reference counting or garbage collection either. How does a Haskell compiler typically decide whether to generate code that allocates on the stack versus code that allocates on the heap for a given variable? Will it consistently heap or stack allocate the same variables across different call sites for the same function? And when it allocates, how does it decide when to free memory? Are stack allocations and deallocations still performed in the same function entrance/exit pattern as in C?
When you call a function like this
f 42 (g x y)
then the runtime behaviour is something like the following:
p1 = malloc(2 * sizeof(Word))
p1[0] = &Tag_for_Int
p1[1] = 42
p2 = malloc(3 * sizeof(Word))
p2[0] = &Code_for_g_x_y
p2[1] = x
p2[2] = y
f(p1, p2)
That is, arguments are usually passed as pointers to objects on the heap like in Java, but unlike Java these objects may represent suspended computations, a.k.a. thunks, such as (g x y/p2) in our example. Without optimisations, this execution model is quite inefficient, but there are ways to avoid many of these overheads.
GHC does a lot of inlining and unboxing. Inlining removes the function call overhead and often enables further optimisations. Unboxing means changing the calling convention, in the example above we could pass 42 directly instead of creating the heap object p1.
Strictness analysis finds out whether an argument is guaranteed to be evaluated. In that case, we don't need to create a thunk, but evaluate the expression fully and then pass the final result as an argument.
Small objects (currently only 8bit Chars and Ints) are cached. That is, instead of allocating a new pointer for each object, a pointer to the cached object is returned. Even though the object is initially allocated on the heap, the garbage collector will de-duplicate them later (only small Ints and Chars). Since objects are immutable this is safe.
Limited escape analysis. For local functions some arguments may be passed on the stack, because they are known to be dead code by the time the outer function returns.
Edit: For (much) more information see "Implementing Lazy Functional Languages on Stock Hardware: The Spineless Tagless G-machine". This paper uses "push/enter" as the calling convention. Newer versions of GHC use the "eval/apply" calling convention. For a discussion of the trade-offs and reasons for that switch see "How to make a fast curry: push/enter vs eval/apply"
The only things GHC puts on the stack are evaluation contexts. Anything allocated with a let/where binding, and all data constructors and functions, are stored in the heap. Lazy evaluation makes everything you know about execution strategies in strict languages irrelevant.
