Need real example of circular reference not handled by ref-counting [closed] - garbage-collection

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
this may be theoretic a bit, I need this because I am designing my own programming language for fun as a hobby project. I see memory management as "four generations". This is my own view, not connected to any paper or school.
First generation is flat block of memory, and the application just manages it some how.
Second generation is malloc/free style heap. System will handle the memory, but you handle the objects.
Third generation is reference counted allocation/disposal, which happens "in place" or in other words during the call to new/delete.
The fourth generation is what you usually call garbage collector. Which may happen in another thread or sometimes just like third generation may happen during call to new/delete or other procedure.
The main difference is that 4th gen GC will sometimes "walk" or "scan" the heap objects and check out if some of them are "unrooted", so that it can delete tham. This is quite compilcated, there are many algorithms for that "scanning". Sometimes those algoritm names have a word "color" in them. Like "Tri-color marking" or "Tri-color counting async parallel" etc. The most obvious reason for having a "heap scan" is to handle circular references. Because there is no (yet?) other way to handle that.
Ok now to my question.
What could be an easy to understand, "Real life" example of the circular reference, that cannot be handled without the 4th generation GC? Of course you can just type two lines of code, to deliberately create circular reference. But that is not my question. I want a "real" example, I mean something that could happen naturally, and can be easily understood. And something that can really show to any newby how out of the blue you could have a circular reference that you need to handle with GC. Maybe I could use also words "practical" and "useful" to describe such an example.

Since you asked for a real-world example, here's one in Java, using LinkedList:
{
List<String> list = new LinkedList<String>();
list.add("hello");
list.add("world");
}
Why is there a cycle here? because Java's LinkedList is a doubly-linked list, and the implementation of its add is:
boolean add(E e) {
linkLast(e);
return true;
}
void linkLast(E e) {
final Node<E> l = last;
final Node<E> newNode = new Node<>(l, e, null);
last = newNode;
if (l == null)
first = newNode;
else
l.next = newNode;
size++;
modCount++;
}
With this inner class and these fields:
private static class Node<E> {
E item;
Node<E> next;
Node<E> prev;
Node(Node<E> prev, E element, Node<E> next) {
this.item = element;
this.next = next;
this.prev = prev;
}
}
Node<E> first, last;
After the scope ends, list has no references so it can be collected. But the two Nodes created from the two linkLast() invocations have two references each - the first from first and last.prev, the second from last and first.next. Collecting list will only remove 1 of these references for each object, leaving a cycle.

Related

Is ABA relevant for push/insertion operations when using the CAS idiom?

The following pseudo-code is taken from http://15418.courses.cs.cmu.edu/spring2013/article/46
while (1) {
n->next = p->next;
Node *old_next = p->next;
if (compare_and_swap(&p->next, old_next, n) == old_next)
return;
}
This is the push operation for a lock-free stack that uses the compare and swap idiom, but does it atomically. It does not seem an ABA issue is relevant here, and I am wondering if that's generally the case for push and insertion operations?
You are right that this function does not suffer from the ABA problem, because there is nothing that depends on the old_next value before the call to compare_and_swap.
Consider the (simplified) pop operation:
while (1) {
Node *top = s->top;
if (top == NULL)
return NULL;
Node *new_top = top->next;
if (compare_and_swap(&s->top, top, new_top);
return top;
}
}
Here we load s->top into top and later perform a CAS on s->top with top as expected value. But before the CAS, we read top->next, so we perform an operation that depends on top! This is what causes the ABA problem.
It is not possible to generally state that all push/insert operations are free of the ABA problem, since it depends on more details. As a somewhat contrived example, consider a push operation that conditionally pushes the value only if it is greater.
while (1) {
n->next = p->next;
Node *old_next = p->next;
if (n->value < old_next->value)
return;
if (compare_and_swap(&p->next, old_next, n) == old_next)
return;
}
This also suffers from the ABA problem, because it might break our invariant that the values are strictly ordered.
There is no ABA issue with this code, but that's really because it doesn't do very much.
The problem is fundamentally that you use compare_and_swap(&p->next, old_next, n) to make sure that the stack hasn't changed since you last looked at it, but it does not reliably do that.
Between the time you read n->next and the time you do the compare_and_swap, other threads could have:
Popped n->next
Pushed a bunch of other stuff
Re-pushed the old n->next
Then your compare_and_swap would succeed even though the stack has changed.
This is not a problem for you, because you never look at any of the fields of n->next. It doesn't matter that the stack has changed, because all you care about is the n->next pointer.
If you did have to look inside that object, though, then your code would be broken. For instance, if you added a stack_size field to atomically keep track of the stack size, then you'd set n->stack_size = old_next->stack_size+1, and this would be wrong if the stack had changed as above.
So it is not true in general that insert operations are ABA-proof.

Interning strings in declarative programming

The following scenario shows an abstraction that seems to me to be impossible to implement declaratively.
Suppose that I want to create a Symbol object which allows you to create objects with strings that can be compared, like Symbol.for() in JavaScript. A simple implementation in JS might look like this:
function MySymbol(text){//Comparable symbol object class
this.text = text;
this.equals = function(other){//Method to compare to other MySymbol
return this.text == other.text;
}
}
I could easily write this in a declarative language like Haskell:
data MySymbol = MySymbol String
makeSymbol :: String -> MySymbol
makeSymbol s = MySymbol s
compareSymbol :: MySymbol -> MySymbol -> Bool
compareSymbol (MySymbol s1) (MySymbol s2) = s1 == s2
However, maybe in the future I want to improve efficiency by using a global registry without changing the interface to the MySymbol objects. (The user of my class doesn't need to know that I've changed it to use a registry)
For example, this is easily done in Javascript:
function MySymbol(text){
if (MySymbol.registry.has(text)){//check if symbol already in registry
this.id = MySymbol.registry.get(text);//get id
} else {
this.id = MySymbol.nextId++;
MySymbol.registry.set(text, this.id);//Add new symbol with nextId
}
this.equals = function(other){//To compare, simply compare ids
return this.id == other.id;
}
}
//Setup initial empty registry
MySymbol.registry = new Map();//A map from strings to numbers
MySymbol.nextId = 0;
However, it is impossible to create a mutable global registry in Haskell. (I can create a registry, but not without changing the interface to my functions.)
Specifically, these three possible Haskell solutions all have problems:
Force the user to pass a registry argument or equivalent, making the interface implementation dependent
Use some fancy Monad stuff like Haskell's Control.Monad.Random, which would require either foreseeing the optimization from the start or changing the interface (and is basically just adding the concept of state into your program and therefore breaks referential transparency etc.)
Have a slow implementation which might not be practical in a given application
None of these solutions allow me to sufficiently abstract away implementation from my Haskell interface.
So, my question is: Is there a way to implement this optimization to a Symbol object in Haskell (or any declarative language) without causing one of the three problems listed above,
and are there any other situations where an imperative language can express an abstraction (for example an optimization like above) that a declarative language can't?
The intern package shows how. As discussed by #luqui, it uses unsafePerformIO at a few key moments, and is careful to hide the identifiers produced during interning.

Must access to scala.collection.immutable.List and Vector be synchronized?

I'm going through Learning Concurrent Programming in Scala, and encountered the following:
In current versions of Scala, however, certain collections that are
deemed immutable, such as List and Vector, cannot be shared without
synchronization. Although their external API does not allow you to
modify them, they contain non-final fields.
Tip: Even if an object
seems immutable, always use proper synchronization to share any object
between the threads.
From Learning Concurrent Programming in Scala by Aleksandar Prokopec, end of Chapter 2 (p.58), Packt Publishing, Nov 2014.
Can that be right?
My working assumption has always been that any internal mutability (to implement laziness, caching, whatever) in Scala library data structures described as immutable would be idempotent, such that the worst that might happen in a bad race is work would be unnecessarily duplicated. This author seems to suggest correctness may be imperiled by concurrent access to immutable structures. Is that true? Do we really need to synchronize access to Lists?
Much of my transition to an immutable-heavy style has been motivated by a desire to avoid synchronization and the potential contention overhead it entails. It would be an unhappy big deal to learn that synchronization cannot be eschewed for Scala's core "immutable" data structures. Is this author simply overconservative?
Scala's documentation of collections includes the following:
A collection in package scala.collection.immutable is guaranteed to be immutable for everyone. Such a collection will never change after it is created. Therefore, you can rely on the fact that accessing the same collection value repeatedly at different points in time will always yield a collection with the same elements.
That doesn't quite say that they are safe for concurrent access by multiple threads. Does anyone know of an authoritative statement that they are (or aren't)?
It depends on where you share them:
it's not safe to share them inside scala-library
it's not safe to share them with Java-code, reflection
Simply saying, these collections are less protected than objects with only final fields. Regardless that they're same on JVM level (without optimization like ldc) - both may be fields with some mutable address, so you can change them with putfield bytecode command. Anyway, var is still less protected by the compiler, in comparision with java's final, scala's final val and val.
However, it's still fine to use them in most cases as their behaviour is logically immutable - all mutable operations are encapsulated (for Scala-code). Let's look at the Vector. It requires mutable fields to implement appending algorithm:
private var dirty = false
//from VectorPointer
private[immutable] var depth: Int = _
private[immutable] var display0: Array[AnyRef] = _
private[immutable] var display1: Array[AnyRef] = _
private[immutable] var display2: Array[AnyRef] = _
private[immutable] var display3: Array[AnyRef] = _
private[immutable] var display4: Array[AnyRef] = _
private[immutable] var display5: Array[AnyRef] = _
which is implemented like:
val s = new Vector(startIndex, endIndex + 1, blockIndex)
s.initFrom(this) //uses displayN and depth
s.gotoPos(startIndex, startIndex ^ focus) //uses displayN
s.gotoPosWritable //uses dirty
...
s.dirty = dirty
And s comes to the user only after method returned it. So it's not even concern of happens-before guarantees - all mutable operations are performed in the same thread (thread where you call :+, +: or updated), it's just kind of initialization. The only problem here is that private[somePackage] is accessible directly from Java code and from scala-library itself, so if you pass it to some Java's method it could modify them.
I don't think you should worry about thread-safety of let's say cons operator. It also has mutable fields:
final case class ::[B](override val head: B, private[scala] var tl: List[B]) extends List[B] {
override def tail : List[B] = tl
override def isEmpty: Boolean = false
}
But they used only inside library methods (inside one-thread) without any explicit sharing or thread creation, and they always return a new collection, let's consider take as an example:
override def take(n: Int): List[A] = if (isEmpty || n <= 0) Nil else {
val h = new ::(head, Nil)
var t = h
var rest = tail
var i = 1
while ({if (rest.isEmpty) return this; i < n}) {
i += 1
val nx = new ::(rest.head, Nil)
t.tl = nx //here is mutation of t's filed
t = nx
rest = rest.tail
}
h
}
So here t.tl = nx is not much differ from t = nx in the meaning of thread-safety. They both are reffered only from the single stack (take's stack). Althrought, if I add let's say someActor ! t (or any other async operation), someField = t or someFunctionWithExternalSideEffect(t) right inside the while loop - I could break this contract.
A little addtion here about relations with JSR-133:
1) new ::(head, Nil) creates new object in the heap and puts its address (lets say 0x100500) into the stack(val h =)
2) as long as this address is in the stack, it's known only to the current thread
3) Other threads could be involved only after sharing this address by putting it into some field; in case of take it has to flush any caches (to restore the stack and registers) before calling areturn (return h), so returned object will be consistent.
So all operations on 0x100500's object are out of scope of JSR-133 as long as 0x100500 is a part of stack only (not heap, not other's stacks). However, some fields of 0x100500's object may point to some shared objects (which might be in scope JSR-133), but it's not the case here (as these objects are immutable for outside).
I think (hope) the author meant logical synchronization guarantees for library's developers - you still need to be careful with these things if you're developing scala-library, as these vars are private[scala], private[immutable] so, it's possible to write some code to mutate them from different threads. From scala-library developer's perspective, it usually means that all mutations on single instance should be applied in single thread and only on collection that invisible to a user (at the moment). Or, simply saying - don't open mutable fields for outer users in any way.
P.S. Scala had several unexpected issues with synchronization, which caused some parts of the library to be surprisely not thread-safe, so I wouldn't wonder if something may be wrong (and this is a bug then), but in let's say 99% cases for 99% methods immutable collections are thread safe. In worst case you might be pushed from usage of some broken method or just (it might be not just "just" for some cases) need to clone the collection for every thread.
Anyway, immutability is still a good way for thread-safety.
P.S.2 Exotic case which might break immutable collections' thread-safety is using reflection to access their non-final fields.
A little addition about another exotic but really terrifying way, as it pointed out in comments with #Steve Waldman and #axel22 (the author). If you share immutable collection as member of some object shared netween threads && if collection's constructor becomes physically (by JIT) inlined (it's not logically inlined by default) && if your JIT-implementation allows to rearrange inlined code with normal one - then you have to synchronize it (usually is enough to have #volatile). However, IMHO, I don't believe that last condition is a correct behaviour - but for now, can't neither prove nor disprove that.
In your question you are asking for an authoritative statement. I found the following in "Programming in Scala" from Martin Odersky et al:
"Third, there is no way for two threads concurrently accessing an immutable to corrupt its state once it has been properbly constructed, because no thread can change the state of an immutable"
If you look for example at the implementation you see that this is followed in the implementation, see below.
There are some fields inside vector which are not final and could lead to data races. But since they are only changed inside a method creating a new instance and since you need an Synchronization action to access the newly created instance in different threads anyway everyting is fine.
The pattern used here is to create and modify an object. Than make it visible to other threads, for example by assigning this instance to a volatile static or static final. And after that make sure that it is not changed anymore.
As an Example the creation of two vectors:
val vector = Vector(4,5,5)
val vector2 = vector.updated(1, 2);
The method updated uses the var field dirty inside:
private[immutable] def updateAt[B >: A](index: Int, elem: B): Vector[B] = {
val idx = checkRangeConvert(index)
val s = new Vector[B](startIndex, endIndex, idx)
s.initFrom(this)
s.dirty = dirty
s.gotoPosWritable(focus, idx, focus ^ idx) // if dirty commit changes; go to new pos and prepare for writing
s.display0(idx & 0x1f) = elem.asInstanceOf[AnyRef]
s
}
but since after creation of vector2 it is assigned to a final variable:
Bytecode of variable declaration:
private final scala.collection.immutable.Vector vector2;
Byte code of constructor:
61 invokevirtual scala.collection.immutable.Vector.updated(int, java.lang.Object, scala.collection.generic.CanBuildFrom) : java.lang.Object [52]
64 checkcast scala.collection.immutable.Vector [48]
67 putfield trace.agent.test.scala.TestVector$.vector2 : scala.collection.immutable.Vector [22]
Everything is o.k.

Is it worthwhile having less locked code but more lock calls?

I have some pieces of code in a multi-threaded environment that access a shared variable. Obviously I need to lock for these accesses, but my question is whether I save or create additional overhead by locking the variable for a short amount of time to get a value, unlocking, doing some things with what I retrieved, then re-locking to modify the data structure that I just unlocked.
Obviously for very expensive operations that can be done outside of a lock, you would benefit by not locking that segment. What about, for example, the following, where the operations being done are relatively inexpensive (creating a new object and an if statement):
Less threaded:
Value v;
lock (values)
{
v = values.FirstOrDefault(a => a.Thing == someValue);
if (v == null)
{
v = new Value { Thing = someValue };
values.Add(v);
}
}
More threaded:
Value v;
lock (values)
{
v = values.FirstOrDefault(a => a.Thing == someValue);
}
if (v == null)
{
v = new Value { Thing = someValue };
lock(values)
{
values.Add(v);
}
}
Both solutions are thread safe and both are quite readable (IMHO), but it would be nice to build into my habits the more efficient of the two, if there is even a slight difference between the two.
The second will, potentially, buy you extra performance, but realistically, it will only be an issue if the Value constructor is relatively expensive.
In the second case, you're avoiding the lock over the null check (which is very fast), but also while constructing the Value instance. This could, potentially, be a significant improvement in performance.
That being said, this is not thread safe. A second thread could add the value to the code between the two lock statements. As such, I'd recommend the first approach (realizing that you'd always have to lock values for any other operation that uses that collection).
If this lookup is a common one, a better approach might be to use ConcurrentDictionary<T,U> instead, with Thing as the key. You could then use the GetOrAdd method to safely add or retrieve the value.
The second case is not thread safe because while a thread is checking if v is null, another thread could have inserted a value equal to someValue (as you name it)

Best explanation for languages without null

Every so often when programmers are complaining about null errors/exceptions someone asks what we do without null.
I have some basic idea of the coolness of option types, but I don't have the knowledge or languages skill to best express it. What is a great explanation of the following written in a way approachable to the average programmer that we could point that person towards?
The undesirability of having references/pointers be nullable by default
How option types work including strategies to ease checking null cases such as
pattern matching and
monadic comprehensions
Alternative solution such as message eating nil
(other aspects I missed)
I think the succinct summary of why null is undesirable is that meaningless states should not be representable.
Suppose I'm modeling a door. It can be in one of three states: open, shut but unlocked, and shut and locked. Now I could model it along the lines of
class Door
private bool isShut
private bool isLocked
and it is clear how to map my three states into these two boolean variables. But this leaves a fourth, undesired state available: isShut==false && isLocked==true. Because the types I have selected as my representation admit this state, I must expend mental effort to ensure that the class never gets into this state (perhaps by explicitly coding an invariant). In contrast, if I were using a language with algebraic data types or checked enumerations that lets me define
type DoorState =
| Open | ShutAndUnlocked | ShutAndLocked
then I could define
class Door
private DoorState state
and there are no more worries. The type system will ensure that there are only three possible states for an instance of class Door to be in. This is what type systems are good at - explicitly ruling out a whole class of errors at compile-time.
The problem with null is that every reference type gets this extra state in its space that is typically undesired. A string variable could be any sequence of characters, or it could be this crazy extra null value that doesn't map into my problem domain. A Triangle object has three Points, which themselves have X and Y values, but unfortunately the Points or the Triangle itself might be this crazy null value that is meaningless to the graphing domain I'm working in. Etc.
When you do intend to model a possibly-non-existent value, then you should opt into it explicitly. If the way I intend to model people is that every Person has a FirstName and a LastName, but only some people have MiddleNames, then I would like to say something like
class Person
private string FirstName
private Option<string> MiddleName
private string LastName
where string here is assumed to be a non-nullable type. Then there are no tricky invariants to establish and no unexpected NullReferenceExceptions when trying to compute the length of someone's name. The type system ensures that any code dealing with the MiddleName accounts for the possibility of it being None, whereas any code dealing with the FirstName can safely assume there is a value there.
So for example, using the type above, we could author this silly function:
let TotalNumCharsInPersonsName(p:Person) =
let middleLen = match p.MiddleName with
| None -> 0
| Some(s) -> s.Length
p.FirstName.Length + middleLen + p.LastName.Length
with no worries. In contrast, in a language with nullable references for types like string, then assuming
class Person
private string FirstName
private string MiddleName
private string LastName
you end up authoring stuff like
let TotalNumCharsInPersonsName(p:Person) =
p.FirstName.Length + p.MiddleName.Length + p.LastName.Length
which blows up if the incoming Person object does not have the invariant of everything being non-null, or
let TotalNumCharsInPersonsName(p:Person) =
(if p.FirstName=null then 0 else p.FirstName.Length)
+ (if p.MiddleName=null then 0 else p.MiddleName.Length)
+ (if p.LastName=null then 0 else p.LastName.Length)
or maybe
let TotalNumCharsInPersonsName(p:Person) =
p.FirstName.Length
+ (if p.MiddleName=null then 0 else p.MiddleName.Length)
+ p.LastName.Length
assuming that p ensures first/last are there but middle can be null, or maybe you do checks that throw different types of exceptions, or who knows what. All these crazy implementation choices and things to think about crop up because there's this stupid representable-value that you don't want or need.
Null typically adds needless complexity. Complexity is the enemy of all software, and you should strive to reduce complexity whenever reasonable.
(Note well that there is more complexity to even these simple examples. Even if a FirstName cannot be null, a string can represent "" (the empty string), which is probably also not a person name that we intend to model. As such, even with non-nullable strings, it still might be the case that we are "representing meaningless values". Again, you could choose to battle this either via invariants and conditional code at runtime, or by using the type system (e.g. to have a NonEmptyString type). The latter is perhaps ill-advised ("good" types are often "closed" over a set of common operations, and e.g. NonEmptyString is not closed over .SubString(0,0)), but it demonstrates more points in the design space. At the end of the day, in any given type system, there is some complexity it will be very good at getting rid of, and other complexity that is just intrinsically harder to get rid of. The key for this topic is that in nearly every type system, the change from "nullable references by default" to "non-nullable references by default" is nearly always a simple change that makes the type system a great deal better at battling complexity and ruling out certain types of errors and meaningless states. So it is pretty crazy that so many languages keep repeating this error again and again.)
The nice thing about option types isn't that they're optional. It is that all other types aren't.
Sometimes, we need to be able to represent a kind of "null" state. Sometimes we have to represent a "no value" option as well as the other possible values a variable may take. So a language that flat out disallows this is going to be a bit crippled.
But often, we don't need it, and allowing such a "null" state only leads to ambiguity and confusion: every time I access a reference type variable in .NET, I have to consider that it might be null.
Often, it will never actually be null, because the programmer structures the code so that it can never happen. But the compiler can't verify that, and every single time you see it, you have to ask yourself "can this be null? Do I need to check for null here?"
Ideally, in the many cases where null doesn't make sense, it shouldn't be allowed.
That's tricky to achieve in .NET, where nearly everything can be null. You have to rely on the author of the code you're calling to be 100% disciplined and consistent and have clearly documented what can and cannot be null, or you have to be paranoid and check everything.
However, if types aren't nullable by default, then you don't need to check whether or not they're null. You know they can never be null, because the compiler/type checker enforces that for you.
And then we just need a back door for the rare cases where we do need to handle a null state. Then an "option" type can be used. Then we allow null in the cases where we've made a conscious decision that we need to be able to represent the "no value" case, and in every other case, we know that the value will never be null.
As others have mentioned, in C# or Java for example, null can mean one of two things:
the variable is uninitialized. This should, ideally, never happen. A variable shouldn't exist unless it is initialized.
the variable contains some "optional" data: it needs to be able to represent the case where there is no data. This is sometimes necessary. Perhaps you're trying to find an object in a list, and you don't know in advance whether or not it's there. Then we need to be able to represent that "no object was found".
The second meaning has to be preserved, but the first one should be eliminated entirely. And even the second meaning should not be the default. It's something we can opt in to if and when we need it. But when we don't need something to be optional, we want the type checker to guarantee that it will never be null.
All of the answers so far focus on why null is a bad thing, and how it's kinda handy if a language can guarantee that certain values will never be null.
They then go on to suggest that it would be a pretty neat idea if you enforce non-nullability for all values, which can be done if you add a concept like Option or Maybe to represent types that may not always have a defined value. This is the approach taken by Haskell.
It's all good stuff! But it doesn't preclude the use of explicitly nullable / non-null types to achieve the same effect. Why, then, is Option still a good thing? After all, Scala supports nullable values (is has to, so it can work with Java libraries) but supports Options as well.
Q. So what are the benefits beyond being able to remove nulls from a language entirely?
A. Composition
If you make a naive translation from null-aware code
def fullNameLength(p:Person) = {
val middleLen =
if (null == p.middleName)
p.middleName.length
else
0
p.firstName.length + middleLen + p.lastName.length
}
to option-aware code
def fullNameLength(p:Person) = {
val middleLen = p.middleName match {
case Some(x) => x.length
case _ => 0
}
p.firstName.length + middleLen + p.lastName.length
}
there's not much difference! But it's also a terrible way to use Options... This approach is much cleaner:
def fullNameLength(p:Person) = {
val middleLen = p.middleName map {_.length} getOrElse 0
p.firstName.length + middleLen + p.lastName.length
}
Or even:
def fullNameLength(p:Person) =
p.firstName.length +
p.middleName.map{length}.getOrElse(0) +
p.lastName.length
When you start dealing with List of Options, it gets even better. Imagine that the List people is itself optional:
people flatMap(_ find (_.firstName == "joe")) map (fullNameLength)
How does this work?
//convert an Option[List[Person]] to an Option[S]
//where the function f takes a List[Person] and returns an S
people map f
//find a person named "Joe" in a List[Person].
//returns Some[Person], or None if "Joe" isn't in the list
validPeopleList find (_.firstName == "joe")
//returns None if people is None
//Some(None) if people is valid but doesn't contain Joe
//Some[Some[Person]] if Joe is found
people map (_ find (_.firstName == "joe"))
//flatten it to return None if people is None or Joe isn't found
//Some[Person] if Joe is found
people flatMap (_ find (_.firstName == "joe"))
//return Some(length) if the list isn't None and Joe is found
//otherwise return None
people flatMap (_ find (_.firstName == "joe")) map (fullNameLength)
The corresponding code with null checks (or even elvis ?: operators) would be painfully long. The real trick here is the flatMap operation, which allows for the nested comprehension of Options and collections in a way that nullable values can never achieve.
Since people seem to be missing it: null is ambiguous.
Alice's date-of-birth is null. What does it mean?
Bob's date-of-death is null. What does that mean?
A "reasonable" interpretation might be that Alice's date-of-birth exists but is unknown, whereas Bob's date-of-death does not exist (Bob is still alive). But why did we get to different answers?
Another problem: null is an edge case.
Is null = null?
Is nan = nan?
Is inf = inf?
Is +0 = -0?
Is +0/0 = -0/0?
The answers are usually "yes", "no", "yes", "yes", "no", "yes" respectively. Crazy "mathematicians" call NaN "nullity" and say it compares equal to itself. SQL treats nulls as not equal to anything (so they behave like NaNs). One wonders what happens when you try to store ±∞, ±0, and NaNs into the same database column (there are 253 NaNs, half of which are "negative").
To make matters worse, databases differ in how they treat NULL, and most of them aren't consistent (see NULL Handling in SQLite for an overview). It's pretty horrible.
And now for the obligatory story:
I recently designed a (sqlite3) database table with five columns a NOT NULL, b, id_a, id_b NOT NULL, timestamp. Because it's a generic schema designed to solve a generic problem for fairly arbitrary apps, there are two uniqueness constraints:
UNIQUE(a, b, id_a)
UNIQUE(a, b, id_b)
id_a only exists for compatibility with an existing app design (partly because I haven't come up with a better solution), and is not used in the new app. Because of the way NULL works in SQL, I can insert (1, 2, NULL, 3, t) and (1, 2, NULL, 4, t) and not violate the first uniqueness constraint (because (1, 2, NULL) != (1, 2, NULL)).
This works specifically because of how NULL works in a uniqueness constraint on most databases (presumably so it's easier to model "real-world" situations, e.g. no two people can have the same Social Security Number, but not all people have one).
FWIW, without first invoking undefined behaviour, C++ references cannot "point to" null, and it's not possible to construct a class with uninitialized reference member variables (if an exception is thrown, construction fails).
Sidenote: Occasionally you might want mutually-exclusive pointers (i.e. only one of them can be non-NULL), e.g. in a hypothetical iOS type DialogState = NotShown | ShowingActionSheet UIActionSheet | ShowingAlertView UIAlertView | Dismissed. Instead, I'm forced to do stuff like assert((bool)actionSheet + (bool)alertView == 1).
The undesirability of having having references/pointers be nullable by default.
I don't think this is the main issue with nulls, the main issue with nulls is that they can mean two things:
The reference/pointer is uninitialized: the problem here is the same as mutability in general. For one, it makes it more difficult to analyze your code.
The variable being null actually means something: this is the case which Option types actually formalize.
Languages which support Option types typically also forbid or discourage the use of uninitialized variables as well.
How option types work including strategies to ease checking null cases such as pattern matching.
In order to be effective, Option types need to be supported directly in the language. Otherwise it takes a lot of boiler-plate code to simulate them. Pattern-matching and type-inference are two keys language features making Option types easy to work with. For example:
In F#:
//first we create the option list, and then filter out all None Option types and
//map all Some Option types to their values. See how type-inference shines.
let optionList = [Some(1); Some(2); None; Some(3); None]
optionList |> List.choose id //evaluates to [1;2;3]
//here is a simple pattern-matching example
//which prints "1;2;None;3;None;".
//notice how value is extracted from op during the match
optionList
|> List.iter (function Some(value) -> printf "%i;" value | None -> printf "None;")
However, in a language like Java without direct support for Option types, we'd have something like:
//here we perform the same filter/map operation as in the F# example.
List<Option<Integer>> optionList = Arrays.asList(new Some<Integer>(1),new Some<Integer>(2),new None<Integer>(),new Some<Integer>(3),new None<Integer>());
List<Integer> filteredList = new ArrayList<Integer>();
for(Option<Integer> op : list)
if(op instanceof Some)
filteredList.add(((Some<Integer>)op).getValue());
Alternative solution such as message eating nil
Objective-C's "message eating nil" is not so much a solution as an attempt to lighten the head-ache of null checking. Basically, instead of throwing a runtime exception when trying to invoke a method on a null object, the expression instead evaluates to null itself. Suspending disbelief, it's as if each instance method begins with if (this == null) return null;. But then there is information loss: you don't know whether the method returned null because it is valid return value, or because the object is actually null. It's a lot like exception swallowing, and doesn't make any progress addressing the issues with null outlined before.
Assembly brought us addresses also known as untyped pointers. C mapped them directly as typed pointers but introduced Algol's null as a unique pointer value, compatible with all typed pointers. The big issue with null in C is that since every pointer can be null, one never can use a pointer safely without a manual check.
In higher-level languages, having null is awkward since it really conveys two distinct notions:
Telling that something is undefined.
Telling that something is optional.
Having undefined variables is pretty much useless, and yields to undefined behavior whenever they occur. I suppose everybody will agree that having things undefined should be avoided at all costs.
The second case is optionality and is best provided explicitly, for instance with an option type.
Let's say we're in a transport company and we need to create an application to help create a schedule for our drivers. For each driver, we store a few informations such as: the driving licences they have and the phone number to call in case of emergency.
In C we could have:
struct PhoneNumber { ... };
struct MotorbikeLicence { ... };
struct CarLicence { ... };
struct TruckLicence { ... };
struct Driver {
char name[32]; /* Null terminated */
struct PhoneNumber * emergency_phone_number;
struct MotorbikeLicence * motorbike_licence;
struct CarLicence * car_licence;
struct TruckLicence * truck_licence;
};
As you observe, in any processing over our list of drivers we'll have to check for null pointers. The compiler won't help you, the safety of the program relies on your shoulders.
In OCaml, the same code would look like this:
type phone_number = { ... }
type motorbike_licence = { ... }
type car_licence = { ... }
type truck_licence = { ... }
type driver = {
name: string;
emergency_phone_number: phone_number option;
motorbike_licence: motorbike_licence option;
car_licence: car_licence option;
truck_licence: truck_licence option;
}
Let's now say that we want to print the names of all the drivers along with their truck licence numbers.
In C:
#include <stdio.h>
void print_driver_with_truck_licence_number(struct Driver * driver) {
/* Check may be redundant but better be safe than sorry */
if (driver != NULL) {
printf("driver %s has ", driver->name);
if (driver->truck_licence != NULL) {
printf("truck licence %04d-%04d-%08d\n",
driver->truck_licence->area_code
driver->truck_licence->year
driver->truck_licence->num_in_year);
} else {
printf("no truck licence\n");
}
}
}
void print_drivers_with_truck_licence_numbers(struct Driver ** drivers, int nb) {
if (drivers != NULL && nb >= 0) {
int i;
for (i = 0; i < nb; ++i) {
struct Driver * driver = drivers[i];
if (driver) {
print_driver_with_truck_licence_number(driver);
} else {
/* Huh ? We got a null inside the array, meaning it probably got
corrupt somehow, what do we do ? Ignore ? Assert ? */
}
}
} else {
/* Caller provided us with erroneous input, what do we do ?
Ignore ? Assert ? */
}
}
In OCaml that would be:
open Printf
(* Here we are guaranteed to have a driver instance *)
let print_driver_with_truck_licence_number driver =
printf "driver %s has " driver.name;
match driver.truck_licence with
| None ->
printf "no truck licence\n"
| Some licence ->
(* Here we are guaranteed to have a licence *)
printf "truck licence %04d-%04d-%08d\n"
licence.area_code
licence.year
licence.num_in_year
(* Here we are guaranteed to have a valid list of drivers *)
let print_drivers_with_truck_licence_numbers drivers =
List.iter print_driver_with_truck_licence_number drivers
As you can see in this trivial example, there is nothing complicated in the safe version:
It's terser.
You get much better guarantees and no null check is required at all.
The compiler ensured that you correctly dealt with the option
Whereas in C, you could just have forgotten a null check and boom...
Note : these code samples where not compiled, but I hope you got the ideas.
Microsoft Research has a intersting project called
Spec#
It is a C# extension with not-null type and some mechanism to check your objects against not being null, although, IMHO, applying the design by contract principle may be more appropriate and more helpful for many troublesome situations caused by null references.
Robert Nystrom offers a nice article here:
http://journal.stuffwithstuff.com/2010/08/23/void-null-maybe-and-nothing/
describing his thought process when adding support for absence and failure to his Magpie programming language.
Coming from .NET background, I always thought null had a point, its useful. Until I came to know of structs and how easy it was working with them avoiding a lot of boilerplate code. Tony Hoare speaking at QCon London in 2009, apologized for inventing the null reference. To quote him:
I call it my billion-dollar mistake. It was the invention of the null
reference in 1965. At that time, I was designing the first
comprehensive type system for references in an object oriented
language (ALGOL W). My goal was to ensure that all use of references
should be absolutely safe, with checking performed automatically by
the compiler. But I couldn't resist the temptation to put in a null
reference, simply because it was so easy to implement. This has led to
innumerable errors, vulnerabilities, and system crashes, which have
probably caused a billion dollars of pain and damage in the last forty
years. In recent years, a number of program analysers like PREfix and
PREfast in Microsoft have been used to check references, and give
warnings if there is a risk they may be non-null. More recent
programming languages like Spec# have introduced declarations for
non-null references. This is the solution, which I rejected in 1965.
See this question too at programmers
I've always looked at Null (or nil) as being the absence of a value.
Sometimes you want this, sometimes you don't. It depends on the domain you are working with. If the absence is meaningful: no middle name, then your application can act accordingly. On the other hand if the null value should not be there: The first name is null, then the developer gets the proverbial 2 a.m. phone call.
I've also seen code overloaded and over-complicated with checks for null. To me this means one of two things:
a) a bug higher up in the application tree
b) bad/incomplete design
On the positive side - Null is probably one of the more useful notions for checking if something is absent, and languages without the concept of null will endup over-complicating things when it's time to do data validation. In this case, if a new variable is not initialized, said languagues will usually set variables to an empty string, 0, or an empty collection. However, if an empty string or 0 or empty collection are valid values for your application -- then you have a problem.
Sometimes this circumvented by inventing special/weird values for fields to represent an uninitialized state. But then what happens when the special value is entered by a well-intentioned user? And let's not get into the mess this will make of data validation routines.
If the language supported the null concept all the concerns would vanish.
Vector languages can sometimes get away with not having a null.
The empty vector serves as a typed null in this case.

Resources