Tuesday, June 4, 2013

Null dereference is not something that you need

Some of the comments on this interesting blog post http://sebastiansylvan.com/2013/05/25/language-design-deal-breakers make me think that people only familiar with languages that allow null pointer dereference do not fully appreciate that it is an artifact of language design.  I want to go into some detail about it.  I'll introduce a typical language with NULL dereference, show how to transform it to a language without, and then show how proper use of the new language really does completely eliminate the problem.

Start with a pointer language, like C++, but without pointer arithmetic.  And let's  assume garbage collection, to eliminate explicitly freeing memory.  Really, the language is more like C# or Java or Go, but let's stick with C++ syntax. I'm going to write NULL (not 0 or nullptr or nihilateration) for the null pointer.

Here are a couple of declarations in this language.

T t1; // t1 is a value of type T
T* p1; // p1 is a pointer to a value of type T

T can be whatever: int, float, another pointer, a struct, etc.  But for talking about NULL, the type doesn't matter, so we'll just stick with T.

There are 3 ways to create a pointer to a value of type T.

Most simply, you can take the address of a value of type T:

&t1 // is a pointer value pointing to t1
p1 = &t1; // stores that pointer value in the pointer variable

That's easy, but not really that common in C++; and it's forbidden in Java.  It's probably more common in Go.

The second way to create a pointer is by allocating a new chunk of memory:

new T() // allocates enough memory for a T value, initializes the value and returns a pointer to the memory
p1 = new T(); // stores that pointer in the variable

The third and final way to create a pointer is just to write NULL (possibly with a cast), which creates a value of type "pointer to T" that doesn't really point to anything.  So, kind of misleading.

NULL // the null pointer
p1 = NULL; // make the variable p1 not point to anything

Many languages implicitly set pointers that are not explicitly initialized to NULL.  In C and C++, that is true in some contexts; in other contexts the pointer just contains random garbage.  Let's assume that pointers not explicitly initialized are implicitly initialized to NULL.  So these are the same:

T* p1;
T* p1 = NULL;

The first two ways of creating a pointer value make it actually point to something; only the third way does not.  The only way to get a NULL value is to specify NULL (or imply it by not specifying anything).

Taking the address of a value in memory always succeeds; and the NULL value is just another literal.  new() can fail, either to allocate memory or to properly initialize the new value; if it fails it will throw; but if it gets past that, it can always successfully return the pointer value.  In C, malloc will return NULL if it fails, but you should always check for that and usually fail with an "out of memory" error.

Let me briefly mention that in C which allows pretty flexible pointer arithmetic, such arithmetic still cannot turn a non-NULL pointer into a NULL, at least not in a way that is well defined.

So, what can you do with a pointer once you have it?  In the absence of pointer arithmetic, not too much.  A pointer is a value, so one thing you can do is copy it around: you can store it in variables, pass it to functions, return it from functions.

T* p2 = p1; // copy the value of the pointer p1 into p2 so that both now point to whatever p1 pointed to before
T* p2 = f(p1);

The second thing you can do with pointers is compare them:

if (p1 == p2) { ... }
if (NULL != p1) { ... }

In the absence of pointer arithmetic, less than (p1 < p2) and its brethren are forbidden.

And the last thing you can do with pointers is dereference them:

T t1 = *p1; // read through p1
*p2 = t1; // write through p2

That's it, three operations on pointer values: copying, comparing and dereferencing.

This is a pretty typical language.  C#, Java, Go, C and C++ all work basically this way.

In languages like this, whether a pointer is NULL doesn't matter when you're copying it.  Copying and comparing can never fail.  But dereferencing a NULL pointer causes a NULL pointer exception, which makes everyone involved very sad.  So you avoid NULL dereferences by judiciously sprinkling your code with lots of comparisons against NULL.

But wait, there's a better way.  Let's morph the language into a new and improved language.  In this new language, you can declare pointers that cannot point to NULL:

T*! v1 = &t1; // exclamation point (!) means never NULL
T*! v2 = NULL; // BZZT!  The compiler won't allow this.
T*! v3; // BZZT! This is also forbidden, because it implicitly initializes to NULL.
T*! v4 = new T(); // fine; if it doesn't throw, it returns a non-NULL value

But you can still declare a pointer which can hold a NULL or an actual pointer.

T*? n1 = &t1; // question mark (?) means sometimes NULL
T*? n2 = NULL;
T*? n3; // implicitly NULL
T*? n4 = n1;

So there are two types of pointers to T: never NULL (T*!) and sometimes NULL (T*?).  The new language does not support the traditional (T*) pointers from the old language. 

Neither of these types (T*! and T*?) alone supports all of the operations allowed to T* pointers.  T*! as we saw already, forbids assigning NULL.  T*? forbids dereference, which prevents NULL derereference.  You can implicitly turn a T*! pointer into a T*? pointer, but not the other way:

T*! v1 = n1; // BZZT!  Cannot assign nullable pointer to non-nullable one.
T*! v1 = (T*!) n1; // BZZT!  Not even explicitly.
if (n1 == v1) { ...} // same as: if (n1 == (T*?) v1) { ... }

If T*? pointers don't allow dereference and cannot be turned into T*! pointers, what good are they?  Well, there is a way to get the T*! pointer back out, but it involves new syntax, the perhaps/butifnot statement:

perhaps (T*! v1 = n1) { // can assign T*? to T*! only in a "perhaps" header
   // here v1 is in scope and can be dereferenced
   t1 = *v1;
} butifnot {
   // here it is not because n1 is NULL
   t1 = t2;
}

This is sort of like

if (NULL != p1) {
   t1 = *p1;
} else {
   t1 = t2;
}  

except that you can't accidentally forget the NULL check.  Note that inside the perhaps clause you still can't dereference n1; you can dererence v1, the non-NULL pointer that n1 held.

People who have never used a language like this may initially think that it doesn't really add much value.  It forces you to do your NULL checks, but that seems more annoying than anything else . Sure, there are technically no NULL dereferences, but don't you end up in "butifnot" blocks all the time with nothing to do but throw an exception, which might as well be a NullDereferenceException?

The answer to that question is: no, you never really end up in a "butifnot" block with nothing to do.  Think about how you would get to this point:

perhaps (T*! v1 = n1) {
   ... *v1 ...
} butifnot {
   // What should I do?  How did I get here?
}

Think in particular about where n1 came from.  Maybe you're writing a function, and it's a parameter:

void f(T*? n1) {
   // ...
   perhaps (T*! v1 = n1) {
      ... *v1 ...
   } butifnot {
      // How did I get here?
   }
   // ...
}

How did you get there?  You declared the wrong type for the function parameter.  You started writing the function thinking it would be ok for the pointer to be NULL, but you've now realized that you were wrong.  The correct way to proceed is not to throw a NullDereference exception, or any exception.  The right thing to do is to fix the declaration of the parameter.  You need the parameter not to be NULL, so just declare it as non-nullable:

void f(T*! v1) {
  // ...
  ... *v1 ...
  // ...
}

No problem, no NULL dereference, no perhaps statement, no NULL checks.  Sure, someone is calling your function, but if they are passing it NULL, it's just going to throw anyway, so make them fix the problem the same way.

Another way you might have gotten there is that you got the n1 is as the result of a function call:

T*? n1 = g();
// ...
perhaps (T*! v1 = n1) {
   ... *v1 ...
} butifnot {
   // How did I get here?
}
// ...

But you need to call g() and get a non-NULL pointer back.  You have two options.  The better one is fixing g() to return a non-NULL pointer, but that's not always feasible.  But you can always wrap it:

T*! wg() {
  T*? n = g();
  perhaps (T*! v = n) {
     return v;
  } butifnot {
     throw GFailed;
  }
}

And then you call your wrapper:

T*! v1 = wg();
// ...
... *v1 ...
// ...

Sure, this still may throws an exception, but it's really not equivalent to a NULL dereference.  First, it fails earlier, which is better.  Second, it tells you that g() didn't do what you were expecting, which is likely to be far more useful than a NULL dereference error.

By the way, a language like this doesn't need traditional NULL checks:

T*! vl = &t1;
if (NULL == v1) { ... } // known false at compile time, just from the type of v1
T*? n1;
if (NULL != n1) { ... } // ok, but not as useful as a perhaps/butifnot statement

The other thing you don't understand when you're used to languages with only nullable pointers is how rarely you actually need them.  Most of the time non-nullable pointers are what you want.  Which really means that the case of the function g() returning a nullable pointer doesn't happen very often.

I will post sometime about sum types and matching, which is a great language feature that lets you write T*? and perhaps/butifnot yourself, along with many other useful things.

Sunday, May 26, 2013

INNER JOIN is not intersection

I've heard more than one person describe INNER JOINs as being like set intersection, usually accompanied by drawing a Venn diagram.  But that's wrong and misleading.

I'm not going to go into much detail on SQL syntax, but a JOIN operator is typically part of a SELECT statement, which looks kind of like this:

SELECT L.a, R.b, L.c, R.d
    FROM LeftTable L
        INNER JOIN RightTable R
            ON OnCondition
    WHERE WhereCondition;
   
LeftTable and RightTable are table expressions with aliases L and R respectively, and OnCondition and WhereCondition are boolean expressions which may involve the columns of the tables.  Here, "a" and "c" are columns of L, while "b" and "d" are columns of R.

The most important aspect of a JOIN (INNER, OUTER or otherwise) is that the result contains all of the columns of both tables, which is why the SELECT list (and the conditions) can reference both tables.

Since it's the most important aspect of a JOIN, I'm going to repeat it.  You get all of the columns of both tables.  Describing this as being anything like set intersection is completely missing the point.  WIth set intersection, both sets come from the same universe; and the result is also from that universe.  With INNER JOIN, the two input tables may have completely different columns; and the result has all columns from both, and therefore looks like neither.

While the ON clause is syntactically required for an INNER JOIN, this is also misleading.  You can freely move conditions between the ON clause to the WHERE clause.  There is absolutely no difference between the above statement and

SELECT L.a, R.b, L.c, R.d
    FROM LeftTable L
        INNER JOIN RightTable R
            ON 1=1
    WHERE OnCondition AND WhereCondition;
   
The expression 1=1 is a common idiom for writing TRUE in SQL.  An INNER JOIN like this one, where the OnCondition is TRUE is called a Cartesian product, or a CROSS JOIN, and this statement is also equivalent:

SELECT L.a, R.b, L.c, R.d
    FROM LeftTable L
        CROSS JOIN RightTable R
    WHERE OnCondition AND WhereCondition;

What's a Cartesian product?  It means that each row of LeftTable is paired with every row of RightTable to make the rows of the result of the CROSS JOIN.  That is, if L contains the rows L1 and L2 and R contains the rows R1, R2 and R3, the result has rows
(L1, R1)
(L1, R2)
(L1, R3)
(L2, R1)
(L2, R2)
(L2, R3)
And this is how you get all of the columns; by pairing up all of the rows.  So if there are NL rows in L and NR rows in R, the result of the join has NL * NR rows.  You generally get more rows than you started with; how is this like set intersection?  With set intersection, you end up with at most the number of elements in the smaller of the two input sets.

SQL does contain an operator that is like set intersection.  It's called INTERSECT.  It is not like an INNER JOIN.

Anyway, the NL * NR is where the word "product" comes into play. in "Cartesian product"  It's Cartesian because this looks like the two-dimensional (x,y) coordinates that Descartes used to label the plane.  In fact, that's a pretty good way to think about it.  The result of a CROSS JOIN or INNER JOIN is a matrix (I'd say "table", but that would be pretty confusing here) with the rows of the two tables labeling the two dimensions.

       R1        R2        R3
L1  (L1, R1)  (L1, R2)  (L1, R3)
L2  (L2, R1)  (L2, R2)  (L2, R3)

In matrix jargon, this would be called an "outer product", but again that's pretty confusing here, since this is an INNER JOIN.  There's also an "inner product" in matrix algebra, but that is in no way relevant.  Forget that I said "matrix."  I meant to say "rectangular grid."  The number of elements in the rectangular grid is the product of the length of its sides, NL * NR.

Of course, the conditions may filter out rows, so NL * NR is the upper bound on the number of rows returned by the SELECT statement.  The extreme case is where the condition is FALSE (such as 0=1), which returns no rows.  You can think of the ON and WHERE clauses as checking each entry in the matrix to decide whether it's good enough to be in the final result set.

People (myself included) may tell you all sorts of rules for what sort of conditions belong in the ON clause of an INNER JOIN and what sort of conditions belong in the WHERE class of the SELECT statement.  It's important to be aware that these are purely stylistic choices; there is no semantic difference.  (Although there is a difference for OUTER JOINs.)

People (the sort who will tell you that INNER JOIN is like intersection) will also try to suggest that there is a difference between an INNER JOIN and a CROSS JOIN.  But you'll know better.  As in my sample statements, every CROSS JOIN can be replaced by INNER JOIN ON 1=1, and every INNER JOIN can be replaced by CROSS JOIN, moving the OnCondition into the WHERE clause.  Whatever they tell you, INNER JOIN is a Cartesian product.

As long as we're going through equivalences, let me point out that INNER JOIN and CROSS JOIN are commutative; the order of the tables doesn't matter.  (Again, this is not true for OUTER JOINs.)  So, we could equally well write

SELECT L.a, R.b, L.c, R.d
    FROM RightTable R
        INNER JOIN LeftTable L
            ON OnCondition
    WHERE WhereCondition;

although I should mention that the phrase "left table" always refers to the table to the left of the JOIN, which here is RightTable.  Again, there may be stylistic reasons for deciding which order to put the tables in, but there is no semantic difference.

In general, the number of rows returned by an INNER JOIN is quadratic (NL * NR), but there is an important special case that guarantees a linear result.  Suppose columns K1 and K2 are a UNIQUE KEY of R; then include f1(L) = R.K1 AND f2(L) = R.K2 in the conditions, where f1 and f2 are arbitrary functions of the columns of L.  Then there is at most one row of R for every row of L; because the rows of L determine the values of the functions, and therefore the values of a UNIQUE KEY of R.  If that key is in R at all, there is exactly one such row in R.  So, in this case, there are at most NL resulting rows instead of NL * NR.

Of course, since the INNER JOIN is commutative, you could swap L and R.  So if a function of R determines a UNIQUE KEY of L, there are at most NR resulting rows.

One last point.  All of the example conditions I've mentioned involve equality (=), but there are many perfectly good conditions that do not:

ON L.a < R.b
    AND L.c <> R.d
   
WHERE L.a - R.b BETWEEN -2 AND 5
    AND L.c*R.d IN (12, 24, 60)

In summary:

- not all conditions are equality
- INNER JOIN is the same as CROSS JOIN, a Cartesian product
- INNER JOIN is quadratic, although you can prove tighter bounds in special cases
- INNER JOIN is not like intersection; it's like a union of columns and product of rows.

And don't let anyone tell you different.

Monday, May 20, 2013

RIFL vs. using

In the last several posts, I've introduced a resource management pattern I call RIFL, or Resource Is Function-Local.  I've given examples in Go, Perl, and C++, and compared to RAII.  Here, I'll give examples in C#, and discuss how it relates to the IDisposable pattern in C#.

In C#, a resource that needs management signals that to the coder by implementing the interface IDisposable.  That contains a single function, void Dispose(), which is about as simple as interfaces get.  Implementing IDisposable means that you should ensure that Dispose() gets called when you're done with the resource.  And the language provides the "using" statement, which is some lovely syntax making calling Dispose() easy.  A typical usage looks like this:

using (var resource = new Resource()) {
    resource.Use();
} // implicitly calls resource.Dispose()

This is roughly equivalent to

var resource = new Reource();
try {
    resource.Use();
} finally {
    resource.Dispose();
}

The differences are mostly too subtle to affect our discussion here.

Often, as with RAII, you want some resource to have the same lifetime as an instance of a class.  In that case, you construct the IDisposable resource in the constructor for the new class, and you have that class also implement IDisposable and call Dispose() on the resource in its own Dispose() method.  Actually, there are subtle, confusing and dangerous interactions between Dispose(), the actual destructor (well, finalizer), the garbage collector and OS resources, which make correctly implementing Dispose() rather tricky.  Fortunately, for our purposes, all of these issues are too subtle to affect our discussion much.

I'm not comfortable really talking much about Java, as I'm much more familiar with C#, but recent versions of Java have an AutoCloseable interface and an extension to the "try" syntax which is similar to Disposable and "using".

C# has similar flexibility to C++ because Disposable is so similar to RAII.  Here's a few examples of using resources and non-resources in C#:

public void ExampleUsingResource() {
    using (var resource = new Resource() {
        resource.Use();
    }
}

public Resource ExampleUseOfEscapingResource() {
    var resource = new Resource();
    resource.Use();
    return resource;
}

public void ExampleUseOfLeakingResource() {
    var resource = new Resource();
    resource.Use();
} // failed to release resource properly by calling Dispose()

public void ExampleUseOfNonResource() {
    var nonResource = new NonResource();
    nonResource.Use();
} // no need to release nonResource, since it does not implement IDisposable

So this is even trickier than in C++, since leaking a resource not only looks like a cross between "using" a resource and deliberately allowing a resource to escape the current scope; but it looks identical to using a non-resource.  And here's where it gets really tricky:  the difference between a non-resource and a resource is that a resource implements IDisposable.  And how do you know that when you're coding?  Well, if you're using Visual Studio with IntelliSense (or something roughly equivalent), usually you would try typing "resource.Disp" and seeing if the system wants to autocomplete it to Dispose.  This almost works, except for three issues.

First, and pretty minor, is that it's possible to implement Dispose without implementing IDisposable, and then you can't use "using".  But you find that out as soon as you try to use a "using" statement, so that's mostly ok.

Second, and pretty tricky, is that some classes explicitly implement IDisposable.Dispose, so that the Dispose method isn't available unless you cast to (IDisposable).  That means it won't autocomplete so you can be fooled into not using "using".

Third, sometimes it is apparently safe not to call Dispose on certain IDisposable objects, such as Task.  This turns out to be pretty convenient, since in many of the ways you are encouraged to use Tasks in recent C# versions, it's very hard to find the right way to Dispose them.  But this means that sometimes the Leaking code doesn't really leak, which is pretty confusing.

Oh, and did I mention that sometimes Dispose throws?

What it adds up to is that the IDisposable pattern seems fine, but ends up being really painful to use safely.  Fortunately, C# has pretty good lambda support, so you can easily implement and use RIFL.

public class RiflResource {
    public static void WithResource(Action<RiflResource> useFunc) {
        using (var raw = RawResource.Obtain()) { // often: new RawResource()
            var rifl = new RiflResource(raw);
            useFunc(rifl);
        } // implicitly Dispose (Release)
    }
   
    public void Use() { raw_.Use(); }
    private readonly RawResource raw_;
    private RiflResource(RawResource raw) { raw_ = raw; }
}

And it's easy to use

public void ExampleRiflUsage() {
    RiflResource.WithResource(resource => {
        resource.Use();
    });
}

In fact, of the languages I've used in examples, C# has the best syntax for anonymous functions, which makes this the cleanest usage.  And I intentionally wrote my example so that it's obvious how to add multiple calls to Use; this simple case has an even briefer alternate syntax:

public void ExampleRiflUsage() {
    RiflResource.WithResource(resource => resource.Use());
}

or even

public void ExampleRiflUsage() {
    RiflResource.WithResource(Use);
}

but that only works if you want to call Use() alone once with no arguments.

I am really happy to wrap up my posts on RIFL.  So far, I've only blogged on 2 topics, and each of those turned into multi-post discussions, where each post was too long.  I hope to find a reasonable-length topic for the next post.

Friday, May 17, 2013

RIFL vs. RAII

In previous posts, I introduced a resource-management pattern I call RIFL (pronounced "rifle"), or Resource Is Function-Local.  Here I will talk about the standard C++ way of managing resources, which is called RAII, or Resource Allocation Is Initialization.  RAII is usually pronounced "are eh aye aye", but I prefer "rah eeeeeeeeeee".

In RAII, you create a class in which the resource is obtained in the constructor and released in the destructor.  Thus the resource necessarily has exactly the same lifetime as the instance of the class, so the instance is a proxy for the resource.  And, if you allocate that instance on the stack, it is released when the stack frame is destroyed.  Actually, in C++, it is released when control exits the lexical scope, but that's not too far different.

class RawResource {
public:
    static RawResource Obtain();
    void Use();
    void Release();
};

class RaiiResource {
public:
    RaiiResource() : raw_(RawResource.Obtain()) {}
    ~RaiiResource() { raw_.Release(); }
    void Use() { raw_.Use(); }
private:
    RawResource &raw_;
};

This is a bit simpler than a RiflResource.  Using the resource is also easy:

void ExampleRaiiUsage() {
    RaiiResource resource;
    resource.Use();
}

This is very similar to using a RiflResource, but again a bit simpler.  As with RIFL, there is no reference to the raw resource, and no likelihood of releasing the resource twice.

The next example usage is both a strength and a weakness of RAII.  If the RAII object is not allocated on the stack, the lifetime of the resource is still the lifetime of the object, whatever it may be.  The strength is that if you need a more flexible lifetime, you can create one:

RaiiResource *RaiiEscape() {
    RaiiResource *resource = new RaiiResource();
    resource->Use();
    return resource;
}

Before I get into the technical weaknesses of RAII, let me just warn you that confusing it with the RIAA may get you sued.

There's a weakness of RAII which is almost identical to the strength; you can accidentally fail to release the resource:

void RaiiLeak() {
    RaiiResource *resource = new RaiiResource();
    resource->Use();
}

This is a memory leak of the resource object, and therefore a resource leak as well.  The biggest problem with this weakness is that the code looks like a cross between the other two valid usages.  In this simple example, of course, it is easy to see the leak; but it takes a lot of discipline to avoid the leaks in large programs.  Of course, C++ offers tools (such as std::shared_ptr) to help manage memory (and therefore, with RAII, other resources).

If you recall, I only showed fake C++ code (using a "finally" clause) to implement RIFL in C++.  The actual way to implement RIFL in C++ is on top of RAII.

class RiflResource {
public:
    static void WithResource(void useResource(RiflResource &resource)) {
        RaiiResource resource; // Obtain
        RiflResource riflResource(resource); // wrap
            useResource(riflResource); // use
    } // implicit release when RAII resource exits scope
    void Use() { low_.Use(); }
private:
    RiflResource(RaiiResource &raii) : raii_(raii) {}
    RaiiResource &raii_;
};

RAII turns managing any resource into managing memory.  Managing memory is hard in the general case, but easy when the memory is on the stack.  RIFL turns managing any resource into managing the stack, which is always easy, but more limiting.

Holding a lock (not the lock itself) is an unusual resource, because there are no Use() methods, just Obtain (Lock) and Release (Unlock), in the usual implementation.  C++ has an RAII wrapper for locking, which gets used like this:

void AccessWhileRaiiLocked() {
    std::lock_guard<std::mutex> myLock(myMutex);
    x.Use(); // use some class data safe in the knowledge that it is locked.
    y.Use();
} // implicit unlock when lock goes out of scope

This is roughly equivalent to this:

void AccessWhileUnsafelyLowLevelLocked() {
    myMutex.lock();
    x.Use();
    y.Use()
    myMutex.unlock(); // not executed if either Use() throws
}

And to this RIFL example:

void AccessWhileRiflLocked() {
    nonstd::lock_rifl<std::mutex>::HoldingLock(myMutex, []() {
        x.Use();
        y.Use();
    });
}

Ignoring the fact that nothing keeps you from using "x" or "y" outside any of the locking mechanisms, I would argue that RIFL has an advantage over RAII in this case.  It's not that it's simpler; it's slightly more verbose.  But the RAII example looks like it simply has an unused variable.  It's not at all obvious from the code that the x.Use is somehow nested within the lock; or that it's at all related.  Or even, that the lock is needed.

Better would be to use RIFL to actually restrict access to the variables:

class RiflControlled {
public:
    void Sync(void useFunc(usable &x, usable &y)) {
        std::lock_guard<std::mutex> raiiLock(mutex_);
        useFunc(x_, y_);
    } // implicit unlock
private:
    RiflControlled(usable const &x0, usable const &y0) : x_(x0), y_(y0) {}
    usable x_;
    usable y_;
    std::mutex mutex_;
};

void AccessWhileRiflControlled(RiflControlled &rifl) {
    rifl.Sync([](usable &x, usable &y) {
        x.Use();
        y.Use();
    });
}

With RiflControlled, there is simply no way to access "x" or "y" without holding the lock.  You've guaranteed that the locking is correct.  Well, that's a bit overstated, but you really have to deliberately undermine it; it's not going to happen by accident.  Note that in this case, the RIFL function is a (non-static) method on the object, unlike all the previous RIFL examples.  This again suggests that RIFL is a flexible approach to resource management.

With RAII, you can't limit the access to the variables to the scope where the lock is held.  Challenge: prove me wrong.

RAII is necessary in C++ because of the lack of a "finally" clause on a try; there's really no good way around it.  However, it is also a relatively low-level mechanism, which can be abused.  RIFL in C++ can be used as a wrapper around RAII, providing more rigid control over resources (which is good); but less flexible control over resources (which is bad).  But RAII is a step up from the raw resource; one might say that RAII is semi-automatic, while RIFL is fully-automatic.

In the next post, I'll compare RIFL to another resource management pattern similar to RAII, which is IDisposable in C#, and about implementing RIFL in C#.

Wednesday, May 15, 2013

RIFL Examples

In the last post, I introduced a pattern I call RIFL, or "Resource Is Function-Local".  The goal is to control access to a resource in such a way that there is a guarantee that it is released exactly once.  The idea is that the resource never exists in the wild.  It is created within a function, and can only be used when passed into another function which is a parameter of the first.  That's a little confusing, but let me jump into some examples which should make it clearer.

First, here is a RIFL wrapper around writing files in Go.  Note that a function declaration in Go is something like "funcName(argName argType) returnType".

package rifl
import "os"

type File struct{ file *os.File }

func WithFile(fileName string, useFile func(file File) error) error {
    file, err := os.Create(fileName) // create or truncate, open for writing
    if err != nil {
        return err
    }
    defer file.Close() // see note below
    return useFile(File{file})
}

func (file File) WriteString(s string) (int, error) {
    return file.file.WriteString(s)
}

I think this example is fairly readable for those who don't know Go.  The only thing that really needs explaining is the "defer" statement.  "defer" basically means: put the rest of the function in a "try" block, and put this in the "finally" clause.  That is, it contains a function call which is not executed until the rest of the function is completed; but is guaranteed to execute, even if there is exception (which Go calls a "panic").

And this is how the rifl package might be used, with poor error handling:

package main
import "rifl"

func main() {
    rifl.WithFile("out.txt", func(file rifl.File) error {
        file.WriteString("Hello\n")
        file.WriteString("Goodbye\n")
        return nil
    })
}

Go is a really good language in many ways, but if you're used to C++'s destructors, you'll miss them mightily.  RIFL can help you feel a bit better.

Note that in the main package, there are no references to os.File, only to rifl.File; we're not using the raw resource, just the RIFL wrapper.  I only implemented WriteString, but I could have implemented the entire interface, or (in principle) improved it.  Or made it worse.  Also, unlike my abstract example in the previous post, the Obtain function (os.Create) and the Use function (WriteString) both take arguments and return multiple values.  So it's a bit noiser.

I deliberately made the useFile argument the last argument to WithFile, because it tends to run on over many lines.

The most important point, of course, is that, in the WithFile function, we guarantee that the file will be closed, by using "defer".  It doesn't matter how many goats are sacrificed while running "useFile"; the file will get closed.

Also, the os.File object "file" is hidden from users.  In Go, identifiers with an initial uppercase letter (File, WithFile, WriteString) are exported; others (file) are not.  This means the caller has no way to call Close().  But that's fine, since WithFile handles that.

Now for something completely different.  Here's a RIFL wrapper for a database transaction in Perl, annotated (comments after # signs) so that those who don't know Perl can (I hope) follow it:

package XAct; # class
use Moose;

has '_dbh' => (is => 'ro'); # private data

sub Atomically { # RIFL function
    my ($dbh, $func) = @_;
    my @results;
    $dbh->begin_work; # start transaction
    eval { # try
    my $wrap = XAct->new(_dbh => $dbh);
    @results = $func->($wrap); # in transaction
    $dbh->commit;
    };
    if ($@) { # catch
    $dbh->rollback;
    die $@; # rethrow
    }
    return @results;
}

sub prepare { # pass-through Use
    my ($self, @args) = @_;
    return $self->_dbh->prepare(@args);
}

1; # end package

The RIFL function Atomically guarantees that the transaction is in effect when the passed-in function, $func is called; and that it is cleaned up afterwards.  A transaction, unlike a typical resource, can be Released either by a Commit or by a Rollback.  Atomically guarantees that if $func does not throw, the transaction is committed; and if it does throw, the transaction is rolled back.  So, this is an additional complexity easily handled by RIFL.

As before, the transaction object (which is really the the database handle) is wrapped to prevent direct access to the commit or rollback.

Note that the database handle itself is a resource, which could also be managed by a RIFL function, but that is not included in this example.

Here is an example function using Atomically:

sub InsertSomeNames {
    my ($dbh, %map) = @_;
    my $sql = 'INSERT INTO SomeNames(Id, Name) VALUES(?,?)';
    XAct::Atomically($dbh, sub {
    my ($xact)=@_; # get resource
    my $sth = $xact->prepare($sql); # use directly
    while (my ($id, $name) = each(%map)) {
        $sth->bind_param(1, $id);
        $sth->bind_param(2, $name);
        $sth->execute; # use indirectly
        }
    });
};

Here the transaction is being passed to the inner function as $xact, which is being used to insert many values into the database.  If any of those inserts fail, all of them will be rolled back; they will only be committed if they all succeed.

The point of this post is that the RIFL pattern is pretty easy to implement and pretty easy to use in many languages.  It relies only on good anonymous functions (lambda functions), used when calling the RIFL function.  Of course, you also have to be able to reasonably manage the resource so that it is guaranteed to be freed at the end of the RIFL function; but it seems like that's a fundamental prerequisite to being able to manage resources in any way in the language.

In the next post, I'll compare RIFL with RAII and show some C++ examples.

Monday, May 13, 2013

Introduction to RIFL

I want to introduce a new pattern for resource management, which I call RIFL (pronounced "rifle"), which stands for "Resource Is Function-Local".  RIFL has some advantages (besides the better name) over another common pattern, RAII.  One big advantage of RIFL is that it can be used in languages like Go which do not have destructors.  I'll compare RIFL and RAII in detail in an upcoming post, but first I want to introduce RIFL, and provide some examples.

"Resource" is a word which here means "something which, when you're done using it, it's best to tell someone that you're done".  That is, when you're done with memory, you should free (or delete) it; when you're done with a file (or network connection), you should close it; when you're done with a transaction you should commit it (or roll it back); when you're done with a lock being locked, you should unlock it.  And a big "et cetera", because this sort of thing is pretty common.  Also, this is slightly broader than the more typical definition of resource; for instance, while a lock itself is often considered a resource, here I'm consider the state of it being locked to be a resource (i.e., "holding the lock", rather than just the existence of the lock).

When you're done reading the book, you should put it back on the shelf.

The life-cycle of a resource is: Obtain, Use (and Use some more), Release.  People often say "allocate" or "open" instead of "obtain", and "free" or "deallocate" or "close" instead of Release; and of course, there are a thousand variants of "use" depending on the resource: "read", "write", "update", etc.  There can be several "use"s for a single resource.  Files typically have "read", "write", and sometimes "seek".  With a database transaction, you have at least "select", "insert", "update", and "delete"; and perhaps "merge", "bulk insert", "create", and "drop".  But when talking about resources in general, I'll just stick with "Use".

The challenging part of resource management is making sure you never forget to put the book back on the shelf, even if you leave the library because of a fire alarm.  I mean, the hard part is being sure that you release the resource even with complicated control flow, including an exception being thrown.  Occasionally it also gets tricky not using the resource after it is released, and especially only releasing the resource once.

One resource that people don't usually think about managing is stack frames (i.e., local variables in a function).  You could say "When you're done with the stack frame, return from the function."  But you don't need to say that; it's pretty natural.  And RIFL takes advantage of this by turning managing other resources (which is hard) into managing the stack (which is easy).

My examples in this post will be mainly in a C++-like language that has a "finally" clause for the "try" statement.  Which is not standard C++, but is easy to understand.

A typical raw (unmanaged) resource then would look like this:

class RawResource {
public:
    static RawResource Obtain();
    void Use();
    void Release();
};

Often Obtain() and Use() take parameters, but that's not important for getting the basic idea.  As I mentioned before, there are often actually many Use() methods.  Occasionally, as with locks, there may be no Use() methods at all.

Also note that Obtain() is a static method returning an object of the class.  Essentially that's a constructor; so, Obtain() can just be the constructor for the class.

This is how a raw resource is supposed to be used:

void ExampleLeakyRawUsage() {
    RawResource resource = RawResource.Obtain();
    resource.Use();
    DoSomethingElse();
    resource.Use();
    resource.Release();
}

That's pretty straightforward, but if Use() or DoSomethingElse() can throw, the Release() will never get called.  It's also possible to have a dozen places in your code where you use the resource, and you could easily miss a Release in one of them; or just miss it along one code path if there is complicated code in the middle.  This leaks the resource, which is bad.

If you want to use the resource correctly, you have to do something like this:

void ExampleRawUsage() {
    RawResource resource = RawResource.Obtain();
    try     {
        resource.Use();
        DoSomethingElse();
        resource.Use();
    } finally { // not C++
        resource.Release(); // guaranteed
    }
}

That's a lot of boilerplate every time you want to use the resource.  Here's a very similar function:

void AnotherExampleRawUsage() {
    RawResource resource = RawResource.Obtain();
    try {
        DoAnotherThing();
        resource.Use();
        DoSomethingElseEntirely();
        resource.Use();
    } finally { // not C++
        resource.Release();
    }
}

Not really very different.  Followers of the DRY (Don't Repeat Yourself) principle immediately look for a way to abstract out the duplication.  And it's pretty straightforward, if you have a background in functional programming.  But the technique takes a little getting used to for a dysfunctional programmer.  But it shouldn't; if you have almost-duplicate code, you turn it into a function, with the differences as parameters.  This is the same thing.

Everything is the same except for the body of the try block.  So make that a parameter to a function, and use that function to eliminate the redundancy.  When I say "make the body of the try block a parameter", I mean a parameter which is a function:

void AlmostARiflFunction(void useResource(RawResource &resource)) {
    RawResource resource = RawResource.Obtain();
    try {
        useResource(resource);
    } finally { // not C++
        resource.Release();
    }
}

Here "useResource" is a function which is passed to "AlmostARiflFunction".  And, in turn, AlmostARifl function passes the resource to "useResource", which then, well, uses the resource. 

The usage looks like this, with anonymous functions (which in real life I usually call lambdas or lambda functions) being passed in.

void ExampleAlmostRiflUsage() {
    AlmostARiflFunction([](RawResource &resource) {
        resource.Use();
        DoSomethingElse();
        resource.Use();
    });
}

void AnotherExampleAlmostRiflUsage() {
    AlmostARiflFunction([](RawResource &resource) {
        DoAnotherThing();
        resource.Use();
        DoSomethingElseEntirely();
        resource.Use();
    }
}

The repetition is vastly reduced.  If you're not familiar with it, the function arguments demonstrate the syntax for anonymous functions in C++-11.  C++ always has elegant syntax; the no-op anonymous function is "[](){}", but you can abbreviate that to "[]{}" for even better readability.  Does sarcasm come across in a blog?

RIFL usability depends on anonymous function usability; C++ isn't great, but it's better than C which doesn't have anonymous functions at all; or MATLAB which doesn't support multi-statement anonymous functions.

Higher-order functions (i.e., functions taking functions as parameters) are the bread of functional programming (the butter is immutability), so I'm sure RIFL is already widely used in functional programming, and probably better named.  But let me elaborate on how best to implement it.

The good thing about AlmostARiflFunction is that, as long as you don't obtain a RawResource on your own, and only use AlmostARiflFunction to get access to one, you cannot fail to release it; or rather AlmostARiflFunction guarantees that it will release it for you, which is even better.

The bad thing is that nothing stops you from releasing it yourself, which can cause trouble.  Or continuing to use it after you release it yourself.  The obvious fix is to hide the Release method.

class RiflResource {
public:
    static void WithResource(void useResource(RiflResource &resource)) {
        RawResource resource = RawResource.Obtain();
        try {
            RiflResource riflResource(resource);
            useResource(riflResource);
        } finally { // not C++
            resource.Release();
        }   
    }
    void Use() { raw_.Use(); }
private:
    RiflResource(RawResource &raw) : raw_(raw) {}
    RawResource &raw_;
};

We simply provide a wrapper class that provides Use() but does not provide Release() or Obtain().  And we can make the RIFL function a static method of the class, to tie everything together.  You can think of the RIFL function as a wrapper for both Obtain() and Release().

void ExampleRiflUsage() {
    RiflResource.WithResource([](RiflResource &resource) {
        resource.Use();
        DoSomethingElse();
        resource.Use();
    });
}

Now we have guaranteed that Release() is called exactly once, and there is no way to use the resource after it is released.

Also, note that the using code now has no references to the raw resource.  We presumably can't remove the raw resource from the language, but as long as we only use it through the RIFL class, we don't have to worry about releasing it (or even explicitly obtaining it).

In the same way that you get a stack frame by calling a function, and free it up by  returning (often implicitly by falling off the end of the function), we obtain the resource by calling the RIFL function, and it is released when our anonymous function ends.

In my next post, I'll give some examples of RIFL in Go and Perl.  And I promise to get to real C++ and RAII soon.

Sunday, May 5, 2013

High-level floats

In my last two posts, I talked about integers and rational numbers in high-level languages.  In this post, I'm going to talk about real numbers.

As with integers and rational numbers, you know the properties of real numbers.  One of the more obscure properties, but very important for computers, is that there are an uncountable number of them.

Rational numbers are countable, which basically means that there are no more rational numbers than there are integers.  That sounds strange, since the integers seems like a small subset of the rationals, but both are infinite, and infinity isn't that intuitive.

But real numbers are uncountable, which means that there's really no good way to represent them exactly in a computer in general, at least in the traditional way we think about representing values.

There is an important subset of the real numbers, called the algebraic numbers, which are countable.  Unlike real numbers, many people don't learn about algebraic numbers when they are young, so you may not be familiar with them.  They are solutions to polynomial equations with integer coefficients, which includes numbers like square roots and cube roots.  Algebraic numbers can be represented exactly in a computer, but still not in a way that permits easy computation.  So, they are important, but not for general purpose computer programming.

I hinted that there is a non-traditional way to represent real numbers in general.  The trick is that you don't have to represent all the reals, just the ones that you generate while running your program, and they can be represented as the series of operations that generated them.  That is, this program in an imaginary C variant:

real a = 3;
real b = 5;
real c = sqrt(a);
real d = log(b);
real e = c + d;
printf("%r\n", e);

would output "sqrt(3) + log(5)" which is exactly correct, but not necessarily useful.  In particular, there is no straightforward way to execute code like this:

bool Maybe(real x) { return sin(x) == sqrt(1 - cos(x)**2); }

Symbolic manipulation systems, like Mathematica, may be able handle simple cases like this using automatic theorem provers and so on, but the general case is insoluble.  Note that Maybe should return true when x is 3 or 7, but false when x is 5; it's not all that obvious.

What it comes down to is this: programming languages cannot provide good support for real numbers.  It's technically impossible.  But we need to work with square roots and sines and so on, so they provide approximations using limited precision floating point.

Many modern languages define a type called "float" or "double" or "real" essentially by just saying "IEEE 754 binary64" and leaving it at that.  And IEEE 754 is a very good standard, so that almost works out ok.  Other languages may define them slightly differently (usually slightly worse), but similarly enough for our purposes here.  And, of course, some languages have two or more floating-point types, with different representations.

I argue that there is a problem with "float" in high-level languages; but it's the opposite problem of with integers.  Languages often have support for proper integers, but they make it hard to use by hiding it away in a "bigint" type.  But they have "float" easy to use, with built-in literals and so on.  I think it's bad that float seems easy to use in many languages, because it can't actually be easy; it can only seem that way.

Limited precision types are inherently tricky to work with.  I'm going to use, for purposes of example, decimal numbers with 2 digits of precision; that is numbers that can be written as a.bEc for decimal digits a and b, and arbitrary integer c.  So, 1 is 1.0E0 and 12 is 1.2E1, but 123 can only be approximated as 1.2E2, which is really 120.

So, what is 100 + 4 + 4, or 1.0E2+4.0E0+4.0E0?  The answer, sadly but necessarily is: it depends on how you associate.  1.0E2+(4.0E0+4.0E0) is 1.0E2 + 8E0 which should round to 1.1E2, which is as close to the right answer as you can get.  But in (1.0E2+4.0E0)+4.0E0 the first term rounds down to 1.0E2 so it reduces to 1.0E2+4.0E0 which again rounds down to 1.0E2, which is not the best answer.  Besides getting a slightly wrong answer (even in the best case), note that addition is not associative with limited precision numbers.  This make it hard to reason about mathematical expressions, and also hard for compilers to optimize them.

Adding three numbers can get you a slightly inaccurate answer; adding a thousand can be much worse.  Suppose you have a set of a thousand numbers to add, and you naively add them in order.  And suppose further that all of the number are 4, or 4.0E0.  The correct answer is, of course, 1000*4 = 4000 or 4.0E3.  After you add the first 25, you correctly get 100 or 1.0E2 as your partial solution, but then each additional number you add rounds back down to 1.0E2, so your final answer is 100, which is pretty much not 4000.

The most accurate algorithm for adding a set of fixed point numbers is shockingly complicated (and left as an exercise for the reader, since you look like you could use some exercise).  There's not a much simpler task than adding a bunch of numbers, but it's actually hard to get this right.  Why?  Because fixed-precision arithmetic is inherently difficult.

I said that IEEE 754 is a good standard.  By that I mean that it pretty much does the right thing for all the cases it covers.  So, if you add two numbers, you get the best answer you can get.  But it covers adding two numbers, not a thousand; that's outside its scope.

How hard is fixed-precision arithmetic?  Universities often cover it in graduate classes, which means that people with a B.S. in Computer Science may know very little about it.  If you're doing non-trivial math with floats, and you don't have this background, my advice is: find a good library and use it; and then figure out an independent way to check your results.

You may be thinking, "But I've used floats with no problem."  That's not skill, that's luck.  If the precision of the float is much larger than the precision you need, or you're only performing a few operations, it often works out fine.  But eventually it will bite you, unless you start worrying a lot about it.

If you take nothing else from this post, just remember this: It is easy to drown when you try to float.

Since fixed-precision arithmetic is so hard, a good high-level language should remind the user of the problems by making it a little awkward to use; and should help a little with the problems by being flexible.  These can go together.  Instead of saying "float", I want to say "float(precision:2, base:10)".  And I don't want to write "z = x*y;".  I want to write "z = mult<precision:x.precision+y.precision>(x,y)" specifying the precision of the result in terms of the precision of the operands, and requiring that the bases are the same.  And I certainly don't want the language implicitly casting floats from one precision or base to another; that's something I want to be completely aware of.

So, where people can get away with rational arithmetic, which is exact and therefore easy, encourage them to use it.  And where they really need fixed-precision, which is hard, make them specify it for every operation to remind them that they need to think about it for every operation.  And let people work in decimal precision; often times it's the correct solution for the problem, and it's almost always easier to think about.  Again, I have no problem supporting binary precision as well, but if anything is the default, it should be decimal.  A language could even allow arbitrary bases; that's not that hard.

I'm not sure why so many languages get this wrong.    C, which dates back to when floating point was slow (and often inaccurate before IEEE 754), makes using floating point easy, presumably because it was supported by the hardware that C was developed on. Strangely, database languages like SQL seem to do this better than most general-purpose programming languages.

Finally, there are two minor difficulties using IEEE 754 which are not necessary in a high-level language.  IEEE 754, of course, is designed to be low-level; and is entirely appropriate to a low-level language.   One problem is the fact that the precision is not specified by the user (except by selecting the type).  If I need 100 decimal digits, I should be able to specify it.  Often SQL allows this, at least up to a point.

The other problem is that the exponent is essentially a bounded integer, and therefore floats have a limited range and can overflow. The limited range is rather large, but it can still impact code.  Similarly, there is a smallest positive value; and there are issues with denormalized small numbers with lower precision than expected.  A high-level language should provide a float type with an unlimited range, but fixed precision.  That is, the exponent should be an integer, not a bounded integer.  I know of no languages that support such a type.

Again, the language could support constraints to bound the exponent, and use machine floats where it knows the exponent (and precision) provided by the machine suffices.

I may have already mentioned that fixed-precision work is difficult; why make it even harder by bounding the exponents unnecessarily.

So, in summary: floating-point should not be easy; it should be hard, but not harder than it needs to be.  High-level languages should encourage the use of exact types, like rational, where they suffice; and should remind programmers that they need to think a lot about limited precision where it is needed; but should not force programmers to worry about incorrect bases or finite bounds where that is not needed.

Here endeth my series of post on high-level languages and numeric types.  I hope that future high-level languages do a better job with numerics than current ones do, since the current ones are pretty shameful.