Sunday, May 26, 2013

INNER JOIN is not intersection

I've heard more than one person describe INNER JOINs as being like set intersection, usually accompanied by drawing a Venn diagram.  But that's wrong and misleading.

I'm not going to go into much detail on SQL syntax, but a JOIN operator is typically part of a SELECT statement, which looks kind of like this:

SELECT L.a, R.b, L.c, R.d
    FROM LeftTable L
        INNER JOIN RightTable R
            ON OnCondition
    WHERE WhereCondition;
   
LeftTable and RightTable are table expressions with aliases L and R respectively, and OnCondition and WhereCondition are boolean expressions which may involve the columns of the tables.  Here, "a" and "c" are columns of L, while "b" and "d" are columns of R.

The most important aspect of a JOIN (INNER, OUTER or otherwise) is that the result contains all of the columns of both tables, which is why the SELECT list (and the conditions) can reference both tables.

Since it's the most important aspect of a JOIN, I'm going to repeat it.  You get all of the columns of both tables.  Describing this as being anything like set intersection is completely missing the point.  WIth set intersection, both sets come from the same universe; and the result is also from that universe.  With INNER JOIN, the two input tables may have completely different columns; and the result has all columns from both, and therefore looks like neither.

While the ON clause is syntactically required for an INNER JOIN, this is also misleading.  You can freely move conditions between the ON clause to the WHERE clause.  There is absolutely no difference between the above statement and

SELECT L.a, R.b, L.c, R.d
    FROM LeftTable L
        INNER JOIN RightTable R
            ON 1=1
    WHERE OnCondition AND WhereCondition;
   
The expression 1=1 is a common idiom for writing TRUE in SQL.  An INNER JOIN like this one, where the OnCondition is TRUE is called a Cartesian product, or a CROSS JOIN, and this statement is also equivalent:

SELECT L.a, R.b, L.c, R.d
    FROM LeftTable L
        CROSS JOIN RightTable R
    WHERE OnCondition AND WhereCondition;

What's a Cartesian product?  It means that each row of LeftTable is paired with every row of RightTable to make the rows of the result of the CROSS JOIN.  That is, if L contains the rows L1 and L2 and R contains the rows R1, R2 and R3, the result has rows
(L1, R1)
(L1, R2)
(L1, R3)
(L2, R1)
(L2, R2)
(L2, R3)
And this is how you get all of the columns; by pairing up all of the rows.  So if there are NL rows in L and NR rows in R, the result of the join has NL * NR rows.  You generally get more rows than you started with; how is this like set intersection?  With set intersection, you end up with at most the number of elements in the smaller of the two input sets.

SQL does contain an operator that is like set intersection.  It's called INTERSECT.  It is not like an INNER JOIN.

Anyway, the NL * NR is where the word "product" comes into play. in "Cartesian product"  It's Cartesian because this looks like the two-dimensional (x,y) coordinates that Descartes used to label the plane.  In fact, that's a pretty good way to think about it.  The result of a CROSS JOIN or INNER JOIN is a matrix (I'd say "table", but that would be pretty confusing here) with the rows of the two tables labeling the two dimensions.

       R1        R2        R3
L1  (L1, R1)  (L1, R2)  (L1, R3)
L2  (L2, R1)  (L2, R2)  (L2, R3)

In matrix jargon, this would be called an "outer product", but again that's pretty confusing here, since this is an INNER JOIN.  There's also an "inner product" in matrix algebra, but that is in no way relevant.  Forget that I said "matrix."  I meant to say "rectangular grid."  The number of elements in the rectangular grid is the product of the length of its sides, NL * NR.

Of course, the conditions may filter out rows, so NL * NR is the upper bound on the number of rows returned by the SELECT statement.  The extreme case is where the condition is FALSE (such as 0=1), which returns no rows.  You can think of the ON and WHERE clauses as checking each entry in the matrix to decide whether it's good enough to be in the final result set.

People (myself included) may tell you all sorts of rules for what sort of conditions belong in the ON clause of an INNER JOIN and what sort of conditions belong in the WHERE class of the SELECT statement.  It's important to be aware that these are purely stylistic choices; there is no semantic difference.  (Although there is a difference for OUTER JOINs.)

People (the sort who will tell you that INNER JOIN is like intersection) will also try to suggest that there is a difference between an INNER JOIN and a CROSS JOIN.  But you'll know better.  As in my sample statements, every CROSS JOIN can be replaced by INNER JOIN ON 1=1, and every INNER JOIN can be replaced by CROSS JOIN, moving the OnCondition into the WHERE clause.  Whatever they tell you, INNER JOIN is a Cartesian product.

As long as we're going through equivalences, let me point out that INNER JOIN and CROSS JOIN are commutative; the order of the tables doesn't matter.  (Again, this is not true for OUTER JOINs.)  So, we could equally well write

SELECT L.a, R.b, L.c, R.d
    FROM RightTable R
        INNER JOIN LeftTable L
            ON OnCondition
    WHERE WhereCondition;

although I should mention that the phrase "left table" always refers to the table to the left of the JOIN, which here is RightTable.  Again, there may be stylistic reasons for deciding which order to put the tables in, but there is no semantic difference.

In general, the number of rows returned by an INNER JOIN is quadratic (NL * NR), but there is an important special case that guarantees a linear result.  Suppose columns K1 and K2 are a UNIQUE KEY of R; then include f1(L) = R.K1 AND f2(L) = R.K2 in the conditions, where f1 and f2 are arbitrary functions of the columns of L.  Then there is at most one row of R for every row of L; because the rows of L determine the values of the functions, and therefore the values of a UNIQUE KEY of R.  If that key is in R at all, there is exactly one such row in R.  So, in this case, there are at most NL resulting rows instead of NL * NR.

Of course, since the INNER JOIN is commutative, you could swap L and R.  So if a function of R determines a UNIQUE KEY of L, there are at most NR resulting rows.

One last point.  All of the example conditions I've mentioned involve equality (=), but there are many perfectly good conditions that do not:

ON L.a < R.b
    AND L.c <> R.d
   
WHERE L.a - R.b BETWEEN -2 AND 5
    AND L.c*R.d IN (12, 24, 60)

In summary:

- not all conditions are equality
- INNER JOIN is the same as CROSS JOIN, a Cartesian product
- INNER JOIN is quadratic, although you can prove tighter bounds in special cases
- INNER JOIN is not like intersection; it's like a union of columns and product of rows.

And don't let anyone tell you different.

Monday, May 20, 2013

RIFL vs. using

In the last several posts, I've introduced a resource management pattern I call RIFL, or Resource Is Function-Local.  I've given examples in Go, Perl, and C++, and compared to RAII.  Here, I'll give examples in C#, and discuss how it relates to the IDisposable pattern in C#.

In C#, a resource that needs management signals that to the coder by implementing the interface IDisposable.  That contains a single function, void Dispose(), which is about as simple as interfaces get.  Implementing IDisposable means that you should ensure that Dispose() gets called when you're done with the resource.  And the language provides the "using" statement, which is some lovely syntax making calling Dispose() easy.  A typical usage looks like this:

using (var resource = new Resource()) {
    resource.Use();
} // implicitly calls resource.Dispose()

This is roughly equivalent to

var resource = new Reource();
try {
    resource.Use();
} finally {
    resource.Dispose();
}

The differences are mostly too subtle to affect our discussion here.

Often, as with RAII, you want some resource to have the same lifetime as an instance of a class.  In that case, you construct the IDisposable resource in the constructor for the new class, and you have that class also implement IDisposable and call Dispose() on the resource in its own Dispose() method.  Actually, there are subtle, confusing and dangerous interactions between Dispose(), the actual destructor (well, finalizer), the garbage collector and OS resources, which make correctly implementing Dispose() rather tricky.  Fortunately, for our purposes, all of these issues are too subtle to affect our discussion much.

I'm not comfortable really talking much about Java, as I'm much more familiar with C#, but recent versions of Java have an AutoCloseable interface and an extension to the "try" syntax which is similar to Disposable and "using".

C# has similar flexibility to C++ because Disposable is so similar to RAII.  Here's a few examples of using resources and non-resources in C#:

public void ExampleUsingResource() {
    using (var resource = new Resource() {
        resource.Use();
    }
}

public Resource ExampleUseOfEscapingResource() {
    var resource = new Resource();
    resource.Use();
    return resource;
}

public void ExampleUseOfLeakingResource() {
    var resource = new Resource();
    resource.Use();
} // failed to release resource properly by calling Dispose()

public void ExampleUseOfNonResource() {
    var nonResource = new NonResource();
    nonResource.Use();
} // no need to release nonResource, since it does not implement IDisposable

So this is even trickier than in C++, since leaking a resource not only looks like a cross between "using" a resource and deliberately allowing a resource to escape the current scope; but it looks identical to using a non-resource.  And here's where it gets really tricky:  the difference between a non-resource and a resource is that a resource implements IDisposable.  And how do you know that when you're coding?  Well, if you're using Visual Studio with IntelliSense (or something roughly equivalent), usually you would try typing "resource.Disp" and seeing if the system wants to autocomplete it to Dispose.  This almost works, except for three issues.

First, and pretty minor, is that it's possible to implement Dispose without implementing IDisposable, and then you can't use "using".  But you find that out as soon as you try to use a "using" statement, so that's mostly ok.

Second, and pretty tricky, is that some classes explicitly implement IDisposable.Dispose, so that the Dispose method isn't available unless you cast to (IDisposable).  That means it won't autocomplete so you can be fooled into not using "using".

Third, sometimes it is apparently safe not to call Dispose on certain IDisposable objects, such as Task.  This turns out to be pretty convenient, since in many of the ways you are encouraged to use Tasks in recent C# versions, it's very hard to find the right way to Dispose them.  But this means that sometimes the Leaking code doesn't really leak, which is pretty confusing.

Oh, and did I mention that sometimes Dispose throws?

What it adds up to is that the IDisposable pattern seems fine, but ends up being really painful to use safely.  Fortunately, C# has pretty good lambda support, so you can easily implement and use RIFL.

public class RiflResource {
    public static void WithResource(Action<RiflResource> useFunc) {
        using (var raw = RawResource.Obtain()) { // often: new RawResource()
            var rifl = new RiflResource(raw);
            useFunc(rifl);
        } // implicitly Dispose (Release)
    }
   
    public void Use() { raw_.Use(); }
    private readonly RawResource raw_;
    private RiflResource(RawResource raw) { raw_ = raw; }
}

And it's easy to use

public void ExampleRiflUsage() {
    RiflResource.WithResource(resource => {
        resource.Use();
    });
}

In fact, of the languages I've used in examples, C# has the best syntax for anonymous functions, which makes this the cleanest usage.  And I intentionally wrote my example so that it's obvious how to add multiple calls to Use; this simple case has an even briefer alternate syntax:

public void ExampleRiflUsage() {
    RiflResource.WithResource(resource => resource.Use());
}

or even

public void ExampleRiflUsage() {
    RiflResource.WithResource(Use);
}

but that only works if you want to call Use() alone once with no arguments.

I am really happy to wrap up my posts on RIFL.  So far, I've only blogged on 2 topics, and each of those turned into multi-post discussions, where each post was too long.  I hope to find a reasonable-length topic for the next post.

Friday, May 17, 2013

RIFL vs. RAII

In previous posts, I introduced a resource-management pattern I call RIFL (pronounced "rifle"), or Resource Is Function-Local.  Here I will talk about the standard C++ way of managing resources, which is called RAII, or Resource Allocation Is Initialization.  RAII is usually pronounced "are eh aye aye", but I prefer "rah eeeeeeeeeee".

In RAII, you create a class in which the resource is obtained in the constructor and released in the destructor.  Thus the resource necessarily has exactly the same lifetime as the instance of the class, so the instance is a proxy for the resource.  And, if you allocate that instance on the stack, it is released when the stack frame is destroyed.  Actually, in C++, it is released when control exits the lexical scope, but that's not too far different.

class RawResource {
public:
    static RawResource Obtain();
    void Use();
    void Release();
};

class RaiiResource {
public:
    RaiiResource() : raw_(RawResource.Obtain()) {}
    ~RaiiResource() { raw_.Release(); }
    void Use() { raw_.Use(); }
private:
    RawResource &raw_;
};

This is a bit simpler than a RiflResource.  Using the resource is also easy:

void ExampleRaiiUsage() {
    RaiiResource resource;
    resource.Use();
}

This is very similar to using a RiflResource, but again a bit simpler.  As with RIFL, there is no reference to the raw resource, and no likelihood of releasing the resource twice.

The next example usage is both a strength and a weakness of RAII.  If the RAII object is not allocated on the stack, the lifetime of the resource is still the lifetime of the object, whatever it may be.  The strength is that if you need a more flexible lifetime, you can create one:

RaiiResource *RaiiEscape() {
    RaiiResource *resource = new RaiiResource();
    resource->Use();
    return resource;
}

Before I get into the technical weaknesses of RAII, let me just warn you that confusing it with the RIAA may get you sued.

There's a weakness of RAII which is almost identical to the strength; you can accidentally fail to release the resource:

void RaiiLeak() {
    RaiiResource *resource = new RaiiResource();
    resource->Use();
}

This is a memory leak of the resource object, and therefore a resource leak as well.  The biggest problem with this weakness is that the code looks like a cross between the other two valid usages.  In this simple example, of course, it is easy to see the leak; but it takes a lot of discipline to avoid the leaks in large programs.  Of course, C++ offers tools (such as std::shared_ptr) to help manage memory (and therefore, with RAII, other resources).

If you recall, I only showed fake C++ code (using a "finally" clause) to implement RIFL in C++.  The actual way to implement RIFL in C++ is on top of RAII.

class RiflResource {
public:
    static void WithResource(void useResource(RiflResource &resource)) {
        RaiiResource resource; // Obtain
        RiflResource riflResource(resource); // wrap
            useResource(riflResource); // use
    } // implicit release when RAII resource exits scope
    void Use() { low_.Use(); }
private:
    RiflResource(RaiiResource &raii) : raii_(raii) {}
    RaiiResource &raii_;
};

RAII turns managing any resource into managing memory.  Managing memory is hard in the general case, but easy when the memory is on the stack.  RIFL turns managing any resource into managing the stack, which is always easy, but more limiting.

Holding a lock (not the lock itself) is an unusual resource, because there are no Use() methods, just Obtain (Lock) and Release (Unlock), in the usual implementation.  C++ has an RAII wrapper for locking, which gets used like this:

void AccessWhileRaiiLocked() {
    std::lock_guard<std::mutex> myLock(myMutex);
    x.Use(); // use some class data safe in the knowledge that it is locked.
    y.Use();
} // implicit unlock when lock goes out of scope

This is roughly equivalent to this:

void AccessWhileUnsafelyLowLevelLocked() {
    myMutex.lock();
    x.Use();
    y.Use()
    myMutex.unlock(); // not executed if either Use() throws
}

And to this RIFL example:

void AccessWhileRiflLocked() {
    nonstd::lock_rifl<std::mutex>::HoldingLock(myMutex, []() {
        x.Use();
        y.Use();
    });
}

Ignoring the fact that nothing keeps you from using "x" or "y" outside any of the locking mechanisms, I would argue that RIFL has an advantage over RAII in this case.  It's not that it's simpler; it's slightly more verbose.  But the RAII example looks like it simply has an unused variable.  It's not at all obvious from the code that the x.Use is somehow nested within the lock; or that it's at all related.  Or even, that the lock is needed.

Better would be to use RIFL to actually restrict access to the variables:

class RiflControlled {
public:
    void Sync(void useFunc(usable &x, usable &y)) {
        std::lock_guard<std::mutex> raiiLock(mutex_);
        useFunc(x_, y_);
    } // implicit unlock
private:
    RiflControlled(usable const &x0, usable const &y0) : x_(x0), y_(y0) {}
    usable x_;
    usable y_;
    std::mutex mutex_;
};

void AccessWhileRiflControlled(RiflControlled &rifl) {
    rifl.Sync([](usable &x, usable &y) {
        x.Use();
        y.Use();
    });
}

With RiflControlled, there is simply no way to access "x" or "y" without holding the lock.  You've guaranteed that the locking is correct.  Well, that's a bit overstated, but you really have to deliberately undermine it; it's not going to happen by accident.  Note that in this case, the RIFL function is a (non-static) method on the object, unlike all the previous RIFL examples.  This again suggests that RIFL is a flexible approach to resource management.

With RAII, you can't limit the access to the variables to the scope where the lock is held.  Challenge: prove me wrong.

RAII is necessary in C++ because of the lack of a "finally" clause on a try; there's really no good way around it.  However, it is also a relatively low-level mechanism, which can be abused.  RIFL in C++ can be used as a wrapper around RAII, providing more rigid control over resources (which is good); but less flexible control over resources (which is bad).  But RAII is a step up from the raw resource; one might say that RAII is semi-automatic, while RIFL is fully-automatic.

In the next post, I'll compare RIFL to another resource management pattern similar to RAII, which is IDisposable in C#, and about implementing RIFL in C#.

Wednesday, May 15, 2013

RIFL Examples

In the last post, I introduced a pattern I call RIFL, or "Resource Is Function-Local".  The goal is to control access to a resource in such a way that there is a guarantee that it is released exactly once.  The idea is that the resource never exists in the wild.  It is created within a function, and can only be used when passed into another function which is a parameter of the first.  That's a little confusing, but let me jump into some examples which should make it clearer.

First, here is a RIFL wrapper around writing files in Go.  Note that a function declaration in Go is something like "funcName(argName argType) returnType".

package rifl
import "os"

type File struct{ file *os.File }

func WithFile(fileName string, useFile func(file File) error) error {
    file, err := os.Create(fileName) // create or truncate, open for writing
    if err != nil {
        return err
    }
    defer file.Close() // see note below
    return useFile(File{file})
}

func (file File) WriteString(s string) (int, error) {
    return file.file.WriteString(s)
}

I think this example is fairly readable for those who don't know Go.  The only thing that really needs explaining is the "defer" statement.  "defer" basically means: put the rest of the function in a "try" block, and put this in the "finally" clause.  That is, it contains a function call which is not executed until the rest of the function is completed; but is guaranteed to execute, even if there is exception (which Go calls a "panic").

And this is how the rifl package might be used, with poor error handling:

package main
import "rifl"

func main() {
    rifl.WithFile("out.txt", func(file rifl.File) error {
        file.WriteString("Hello\n")
        file.WriteString("Goodbye\n")
        return nil
    })
}

Go is a really good language in many ways, but if you're used to C++'s destructors, you'll miss them mightily.  RIFL can help you feel a bit better.

Note that in the main package, there are no references to os.File, only to rifl.File; we're not using the raw resource, just the RIFL wrapper.  I only implemented WriteString, but I could have implemented the entire interface, or (in principle) improved it.  Or made it worse.  Also, unlike my abstract example in the previous post, the Obtain function (os.Create) and the Use function (WriteString) both take arguments and return multiple values.  So it's a bit noiser.

I deliberately made the useFile argument the last argument to WithFile, because it tends to run on over many lines.

The most important point, of course, is that, in the WithFile function, we guarantee that the file will be closed, by using "defer".  It doesn't matter how many goats are sacrificed while running "useFile"; the file will get closed.

Also, the os.File object "file" is hidden from users.  In Go, identifiers with an initial uppercase letter (File, WithFile, WriteString) are exported; others (file) are not.  This means the caller has no way to call Close().  But that's fine, since WithFile handles that.

Now for something completely different.  Here's a RIFL wrapper for a database transaction in Perl, annotated (comments after # signs) so that those who don't know Perl can (I hope) follow it:

package XAct; # class
use Moose;

has '_dbh' => (is => 'ro'); # private data

sub Atomically { # RIFL function
    my ($dbh, $func) = @_;
    my @results;
    $dbh->begin_work; # start transaction
    eval { # try
    my $wrap = XAct->new(_dbh => $dbh);
    @results = $func->($wrap); # in transaction
    $dbh->commit;
    };
    if ($@) { # catch
    $dbh->rollback;
    die $@; # rethrow
    }
    return @results;
}

sub prepare { # pass-through Use
    my ($self, @args) = @_;
    return $self->_dbh->prepare(@args);
}

1; # end package

The RIFL function Atomically guarantees that the transaction is in effect when the passed-in function, $func is called; and that it is cleaned up afterwards.  A transaction, unlike a typical resource, can be Released either by a Commit or by a Rollback.  Atomically guarantees that if $func does not throw, the transaction is committed; and if it does throw, the transaction is rolled back.  So, this is an additional complexity easily handled by RIFL.

As before, the transaction object (which is really the the database handle) is wrapped to prevent direct access to the commit or rollback.

Note that the database handle itself is a resource, which could also be managed by a RIFL function, but that is not included in this example.

Here is an example function using Atomically:

sub InsertSomeNames {
    my ($dbh, %map) = @_;
    my $sql = 'INSERT INTO SomeNames(Id, Name) VALUES(?,?)';
    XAct::Atomically($dbh, sub {
    my ($xact)=@_; # get resource
    my $sth = $xact->prepare($sql); # use directly
    while (my ($id, $name) = each(%map)) {
        $sth->bind_param(1, $id);
        $sth->bind_param(2, $name);
        $sth->execute; # use indirectly
        }
    });
};

Here the transaction is being passed to the inner function as $xact, which is being used to insert many values into the database.  If any of those inserts fail, all of them will be rolled back; they will only be committed if they all succeed.

The point of this post is that the RIFL pattern is pretty easy to implement and pretty easy to use in many languages.  It relies only on good anonymous functions (lambda functions), used when calling the RIFL function.  Of course, you also have to be able to reasonably manage the resource so that it is guaranteed to be freed at the end of the RIFL function; but it seems like that's a fundamental prerequisite to being able to manage resources in any way in the language.

In the next post, I'll compare RIFL with RAII and show some C++ examples.

Monday, May 13, 2013

Introduction to RIFL

I want to introduce a new pattern for resource management, which I call RIFL (pronounced "rifle"), which stands for "Resource Is Function-Local".  RIFL has some advantages (besides the better name) over another common pattern, RAII.  One big advantage of RIFL is that it can be used in languages like Go which do not have destructors.  I'll compare RIFL and RAII in detail in an upcoming post, but first I want to introduce RIFL, and provide some examples.

"Resource" is a word which here means "something which, when you're done using it, it's best to tell someone that you're done".  That is, when you're done with memory, you should free (or delete) it; when you're done with a file (or network connection), you should close it; when you're done with a transaction you should commit it (or roll it back); when you're done with a lock being locked, you should unlock it.  And a big "et cetera", because this sort of thing is pretty common.  Also, this is slightly broader than the more typical definition of resource; for instance, while a lock itself is often considered a resource, here I'm consider the state of it being locked to be a resource (i.e., "holding the lock", rather than just the existence of the lock).

When you're done reading the book, you should put it back on the shelf.

The life-cycle of a resource is: Obtain, Use (and Use some more), Release.  People often say "allocate" or "open" instead of "obtain", and "free" or "deallocate" or "close" instead of Release; and of course, there are a thousand variants of "use" depending on the resource: "read", "write", "update", etc.  There can be several "use"s for a single resource.  Files typically have "read", "write", and sometimes "seek".  With a database transaction, you have at least "select", "insert", "update", and "delete"; and perhaps "merge", "bulk insert", "create", and "drop".  But when talking about resources in general, I'll just stick with "Use".

The challenging part of resource management is making sure you never forget to put the book back on the shelf, even if you leave the library because of a fire alarm.  I mean, the hard part is being sure that you release the resource even with complicated control flow, including an exception being thrown.  Occasionally it also gets tricky not using the resource after it is released, and especially only releasing the resource once.

One resource that people don't usually think about managing is stack frames (i.e., local variables in a function).  You could say "When you're done with the stack frame, return from the function."  But you don't need to say that; it's pretty natural.  And RIFL takes advantage of this by turning managing other resources (which is hard) into managing the stack (which is easy).

My examples in this post will be mainly in a C++-like language that has a "finally" clause for the "try" statement.  Which is not standard C++, but is easy to understand.

A typical raw (unmanaged) resource then would look like this:

class RawResource {
public:
    static RawResource Obtain();
    void Use();
    void Release();
};

Often Obtain() and Use() take parameters, but that's not important for getting the basic idea.  As I mentioned before, there are often actually many Use() methods.  Occasionally, as with locks, there may be no Use() methods at all.

Also note that Obtain() is a static method returning an object of the class.  Essentially that's a constructor; so, Obtain() can just be the constructor for the class.

This is how a raw resource is supposed to be used:

void ExampleLeakyRawUsage() {
    RawResource resource = RawResource.Obtain();
    resource.Use();
    DoSomethingElse();
    resource.Use();
    resource.Release();
}

That's pretty straightforward, but if Use() or DoSomethingElse() can throw, the Release() will never get called.  It's also possible to have a dozen places in your code where you use the resource, and you could easily miss a Release in one of them; or just miss it along one code path if there is complicated code in the middle.  This leaks the resource, which is bad.

If you want to use the resource correctly, you have to do something like this:

void ExampleRawUsage() {
    RawResource resource = RawResource.Obtain();
    try     {
        resource.Use();
        DoSomethingElse();
        resource.Use();
    } finally { // not C++
        resource.Release(); // guaranteed
    }
}

That's a lot of boilerplate every time you want to use the resource.  Here's a very similar function:

void AnotherExampleRawUsage() {
    RawResource resource = RawResource.Obtain();
    try {
        DoAnotherThing();
        resource.Use();
        DoSomethingElseEntirely();
        resource.Use();
    } finally { // not C++
        resource.Release();
    }
}

Not really very different.  Followers of the DRY (Don't Repeat Yourself) principle immediately look for a way to abstract out the duplication.  And it's pretty straightforward, if you have a background in functional programming.  But the technique takes a little getting used to for a dysfunctional programmer.  But it shouldn't; if you have almost-duplicate code, you turn it into a function, with the differences as parameters.  This is the same thing.

Everything is the same except for the body of the try block.  So make that a parameter to a function, and use that function to eliminate the redundancy.  When I say "make the body of the try block a parameter", I mean a parameter which is a function:

void AlmostARiflFunction(void useResource(RawResource &resource)) {
    RawResource resource = RawResource.Obtain();
    try {
        useResource(resource);
    } finally { // not C++
        resource.Release();
    }
}

Here "useResource" is a function which is passed to "AlmostARiflFunction".  And, in turn, AlmostARifl function passes the resource to "useResource", which then, well, uses the resource. 

The usage looks like this, with anonymous functions (which in real life I usually call lambdas or lambda functions) being passed in.

void ExampleAlmostRiflUsage() {
    AlmostARiflFunction([](RawResource &resource) {
        resource.Use();
        DoSomethingElse();
        resource.Use();
    });
}

void AnotherExampleAlmostRiflUsage() {
    AlmostARiflFunction([](RawResource &resource) {
        DoAnotherThing();
        resource.Use();
        DoSomethingElseEntirely();
        resource.Use();
    }
}

The repetition is vastly reduced.  If you're not familiar with it, the function arguments demonstrate the syntax for anonymous functions in C++-11.  C++ always has elegant syntax; the no-op anonymous function is "[](){}", but you can abbreviate that to "[]{}" for even better readability.  Does sarcasm come across in a blog?

RIFL usability depends on anonymous function usability; C++ isn't great, but it's better than C which doesn't have anonymous functions at all; or MATLAB which doesn't support multi-statement anonymous functions.

Higher-order functions (i.e., functions taking functions as parameters) are the bread of functional programming (the butter is immutability), so I'm sure RIFL is already widely used in functional programming, and probably better named.  But let me elaborate on how best to implement it.

The good thing about AlmostARiflFunction is that, as long as you don't obtain a RawResource on your own, and only use AlmostARiflFunction to get access to one, you cannot fail to release it; or rather AlmostARiflFunction guarantees that it will release it for you, which is even better.

The bad thing is that nothing stops you from releasing it yourself, which can cause trouble.  Or continuing to use it after you release it yourself.  The obvious fix is to hide the Release method.

class RiflResource {
public:
    static void WithResource(void useResource(RiflResource &resource)) {
        RawResource resource = RawResource.Obtain();
        try {
            RiflResource riflResource(resource);
            useResource(riflResource);
        } finally { // not C++
            resource.Release();
        }   
    }
    void Use() { raw_.Use(); }
private:
    RiflResource(RawResource &raw) : raw_(raw) {}
    RawResource &raw_;
};

We simply provide a wrapper class that provides Use() but does not provide Release() or Obtain().  And we can make the RIFL function a static method of the class, to tie everything together.  You can think of the RIFL function as a wrapper for both Obtain() and Release().

void ExampleRiflUsage() {
    RiflResource.WithResource([](RiflResource &resource) {
        resource.Use();
        DoSomethingElse();
        resource.Use();
    });
}

Now we have guaranteed that Release() is called exactly once, and there is no way to use the resource after it is released.

Also, note that the using code now has no references to the raw resource.  We presumably can't remove the raw resource from the language, but as long as we only use it through the RIFL class, we don't have to worry about releasing it (or even explicitly obtaining it).

In the same way that you get a stack frame by calling a function, and free it up by  returning (often implicitly by falling off the end of the function), we obtain the resource by calling the RIFL function, and it is released when our anonymous function ends.

In my next post, I'll give some examples of RIFL in Go and Perl.  And I promise to get to real C++ and RAII soon.

Sunday, May 5, 2013

High-level floats

In my last two posts, I talked about integers and rational numbers in high-level languages.  In this post, I'm going to talk about real numbers.

As with integers and rational numbers, you know the properties of real numbers.  One of the more obscure properties, but very important for computers, is that there are an uncountable number of them.

Rational numbers are countable, which basically means that there are no more rational numbers than there are integers.  That sounds strange, since the integers seems like a small subset of the rationals, but both are infinite, and infinity isn't that intuitive.

But real numbers are uncountable, which means that there's really no good way to represent them exactly in a computer in general, at least in the traditional way we think about representing values.

There is an important subset of the real numbers, called the algebraic numbers, which are countable.  Unlike real numbers, many people don't learn about algebraic numbers when they are young, so you may not be familiar with them.  They are solutions to polynomial equations with integer coefficients, which includes numbers like square roots and cube roots.  Algebraic numbers can be represented exactly in a computer, but still not in a way that permits easy computation.  So, they are important, but not for general purpose computer programming.

I hinted that there is a non-traditional way to represent real numbers in general.  The trick is that you don't have to represent all the reals, just the ones that you generate while running your program, and they can be represented as the series of operations that generated them.  That is, this program in an imaginary C variant:

real a = 3;
real b = 5;
real c = sqrt(a);
real d = log(b);
real e = c + d;
printf("%r\n", e);

would output "sqrt(3) + log(5)" which is exactly correct, but not necessarily useful.  In particular, there is no straightforward way to execute code like this:

bool Maybe(real x) { return sin(x) == sqrt(1 - cos(x)**2); }

Symbolic manipulation systems, like Mathematica, may be able handle simple cases like this using automatic theorem provers and so on, but the general case is insoluble.  Note that Maybe should return true when x is 3 or 7, but false when x is 5; it's not all that obvious.

What it comes down to is this: programming languages cannot provide good support for real numbers.  It's technically impossible.  But we need to work with square roots and sines and so on, so they provide approximations using limited precision floating point.

Many modern languages define a type called "float" or "double" or "real" essentially by just saying "IEEE 754 binary64" and leaving it at that.  And IEEE 754 is a very good standard, so that almost works out ok.  Other languages may define them slightly differently (usually slightly worse), but similarly enough for our purposes here.  And, of course, some languages have two or more floating-point types, with different representations.

I argue that there is a problem with "float" in high-level languages; but it's the opposite problem of with integers.  Languages often have support for proper integers, but they make it hard to use by hiding it away in a "bigint" type.  But they have "float" easy to use, with built-in literals and so on.  I think it's bad that float seems easy to use in many languages, because it can't actually be easy; it can only seem that way.

Limited precision types are inherently tricky to work with.  I'm going to use, for purposes of example, decimal numbers with 2 digits of precision; that is numbers that can be written as a.bEc for decimal digits a and b, and arbitrary integer c.  So, 1 is 1.0E0 and 12 is 1.2E1, but 123 can only be approximated as 1.2E2, which is really 120.

So, what is 100 + 4 + 4, or 1.0E2+4.0E0+4.0E0?  The answer, sadly but necessarily is: it depends on how you associate.  1.0E2+(4.0E0+4.0E0) is 1.0E2 + 8E0 which should round to 1.1E2, which is as close to the right answer as you can get.  But in (1.0E2+4.0E0)+4.0E0 the first term rounds down to 1.0E2 so it reduces to 1.0E2+4.0E0 which again rounds down to 1.0E2, which is not the best answer.  Besides getting a slightly wrong answer (even in the best case), note that addition is not associative with limited precision numbers.  This make it hard to reason about mathematical expressions, and also hard for compilers to optimize them.

Adding three numbers can get you a slightly inaccurate answer; adding a thousand can be much worse.  Suppose you have a set of a thousand numbers to add, and you naively add them in order.  And suppose further that all of the number are 4, or 4.0E0.  The correct answer is, of course, 1000*4 = 4000 or 4.0E3.  After you add the first 25, you correctly get 100 or 1.0E2 as your partial solution, but then each additional number you add rounds back down to 1.0E2, so your final answer is 100, which is pretty much not 4000.

The most accurate algorithm for adding a set of fixed point numbers is shockingly complicated (and left as an exercise for the reader, since you look like you could use some exercise).  There's not a much simpler task than adding a bunch of numbers, but it's actually hard to get this right.  Why?  Because fixed-precision arithmetic is inherently difficult.

I said that IEEE 754 is a good standard.  By that I mean that it pretty much does the right thing for all the cases it covers.  So, if you add two numbers, you get the best answer you can get.  But it covers adding two numbers, not a thousand; that's outside its scope.

How hard is fixed-precision arithmetic?  Universities often cover it in graduate classes, which means that people with a B.S. in Computer Science may know very little about it.  If you're doing non-trivial math with floats, and you don't have this background, my advice is: find a good library and use it; and then figure out an independent way to check your results.

You may be thinking, "But I've used floats with no problem."  That's not skill, that's luck.  If the precision of the float is much larger than the precision you need, or you're only performing a few operations, it often works out fine.  But eventually it will bite you, unless you start worrying a lot about it.

If you take nothing else from this post, just remember this: It is easy to drown when you try to float.

Since fixed-precision arithmetic is so hard, a good high-level language should remind the user of the problems by making it a little awkward to use; and should help a little with the problems by being flexible.  These can go together.  Instead of saying "float", I want to say "float(precision:2, base:10)".  And I don't want to write "z = x*y;".  I want to write "z = mult<precision:x.precision+y.precision>(x,y)" specifying the precision of the result in terms of the precision of the operands, and requiring that the bases are the same.  And I certainly don't want the language implicitly casting floats from one precision or base to another; that's something I want to be completely aware of.

So, where people can get away with rational arithmetic, which is exact and therefore easy, encourage them to use it.  And where they really need fixed-precision, which is hard, make them specify it for every operation to remind them that they need to think about it for every operation.  And let people work in decimal precision; often times it's the correct solution for the problem, and it's almost always easier to think about.  Again, I have no problem supporting binary precision as well, but if anything is the default, it should be decimal.  A language could even allow arbitrary bases; that's not that hard.

I'm not sure why so many languages get this wrong.    C, which dates back to when floating point was slow (and often inaccurate before IEEE 754), makes using floating point easy, presumably because it was supported by the hardware that C was developed on. Strangely, database languages like SQL seem to do this better than most general-purpose programming languages.

Finally, there are two minor difficulties using IEEE 754 which are not necessary in a high-level language.  IEEE 754, of course, is designed to be low-level; and is entirely appropriate to a low-level language.   One problem is the fact that the precision is not specified by the user (except by selecting the type).  If I need 100 decimal digits, I should be able to specify it.  Often SQL allows this, at least up to a point.

The other problem is that the exponent is essentially a bounded integer, and therefore floats have a limited range and can overflow. The limited range is rather large, but it can still impact code.  Similarly, there is a smallest positive value; and there are issues with denormalized small numbers with lower precision than expected.  A high-level language should provide a float type with an unlimited range, but fixed precision.  That is, the exponent should be an integer, not a bounded integer.  I know of no languages that support such a type.

Again, the language could support constraints to bound the exponent, and use machine floats where it knows the exponent (and precision) provided by the machine suffices.

I may have already mentioned that fixed-precision work is difficult; why make it even harder by bounding the exponents unnecessarily.

So, in summary: floating-point should not be easy; it should be hard, but not harder than it needs to be.  High-level languages should encourage the use of exact types, like rational, where they suffice; and should remind programmers that they need to think a lot about limited precision where it is needed; but should not force programmers to worry about incorrect bases or finite bounds where that is not needed.

Here endeth my series of post on high-level languages and numeric types.  I hope that future high-level languages do a better job with numerics than current ones do, since the current ones are pretty shameful.

Thursday, May 2, 2013

High-level rats

In my previous post, I talked about integers in high-level languages.  In this post, I'm going to talk about rational numbers.

Like integers, you're already familiar with rational numbers, although you may refer to them as fractions.  They are numbers like 3/2, ratios of integers.

You know properties of rational numbers, such as: closed under addition, subtraction and multiplication.  And closed under division, except for division by zero.  And so on.  And, of course, the result of dividing two integers is a rational number, again excepting division by zero.  Some rational numbers are also integers; but many are not.

As with integers, another property is that rational numbers can be exactly represented on a computer in such a way that basic math can be performed efficiently.

And, in fact, many languages, in the same library as "bigint", have a "bigrat" or "rational" or "fraction" type.  But, even in most languages in which "int" is actually an integer, the rational type is usually buried.  The only exception I've found is Smalltalk.  In Smalltalk, as you would expect, the result of dividing two integers is a rational number.

Sadly, in almost every other language I've looked at, the result of dividing two integers is an integer; and therefore, is usually wrong.  (Obviously, this only applies to languages with any integer types.)  In cases where the dividend (i.e., the numerator) is actually divisible by the divisor (the denominator), the result is correct.  In other cases (again, ignoring division by zero), the result is one of the two integers nearest the correct answer; which one depends on the language.

The options seem to be: truncate toward zero, truncate down, or round.   I've never seen any language deciding that 31/10 = 4, although technically that's no worse than 39/10 = 3. Truncate toward zero and truncate down only differ for negative numbers; these seem much more common than rounding.  And rounding exact halves also has options: round down, round up, round toward zero, round away from zero, or round toward the even number.

There are two problems with getting the wrong answer when you divide two integers. The first, of course, is that you have the wrong answer.  The second is that you've confused the programmer.

You may remember the shock you felt when "integer division" in a programming language was first explained to you; or worse, when you accidentally discovered it on your own.  You may have said to yourself, "This is not my high-level language.  How did this integer get here?"  Or perhaps you have had the unpleasant and guilty experience of explaining these shenanigans to someone else.

Here's a general rule: if you're explaining to someone something which they already know, it should be pretty easy; or there should be a good reason it isn't (for example, my next post will discuss real numbers which are actually tricky on a computer).  People know how to divide integers before they start programming; all that you should need to explain to them is that in most programming languages you need to use the forward slash character to divide.  That should be it; they should already know the rest.  If you're explaining that the rules have changed (for no good reason), there's a problem.

Aside from producing wrong answers, it also makes it hard to reason about programs.  In normal math, a/b = (1/b)*a; and (a/b)*(c/d) = (c/b)*(a/d) = (a*c)/(b*d) etc.  This is not true in most programming languages; only Smalltalk gets this right.

Fix this, and a class of bugs goes away, a set of optimizations becomes available, and a pool of confusion evaporates.

So kudos to Smalltalk, and shame on all of the so-called high-level languages that get this wrong.

But even Smalltalk doesn't get it quite right.  We all know that 3.1 is the same as 31/10; that is, it's a rational number.  Or we used to know that before we learned how to program.  Because that's not true in any programming language I can find.  Most programming languages treat 3.1 as an inexact number, a floating point number with limited precision.  And usually that number is stored in binary (because of the bit-orientation of numeric types even in supposedly high-level languages).  And binary can't even exactly represent 1/10.

A few languages consider 3.1 to be a decimal floating point number; there are still issues with the limited precision once you start doing math, but at least 3.1 is equal to 31/10.

In languages where 3.1 is a binary floating point number, simple literal expression can generate incorrect results.  For instance, if I type "0.1 + 0.2 - 0.3" into my gnu-smalltalk, it tells me "5.551115123125783d-17", which isn't the correct answer.  But it gets "(1/10) + (2/10) - (3/10)" correct.  It prints "0", which also happens to be the number of languages I can find that get support for integers and rationals right, including literals.

In summary, high-level languages should have easy support for rational numbers, because they are easy to work with; and in particular, the quotient of two integers should be a rational number; and numeric literals with a decimal place should be rational numbers.  To me this all seems fairly basic, and pretty obvious, but it seems to have slipped past quite a few language designers.  Which doesn't seem rational.

Oh, and let's call this type "rat" so it's as easy to type as "int" and more fun to say.  "The function chooses the smallest of a set of rats."

In the next post, I'll talk about real numbers.  Really.  Sorry.