Showing posts with label SQL. Show all posts
Showing posts with label SQL. Show all posts

Sunday, May 26, 2013

INNER JOIN is not intersection

I've heard more than one person describe INNER JOINs as being like set intersection, usually accompanied by drawing a Venn diagram.  But that's wrong and misleading.

I'm not going to go into much detail on SQL syntax, but a JOIN operator is typically part of a SELECT statement, which looks kind of like this:

SELECT L.a, R.b, L.c, R.d
    FROM LeftTable L
        INNER JOIN RightTable R
            ON OnCondition
    WHERE WhereCondition;
   
LeftTable and RightTable are table expressions with aliases L and R respectively, and OnCondition and WhereCondition are boolean expressions which may involve the columns of the tables.  Here, "a" and "c" are columns of L, while "b" and "d" are columns of R.

The most important aspect of a JOIN (INNER, OUTER or otherwise) is that the result contains all of the columns of both tables, which is why the SELECT list (and the conditions) can reference both tables.

Since it's the most important aspect of a JOIN, I'm going to repeat it.  You get all of the columns of both tables.  Describing this as being anything like set intersection is completely missing the point.  WIth set intersection, both sets come from the same universe; and the result is also from that universe.  With INNER JOIN, the two input tables may have completely different columns; and the result has all columns from both, and therefore looks like neither.

While the ON clause is syntactically required for an INNER JOIN, this is also misleading.  You can freely move conditions between the ON clause to the WHERE clause.  There is absolutely no difference between the above statement and

SELECT L.a, R.b, L.c, R.d
    FROM LeftTable L
        INNER JOIN RightTable R
            ON 1=1
    WHERE OnCondition AND WhereCondition;
   
The expression 1=1 is a common idiom for writing TRUE in SQL.  An INNER JOIN like this one, where the OnCondition is TRUE is called a Cartesian product, or a CROSS JOIN, and this statement is also equivalent:

SELECT L.a, R.b, L.c, R.d
    FROM LeftTable L
        CROSS JOIN RightTable R
    WHERE OnCondition AND WhereCondition;

What's a Cartesian product?  It means that each row of LeftTable is paired with every row of RightTable to make the rows of the result of the CROSS JOIN.  That is, if L contains the rows L1 and L2 and R contains the rows R1, R2 and R3, the result has rows
(L1, R1)
(L1, R2)
(L1, R3)
(L2, R1)
(L2, R2)
(L2, R3)
And this is how you get all of the columns; by pairing up all of the rows.  So if there are NL rows in L and NR rows in R, the result of the join has NL * NR rows.  You generally get more rows than you started with; how is this like set intersection?  With set intersection, you end up with at most the number of elements in the smaller of the two input sets.

SQL does contain an operator that is like set intersection.  It's called INTERSECT.  It is not like an INNER JOIN.

Anyway, the NL * NR is where the word "product" comes into play. in "Cartesian product"  It's Cartesian because this looks like the two-dimensional (x,y) coordinates that Descartes used to label the plane.  In fact, that's a pretty good way to think about it.  The result of a CROSS JOIN or INNER JOIN is a matrix (I'd say "table", but that would be pretty confusing here) with the rows of the two tables labeling the two dimensions.

       R1        R2        R3
L1  (L1, R1)  (L1, R2)  (L1, R3)
L2  (L2, R1)  (L2, R2)  (L2, R3)

In matrix jargon, this would be called an "outer product", but again that's pretty confusing here, since this is an INNER JOIN.  There's also an "inner product" in matrix algebra, but that is in no way relevant.  Forget that I said "matrix."  I meant to say "rectangular grid."  The number of elements in the rectangular grid is the product of the length of its sides, NL * NR.

Of course, the conditions may filter out rows, so NL * NR is the upper bound on the number of rows returned by the SELECT statement.  The extreme case is where the condition is FALSE (such as 0=1), which returns no rows.  You can think of the ON and WHERE clauses as checking each entry in the matrix to decide whether it's good enough to be in the final result set.

People (myself included) may tell you all sorts of rules for what sort of conditions belong in the ON clause of an INNER JOIN and what sort of conditions belong in the WHERE class of the SELECT statement.  It's important to be aware that these are purely stylistic choices; there is no semantic difference.  (Although there is a difference for OUTER JOINs.)

People (the sort who will tell you that INNER JOIN is like intersection) will also try to suggest that there is a difference between an INNER JOIN and a CROSS JOIN.  But you'll know better.  As in my sample statements, every CROSS JOIN can be replaced by INNER JOIN ON 1=1, and every INNER JOIN can be replaced by CROSS JOIN, moving the OnCondition into the WHERE clause.  Whatever they tell you, INNER JOIN is a Cartesian product.

As long as we're going through equivalences, let me point out that INNER JOIN and CROSS JOIN are commutative; the order of the tables doesn't matter.  (Again, this is not true for OUTER JOINs.)  So, we could equally well write

SELECT L.a, R.b, L.c, R.d
    FROM RightTable R
        INNER JOIN LeftTable L
            ON OnCondition
    WHERE WhereCondition;

although I should mention that the phrase "left table" always refers to the table to the left of the JOIN, which here is RightTable.  Again, there may be stylistic reasons for deciding which order to put the tables in, but there is no semantic difference.

In general, the number of rows returned by an INNER JOIN is quadratic (NL * NR), but there is an important special case that guarantees a linear result.  Suppose columns K1 and K2 are a UNIQUE KEY of R; then include f1(L) = R.K1 AND f2(L) = R.K2 in the conditions, where f1 and f2 are arbitrary functions of the columns of L.  Then there is at most one row of R for every row of L; because the rows of L determine the values of the functions, and therefore the values of a UNIQUE KEY of R.  If that key is in R at all, there is exactly one such row in R.  So, in this case, there are at most NL resulting rows instead of NL * NR.

Of course, since the INNER JOIN is commutative, you could swap L and R.  So if a function of R determines a UNIQUE KEY of L, there are at most NR resulting rows.

One last point.  All of the example conditions I've mentioned involve equality (=), but there are many perfectly good conditions that do not:

ON L.a < R.b
    AND L.c <> R.d
   
WHERE L.a - R.b BETWEEN -2 AND 5
    AND L.c*R.d IN (12, 24, 60)

In summary:

- not all conditions are equality
- INNER JOIN is the same as CROSS JOIN, a Cartesian product
- INNER JOIN is quadratic, although you can prove tighter bounds in special cases
- INNER JOIN is not like intersection; it's like a union of columns and product of rows.

And don't let anyone tell you different.

Sunday, May 5, 2013

High-level floats

In my last two posts, I talked about integers and rational numbers in high-level languages.  In this post, I'm going to talk about real numbers.

As with integers and rational numbers, you know the properties of real numbers.  One of the more obscure properties, but very important for computers, is that there are an uncountable number of them.

Rational numbers are countable, which basically means that there are no more rational numbers than there are integers.  That sounds strange, since the integers seems like a small subset of the rationals, but both are infinite, and infinity isn't that intuitive.

But real numbers are uncountable, which means that there's really no good way to represent them exactly in a computer in general, at least in the traditional way we think about representing values.

There is an important subset of the real numbers, called the algebraic numbers, which are countable.  Unlike real numbers, many people don't learn about algebraic numbers when they are young, so you may not be familiar with them.  They are solutions to polynomial equations with integer coefficients, which includes numbers like square roots and cube roots.  Algebraic numbers can be represented exactly in a computer, but still not in a way that permits easy computation.  So, they are important, but not for general purpose computer programming.

I hinted that there is a non-traditional way to represent real numbers in general.  The trick is that you don't have to represent all the reals, just the ones that you generate while running your program, and they can be represented as the series of operations that generated them.  That is, this program in an imaginary C variant:

real a = 3;
real b = 5;
real c = sqrt(a);
real d = log(b);
real e = c + d;
printf("%r\n", e);

would output "sqrt(3) + log(5)" which is exactly correct, but not necessarily useful.  In particular, there is no straightforward way to execute code like this:

bool Maybe(real x) { return sin(x) == sqrt(1 - cos(x)**2); }

Symbolic manipulation systems, like Mathematica, may be able handle simple cases like this using automatic theorem provers and so on, but the general case is insoluble.  Note that Maybe should return true when x is 3 or 7, but false when x is 5; it's not all that obvious.

What it comes down to is this: programming languages cannot provide good support for real numbers.  It's technically impossible.  But we need to work with square roots and sines and so on, so they provide approximations using limited precision floating point.

Many modern languages define a type called "float" or "double" or "real" essentially by just saying "IEEE 754 binary64" and leaving it at that.  And IEEE 754 is a very good standard, so that almost works out ok.  Other languages may define them slightly differently (usually slightly worse), but similarly enough for our purposes here.  And, of course, some languages have two or more floating-point types, with different representations.

I argue that there is a problem with "float" in high-level languages; but it's the opposite problem of with integers.  Languages often have support for proper integers, but they make it hard to use by hiding it away in a "bigint" type.  But they have "float" easy to use, with built-in literals and so on.  I think it's bad that float seems easy to use in many languages, because it can't actually be easy; it can only seem that way.

Limited precision types are inherently tricky to work with.  I'm going to use, for purposes of example, decimal numbers with 2 digits of precision; that is numbers that can be written as a.bEc for decimal digits a and b, and arbitrary integer c.  So, 1 is 1.0E0 and 12 is 1.2E1, but 123 can only be approximated as 1.2E2, which is really 120.

So, what is 100 + 4 + 4, or 1.0E2+4.0E0+4.0E0?  The answer, sadly but necessarily is: it depends on how you associate.  1.0E2+(4.0E0+4.0E0) is 1.0E2 + 8E0 which should round to 1.1E2, which is as close to the right answer as you can get.  But in (1.0E2+4.0E0)+4.0E0 the first term rounds down to 1.0E2 so it reduces to 1.0E2+4.0E0 which again rounds down to 1.0E2, which is not the best answer.  Besides getting a slightly wrong answer (even in the best case), note that addition is not associative with limited precision numbers.  This make it hard to reason about mathematical expressions, and also hard for compilers to optimize them.

Adding three numbers can get you a slightly inaccurate answer; adding a thousand can be much worse.  Suppose you have a set of a thousand numbers to add, and you naively add them in order.  And suppose further that all of the number are 4, or 4.0E0.  The correct answer is, of course, 1000*4 = 4000 or 4.0E3.  After you add the first 25, you correctly get 100 or 1.0E2 as your partial solution, but then each additional number you add rounds back down to 1.0E2, so your final answer is 100, which is pretty much not 4000.

The most accurate algorithm for adding a set of fixed point numbers is shockingly complicated (and left as an exercise for the reader, since you look like you could use some exercise).  There's not a much simpler task than adding a bunch of numbers, but it's actually hard to get this right.  Why?  Because fixed-precision arithmetic is inherently difficult.

I said that IEEE 754 is a good standard.  By that I mean that it pretty much does the right thing for all the cases it covers.  So, if you add two numbers, you get the best answer you can get.  But it covers adding two numbers, not a thousand; that's outside its scope.

How hard is fixed-precision arithmetic?  Universities often cover it in graduate classes, which means that people with a B.S. in Computer Science may know very little about it.  If you're doing non-trivial math with floats, and you don't have this background, my advice is: find a good library and use it; and then figure out an independent way to check your results.

You may be thinking, "But I've used floats with no problem."  That's not skill, that's luck.  If the precision of the float is much larger than the precision you need, or you're only performing a few operations, it often works out fine.  But eventually it will bite you, unless you start worrying a lot about it.

If you take nothing else from this post, just remember this: It is easy to drown when you try to float.

Since fixed-precision arithmetic is so hard, a good high-level language should remind the user of the problems by making it a little awkward to use; and should help a little with the problems by being flexible.  These can go together.  Instead of saying "float", I want to say "float(precision:2, base:10)".  And I don't want to write "z = x*y;".  I want to write "z = mult<precision:x.precision+y.precision>(x,y)" specifying the precision of the result in terms of the precision of the operands, and requiring that the bases are the same.  And I certainly don't want the language implicitly casting floats from one precision or base to another; that's something I want to be completely aware of.

So, where people can get away with rational arithmetic, which is exact and therefore easy, encourage them to use it.  And where they really need fixed-precision, which is hard, make them specify it for every operation to remind them that they need to think about it for every operation.  And let people work in decimal precision; often times it's the correct solution for the problem, and it's almost always easier to think about.  Again, I have no problem supporting binary precision as well, but if anything is the default, it should be decimal.  A language could even allow arbitrary bases; that's not that hard.

I'm not sure why so many languages get this wrong.    C, which dates back to when floating point was slow (and often inaccurate before IEEE 754), makes using floating point easy, presumably because it was supported by the hardware that C was developed on. Strangely, database languages like SQL seem to do this better than most general-purpose programming languages.

Finally, there are two minor difficulties using IEEE 754 which are not necessary in a high-level language.  IEEE 754, of course, is designed to be low-level; and is entirely appropriate to a low-level language.   One problem is the fact that the precision is not specified by the user (except by selecting the type).  If I need 100 decimal digits, I should be able to specify it.  Often SQL allows this, at least up to a point.

The other problem is that the exponent is essentially a bounded integer, and therefore floats have a limited range and can overflow. The limited range is rather large, but it can still impact code.  Similarly, there is a smallest positive value; and there are issues with denormalized small numbers with lower precision than expected.  A high-level language should provide a float type with an unlimited range, but fixed precision.  That is, the exponent should be an integer, not a bounded integer.  I know of no languages that support such a type.

Again, the language could support constraints to bound the exponent, and use machine floats where it knows the exponent (and precision) provided by the machine suffices.

I may have already mentioned that fixed-precision work is difficult; why make it even harder by bounding the exponents unnecessarily.

So, in summary: floating-point should not be easy; it should be hard, but not harder than it needs to be.  High-level languages should encourage the use of exact types, like rational, where they suffice; and should remind programmers that they need to think a lot about limited precision where it is needed; but should not force programmers to worry about incorrect bases or finite bounds where that is not needed.

Here endeth my series of post on high-level languages and numeric types.  I hope that future high-level languages do a better job with numerics than current ones do, since the current ones are pretty shameful.