Sunday, May 26, 2013

INNER JOIN is not intersection

I've heard more than one person describe INNER JOINs as being like set intersection, usually accompanied by drawing a Venn diagram.  But that's wrong and misleading.

I'm not going to go into much detail on SQL syntax, but a JOIN operator is typically part of a SELECT statement, which looks kind of like this:

SELECT L.a, R.b, L.c, R.d
    FROM LeftTable L
        INNER JOIN RightTable R
            ON OnCondition
    WHERE WhereCondition;
   
LeftTable and RightTable are table expressions with aliases L and R respectively, and OnCondition and WhereCondition are boolean expressions which may involve the columns of the tables.  Here, "a" and "c" are columns of L, while "b" and "d" are columns of R.

The most important aspect of a JOIN (INNER, OUTER or otherwise) is that the result contains all of the columns of both tables, which is why the SELECT list (and the conditions) can reference both tables.

Since it's the most important aspect of a JOIN, I'm going to repeat it.  You get all of the columns of both tables.  Describing this as being anything like set intersection is completely missing the point.  WIth set intersection, both sets come from the same universe; and the result is also from that universe.  With INNER JOIN, the two input tables may have completely different columns; and the result has all columns from both, and therefore looks like neither.

While the ON clause is syntactically required for an INNER JOIN, this is also misleading.  You can freely move conditions between the ON clause to the WHERE clause.  There is absolutely no difference between the above statement and

SELECT L.a, R.b, L.c, R.d
    FROM LeftTable L
        INNER JOIN RightTable R
            ON 1=1
    WHERE OnCondition AND WhereCondition;
   
The expression 1=1 is a common idiom for writing TRUE in SQL.  An INNER JOIN like this one, where the OnCondition is TRUE is called a Cartesian product, or a CROSS JOIN, and this statement is also equivalent:

SELECT L.a, R.b, L.c, R.d
    FROM LeftTable L
        CROSS JOIN RightTable R
    WHERE OnCondition AND WhereCondition;

What's a Cartesian product?  It means that each row of LeftTable is paired with every row of RightTable to make the rows of the result of the CROSS JOIN.  That is, if L contains the rows L1 and L2 and R contains the rows R1, R2 and R3, the result has rows
(L1, R1)
(L1, R2)
(L1, R3)
(L2, R1)
(L2, R2)
(L2, R3)
And this is how you get all of the columns; by pairing up all of the rows.  So if there are NL rows in L and NR rows in R, the result of the join has NL * NR rows.  You generally get more rows than you started with; how is this like set intersection?  With set intersection, you end up with at most the number of elements in the smaller of the two input sets.

SQL does contain an operator that is like set intersection.  It's called INTERSECT.  It is not like an INNER JOIN.

Anyway, the NL * NR is where the word "product" comes into play. in "Cartesian product"  It's Cartesian because this looks like the two-dimensional (x,y) coordinates that Descartes used to label the plane.  In fact, that's a pretty good way to think about it.  The result of a CROSS JOIN or INNER JOIN is a matrix (I'd say "table", but that would be pretty confusing here) with the rows of the two tables labeling the two dimensions.

       R1        R2        R3
L1  (L1, R1)  (L1, R2)  (L1, R3)
L2  (L2, R1)  (L2, R2)  (L2, R3)

In matrix jargon, this would be called an "outer product", but again that's pretty confusing here, since this is an INNER JOIN.  There's also an "inner product" in matrix algebra, but that is in no way relevant.  Forget that I said "matrix."  I meant to say "rectangular grid."  The number of elements in the rectangular grid is the product of the length of its sides, NL * NR.

Of course, the conditions may filter out rows, so NL * NR is the upper bound on the number of rows returned by the SELECT statement.  The extreme case is where the condition is FALSE (such as 0=1), which returns no rows.  You can think of the ON and WHERE clauses as checking each entry in the matrix to decide whether it's good enough to be in the final result set.

People (myself included) may tell you all sorts of rules for what sort of conditions belong in the ON clause of an INNER JOIN and what sort of conditions belong in the WHERE class of the SELECT statement.  It's important to be aware that these are purely stylistic choices; there is no semantic difference.  (Although there is a difference for OUTER JOINs.)

People (the sort who will tell you that INNER JOIN is like intersection) will also try to suggest that there is a difference between an INNER JOIN and a CROSS JOIN.  But you'll know better.  As in my sample statements, every CROSS JOIN can be replaced by INNER JOIN ON 1=1, and every INNER JOIN can be replaced by CROSS JOIN, moving the OnCondition into the WHERE clause.  Whatever they tell you, INNER JOIN is a Cartesian product.

As long as we're going through equivalences, let me point out that INNER JOIN and CROSS JOIN are commutative; the order of the tables doesn't matter.  (Again, this is not true for OUTER JOINs.)  So, we could equally well write

SELECT L.a, R.b, L.c, R.d
    FROM RightTable R
        INNER JOIN LeftTable L
            ON OnCondition
    WHERE WhereCondition;

although I should mention that the phrase "left table" always refers to the table to the left of the JOIN, which here is RightTable.  Again, there may be stylistic reasons for deciding which order to put the tables in, but there is no semantic difference.

In general, the number of rows returned by an INNER JOIN is quadratic (NL * NR), but there is an important special case that guarantees a linear result.  Suppose columns K1 and K2 are a UNIQUE KEY of R; then include f1(L) = R.K1 AND f2(L) = R.K2 in the conditions, where f1 and f2 are arbitrary functions of the columns of L.  Then there is at most one row of R for every row of L; because the rows of L determine the values of the functions, and therefore the values of a UNIQUE KEY of R.  If that key is in R at all, there is exactly one such row in R.  So, in this case, there are at most NL resulting rows instead of NL * NR.

Of course, since the INNER JOIN is commutative, you could swap L and R.  So if a function of R determines a UNIQUE KEY of L, there are at most NR resulting rows.

One last point.  All of the example conditions I've mentioned involve equality (=), but there are many perfectly good conditions that do not:

ON L.a < R.b
    AND L.c <> R.d
   
WHERE L.a - R.b BETWEEN -2 AND 5
    AND L.c*R.d IN (12, 24, 60)

In summary:

- not all conditions are equality
- INNER JOIN is the same as CROSS JOIN, a Cartesian product
- INNER JOIN is quadratic, although you can prove tighter bounds in special cases
- INNER JOIN is not like intersection; it's like a union of columns and product of rows.

And don't let anyone tell you different.

1 comment:

  1. I was asked if there could also be performance implications for WHERE vs. ON; and for the order of tables in JOINs. The answer is that it depends on which version of which database management system you're using. For recent versions of SQL Server, I believe that the answer is that WHERE vs. ON makes no difference; and that "a JOIN b" is the same as "b JOIN a"; but that "a JOIN b JOIN c" may not have the same performance as "b JOIN a JOIN c" due to limitations of the optimizer.

    Technically, the answer for essentially all questions relating to SQL is "it depends on the RDBMS", and that sadly includes basic questions about the syntax. While there is a SQL standard, it seems to be more honored in the breach.

    ReplyDelete