Tuesday, April 30, 2013

High-level ints

In this blog, I am going to be talking about various aspects of computer programming and programming languages and everything else worthwhile in this world.

The first few entries will be about how I think numeric types should be handled in high-level languages.

Low-level languages, like assembler and C, are all about what the machine can do well.  High-level languages are supposed to abstract away the machine and allow the programmer to more easily focus on solving a problem.

Almost all programming languages have numeric types, to do math, because arithmetic is often useful.  But somehow, many get the basic numeric types wrong.

I'm going to discuss some numeric types in mathematics: integers, rational numbers, and real numbers; and also how they can be represented on the computer; and how they actually are represented in many programming languages.  I'll also talk about some extensions to these types and how they could be represented.

In this post, I'm going to talk about integers.  Integers, as you recall, are numbers with no fractional part:
        ..., -3, -2, -1, 0, 1, 2, 3, ...
You have been learning the properties of integers since you started learning to count, and you're familiar with them.  One useful property is that integers are closed under addition, multiplication and subtraction.  Or maybe that's three properties.  But, if you add, subtract or multiply two integers, you get another integer as a result.  Another property is that they are unbounded on both sides; there is neither a largest or a smallest integer.  And there are lots more useful properties that you know, and that you use all the time when doing math, even if you don't consciously think about them.

Here's a useful property of integers that you may never have thought about: integers are exactly representable in a computer, in a way that is useful in that it readily supports implementing arithmetic efficiently.  In other words, they can reasonably be a type in a programming language.

Not surprising, then, that most languages have a type called "integer" or "int" or something similar.  What's surprising is that this type doesn't actually represent the integers in most languages.

From what I can tell, the most popular languages that get "int" basically right are Python and Ruby.  Some less popular languages, including Smalltalk also get "int" right.  Smalltalk is notable in that it seems to get "int" more right than Python and Ruby, which I'll talk about in the next post.

If you're a programmer, there's a good chance that you've used one of the C-derived languages, which include C, C++, C#, Objective-C, Java, Go and D.  These languages, and many others, all have an "int" type which can represent only a subset of the integers.  And the particular subset is usually one that fits into 32 or 64 bits (or 16 bits for some older implementations).  This means there is a largest integer ("intMax") and a smallest one ("intMin").  Which means it's actually pretty different from the integers we know and love.  Integers are unbounded, so we can love them without limit.

C and assembly languages are the common low-level languages, although there are a few others.  Some languages, like C++ try to support both low-level and high-level programming.  But most languages, definitely C# and Java, are trying to be high-level languages, and abstract machine details away from the programmer to make programming easier.  But they're failing on "int", which seems pretty fundamental.

If you're reading the definition of "int" in a high-level language, and you see the phrase "fits into 32 bits", that's a strong hint that your high-level language is not actually abstracting away from the machine; and therefore is not actually a high-level language.  You've been hoodwinked, bamboozled.  It is a low-level language in sheep's clothing.

The main problem with restricting "int" to some number of bits is that it can overflow (or underflow, which is similar, but I will largely ignore).  There are a few different options for managing overflow:

- Wrapping (modular arithmetic).  This is very popular.  Here, intMax + 1 is intMin, which is pretty strange when you first see it.  But, if you're a programmer, chances are you've eventually grown used to this behavior.
- Promoting to a larger type.  If you always automatically promote to a larger integer type, you're effectively implementing integers; this is how Python does it.  But if you promote to a different type, it can lead to problems I'll discuss later.  For example, PHP promotes "int" to "float" when "int" overflows.
- Failing (throwing an exception).  This is less common.  Here, intMax + 1 is an error.
- Clamping.  I can't find any examples of this at all.  Here, intMax + 1 = intMax.

I said I'd ignore underflow, but it has the same options, and always seems to be handled the same way overflow is, although in principle you could design a very confusing language where overflow wrapped and underflow clamped.

Here's one explicit example of the problem with bounded ints.  The numbers in the example assume a typical signed 32-bit int; but you could create a similar example with different numbers for any bounded int.

You write a program which is supposed to count the number of characters in a file.  You test it carefully on a bunch of small files, where you can actually count the characters by hand, and it works perfectly.  Now you run it on the file you actually care about, which happens to have 6,442,450,941 characters in it, which is a large file.

The answer you get depends on how the language handles overflow.  With wrapping, you get the answer 2,147,483,645 which is a large number, and you believe it.  But it's only about a third of the right answer.  Your program was completely wrong, and you can't even tell.  If this were an actually useful result, something would catch on fire.

If your language is a rare language that clamps, you get 2,147,483,647 instead, which isn't all that different.

If your language throws on overflow, you just get an error.  At least you know you don't have the right answer, but it's still not useful.  But if your language supports bounded types, this is the best kind, because at least you know when you don't get the right answer.  We should rather bear those ills we have than fly to others we know not of.

If the language promotes to a floating-point type which is large enough to hold the correct answer, you get the correct answer.  But, for some larger file, floating-point types look kind of like clamping types:  There is some smallest number M such that M+1 = M.  Unlike real clamping types, there's another number M' such that M'+2 = M' (and also M' + 1 = M), where M' > M.  But eventually you get the wrong answer.

I should briefly mention here that there are a few languages which don't have integers, and only have floating point numbers.  These suffer from this same issue.  I count Perl among these languages, although it is in principle possible to use int in Perl.

There are also a few languages that don't have any numeric types, and aren't good choices for doing even simple math, such as counting.   Languages which specifically target non-numeric programming are, of course, exempt from criticism that they have a poor numeric types.

Many languages (such as Java) have a type hidden away in a library somewhere with a name like "bigint" which actually can represent all integers.  Simply having such a type is not sufficient to be a high-level language.  The easiest to use integer type, and the type of integer literals needs to behave like integers, not like bounded ints with some bit size.

There's nothing wrong with a high-level language including low-level types.  There are two options high-level languages have for including low-level types.  One way, just throwing in the C types (int32, int64) is ok; it lets people use them where they want them.  There's another option, though, which is much more powerful.

define int16 as i : int such that -32768 < i and i < 32767;

A language could permit arbitrary constraints on a type; and check those constraints.  That gives bounded types that throw, which are the best bounded types  And it can use those constraints for optimization.

This lets the programmer put appropriate bounds on the types, not arbitrary ones because of bit sizes.  If I want a type that can only hold even numbers from -6 to 60, I just declare that type.

So, in summary, the rule for high-level languages is: Integers should be easy.  The integral type which is easiest to use, because it has the shortest, simplest name, and because it is the type of integer literals should be unbounded (or at least act as if it is).

Python basically gets integers right; and Java basically gets them wrong (along with lots of other languages).  So what difference does it make?  First, there's a class of bugs which can happen in Java that cannot happen in Python.  Second, there's one fewer bizarre feature that needs to be learned in Python than in Java.  It takes seconds to completely explain integer literals and addition, subtraction and multiplication of integers in Python, because it's just integers, and the student knows it already.  In Java, you're talking about bits and bounds.

In the next post, I'll talk about dividing integers, which Smalltalk still gets right, but Python gets wrong.

2 comments:

  1. Matlab clamps integers:

    http://www.mathworks.com/help/matlab/matlab_prog/integers.html#f2-98095

    Now you have an example!

    ReplyDelete
    Replies
    1. I should probably have known that. Of course, MATLAB is very floating-point oriented; it doesn't even have integer literals.

      Delete