Careful Code: high-level

Showing posts with label high-level. Show all posts

Sunday, May 5, 2013

High-level floats

In my last two posts, I talked about integers and rational numbers in high-level languages. In this post, I'm going to talk about real numbers.

As with integers and rational numbers, you know the properties of real numbers. One of the more obscure properties, but very important for computers, is that there are an uncountable number of them.

Rational numbers are countable, which basically means that there are no more rational numbers than there are integers. That sounds strange, since the integers seems like a small subset of the rationals, but both are infinite, and infinity isn't that intuitive.

But real numbers are uncountable, which means that there's really no good way to represent them exactly in a computer in general, at least in the traditional way we think about representing values.

There is an important subset of the real numbers, called the algebraic numbers, which are countable. Unlike real numbers, many people don't learn about algebraic numbers when they are young, so you may not be familiar with them. They are solutions to polynomial equations with integer coefficients, which includes numbers like square roots and cube roots. Algebraic numbers can be represented exactly in a computer, but still not in a way that permits easy computation. So, they are important, but not for general purpose computer programming.

I hinted that there is a non-traditional way to represent real numbers in general. The trick is that you don't have to represent all the reals, just the ones that you generate while running your program, and they can be represented as the series of operations that generated them. That is, this program in an imaginary C variant:

real a = 3;
real b = 5;
real c = sqrt(a);
real d = log(b);
real e = c + d;
printf("%r\n", e);

would output "sqrt(3) + log(5)" which is exactly correct, but not necessarily useful. In particular, there is no straightforward way to execute code like this:

bool Maybe(real x) { return sin(x) == sqrt(1 - cos(x)**2); }

Symbolic manipulation systems, like Mathematica, may be able handle simple cases like this using automatic theorem provers and so on, but the general case is insoluble. Note that Maybe should return true when x is 3 or 7, but false when x is 5; it's not all that obvious.

What it comes down to is this: programming languages cannot provide good support for real numbers. It's technically impossible. But we need to work with square roots and sines and so on, so they provide approximations using limited precision floating point.

Many modern languages define a type called "float" or "double" or "real" essentially by just saying "IEEE 754 binary64" and leaving it at that. And IEEE 754 is a very good standard, so that almost works out ok. Other languages may define them slightly differently (usually slightly worse), but similarly enough for our purposes here. And, of course, some languages have two or more floating-point types, with different representations.

I argue that there is a problem with "float" in high-level languages; but it's the opposite problem of with integers. Languages often have support for proper integers, but they make it hard to use by hiding it away in a "bigint" type. But they have "float" easy to use, with built-in literals and so on. I think it's bad that float seems easy to use in many languages, because it can't actually be easy; it can only seem that way.

Limited precision types are inherently tricky to work with. I'm going to use, for purposes of example, decimal numbers with 2 digits of precision; that is numbers that can be written as a.bEc for decimal digits a and b, and arbitrary integer c. So, 1 is 1.0E0 and 12 is 1.2E1, but 123 can only be approximated as 1.2E2, which is really 120.

So, what is 100 + 4 + 4, or 1.0E2+4.0E0+4.0E0? The answer, sadly but necessarily is: it depends on how you associate. 1.0E2+(4.0E0+4.0E0) is 1.0E2 + 8E0 which should round to 1.1E2, which is as close to the right answer as you can get. But in (1.0E2+4.0E0)+4.0E0 the first term rounds down to 1.0E2 so it reduces to 1.0E2+4.0E0 which again rounds down to 1.0E2, which is not the best answer. Besides getting a slightly wrong answer (even in the best case), note that addition is not associative with limited precision numbers. This make it hard to reason about mathematical expressions, and also hard for compilers to optimize them.

Adding three numbers can get you a slightly inaccurate answer; adding a thousand can be much worse. Suppose you have a set of a thousand numbers to add, and you naively add them in order. And suppose further that all of the number are 4, or 4.0E0. The correct answer is, of course, 1000*4 = 4000 or 4.0E3. After you add the first 25, you correctly get 100 or 1.0E2 as your partial solution, but then each additional number you add rounds back down to 1.0E2, so your final answer is 100, which is pretty much not 4000.

The most accurate algorithm for adding a set of fixed point numbers is shockingly complicated (and left as an exercise for the reader, since you look like you could use some exercise). There's not a much simpler task than adding a bunch of numbers, but it's actually hard to get this right. Why? Because fixed-precision arithmetic is inherently difficult.

I said that IEEE 754 is a good standard. By that I mean that it pretty much does the right thing for all the cases it covers. So, if you add two numbers, you get the best answer you can get. But it covers adding two numbers, not a thousand; that's outside its scope.

How hard is fixed-precision arithmetic? Universities often cover it in graduate classes, which means that people with a B.S. in Computer Science may know very little about it. If you're doing non-trivial math with floats, and you don't have this background, my advice is: find a good library and use it; and then figure out an independent way to check your results.

You may be thinking, "But I've used floats with no problem." That's not skill, that's luck. If the precision of the float is much larger than the precision you need, or you're only performing a few operations, it often works out fine. But eventually it will bite you, unless you start worrying a lot about it.

If you take nothing else from this post, just remember this: It is easy to drown when you try to float.

Since fixed-precision arithmetic is so hard, a good high-level language should remind the user of the problems by making it a little awkward to use; and should help a little with the problems by being flexible. These can go together. Instead of saying "float", I want to say "float(precision:2, base:10)". And I don't want to write "z = x*y;". I want to write "z = mult<precision:x.precision+y.precision>(x,y)" specifying the precision of the result in terms of the precision of the operands, and requiring that the bases are the same. And I certainly don't want the language implicitly casting floats from one precision or base to another; that's something I want to be completely aware of.

So, where people can get away with rational arithmetic, which is exact and therefore easy, encourage them to use it. And where they really need fixed-precision, which is hard, make them specify it for every operation to remind them that they need to think about it for every operation. And let people work in decimal precision; often times it's the correct solution for the problem, and it's almost always easier to think about. Again, I have no problem supporting binary precision as well, but if anything is the default, it should be decimal. A language could even allow arbitrary bases; that's not that hard.

I'm not sure why so many languages get this wrong. C, which dates back to when floating point was slow (and often inaccurate before IEEE 754), makes using floating point easy, presumably because it was supported by the hardware that C was developed on. Strangely, database languages like SQL seem to do this better than most general-purpose programming languages.

Finally, there are two minor difficulties using IEEE 754 which are not necessary in a high-level language. IEEE 754, of course, is designed to be low-level; and is entirely appropriate to a low-level language. One problem is the fact that the precision is not specified by the user (except by selecting the type). If I need 100 decimal digits, I should be able to specify it. Often SQL allows this, at least up to a point.

The other problem is that the exponent is essentially a bounded integer, and therefore floats have a limited range and can overflow. The limited range is rather large, but it can still impact code. Similarly, there is a smallest positive value; and there are issues with denormalized small numbers with lower precision than expected. A high-level language should provide a float type with an unlimited range, but fixed precision. That is, the exponent should be an integer, not a bounded integer. I know of no languages that support such a type.

Again, the language could support constraints to bound the exponent, and use machine floats where it knows the exponent (and precision) provided by the machine suffices.

I may have already mentioned that fixed-precision work is difficult; why make it even harder by bounding the exponents unnecessarily.

So, in summary: floating-point should not be easy; it should be hard, but not harder than it needs to be. High-level languages should encourage the use of exact types, like rational, where they suffice; and should remind programmers that they need to think a lot about limited precision where it is needed; but should not force programmers to worry about incorrect bases or finite bounds where that is not needed.

Here endeth my series of post on high-level languages and numeric types. I hope that future high-level languages do a better job with numerics than current ones do, since the current ones are pretty shameful.

Thursday, May 2, 2013

High-level rats

In my previous post, I talked about integers in high-level languages. In this post, I'm going to talk about rational numbers.

Like integers, you're already familiar with rational numbers, although you may refer to them as fractions. They are numbers like 3/2, ratios of integers.

You know properties of rational numbers, such as: closed under addition, subtraction and multiplication. And closed under division, except for division by zero. And so on. And, of course, the result of dividing two integers is a rational number, again excepting division by zero. Some rational numbers are also integers; but many are not.

As with integers, another property is that rational numbers can be exactly represented on a computer in such a way that basic math can be performed efficiently.

And, in fact, many languages, in the same library as "bigint", have a "bigrat" or "rational" or "fraction" type. But, even in most languages in which "int" is actually an integer, the rational type is usually buried. The only exception I've found is Smalltalk. In Smalltalk, as you would expect, the result of dividing two integers is a rational number.

Sadly, in almost every other language I've looked at, the result of dividing two integers is an integer; and therefore, is usually wrong. (Obviously, this only applies to languages with any integer types.) In cases where the dividend (i.e., the numerator) is actually divisible by the divisor (the denominator), the result is correct. In other cases (again, ignoring division by zero), the result is one of the two integers nearest the correct answer; which one depends on the language.

The options seem to be: truncate toward zero, truncate down, or round. I've never seen any language deciding that 31/10 = 4, although technically that's no worse than 39/10 = 3. Truncate toward zero and truncate down only differ for negative numbers; these seem much more common than rounding. And rounding exact halves also has options: round down, round up, round toward zero, round away from zero, or round toward the even number.

There are two problems with getting the wrong answer when you divide two integers. The first, of course, is that you have the wrong answer. The second is that you've confused the programmer.

You may remember the shock you felt when "integer division" in a programming language was first explained to you; or worse, when you accidentally discovered it on your own. You may have said to yourself, "This is not my high-level language. How did this integer get here?" Or perhaps you have had the unpleasant and guilty experience of explaining these shenanigans to someone else.

Here's a general rule: if you're explaining to someone something which they already know, it should be pretty easy; or there should be a good reason it isn't (for example, my next post will discuss real numbers which are actually tricky on a computer). People know how to divide integers before they start programming; all that you should need to explain to them is that in most programming languages you need to use the forward slash character to divide. That should be it; they should already know the rest. If you're explaining that the rules have changed (for no good reason), there's a problem.

Aside from producing wrong answers, it also makes it hard to reason about programs. In normal math, a/b = (1/b)*a; and (a/b)*(c/d) = (c/b)*(a/d) = (a*c)/(b*d) etc. This is not true in most programming languages; only Smalltalk gets this right.

Fix this, and a class of bugs goes away, a set of optimizations becomes available, and a pool of confusion evaporates.

So kudos to Smalltalk, and shame on all of the so-called high-level languages that get this wrong.

But even Smalltalk doesn't get it quite right. We all know that 3.1 is the same as 31/10; that is, it's a rational number. Or we used to know that before we learned how to program. Because that's not true in any programming language I can find. Most programming languages treat 3.1 as an inexact number, a floating point number with limited precision. And usually that number is stored in binary (because of the bit-orientation of numeric types even in supposedly high-level languages). And binary can't even exactly represent 1/10.

A few languages consider 3.1 to be a decimal floating point number; there are still issues with the limited precision once you start doing math, but at least 3.1 is equal to 31/10.

In languages where 3.1 is a binary floating point number, simple literal expression can generate incorrect results. For instance, if I type "0.1 + 0.2 - 0.3" into my gnu-smalltalk, it tells me "5.551115123125783d-17", which isn't the correct answer. But it gets "(1/10) + (2/10) - (3/10)" correct. It prints "0", which also happens to be the number of languages I can find that get support for integers and rationals right, including literals.

In summary, high-level languages should have easy support for rational numbers, because they are easy to work with; and in particular, the quotient of two integers should be a rational number; and numeric literals with a decimal place should be rational numbers. To me this all seems fairly basic, and pretty obvious, but it seems to have slipped past quite a few language designers. Which doesn't seem rational.

Oh, and let's call this type "rat" so it's as easy to type as "int" and more fun to say. "The function chooses the smallest of a set of rats."

In the next post, I'll talk about real numbers. Really. Sorry.

Tuesday, April 30, 2013

High-level ints

In this blog, I am going to be talking about various aspects of computer programming and programming languages and everything else worthwhile in this world.

The first few entries will be about how I think numeric types should be handled in high-level languages.

Low-level languages, like assembler and C, are all about what the machine can do well. High-level languages are supposed to abstract away the machine and allow the programmer to more easily focus on solving a problem.

Almost all programming languages have numeric types, to do math, because arithmetic is often useful. But somehow, many get the basic numeric types wrong.

I'm going to discuss some numeric types in mathematics: integers, rational numbers, and real numbers; and also how they can be represented on the computer; and how they actually are represented in many programming languages. I'll also talk about some extensions to these types and how they could be represented.

In this post, I'm going to talk about integers. Integers, as you recall, are numbers with no fractional part:
..., -3, -2, -1, 0, 1, 2, 3, ...
You have been learning the properties of integers since you started learning to count, and you're familiar with them. One useful property is that integers are closed under addition, multiplication and subtraction. Or maybe that's three properties. But, if you add, subtract or multiply two integers, you get another integer as a result. Another property is that they are unbounded on both sides; there is neither a largest or a smallest integer. And there are lots more useful properties that you know, and that you use all the time when doing math, even if you don't consciously think about them.

Here's a useful property of integers that you may never have thought about: integers are exactly representable in a computer, in a way that is useful in that it readily supports implementing arithmetic efficiently. In other words, they can reasonably be a type in a programming language.

Not surprising, then, that most languages have a type called "integer" or "int" or something similar. What's surprising is that this type doesn't actually represent the integers in most languages.

From what I can tell, the most popular languages that get "int" basically right are Python and Ruby. Some less popular languages, including Smalltalk also get "int" right. Smalltalk is notable in that it seems to get "int" more right than Python and Ruby, which I'll talk about in the next post.

If you're a programmer, there's a good chance that you've used one of the C-derived languages, which include C, C++, C#, Objective-C, Java, Go and D. These languages, and many others, all have an "int" type which can represent only a subset of the integers. And the particular subset is usually one that fits into 32 or 64 bits (or 16 bits for some older implementations). This means there is a largest integer ("intMax") and a smallest one ("intMin"). Which means it's actually pretty different from the integers we know and love. Integers are unbounded, so we can love them without limit.

C and assembly languages are the common low-level languages, although there are a few others. Some languages, like C++ try to support both low-level and high-level programming. But most languages, definitely C# and Java, are trying to be high-level languages, and abstract machine details away from the programmer to make programming easier. But they're failing on "int", which seems pretty fundamental.

If you're reading the definition of "int" in a high-level language, and you see the phrase "fits into 32 bits", that's a strong hint that your high-level language is not actually abstracting away from the machine; and therefore is not actually a high-level language. You've been hoodwinked, bamboozled. It is a low-level language in sheep's clothing.

The main problem with restricting "int" to some number of bits is that it can overflow (or underflow, which is similar, but I will largely ignore). There are a few different options for managing overflow:

- Wrapping (modular arithmetic). This is very popular. Here, intMax + 1 is intMin, which is pretty strange when you first see it. But, if you're a programmer, chances are you've eventually grown used to this behavior.
- Promoting to a larger type. If you always automatically promote to a larger integer type, you're effectively implementing integers; this is how Python does it. But if you promote to a different type, it can lead to problems I'll discuss later. For example, PHP promotes "int" to "float" when "int" overflows.
- Failing (throwing an exception). This is less common. Here, intMax + 1 is an error.
- Clamping. I can't find any examples of this at all. Here, intMax + 1 = intMax.

I said I'd ignore underflow, but it has the same options, and always seems to be handled the same way overflow is, although in principle you could design a very confusing language where overflow wrapped and underflow clamped.

Here's one explicit example of the problem with bounded ints. The numbers in the example assume a typical signed 32-bit int; but you could create a similar example with different numbers for any bounded int.

You write a program which is supposed to count the number of characters in a file. You test it carefully on a bunch of small files, where you can actually count the characters by hand, and it works perfectly. Now you run it on the file you actually care about, which happens to have 6,442,450,941 characters in it, which is a large file.

The answer you get depends on how the language handles overflow. With wrapping, you get the answer 2,147,483,645 which is a large number, and you believe it. But it's only about a third of the right answer. Your program was completely wrong, and you can't even tell. If this were an actually useful result, something would catch on fire.

If your language is a rare language that clamps, you get 2,147,483,647 instead, which isn't all that different.

If your language throws on overflow, you just get an error. At least you know you don't have the right answer, but it's still not useful. But if your language supports bounded types, this is the best kind, because at least you know when you don't get the right answer. We should rather bear those ills we have than fly to others we know not of.

If the language promotes to a floating-point type which is large enough to hold the correct answer, you get the correct answer. But, for some larger file, floating-point types look kind of like clamping types: There is some smallest number M such that M+1 = M. Unlike real clamping types, there's another number M' such that M'+2 = M' (and also M' + 1 = M), where M' > M. But eventually you get the wrong answer.

I should briefly mention here that there are a few languages which don't have integers, and only have floating point numbers. These suffer from this same issue. I count Perl among these languages, although it is in principle possible to use int in Perl.

There are also a few languages that don't have any numeric types, and aren't good choices for doing even simple math, such as counting. Languages which specifically target non-numeric programming are, of course, exempt from criticism that they have a poor numeric types.

Many languages (such as Java) have a type hidden away in a library somewhere with a name like "bigint" which actually can represent all integers. Simply having such a type is not sufficient to be a high-level language. The easiest to use integer type, and the type of integer literals needs to behave like integers, not like bounded ints with some bit size.

There's nothing wrong with a high-level language including low-level types. There are two options high-level languages have for including low-level types. One way, just throwing in the C types (int32, int64) is ok; it lets people use them where they want them. There's another option, though, which is much more powerful.

define int16 as i : int such that -32768 < i and i < 32767;

A language could permit arbitrary constraints on a type; and check those constraints. That gives bounded types that throw, which are the best bounded types And it can use those constraints for optimization.

This lets the programmer put appropriate bounds on the types, not arbitrary ones because of bit sizes. If I want a type that can only hold even numbers from -6 to 60, I just declare that type.

So, in summary, the rule for high-level languages is: Integers should be easy. The integral type which is easiest to use, because it has the shortest, simplest name, and because it is the type of integer literals should be unbounded (or at least act as if it is).

Python basically gets integers right; and Java basically gets them wrong (along with lots of other languages). So what difference does it make? First, there's a class of bugs which can happen in Java that cannot happen in Python. Second, there's one fewer bizarre feature that needs to be learned in Python than in Java. It takes seconds to completely explain integer literals and addition, subtraction and multiplication of integers in Python, because it's just integers, and the student knows it already. In Java, you're talking about bits and bounds.

In the next post, I'll talk about dividing integers, which Smalltalk still gets right, but Python gets wrong.