Sunday, April 11, 2010

Significant Figures in R and Info Zeros

The other day, I stumbled upon the signif function in R, so I thought I'd take a look at what it does and compare it with some results discussed in Chap. 3 "Damaging Digits in Capacity Calculations" of my GCaP book, viz., Example 3.5 on page 31. The measured numbers in that example are reproduced here in Table 1 using read.table in R.

Table 1.
Idx     Number Pdd SD
1           50   0  1
2      0.00341   5  3
3       1.0040   4  5
4      50.0005   4  6
5        6.751   3  4
6        0.157   3  3
7         28.0   1  3
8        40300   0  3
9        0.070   3  2
10       30.07   2  4
11       65000   0  2
12      0.0067   4  2
13 6.023*10^23   3  4

The column labeled Pdd counts the number of digits past the decimal point (if any). The last column, labeled SD, shows the number of significant digits as determined by applying the following recipe (Algorithm 3.1) from Chap. 3:

Scan the number from left to right.
Is there an explicit decimal point?

Yes:
  • Scan and locate the first nonzero digit.
  • Count it and all digits to its right. (including zeros)
No:
  • Append a decimal point.
  • Count the position of the last non-zero digit prior to the decimal point.
  • All zeros in between should be ignored.

Let's check this recipe against the number 314000.000314000000 with all those trailing zeros kept intentionally. Since it has a decimal point, the first non-zero digit (scanning from the left) is the '3' that starts the number. Counting it and all the digits to its right, including all zeros, we get SD = 18 significant digits. If the number was just the integer 314000 with no decimal fraction then, according to the second part of Algorithm 3.1, we would get SD = 3, i.e., the '4' is in position 3 and we ignore the three zeros between it and the decimal point.


Applying Algorithm 3.1 to the numbers in Table 1, we get the corresponding values in the SD column. Since that's a pain to do manually, let's see what R produces using signif() using the following R code:

sigdigs<-function(n) {
    d<-0
    while(signif(n,digits=d) != n) {
        d<-d+1
        next
    }
    return(d)
}
The results are shown in the column labeled Rsn of Table 2.
Table 2.
Idx                    Number Pdd SD Rsn
1                        50  0  1  0
2                   0.00341  5  3  3
3                    1.0040  4  5  4
4                   50.0005  4  6  6
5                     6.751  3  4  4
6                     0.157  3  3  3
7                      28.0  1  3  2
8                     40300  0  3  3
9                     0.070  3  2  0
10                     30.07  2  4  4
11                     65000  0  2  2
12                    0.0067  4  2  2
13  602299999999999975882752  0  4  4
Oh oh! There are discrepancies between SD and Rsn in rows 1, 3, 7, and 9. What's going on? It doesn't mean R is wrong. The difference arises from the interpretation of the word "significant" and whether it refers to numerical significance or measurement significance. Chapter 3 is about significant digits in numbers used to represent performance measurements. To understand this distinction more clearly, the digits in our earlier contrived number are generalized in Figure 1.

Significant digits

Figure 1

Looking at the integer component on the left side of the decimal point, we see that zeros between the decimal point and the first non-zero digit to its left (L) do not carry any new information. They simply scale the number. It's like 1000 meters being identical to 1 kilometer. The physical units are different, but that has nothing to do with significant digits. This might be easier to appreciate if I compare the phrase 1 thousand meters with the phrase 1 kilo meter. Now, the '1' is clearly the only digit that matters. The scale zeros do not tell us anything we didn't already know. The information content is identical. The same is true for zeros to the immediate right of the decimal point, but instead of scaling by multiples of 10, the fractional part is scaled by multiples of 1/10. For example, 3 milliseconds is identical to 0.003 seconds. Nothing new is revealed. The digits RRR... to the right of the fractional scale-zeros do, however, carry information, so they are significant digits.

What about those zeros to the right of the R-digits? That's where the ambiguity creeps in. Mathematically speaking, those trailing zeros carry no additional information because they do not alter the position of the number of the real number line so, they can be ignored and that's what R (and Mathematica) does. They are the complement of the suppressed zeros to the left of the L-digits. Both go on forever (theoretically) so there's no point in trying to write them down and anyway, they add no new information regarding location on the number line. Physically speaking, however, those trailing zeros do count when they represent a measured quantity. Since all information is physical, if there are trailing zeros in a measured number, then a lot of work must have been done in each decimal position to know that they are zeros and not some other digit.

It represents not just numerical precision but measurement resolution. In that sense, those trailing zeros do carry information—they're info zeros—therefore they are significant.


With this as background, let's examine the number in row 13 of Table 1, 6.023 × 1023, which is Avogadro's number (known in Germany as Loschmidt constant). It's roughly the number of water molecules in an ice cube. But 6.023 × 1023 is only a crude approximation, which we learn in chemistry as a simple mnemonic (the two 23s). The current best measured value is written more accurately as:
6.02214179(30) × 1023
It's an inconceivably large number, although by no means the largest number known in physics or mathematics. To get some idea of its size, it's been said that Intel manufactured on the order of 1018 transistors last year. Even at that impressive rate, however, it would take another million years to reach Avogadro's number of transistors!

As written above, you will immediately notice the strange use of parentheses. You never see digits enclosed in parens (in this way) in mathematical numbers, e.g., 3.1415926 representing π as a decimal fraction. That's because those parens indicate measurement uncertainty or lack of resolution in the 9th and 10th decimal places. It's analogous to trying to resolve more and more detail in a microscope image. Even at the highest magnification, the smallest details will remain unclear. No such thing occurs in the decimal representation of π. If you want to know the 9th and 10th decimal digits, you simply calculate them; you don't measure them. The number in row 13 of Table 1 is encoded in scientific notation, which is shorthand for:

6.02214179(30) × 1.00 000 000 000 000 000 000 000
What happens to the significant digits when Avogadro's number is expressed as an integer? Despite its resemblance to a decimal fraction, it must be an integer since it counts things like molecules.

Writing it out in full and ignoring the imprecise digits (for simplicity), it looks like this:

602 214 179 000 000 000 000 000
According to Algorithm 3.1, all the zeros beyond the '9' do not count as significant digits. As shown in Figure 1, they are merely scale zeros, so all is well and SD = 9 in this approximation. But wait! Look at row 13 in Table 2. There, I forced R to express Avogadro's number as an integer and it produced:
602 299 999 999 999 975 882 752
which is not the same integer we've just been discussing. Where did all those non-zero digits come from? And according to Algorithm 3.1, if they're not zeros they must be significant, in which case SD = 24. Once again, the problem resides in the distinction between numerical precision and measurement precision. The numbers in Row 13 of Tables 1 and 2 are correct to 4 significant digits. In Table 2, however, R didn't know what to put in the remaining 20 places of Avogadro's integer (it can't compute it, like π), so it essentially inserts random junk that rounds up correctly to
602 300 000 000 000 000 000 000
in the chemistry-mnemonic approximation, i.e., SD = 4. Obviously, you can't get extra precision out of the vacuum, even though R makes it seem like you can.

This is a good reminder that you have to remain vigilant when using sophisticated tools like R.

It's not that R is wrong, it's that you might draw the wrong conclusion from R's output.


We see now that it's the numerical truncation of trailing zeros that accounts for all the discrepancies between the SD and Rsn in Table 2. To avoid having your measurement zeros truncated or rounded away, those numbers need to be represented as literal strings of characters. In R, this can be accomplished using format(number, nsmall) with nsmall set to the number of digits to the right of the decimal point that you want preserved. That's the role of the Pdd (post-decimal digits) in Tables 1 and 2. Here's the R function I wrote to generate Table 2:

sigdigss<-function(n) {
    i <- 0
    # Check for decimal point is present
    if(length(grep("\\.", numstr[n])) > 0) { # real number
        # Separate integer and fractional parts
        intfrac <- unlist(strsplit(numstr[n], "\\."))
        digstring <- paste(intfrac[1], intfrac[2], sep = "")
        numfigs <- nchar(digstring)
        while(i < numfigs) {
            # Find index of 1st non-zero digit from LEFT
            if(substr(digstring,i+1,i+1) == "0") {
                i <- i + 1
                next
            } else {
                sigfigs <- numfigs - i
                break
            }
        }   
    } else {  # must be an integer      
        digstring <- numstr[n]
        numfigs <- nchar(digstring)
        while(i < numfigs) {
            # Find index of 1st non-zero digit from RIGHT
            if(substr(digstring, numfigs-i, numfigs-i) == "0") {
                i <- i + 1
                next
            } else {
                sigfigs <- numfigs - i
                break
            }
        }   
    }   
    return(sigfigs)
}
You can also compare this R code with the corresponding Mathematica and Perl codes.