Practical Data Management and Statistical Computing (BioEp691F)

Contacts

Outline
Assignments

Resources


Outline: Lec11 Lec12 Lec13 lec14 Lec15 Lec16 Lec17 Lec18 Lec19 Lec20
Lectures: Lec11 Lec12 Lec13 Lec14 Lec15 Lec16 Lec17 Lec18 Lec19 Lec20


Lecture 15


1. Using a LENGTH statement to create Character Values in a DATA Steps

On occasion, there may be interest in assigning character variables to data. For example, in the previous example, suppose that over time, follow-up information is being recorded on each subject. Some subjects are "active", while other subjects have "dropped out". We may add a variable with the follow-up status to the listing of IDS. Suppose for example, the subjects with ID 1, 3, 5, and 6 are "active", while the other subjects have "dropped out". A list of the IDs, names, and follow-up status may be attempted with the program LEC15P11.SAS with OUTPUT.

We correct the problem of too narrow a length by adding a length variable prior to the INPUT statement in LEC15P12.SAS, resulting in the following OUTPUT


2. Working with Numeric Variables

Numeric variables are relatively straight-forward to work with in SAS. The main difficulty that can occur results from attempts to save space in SAS data sets. First, note that since all computers ultimately use BASE 2 to perform all calculations, there are not exact BASE 2 expressions for numbers like 1/3, or 1/6 or other decimal numbers. Furthermore, there are limits in precision for even integer numbers. These limits can cause problems when operating on numeric variables. The program LEC16P1.SAS illustrates some of these limits, as shown in the output..

 Note that while the first statements produce a "MATCH" for variables A and B, the value of A is not equal to the value of 1/3 specified in the program. Furthermore, note that the value of D in the program differs, while the value of D in the output is the same. Both examples are due to limits of precision on the computer for numeric value representations.

SAS uses 8 bytes to represent a numeric value (such as the value of D in LEC16P1.SAS ). With 8 bytes, SAS can represent an integer as large as 2 to the 53rd power=9,007,199,254,740,992 exactly. Many integer values can be represented exactly with fewer bytes. Specifying a length less than 8 bytes means that variables will occupy less space in SAS data sets. The Table below (from SAS LANGUAGE GUIDE, Version 6.03, p198) indicates the maximum integer that can be expressed exactly by the number of Bytes.

 

Table 1. Number of Bytes and Maximum Integer that Can Be Represented Exactly

# of Bytes

Power of 2

Maximum Integer

3

13

8,192

4

21

2,087,152

5

29

536,870,912

6

37

137,438,953,472

7

45

35,184,372,088,832

8

53

9,007,199,254,740,992

In practice, there are many settings where variables take on discrete values (such as the values 1-5). SAS will automatically assign a LENGTH of 8 Bytes to numeric variables. This length should not be changed if the variables contain decimal points. However, if the variables correspond to integers (less than the Maximum values given above) the variables can be stored using fewer Bytes without loss of precision. Such storage is accomplished using a LENGTH statement in the same manner as a LENGTH statement is used with Character variables.


3. Dangers in Setting Lengths For Numeric Variables

It is desirable to limit the length for integer variables so as to minimize data set size. For example, in the program LEC16P2.SAS, 1000 records with 100 variables in each record are created, where in one data set, all variables have length 8, while in the other data set, all variables have length 3. The data set with the smaller length is stored as a smaller data set on the C:\TEMP directory.

The file D1.SD2 contains 817,000 Bytes, while the file D2.SD2 contains 333,000 bytes. Thus the savings in storage space is roughly proportional to the percent reduction in the length of the variables.

Problems occur when variables are stored with a given length (to save storage space) that is inadequate for all of the values of the variables. An example is given in LEC16P3.SAS, where a variable, age, is reported in years for subjects, with a value of 99.9 corresponding to the missing value. The length has been set for age to 3, since the integer values of age are less than 3 digits long. However, since the missing value code is a decimal value, defining age of length 3 will mis-represent the decimal code, as illustrated in the output.



Produced and maintained by the Dept of BioEpi at UMASS
Send comments or questions about this web site to Ed Stanek
Email:
stanek@schoolph.umass.edu
\be691f\web\webready\lec15.html
Lst Update: 11/9/99