Practical Data Management and Statistical Computing (BioEp691F)

Contacts

Outline
Assignments

Resources


Outline: Lec11 Lec12 Lec13 lec14 Lec15 Lec16 Lec17 Lec18 Lec19 Lec20
Lectures: Lec11 Lec12 Lec13 Lec14 Lec15 Lec16 Lec17 Lec18 Lec19 Lec20


Lecture 18


1.Handling missing values

One of the strengths of SAS is the flexibility given in the handling of missing values. Almost all collections of data have some missing values or values that are so obviously invalid or out of range that they must be replaced with missing values.


Recogonizing Missing and Invalid Values When Reading Numeric Data in SAS

When reading data into a SAS data set from an ASCII file, missing data can occur for numeric data or character data. For numeric data, when using LIST input, a period (or special missing value code, such as "99") must be used to represent a missing value. If a "blank" is use for the missing value, the SAS program will read the next value for the variable, and all subsequent values, at least for that line, will be misread. For numeric data when using COLUMN input, the columns may simply be left blank (or a period can be used). The fact that no special missing value codes are needed is an advantage of COLUMN input.

As a general policy, all data should be duplicate entered. Such entry will usually avoid problems of errors in entry that may result in a "character" value entered in place of a numeric variable. However, such errors can still occur, resulting in invalid data for a numeric field. Without special attention, the SAS program will report an error in reading in such data. An example of how SAS processes missing or illegal values is given in LEC9P1.SAS.

These data are read using a trailing @@ format, so that multiple records are read from a given line of data. There are values for all three variables in a set, but some of the values are missing or invalid. The missing values have been denoted by ".". However, the invalid values (entered as characters) are detected by SAS as invalid values and recoded to a SAS missing value.

The LOG contains a list of the values that have been recoded by SAS. A message is produced for each variable where a missing value occurred. Note that for some of these missing values, the value entered was a lower case "L" instead of a "1", or an upper case "o" instead of a "0". These types of errors are difficult to identify by visual inspection of ASCII data.

The OUTPUT illustrates the errors in reading the data. Note that SAS properly reported missing values when non-numeric values were entered for values in the data. However, there was no distinction between missing values (where the value was unknown) and invalid entries (where the value may be invalid due to an entry error). This distinction can be made by using the statement OPTION INVALIDDATA="Z"; This option assigns the special missing value ".Z" to numeric variables where characters values have been entered. An example is given in LEC19P2.SAS with the following output.

Notice how the missing values are specified in the PUT statement in the second part of the program. Although the INVALIDDATA option specifies a value of "Z", the missing value is referred to as ".Z".


Recogonizing Missing and Invalid values When Reading Character Variables in SAS

Character variables are identified by a "blank" or by a simple "." as missing values. When reading data in LIST format, a "blank" will be skipped over and not recogonized as a missing character variable. For this reason, if character values are possibly missing, a "." should be used if data are read using LIST format. Preferably, data will be read using COLUMN format.

We illustrate an application using COLUMN format with LEC19P3.SAS in which the occupation is reported for subjects. Note in the output that character values are left justified, and that the missing value of "." is represented as a "blank" in the SAS data set.


2. Converting Numeric and Character Variables

A common problem that can occur is to have to convert a numeric variable to a character variable, or visa versa. When comparing values between two variables, the each variable must be of the same type. If one variable is a character value, while the other is a numeric value, a new variable must be created, or the type of one of the variables converted. We consider ways in which character values can be converted to numeric values here.


Converting Character values to Numeric Values

Suppose a study is conducted among adults 18-24 to assess smoking and drug use among high school drop-outs. As part of the study, the subjects are asked to report their age. Later on in the study, the subjects also report their date of birth. In some cases, the self reported age is given as ">18", with no specific age specified. When entering the data, the reponse reported was entered, resulting in special characters in the field for the self reported age. Here are data reported for the first 6 subjects.

Since character and numeric values are present for "AGE", these data were read using a character variable. The date of birth was read using a MMDDYY8. date INFORMAT. Subject's age was also evaluated using the date of birth as of Nov. 11, 1995. The program used to read the data is called LEC9P4.SAS. The output lists the self-reported age (AGE) and the calculated age (AGEN). The first variable is a character variable, while the second is a numeric variable, as illustrated in the Contents.


The listing of data illustrate several problems. From such data, we may want to create a single variable that is the best representation of the subject's age. We define this variable to be

Also we create an indicator variable that has a value of :

"1" if the subject's age is within the age range 18-24 (including subjects reporting > 18), or

"0" if the subject's age is unknown or outside of the age range for the study.

The program LEC9P5.SAS develops two variables, AGEF, and INSTUDY according to these guidelines. Note in the LOG that since some values of age have been recorded as character variables, they are converted to numeric variables automatically in the assignment statement for AGEF=AGE. The output contains the final age for use in the study.

Details on Numeric Variables on PCs and SAS


3. Recoding Missing Value Codes to SAS Missing Values and other Special Missing Values

In SAS data sets, missing character values are represented by blanks, and missing numeric values are normally represented by a period (.). Sometimes, missing values have been assigned a different code, such as 9 or 99 in an input data file. These values should be recoded to SAS missing values with simple assignment statements such as:

IF VAR1=9 THEN VAR1=.;
IF VAR2=99 THEN VAR2=.;

In this way, the missing values will be appropriately handled in SAS procedures. Certain reporting procedures, such as PROC FREQ offer choices in the handling of missing values, such as whether or not they are included in the computation of observation totals and percentages.

A period is the default for a missing value, but it is also possible to define special numeric missing values, represented by the letters A, ..., Z, and the underscore (_). For example, if you wish to distinguish between refusals (coded as 7), not applicable (coded as 8) and missing (coded as 9), the following statements could be used:

IF VAR1=7 THEN VAR1=.R;
ELSE IF VAR1=8 THEN VAR1=.N;
ELSE IF VAR1=9 THEN VAR1=.;

This assigns the special missing value R to refusals, and N to not applicable. This might be useful subsequently, should there be interest in finding out who the refusers were, or in getting a count of refusals, while still treating these as missing values for computational purposes. The values are stored in the SAS data set and printed as a letter, however in programming statements you must refer to them by preceding them with a period.


A Caution when Forming Groups if Some Values are Missing

When dealing with missing values it is important to note that missing values are considered less than all possible numeric values (even negative ones). When creating new variables from variables that have missing values this must be taken into account. For example in creating age groups from a variable AGE, representing age in years, the following statements would include those with missing AGE in the youngest age group:

IF AGE< 20 THEN AGEGR=1;
ELSE IF 20<=AGE< 40 THEN AGEGR=2;
ELSE IF 40<=AGE THEN AGEGR=3;

To avoid this problem it is necessary to say: IF 0<=AGE<20; so that those with missing AGE will also have missing AGEGR, the variable for age group. Missing values are also ordered from smallest to largest as _ . A B ... Z. To select all records where VAR1 is missing, use:

IF VAR1<=.Z THEN DO;

To delete all observations with missing values for VAR1 use;

IF VAR1>.Z;

In the same way, missing character values (blanks) precede all other characters in alphabetic sorting, and must be handled appropriately. Missing character values can be refered to by enclosing a blank in single quotes. For example if the variable SEX is represented by F and M, with blank for missing, all those with missing data for SEX could be deleted with the statement:

IF SEX=' ' THEN DELETE;



Produced and maintained by the Dept of BioEpi at UMASS
Send comments or questions about this web site to Ed Stanek
Email:
stanek@schoolph.umass.edu
\be691f\web\webready\lec18.html
Lst Update: 11/16/99