|
|
|
|
One of the strengths of SAS is the flexibility given in the
handling of missing values. Almost all collections of data have some
missing values or values that are so obviously invalid or out of
range that they must be replaced with missing values.
When reading data into a SAS data set from an ASCII file, missing data can occur for numeric data or character data. For numeric data, when using LIST input, a period (or special missing value code, such as "99") must be used to represent a missing value. If a "blank" is use for the missing value, the SAS program will read the next value for the variable, and all subsequent values, at least for that line, will be misread. For numeric data when using COLUMN input, the columns may simply be left blank (or a period can be used). The fact that no special missing value codes are needed is an advantage of COLUMN input.
As a general policy, all data should be duplicate entered. Such entry will usually avoid problems of errors in entry that may result in a "character" value entered in place of a numeric variable. However, such errors can still occur, resulting in invalid data for a numeric field. Without special attention, the SAS program will report an error in reading in such data. An example of how SAS processes missing or illegal values is given in LEC9P1.SAS.
These data are read using a trailing @@ format, so that multiple records are read from a given line of data. There are values for all three variables in a set, but some of the values are missing or invalid. The missing values have been denoted by ".". However, the invalid values (entered as characters) are detected by SAS as invalid values and recoded to a SAS missing value.
The LOG contains a list of the values that have been recoded by SAS. A message is produced for each variable where a missing value occurred. Note that for some of these missing values, the value entered was a lower case "L" instead of a "1", or an upper case "o" instead of a "0". These types of errors are difficult to identify by visual inspection of ASCII data.
The OUTPUT illustrates the errors in
reading the data. Note that SAS properly reported missing values when
non-numeric values were entered for values in the data. However,
there was no distinction between missing values (where the value was
unknown) and invalid entries (where the value may be invalid due to
an entry error). This distinction can be made by using the statement
OPTION INVALIDDATA="Z"; This
option assigns the special missing value ".Z" to numeric variables
where characters values have been entered. An example is given in
LEC19P2.SAS with the following output.
Notice how the missing values are specified in the PUT statement in
the second part of the program. Although the INVALIDDATA option
specifies a value of "Z", the missing value is referred to as
".Z".
Character variables are identified by a "blank" or by a simple "." as missing values. When reading data in LIST format, a "blank" will be skipped over and not recogonized as a missing character variable. For this reason, if character values are possibly missing, a "." should be used if data are read using LIST format. Preferably, data will be read using COLUMN format.
We illustrate an application using COLUMN format with LEC19P3.SAS in which the occupation is reported for subjects. Note in the output that character values are left justified, and that the missing value of "." is represented as a "blank" in the SAS data set.
A common problem that can occur is to have to convert a numeric variable to a character variable, or visa versa. When comparing values between two variables, the each variable must be of the same type. If one variable is a character value, while the other is a numeric value, a new variable must be created, or the type of one of the variables converted. We consider ways in which character values can be converted to numeric values here.
Since character and numeric values are present for "AGE", these data were read using a character variable. The date of birth was read using a MMDDYY8. date INFORMAT. Subject's age was also evaluated using the date of birth as of Nov. 11, 1995. The program used to read the data is called LEC9P4.SAS. The output lists the self-reported age (AGE) and the calculated age (AGEN). The first variable is a character variable, while the second is a numeric variable, as illustrated in the Contents.
Also we create an indicator variable that has a value of :
"1" if the subject's age is within the age range 18-24 (including subjects reporting > 18), or"0" if the subject's age is unknown or outside of the age range for the study.
The program LEC9P5.SAS develops two variables, AGEF, and INSTUDY according to these guidelines. Note in the LOG that since some values of age have been recorded as character variables, they are converted to numeric variables automatically in the assignment statement for AGEF=AGE. The output contains the final age for use in the study.
Details on Numeric Variables on PCs and SAS
In SAS data sets, missing character values are represented by blanks, and missing numeric values are normally represented by a period (.). Sometimes, missing values have been assigned a different code, such as 9 or 99 in an input data file. These values should be recoded to SAS missing values with simple assignment statements such as:
IF VAR1=9 THEN VAR1=.;
IF VAR2=99 THEN VAR2=.;
In this way, the missing values will be appropriately handled in SAS procedures. Certain reporting procedures, such as PROC FREQ offer choices in the handling of missing values, such as whether or not they are included in the computation of observation totals and percentages.
A period is the default for a missing value, but it is also
possible to define special numeric missing values, represented by the
letters A, ..., Z, and the underscore (_). For example, if you wish
to distinguish between refusals (coded as 7), not applicable (coded
as 8) and missing (coded as 9), the following statements could be
used:
IF VAR1=7 THEN VAR1=.R;
ELSE IF VAR1=8 THEN VAR1=.N;
ELSE IF VAR1=9 THEN VAR1=.;
This assigns the special missing value R to refusals, and N to not applicable. This might be useful subsequently, should there be interest in finding out who the refusers were, or in getting a count of refusals, while still treating these as missing values for computational purposes. The values are stored in the SAS data set and printed as a letter, however in programming statements you must refer to them by preceding them with a period.
When dealing with missing values it is important to note that missing values are considered less than all possible numeric values (even negative ones). When creating new variables from variables that have missing values this must be taken into account. For example in creating age groups from a variable AGE, representing age in years, the following statements would include those with missing AGE in the youngest age group:
IF AGE< 20 THEN AGEGR=1;
ELSE IF 20<=AGE< 40 THEN AGEGR=2;
ELSE IF 40<=AGE THEN AGEGR=3;
To avoid this problem it is necessary to say: IF 0<=AGE<20; so that those with missing AGE will also have missing AGEGR, the variable for age group. Missing values are also ordered from smallest to largest as _ . A B ... Z. To select all records where VAR1 is missing, use:
IF VAR1<=.Z THEN DO;
To delete all observations with missing values for VAR1 use;
IF VAR1>.Z;
In the same way, missing character values (blanks) precede all
other characters in alphabetic sorting, and must be handled
appropriately. Missing character values can be refered to by
enclosing a blank in single quotes. For example if the variable SEX
is represented by F and M, with blank for missing, all those with
missing data for SEX could be deleted with the statement:
IF SEX=' ' THEN DELETE;
|
|
Produced and maintained by the
Dept
of BioEpi at UMASS |