|
|
|
|
A simple way to read data into SAS is using LIST input format. While simple to use, this way or reading data is also the easiest way to "miss" read data. LIST input format should be used only when other approaches to reading the data fail, or for simple short data files. Data should be entered so that other methods of reading data can be used (such as in fixed columns). For routine creation and reading of data files, LIST input format should NOT be used. Instead, Column input format should be used.
Basics:
Several problems can occur when reading character variables using LIST input format. These problems can be resolved using additional SAS commands, but may be a source of frustration. We illustrate these problems, and show how the problems can be resolved.
Note that the age for John was unknown, and hence missing in the first line of data. In addition, no information was available for Jan, and Bobbie's SEX was unknown. The LOG gives an error message, but an error message that doesn't seem to match the problem. The problem is that SAS expects to find a numeric variable as the second variable, and finds instead the character "M". Subsequently, looking for the value of SEX, SAS continues to read the next record and determines that SEX is equal to "PETER". The SAS program now thinks that it has completed reading the first record, and skips to the next line of data.
The next line is for "JAN", but when looking for JAN's age, SAS concludes that JAN's age has a value of "JAMES", an illegal character value, and hence replaces JAN's age by the SAS default missing value. Looking for the next variable, SEX, SAS concludes that JAN's SEX is "16". A similar pattern is followed in attempting to read the additional data.
The best way to fix missing values for some records is to type in missing value codes for the missing values in the data. Since the SAS missing value code is ".", a "." can replace all missing values. This is a very manageable process for a small data file as in the example above. For a large file, this may be an impossibility. The SAS program LEC14P2.SAS uses such missing values and reads the data successfully.
When the amount of data are large, it will not be feasible to replace missing values manually in the data set. If missing values occur only for variables at the end of record, it is possible to prevent SAS from skipping to another record to complete the reading of the variables. To do so, the data must be specified with an INFILE statement, as opposed to a CARDS statement. The option MISSOVER to the INFILE statement prevents SAS from going to the next line of data to complete a record. This is illustrated in the program LEC14P3.SAS, using data LEC14P3.TXT.
Notice that while the OUTPUT produced by this program is closer to the representation of the true data, some of the data are still misread. Often, the only real way to make sure that such mistakes do not occur is to type the missing values whenever they occur. Of course, if you have control over how data are entered into an ASCII file, then choosing column input over list input makes the problems simpler to resolve.
Several problems can occur when reading character variables using LIST input format. These problems can be resolved using additional SAS commands, but may be a source of initial frustration. We illustrate these problems, and show how the problems can be resolved.
Note that the data for this program are complete for all subjects. However, after running this program, one ERROR is present in the LOG window. The error indicates a problem in reading the variable AGE for "Mary Jo". Inspection of the LOG indicates that SAS read the NAME as "Mary", and apparently ignored "Jo". Actually, since the next variable was recognized by SAS following a "space", SAS attempted to read the variable AGE for "Mary" as "Jo". Since this value is a character, an ERROR message was issued in the LOG window.
The problem that occurred in LEC14P4.SAS for "Mary Jo" was that the embedded blank that separated the two parts of her name was read as a SAS variable delimiter. We can replace the "delimiter" for the next variable by adding the character & following a variable name. The effect of adding & is to require two consecutive "spaces" to occur prior to SAS looking for the next variable.
Notice in the OUTPUT listing that inclusion of the "&" character has solved the problem of reading in the names correctly. However, John-Peter's name is not read correctly by the program. SAS chooses a "default" length for character variables of 8. This means that only the first 8 columns of a character variable are included unless otherwise specified. For this reason, "John-Peter" was read as "John-Pet".
It is simple to alter the maximum number of columns to be contained by a character variable. The specification is made by including the number of columns directly following the dollar sign. Thus, 10 columns are specified as "$10.".
Inclusion of the length for 10 columns for the variable NAME allow names up to 10 columns to be read correctly. Since an "&" is also included, SAS will not begin reading the next variable until two consecutive blank "spaces" occur. Notice that the value of NAME does not have take up 10 columns. When two consecutive blank "spaces" occur prior to the 10 columns, the next variable is read.
Sometimes an ASCII data set may come with variable values separated by commas instead of blanks. For example, when data are entered in an EXCEL spreadsheet (as for example, EXCEL1.XLS), the results may be saved in comma-deliminated format (using the SAVE-AS option with CSV (OS/2 or MS-DOS) *.csv) ( excel1.csv). Note that each variable is separated by a comma instead of a space.
It is best to organize ASCII files so that the resulting data can be read into SAS using Column input. The term "Column input" simply means that data are store in specified columns. Column input enables missing values to be easily identified, and allows for more flexible character variable input. When creating ASCII files, thought should be given so that column input can be used to read the files into SAS.
Column input in SAS consists of a specification of variable names and the associated columns that contain the variables. Since the associated columns are specified with the variables, no variable delimiter is needed. As a result, character variables can be read that include imbedded blanks without special instructions. In addition, if no values exist in the columns specified for a variable, missing values are automatically generated. These features of column input make its routine use advantageous.
Features:
Program LEC14P7.SAS requires data for each variable to be recorded in specific columns. When there are no values in the column, the value for the variable is missing. Column input automatically begins reading at the 1st column. The program specifies that the variable "NAME" will be located in the first 10 columns. Next, one column is skipped prior to reading the values for the variable "AGE" in the next two columns. After skipping three more columns, the variable "SEX is contained in the next 6 columns.
Inspect the OUTPUT from LEC14P7.SAS. Note that the listing for the variable "SEX" has all of the 1st characters for "SEX" appearing in the same column. All character variables read with an INPUT statement are "left aligned" by default. This means that any leading "blanks" are deleted prior to storing the value of the variable. For this reason, the value of SEX for "Midge" and "Mary Jo" in the above example is stored exactly the same.
The same Column Input can be specified in several different manners. Note that the DATA statements given here have a more direct link to the column positions in the ASCII data file. For this reason, this form of specifying the input is preferable. The OUTPUT is the same.
The columns corresponding to variables can be specified in one other manner. This specification has the advantage of providing a direct link to the column positions in the ASCII data file. Since the link between the variable names and the column positions are more explicit, this specification is best. Once again, the OUTPUT is the same.
Since the column positions are explicitly specified in LEC14P9.SAS, the order in which the variables are read can change without altering the values read for different variables. For example, the program LEC14P10.SAS will result in the same values of the variables being read.
Large numbers of variables may be collected in a study. It is not unusual to have 150 to 300 variables as part of a data set. The values of these variables may be listed on a single line, but when many variables exist, the length of the line will become very long. This creates no problems (since the LRECL options can be used with the INFILE command to recogonize line lengths over 256 columns), except for the fact that only about 80 columns can be viewed at any one time on a computer screen, or printed on a printer in a legible font. To avoid these problems, data are often read on several lines using fewer columns on each line. It is generally a good idea to keep line lengths less than 140 columns to facilitate printing and viewing data on the computer screen.
As an example, suppose that values for 100 variables are to be entered, with each value requiring five columns. If all variables are written on the same line, the line will have a length of 500 columns. An alternative way of entering the data is to include 20 variables on each of five lines, with values for the first 20 variables on the first line, etc. Each line of data will now contain 100 columns of data.
When data are entered on multiple lines, we need to define terms that distinguish the "line" of data from the "record" in the data file. For example, 5 lines of data in the previous example will correspond to data for a single record. Clearly, the values recorded in the same set of columns on different lines may differ, and will refer to different variables. It is important that the sets of 5 records be kept together so that they refer to responses for the same "record".
In practice, this linking of lines for a given record is accomplished by defining a variable for each line that has a common value. This ID variable, which is unique for each record, links the lines to the records. In addition, since the lines in the record contain different variables, it is important to include an additional variable on each line that identifies the line number. With an ID and line number variable, the values on a line can be clearly associated with variables and an ID.
- A Simple Example Reading Multiple Lines Per Record: LEC14P11.SAS
A blood pressure screening program has collected data from Screening Days over a one month period. The screenings were performed on certain days in the lobby of local area banks. Data collected at the screening included the values of subject's weight and height. Blood pressure data were automatically recorded by a blood pressure device, and will be later combined with the other data. In addition, the subjects name and address were recorded. Each subject was assigned an ID number at the site. The data have been entered in an ASCII file with two records per subject. The first record contains the following variables:
while the second record contained the name and address variables:
Variable
Columns
Description
ID
1-5
Line #
6
Ht (inches)
8-10
Reported without decimal point to 10th of inch
Wt (lbs)
11-13
Data on more than one line are indicated in SAS by #1 and #2 in the INPUT statement. The resulting data are listed in the OUTPUT.
Variable
Columns
Description
ID
1-5
Line #
6
Name (last, first)
7-37
Street #
38-41
Street Name
63-60
Additional lines for a record can be specifies by #3, #4, etc.. Using the "#" notation to identify line numbers is the clearest way of reading multiple lines per record. However, it is also possible to indicate a subsequent line with the symbol "/". (LEC14P12.SAS)
- There are several other features of the program that are worthy of comment.
- First, note that the variables for ID and LINE are assigned different names for the first and second line. If different variable names were not used, only the value of the variables from the second line would be written to the SAS data set.
- Second, note the correspondence between the way in which the height variable is entered in the ASCII file, and the variable is read by the INPUT statement. A height of 68.4 inches is entered in the ASCII file as 684 in columns 8-10. The decimal point is not entered. When reading this value into SAS, the number of columns for the variable is first indicated (3), followed by the number of columns that occur to the right of the decimal (1). Thus, the decimal point is automatically replaced for height in the SAS data set with the INPUT statement:
- @8 ht 3.1
- Third, note the manner in which the Last Name and First Name are read in the second line. These data are entered as LIST input, with two spaces separating the Last from the First names. The LIST input format allows embedded blanks in the names, and successfully separates the first from last names.
Frequently, data will be recorded that use a special format that combines numeric and character values. Two common examples are dates and dollar amounts. Dates are often reported with month, day, and year separated by "/" or by "-". For example, October 12, 1995 may be reported as:
10/12/95 or 10-12-95 or 10/12/1995
while May 4, 1949 may be reported as
5/5/49 or 5/04/1949 or ...
In a similar manner, dollar amounts may include commas. For example, annual income may be reported as
$20,000 or 20000 or $20000
Such data are recorded with a combination of numbers and characters, although the variables are inherently numeric. To read such data successfully, the special format in which the data is recorded must be recognized, and the values converted to numeric variables. The numeric variables representing dollar amounts correspond directly to the dollar amounts (without the commas). The numeric variables representing dates are to some extent arbitrary since the value of day=0 needs to be defined. SAS defines day "0" as Jan 1, 1960, and thus converts all dates to the number of days that follow this date, or precede this date. Thus, the SAS numeric value for Jan 3, 1960 is 2.
INFORMATS are associated with variables with a special format that are to be read into a SAS DATA set. Their purpose is to recognize the special format of the variables. INFORMATS can be specified in two ways:
In order for SAS to recognize a date and convert it to a numeric value, the date must be entered following some set of rules. Different rules can be used, and a choice must be made as to which rule will be followed when entering the dates in an ASCII data set.
- Example: LEC14P13.SAS with OUTPUT.
A more complete description is given in Chapter 13 of the SAS Language Reference Manual.
|
|
Produced and maintained by the
Dept
of BioEpi at UMASS |