Practical Data Management and Statistical Computing (BioEp691F)

Contacts

Outline
Assignments

Resources


Outline: Lec11 Lec12 Lec13 lec14 Lec15 Lec16 Lec17 Lec18 Lec19 Lec20
Lectures: Lec11 Lec12 Lec13 Lec14 Lec15 Lec16 Lec17 Lec18 Lec19 Lec20


Lecture 13


More on List Input

 A simple way to read data into SAS is using LIST input format. While simple to use, this way or reading data is also the easiest way to "miss" read data. LIST input format should be used only when other approaches to reading the data fail, or for simple short data files. Data should be entered so that other methods of reading data can be used (such as in fixed columns). For routine creation and reading of data files, LIST input format should NOT be used. Instead, Column input format should be used.

Basics:


1. Problems with List Input

Several problems can occur when reading character variables using LIST input format. These problems can be resolved using additional SAS commands, but may be a source of frustration. We illustrate these problems, and show how the problems can be resolved.

Notice that while the OUTPUT produced by this program is closer to the representation of the true data, some of the data are still misread. Often, the only real way to make sure that such mistakes do not occur is to type the missing values whenever they occur. Of course, if you have control over how data are entered into an ASCII file, then choosing column input over list input makes the problems simpler to resolve.


2. Problems Reading Character Variables with List Input Format

Several problems can occur when reading character variables using LIST input format. These problems can be resolved using additional SAS commands, but may be a source of initial frustration. We illustrate these problems, and show how the problems can be resolved.


3. Reading Data in LIST input when deliminators are not Blanks, and Telephone numbers

Sometimes an ASCII data set may come with variable values separated by commas instead of blanks. For example, when data are entered in an EXCEL spreadsheet (as for example, EXCEL1.XLS), the results may be saved in comma-deliminated format (using the SAVE-AS option with CSV (OS/2 or MS-DOS) *.csv) ( excel1.csv). Note that each variable is separated by a comma instead of a space.

 


4. Variations on Reading Data With Column Input

It is best to organize ASCII files so that the resulting data can be read into SAS using Column input. The term "Column input" simply means that data are store in specified columns. Column input enables missing values to be easily identified, and allows for more flexible character variable input. When creating ASCII files, thought should be given so that column input can be used to read the files into SAS.


5.Using Column Input to Read Data from Multiple Lines Into A Single Record

Large numbers of variables may be collected in a study. It is not unusual to have 150 to 300 variables as part of a data set. The values of these variables may be listed on a single line, but when many variables exist, the length of the line will become very long. This creates no problems (since the LRECL options can be used with the INFILE command to recogonize line lengths over 256 columns), except for the fact that only about 80 columns can be viewed at any one time on a computer screen, or printed on a printer in a legible font. To avoid these problems, data are often read on several lines using fewer columns on each line. It is generally a good idea to keep line lengths less than 140 columns to facilitate printing and viewing data on the computer screen.

As an example, suppose that values for 100 variables are to be entered, with each value requiring five columns. If all variables are written on the same line, the line will have a length of 500 columns. An alternative way of entering the data is to include 20 variables on each of five lines, with values for the first 20 variables on the first line, etc. Each line of data will now contain 100 columns of data.

When data are entered on multiple lines, we need to define terms that distinguish the "line" of data from the "record" in the data file. For example, 5 lines of data in the previous example will correspond to data for a single record. Clearly, the values recorded in the same set of columns on different lines may differ, and will refer to different variables. It is important that the sets of 5 records be kept together so that they refer to responses for the same "record".

In practice, this linking of lines for a given record is accomplished by defining a variable for each line that has a common value. This ID variable, which is unique for each record, links the lines to the records. In addition, since the lines in the record contain different variables, it is important to include an additional variable on each line that identifies the line number. With an ID and line number variable, the values on a line can be clearly associated with variables and an ID.

  • A Simple Example Reading Multiple Lines Per Record: LEC14P11.SAS

A blood pressure screening program has collected data from Screening Days over a one month period. The screenings were performed on certain days in the lobby of local area banks. Data collected at the screening included the values of subject's weight and height. Blood pressure data were automatically recorded by a blood pressure device, and will be later combined with the other data. In addition, the subjects name and address were recorded. Each subject was assigned an ID number at the site. The data have been entered in an ASCII file with two records per subject. The first record contains the following variables:

Variable

Columns

Description

ID

1-5

Line #

6

Ht (inches)

8-10

Reported without decimal point to 10th of inch

Wt (lbs)

11-13

while the second record contained the name and address variables:

Variable

Columns

Description

ID

1-5

Line #

6

Name (last, first)

7-37

Street #

38-41

Street Name

63-60

Data on more than one line are indicated in SAS by #1 and #2 in the INPUT statement. The resulting data are listed in the OUTPUT.

Additional lines for a record can be specifies by #3, #4, etc.. Using the "#" notation to identify line numbers is the clearest way of reading multiple lines per record. However, it is also possible to indicate a subsequent line with the symbol "/". (LEC14P12.SAS)

  • There are several other features of the program that are worthy of comment.
    • First, note that the variables for ID and LINE are assigned different names for the first and second line. If different variable names were not used, only the value of the variables from the second line would be written to the SAS data set.
    • Second, note the correspondence between the way in which the height variable is entered in the ASCII file, and the variable is read by the INPUT statement. A height of 68.4 inches is entered in the ASCII file as 684 in columns 8-10. The decimal point is not entered. When reading this value into SAS, the number of columns for the variable is first indicated (3), followed by the number of columns that occur to the right of the decimal (1). Thus, the decimal point is automatically replaced for height in the SAS data set with the INPUT statement:
      • @8 ht 3.1
    • Third, note the manner in which the Last Name and First Name are read in the second line. These data are entered as LIST input, with two spaces separating the Last from the First names. The LIST input format allows embedded blanks in the names, and successfully separates the first from last names.


6. Using Informats for Dates and Commas with Column Input

  Frequently, data will be recorded that use a special format that combines numeric and character values. Two common examples are dates and dollar amounts. Dates are often reported with month, day, and year separated by "/" or by "-". For example, October 12, 1995 may be reported as:

10/12/95 or 10-12-95 or 10/12/1995

while May 4, 1949 may be reported as

5/5/49 or 5/04/1949 or ...

In a similar manner, dollar amounts may include commas. For example, annual income may be reported as

$20,000 or 20000 or $20000

Such data are recorded with a combination of numbers and characters, although the variables are inherently numeric. To read such data successfully, the special format in which the data is recorded must be recognized, and the values converted to numeric variables. The numeric variables representing dollar amounts correspond directly to the dollar amounts (without the commas). The numeric variables representing dates are to some extent arbitrary since the value of day=0 needs to be defined. SAS defines day "0" as Jan 1, 1960, and thus converts all dates to the number of days that follow this date, or precede this date. Thus, the SAS numeric value for Jan 3, 1960 is 2.

INFORMATS are associated with variables with a special format that are to be read into a SAS DATA set. Their purpose is to recognize the special format of the variables. INFORMATS can be specified in two ways:


6. Reading Dates

In order for SAS to recognize a date and convert it to a numeric value, the date must be entered following some set of rules. Different rules can be used, and a choice must be made as to which rule will be followed when entering the dates in an ASCII data set.

A more complete description is given in Chapter 13 of the SAS Language Reference Manual.



Produced and maintained by the Dept of BioEpi at UMASS
Send comments or questions about this web site to Ed Stanek
Email:
stanek@schoolph.umass.edu
\be691f\web\webready\lec13.html
Lst Update: 11/2/99