|
|
|
|
1. Overview
Once SAS data sets have been created, they can be used to create new SAS data sets, combine data sets, or create new variables in a specific manner. SAS data sets are managed most often in a SAS DATA step. SAS data sets can also be managed using SAS PROCEDURE (PROC) steps. There are three principal methods in which data sets are managed with DATA steps (Table 1). We will discuss use of the SET and MERGE statements here.
Table 1. Statements for Managing SAS Data
files in a DATA
STEP
In addition, there are four principal methods in which data sets are managed using PROCEDURES. The SAS Procedures that combine data are summarized in Table 2. We will not discuss these procedures here.
Table 2. Statements for Managing SAS Data files with PROCedures.
We focus on combining data using a DATA step with either a SET or MERGE statement. We do not illustrate use of the UPDATE statement or other procedures that can be used to combine or manipulate SAS data sets. In some production settings, these other procedures are useful. In research study settings, data are usually combined using simple SET or MERGE statements.
We illustrate combining data sets in the context of a simple example. Suppose a study is conducted among women to reduce dietary fat intake. Such a study (called the Women's Health Initiative) is currently ongoing, with the intention of enrolling 160,000 women and following them over a 10 year period. We consider example data read by lec21p1.sas that may result from this study at three times:
baseline
1 month visit, and
2 month visits
on a group of women.
A total of 5 subjects are included in the study. Of these subject, three subjects have measures at 1 month (although one of these subjects (ID=1) has two measures, with one at each clinic), while 4 subjects have measures at 2 months.
A simple way to combine SAS data sets is to concatenate the files. This is accomplished with a SET statement. While such combinations of data are simple, the resulting data set may not be very useful. We illustrate such an example by concatenating the data for the three measures of Fat intake using LEC21P2.SAS, resulting in the following data.
The listing of the data illustrates the problems that occur when a simple concatenation of data is used. Missing values are created for variables that are not found in other data sets. In some cases, the missing values make some sense (as for example for AGE), where age is a subject characteristic, not a repeated measure. Still, the representation of AGE=. for follow-up times other than baseline does not really imply that the subject's age is unknown. In contrast, the missing values created for FAT0, FAT1, and FAT2 occur simply because the variables have been assigned different names in the data sets.
There may be interest is simple lists of data for each subject. Such lists are self-explainatory. The lists may be constructed by simply sorting the concatenated data by subject and follow-up date, and then printing the data. Such a listing is given by LEC21P4.SAS. Alternatively, if the files that are to be combined into a single data set are first sorted by the ID variable, then the records can be INTERLEAVED by including a "BY" statement along with the SET statement in the DATA step. We illustrate the INTERLEAVING of data sets in the same program resulting in the following identical output.
In the lists given above, age is missing for all records other than the first record for a subject. We can retain the age of a subject with a RETAIN statement in the DATA step. Such a statement will retain the value of a variable (ageyr) from the previous record for the next record. We tell SAS to change the value of ageyr by identifying the first occurance of a variable in the data (after sorting by that variable). The FIRST.var (and LAST.var) are automatic SAS variable that may be used after including a BY statement in a previously sorted SAS data set. These variables are indicator variables that have a value of 1 for the record that corresponds to the FIRST (or LAST) occurance of the value of the var. The program LEC21P5.SAS illustrates these steps, with the output given.
The SET (or MERGE) statement in a SAS DATA step can create variables that identify the source of a given record called IN variables. These variables are created as temporary variables and are not kept in the resulting SAS data set. We illustrate this with LEC21P6.SAS, resulting in a similar output . These variables can be made permanent by assigning a new variable to the IN variable name.
Concatenating data is valuable for descriptive analyses, and some other statistical analyses (such as plots, and repeated measures analyses). However, simple differences can be constructed if the data are arranged with one record per subject. We illustrate this arrangement here via the MERGE statement in the DATA step.
The first step in MERGING data is to place each data set in the same order. Usually, data will be combined making use of a matching variable (such as ID). When such a variable is not present, the order of the data in each separate data file will dicatate of final result. Since the order may not match like subjects from different data sets, only match merging should be routinely used. We merge the data sets using the program LEC21P7.SAS. Note that the data sets are first SORTed by the same common variable, ID. In the data step, the MERGE statement does not result in a MATCH MERGE unless it is once again accompanyied by the BY variable indicating the matching.
It is valuable to examine how the DATA step handles multiple records in one or another of the data sets. Note that in the MTH1 data set, there are two records for ID=1. These two records correspond to ID=1 having an intake measures at each clinic at 1 month. When combining data, a complete record (with all three fat intake measures) is created for each occurance of a duplicate record. Thus, the values of FAT0 and FAT2 are repeated for each value of FAT1 for ID=1 as shown in the resulting MERGEd data.
A closer examination of the original data, and the combined data results in some question raised as to the coding of CLINIC in the final data set. In the original data set, the subjects 1-3 were observed in CLINIC #1 at the initial visit. In addition, subject 1 and 2 were observed in CLINIC 1 at the 1 month visit, with only subject 2 observed at the 2-month visit.
The combined data set has only subject 2 reported at from CLINIC 2. This illustrates that in a MERGE statement, SAS will take the value of a variable from the last data set listed in a MERGE statement as the value output to the data set. For subjects 1,2,3, and 5, this last value is from D2, the 1-month data. For ID=4, no records were observed at month 1 or 2, so the CLINIC value is the value from the baseline data.
Suppose we wish to compare data for subjects where fat intake measures were made at the same clinic. We can obtain such a data set by sorting data first by clinic, and then merging by CLINIC and ID as shown in LEC21P10.SAS with the following results.
When combining data sets, it is possible to combine data sets using more than one matching variable. In the example data, two subjects (ID=1 and ID=3) have fat intake measured at different clinics. Suppose that we want to restrict the data to measures of FAT intake that were taken at the same clinic as the baseline record. When merging several data sets, special variables are automatically created in SAS that identify the presence of a record from the data set. These variables are created as temporary variables, and are normally not retained in the resultant data step. The values of these variables may be retained if they are assigned a variable name using an assignment statement. We first illustrate the manner in which the variables are created in LEC21P11.SAS with the results given in the table.
Variables created via the IN option are assigned as values of permanent variables. The variables a, b, and c are temporary variables with values of 0 (if the data set does not contribute a record) or 1 (if the data set does contribute a record), while the variables A1, B1, and C1 are permanent, and kept in the new SAS data set.
SAS automatically generates another set of variables that are often useful when combining data. These variables are generated whenever the data set has been previously sorted. The variables have values of "1" for the record representing the first occurrence of a sort group (for FIRST.), and a value of "0" for all other records. LAST. variables are created in a similar manner, where a value of "1" is assigned to the last record in a sort group.
FIRST and LAST variables are identified in SAS by the special SAS key word, FIRST.var or LAST.var, where "var" represents the variable that defines the sort categories.
We illustrate use of FIRST and LAST variables in the context of the FAT study. Suppose that clinics differ in the quality of measurement. For example, higher quality measures may be obtained from "core" clinics, with satellite clinics having lower quality scores. If this is the case, and assuming that a subject's fat intake may be measured at more than one clinic at a given time, then we may only want to use the fat intake values from the highest quality clinic. We add a quality variable to the clinics in the Baseline data, and then illustrate how FIRST and LAST variables can be used to obtain a data set with one record per subject, where the fat measures come only from the highest quality clinics. The program LEC21P13.SAS manipulates the data to produce these results.
|
|
Produced and maintained by the
Dept
of BioEpi at UMASS |