This document discusses the more frequently used data management techniques for working with and reshaping SAS data sets. Specifically the following points are discussed:

  • Conditional execution of groups of statements
  • Variable selection and deletion
  • Sorting SAS data set
  • Concatenation, interleaving, and merging of data sets

Part 1: The DO And END Statements

The two purposes of the DO and END statements are:
a) to execute several statements when certain conditions are specified, and
b) to execute certain statements in repetition using an index variable.

Form: The general form for condition (a) is:

IF expression THEN DO;
<< other SAS statements >>
END;

The general form for condition (b) is:
DO index variable = start TO stop BY increment;

<< other SAS statements >>

END:

Note: The BY increment phrase is optional.

Example 1: IF State = ‘NC’ THEN DO;
Adjpop = Pop / Area;
Name = ‘North Carolina’;
END;

Example 2: DO i = 1 TO 50;
y = sqrt(x) + (i * 10);
Output;
END;

Result: Example 1, will provide ADJPOP and NAME with values only in the event that the value of STATE is NC at the current observation. Example 2, will create 50 new observations for variable Y.

Part 2: The DROP And KEEP Statements

The purpose of these statements is to indicate which variables are to be included in the SAS data set that is being created. The DROP statement identifies variables to be excluded from the data set. Alternatively, the KEEP statement identifies variables to be included in the data set.

Form: DROP variable list;
-or-
KEEP variable list;

Notes: These statements are one of many SAS statements which can be used in both the DATA and the PROC step.

However, the DROP and KEEP statements are mutually exclusive. That is, both statements cannot be used together in the same DATA or PROC step.

In the DATA step they control which variables are stored in the data set being created.

In the PROC step they are valid only for the duration of the step and are equivalent to using the VAR statement.

Example 1: The DROP Statement

DATA Census;
INPUT State $ Pop Area;
PopSqm = Pop * 1000 / Area;
DROP Pop Area;

Example 2: The KEEP Statement

DATA Hischool College;
INPUT Name $ Age Yred;
IF Yred LE 12 THEN OUTPUT Hischool;
ELSE OUTPUT College;
KEEP Name Age;

Result: In Example 1, four variables are created but the variables POP and AREA are dropped. That is, the data set will retain only variables STATE and POPSQM. In Example 2, two data sets are created simultaneously. Each with three variables. However, the variables NAME and AGE will be retained and YRED will be dropped.

Part 3: Sample Data Set For The SET And MERGE Statements

The following data sets are defined for the discussion on the SET and MERGE statements.

Data CensusA;
Input State Pop;
Cards;
NC 5.08
FL 6.78
TN 3.92

Data CensusB;
Input State Area;
Cards;
FL 58560
TN 42244

Remarks: Both data sets have two variables.
Of which only one variable (STATE) is common to both data sets.
CensusA has POPulation data for three unique states.
However, CensusB has data on the AREA for only two of the three states.

Goal: The programmer’s goal is to join these two data sets into one so as to be able to perform statistical analysis on the combined data. As shown in the examples below the programmer should use match-merging. However, before we discuss match-merging it would be helpful to visualize the different forms of SET and MERGE and why they should not be used for the specified goal.

Part 4: The SET Statement

The SET statement is used to instruct SAS to read data from one or more SAS data sets. Recall: SAS data sets are not ASCII data sets. With the use of the SET statement the programmer can perform any one of the following data set manipulations:

* Reusing previously created data
* Concatenation
* Interleaving

The form of the SET statement follows.

Form: SET sasdataset1 sasdataset2 …;

Examples Of SET

Example 1: Reusing Data

In a previous Handout we discussed how to create SAS data sets. The question now is, “How does one re-use these SAS data sets at a later stage ?” The answer is simple. Use the SET statement.

Form: DATA Reuse;
SET CensusA; * temporary data set is CENSUSA;

-or-

LIBNAME sasd ‘D:’;

DATA Reuse;
SET sasd.censusa; * permanent data set is censusa.sas7bdat on drive d

Example 2: Concatenation

Two or more data sets are said to be concatenated when they are stacked together to form one large data set.

Form: DATA Both;
SET CensusA CensusB;

Result: State Pop Area
FL 6.78 .
NC 5.08 .
TN 3.92 .
FL . 58560
TN . 42244

Remark: All information contained in CensusA is stacked before information in CensusB.
Use this method when reading data for the same set of variables but different observations.

Example 2: Interleaving

If a BY statement follows the SET statement the resulting data set will be interleaved by the value of the BY variable. Note: The data sets to be interleaved must already be sorted by the variable(s) listed in the BY statement.

Form: DATA Both;
SET CensusA CensusB;
BY State;

Result: State Pop Area
FL 6.78 .
FL . 58560
NC 5.08 .
TN 3.92 .
TN . 42244

Remarks: The data set is sorted in ascending order by the values of STATE. Use this method when reading data for the same set of variables but different observations and when the resultant data set needs to be in some sort of sorted order.

Part 5: The MERGE Statement

The matching of observations across two or more data sets is referred to as merging. There are two types of merge operations. These are:
* One-to-One Merging
* Match Merging

Examples Of MERGE

Example 1: One-to-One Merging

In a one-to-one merge, if the two data sets to be merged have a common variable but it has a different value in both the data sets, the last value read will be the one to appear in the new data set. The number of variables in the new data set will equal to the sum of unique variables in the data sets being merged. The number of observations will equal to the number of observations of the largest data set (in terms of observations).

Form: DATA Both;
MERGE CensusA CensusB;

Result: State Pop Area
FL 5.08 58560
TN 6.78 42244
TN 3.92 .

Remarks: The new data set has 3 variables since STATE is common in both data sets.
The new data set has 3 observations since CensusA has three observations.
The new data set has matched up NC’s population with FL’s area.
The new data set has matched up FL’s population with TN’s area.
The new data set has no observation for the state of NC but two observations for TN

In short this usage of the MERGE statement is faulty.

This method should be used only when the original data sets have different variables and preferably the same number of observations.

Example 2: Match Merging

Controls which observations are matched. Match-merge requires that at least one variable is common to each data set. And that the data sets in use are sorted by the common variable.

Form: DATA Both;
MERGE CensusA CensusB; BY State;

Result: State Pop Area
FL 6.78 58560
NC 5.08 .
TN 3.92 42244

Remarks: The new data set has 3 variables and 3 observations.
Since the BY statement was used the observations are properly aligned for the states in question.

Part 6: Creating ASCII Files From A SAS Data Set

Often it is necessary to export data from a SAS data set to an ASCII file which may be transported to other software systems. This is achieved using the FILE and PUT Statements.

The FILE Statement

Purpose: Directs SAS to create an external data file.
Form: FILE ‘path_and_file_name’;
or
FILENAME fileref ‘path_and_file_name’; *before the DATA statement;
FILE fileref;
Keyword: FILE

The PUT Statement

Purpose: Describes to SAS the format of the ASCII data.
Form: PUT variable names variable formats;
Keyword: PUT
NamingRules: Limited to 32 characters. First character must be alphabetic or and underscore ‘_’.

The entire variable name must contain only letters, numbers, or underscore.

Notes: As a general rule all variables in SAS are assumed to be numeric. If a variable has character values you must place a $ sign after the variable’s name in the PUT statement in LIST output.

The default missing data value is a dot (period).
Two additional pointer directions available in the PUT statement are:

_PAGE_ Prints the current page and begins a new page

OVERPRINT causes the current line to be printed over the previous line. This option is useful for underlining text by overprinting underscores.
Example: Observe that the FILE and PUT statements use all of the same syntax as INFILE and INPUT. For example, suppose you wish to create an Ascii data set using the previously defined BUDGET data set of handout 1. You would then write the following SAS program:

DATA _NULL_;
SET Budget; /* bring in budget */
FILE ‘a:revenue.out’; /* create file: revenue.out on drive A */
PUT Name $ Dept $ Revenue; /* write values of the variables */