Proc Merge In Sas

 admin

The SQL implementation within the SAS System. Prior to the availability of PROC SQL in Version 6.0 of the SAS System, DATA step logic and a few utility procedures were the only tools available for creating, joining, sub-setting, transforming, and sorting data. You might want to overlay and compare ROC curves from multiple predictive models (either from PROC LOGISTIC or from other sources). PROC LOGISTIC can do that as well. You just need to merge the various predicted probabilities into a single SAS data set and then specify multiple ROC statements, as follows.

Previous PageNext Page
Merging SAS Data Sets
Merging with a BY Statement

Merging with a BY statement enables youto match observations according to the values of the BY variables that youspecify. Before you can perform a match-merge, all data sets must be sortedby the variables that you want to use for the merge.

In order to understand match-merging, you must understandthree key concepts:

BY variable

is a variable named in a BY statement.

BY value

is the value of a BY variable.

BY group

is the set of all observations withthe same value for the BY variable (if there is only one BY variable). Ifyou use more than one variable in a BY statement, then a BY group is the setof observations with a unique combination of values for those variables. Indiscussions of match-merging, BY groups commonly span more than one data set.

Input SAS Data Set for Examples

For example, the director of a small repertory theatercompany, the Little Theater, maintains company records in two SAS data sets,COMPANY and FINANCE.

Data SetVariableDescription
COMPANYNameplayer's name
Ageplayer's age
Genderplayer's gender
FINANCENameplayer's name
IdNumberplayer's employee ID number
Salaryplayer's annual salary

The following program creates,sorts, and displays COMPANYand FINANCE:

The following output displays the data sets. Notice that the FINANCEdata set does not contain an observation for Michael Morrison.

The Program

To avoid having to maintain two separate data sets,the director wants to merge the records for each player from both data setsinto a new data set that contains all the variables. The variable that iscommon to both data sets is Name. Therefore, Name is the appropriate BY variable.

The data sets are already sorted by NAME, so no furthersorting is required. The following program merges them by NAME:

The following output displays the merged data set:

Explanation

The new data set contains one observation for each playerin the company. Each observation contains all the variables from both datasets. Notice in particular the fourth observation. The data set FINANCE doesnot have an observation for Michael Morrison. In this case, the values ofthe variables that are unique to FINANCE (IdNumber and Salary) are missing.

Sas Match Merge

Match-Merging Data Sets with Multiple Observations in a BY Group

Input SAS Data Set for Examples

The Little Theaterhas a third data set, REPERTORY, that tracks the casting assignments in eachof the season's plays. REPERTORY contains these variables:

Play

is the name of one of the plays inthe repertory.

Role

is the name of a character in Play.

IdNumber

is the employee ID number of theplayer playing Role.

The following program creates and displays REPERTORY:

The following output displays the REPERTORY data set:

To maintain confidentiality during preliminary casting,this data set identifies players by employee ID number. However, casting decisionsare now final, and the manager wants to replace each employee ID number withthe player's name. Of course, it is possible to re-create the data set, enteringeach player's name instead of the employee ID number in the raw data. However,it is more efficient to make use of the data set FINANCE, which already containsthe name and employee ID number of all players (see The COMPANY and FINANCE Data Sets). When the data sets aremerged, SAS takes care of adding the players' names to the data set.

Of course, before you can merge the data sets, you mustsort them by IdNumber.

The following output displays the FINANCE and REPERTORY datasets, sorted by IdNumber:

These two data sets contain seven BY groups; that is,among the 23 observations are seven different values for the BY variable,IdNumber. The first BY group has a value of 029-46-9261 for IdNumber. FINANCEhas one observation in this BY group; REPERTORY has two. The last BY grouphas a value of 929-75-0218 for IdNumber. FINANCE has one observation in thisBY group; REPERTORY has three.


The Program

The following program merges the data sets FINANCE andREPERTORY and illustrates what happens when a BY group in one data set hasmore observations in it than the same BY group in the other data set.

The resulting data set contains all variables from bothdata sets.

Note: The OPTIONS statement extends theline size to 120 so that PROC PRINT can display all variables on one line.Most output in this section is created with line size set to 76 in the OPTIONSstatement. An OPTIONS statement appears only in examples using a differentline size. When you set the LINESIZE= option, it remains in effect untilyou reset it or end the SAS session.

The following output displays the merged data set:


Merge

Explanation

Carefully examine the first few observations in thenew data set and consider how SAS creates them.

  1. Before executing the DATA step, SAS reads thedescriptor portion of the two data sets and creates a program data vectorthat contains all variables from both data sets:

    • IdNumber, Name, and Salary from FINANCE

    • Play and Role fromREPERTORY.

    IdNumber is already in the program data vector because itis in FINANCE. SAS sets the values of all variables to missing, as the followingfigure illustrates.

    Program Data Vector before Reading from Data Sets

  2. SAS looks at the first BY group in each data setto determine which BY group should appear first. In this case, the first BYgroup, observations with the value 029-46-9261 for IdNumber, is the same inboth data sets.

  3. SAS reads and copies the first observation fromFINANCE into the program data vector, as the next figure illustrates.

    Program Data Vector after Reading FINANCE Data Set

  4. SAS reads and copies the first observation fromREPERTORY into the program data vector, as the next figure illustrates. Ifa data set does not have any observations in a BY group, then the programdata vector contains missing values for the variables that are unique to thatdata set.

    Program Data Vector after Reading REPERTORY Data Set

  5. SAS writes the observation to the new data setand retains the values in the program data vector. (If the program data vectorcontained variables created by the DATA step, then SAS would set them to missingafter writing to the new data set.)

  6. SAS looks for a second observation in the BY groupin each data set. REPERTORY has one; FINANCE does not. The MERGE statementreads the second observation in the BY group from REPERTORY. Because FINANCEhas only one observation in the BY group, the statement uses the values ofName (Rudelich , Herbert) and Salary (35000) retained in the programdata vector for the second observation in the new data set. The next figureillustrates this behavior.

    Program Data Vector with Second Observation in the BY Group

  7. SAS writes the observation to the new data set.Neither data set contains any more observations in this BY group. Therefore,as the final figure illustrates, SAS sets all values in the program data vectorto missing and begins processing the next BY group. It continues processingobservations until it exhausts all observations in both data sets.

    Program Data Vector before New BY Groups

Match-Merging Data Sets with Dropped Variables

Now that casting decisions are final,the director wants to post the casting list, but does not want to includesalary or employee ID information. As the next program illustrates, Salaryand IdNumber can be eliminated by using the DROP= data set option when creatingthe new data set.

Note: The difference in placement of thetwo DROP= data set options is crucial. Dropping IdNumber in the DATA statementmeans that the variable is available to the MERGE and BY statements (to whichit is essential) but that it does not go into the new data set. Dropping Salaryin the MERGE statement means that the MERGE statement does not even read thisvariable, so Salary is unavailable to the program statements. Because thevariable Salary is not needed for processing, it is more efficient to preventit from being read into the PDV in the first place.

The followingoutput displays the merged data set without the IdNumber and Salary variables:
Match-Merging Data Sets with the Same Variables

You canmatch-merge data sets that contain the same variables(variables with the same name) by using the RENAME= data set option, justas you would when performing a one-to-one merge (see Performing a One-to-One Merge on Data Sets with the Same Variables).

If you do not use the RENAME= option and avariableexists in more than one data set, then the value of that variable in the lastdata set read is the value that goes into the new data set.

Match-Merging Data Sets That Lack a Common Variable

You can name any number of datasets in the MERGE statement. However, if you are match-merging the data sets,then you must be sure they all have a common variable and are sorted by thatvariable. If the data sets do not have a common variable, then you mightbe able to use another data set that has variables common to the originaldata sets to merge them.

Set

For instance, consider the data sets that are used inthe match-merge examples. The table that follows shows the names of the datasets and the names of the variables in each data set.

Data SetVariables
COMPANYName, Age, Gender
FINANCEName, IdNumber, Salary
REPERTORYPlay, Role, IdNumber

These data sets donot share a common variable. However,COMPANY and FINANCE share the variable Name. Similarly, FINANCE and REPERTORYshare the variable IdNumber. Therefore, as the next program shows, you canmerge the data sets into one with two separate DATA steps. As usual, you mustsort the data sets by the appropriate BY variable. (REPERTORY is already sortedby IdNumber.)

In order to merge the three data sets, this program

  • sorts FINANCE and COMPANY byName

  • merges COMPANY and FINANCE into a temporary dataset, TEMP

  • sorts TEMP by IdNumber

  • merges TEMP and REPERTORY by IdNumber.

The followingoutput displays the resulting data set, ALL:
Previous PageNext PageTop of Page
Previous PageNext Page
Statements

Joins observations from two or more SAS data sets into a singleobservation.
in a DATA step
File-handling
Executable
Syntax
Arguments
Details
Overview
Using Data Set Lists with MERGE
One-to-One Merging
Match-Merging
Comparisons
Examples
Example 1: One-to-One Merging
Example 2: Match-Merging
Example 3: Merging with a Data Set List
See Also

Syntax

MERGE SAS-data-set-1 <(data-set-options)>
SAS-data-set-2 <(data-set-options) >
<...SAS-data-set-n<(data-set-options)>>
<END=variable>;

Arguments

SAS-data-set

specifies at least two existing SAS datasets from which observations are read. You can specify individual data sets,data set lists, or a combination of both.

You can specify additionalSAS data sets.
Using Data Set Lists with MERGE
(data-set-options)

specifies one or more SAS data set optionsin parentheses after a SAS data set name.

The data set optionsspecify actions that SAS is to take when it reads observations into the DATAstep for processing. For a list of data set options, see Data Set Options by Category.
Data set options that apply to a data setlist apply to all of the data sets in the list.
END=variable

names and creates a temporary variable thatcontains an end-of-file indicator.

The variable, whichis initialized to 0, is set to 1 when the MERGE statement processes the lastobservation. If the input data sets have different numbers of observations,the END= variable is set to 1 when MERGE processes the last observation fromall data sets.
The END= variable is notadded to any SAS data set that is being created.
Details

Overview

The MERGE statementis flexible and has a variety of uses in SAS programming. This section describesbasic uses of MERGE. Other applications include using more than one BY variable,merging more than two data sets, and merging a few observations with all observationsin another data set.

For more information, see How to Prepare Your Data Sets in SAS Language Reference: Concepts.


Using Data Set Lists with MERGE

Proc Join In Sas

You canuse data set lists with the MERGE statement. Data set lists provide a quickway to reference existing groups of data sets. These data set lists must beeither name prefix lists or numbered range lists.

Name prefix lists refer to all data sets that begin witha specified character string. For example, merge SALES1:; tells SAS to merge all data sets starting with 'SALES1'such as SALES1, SALES10, SALES11, and SALES12.

Numbered range lists require you to have a series of datasets with the same name, except for the last character or characters, whichare consecutive numbers. In a numbered range list, you can begin with anynumber and end with any number. For example, these lists refer to the samedata sets:

Note: If the numeric suffix of the first data setname contains leading zeros, the number of digits in the numeric suffix ofthe last data set name must be greater than or equal to the number of digitsin the first data set name. Otherwise, an error will occur. Forexample, the data set lists sales001-sales99 and sales01-sales9will cause an error. The data set list sales001-sales999 is valid. Ifthe numeric suffix of the first data set name does not contain leading zeros,the number of digits in the numeric suffix of the first and last data setnames do not have to be equal. For example, the data set list sales1-sales999is valid.

Some other rules to consider when using numbered data set lists areas follows:

  • You can specify groups of ranges.

  • You can mix numbered range lists with name prefix lists.

  • You can mix single data sets with data set lists.

  • Quotation marks around data set lists are ignored.

  • Spaces in data set names are invalid. If quotation marks are used,trailing blanks are ignored.

  • The maximum numeric suffix is 2147483647.

  • Physical pathnames are not allowed.


One-to-One Merging

One-to-one mergingcombines observations from two or more SAS data sets into a single observationin a new data set. To perform a one-to-one merge, use the MERGE statementwithout a BY statement. SAS combines the first observation from all data setsthat are named in the MERGE statement into the first observation in the newdata set, the second observation from all data sets into the second observationin the new data set, and so on. In a one-to-one merge, the number of observationsin the new data set is equal to the number of observations in the largestdata set named in the MERGE statement. See Example 1 for an example of a one-to-onemerge. For more information, see Reading, Combining, and ModifyingSAS Data Sets in SAS Language Reference: Concepts.

CAUTION:
Use care when you combine data sets with a one-to-onemerge.

One-to-one merges can sometimes produceundesirable results. Test your program on representative samples of the datasets before you use this method.

Sas

Match-Merging

Match-merging combines observationsfrom two or more SAS data sets into a single observation in a new data setaccording to the values of a common variable. The number of observations inthe new data set is the sum of the largest number of observations in eachBY group in all data sets. To perform a match-merge, use a BY statement immediatelyafter the MERGE statement. The variables in the BY statement must be commonto all data sets. Only one BY statement can accompany each MERGE statementin a DATA step. The data sets that are listed in the MERGE statement mustbe sorted in order of the values of the variables that are listed in the BYstatement, or they must have an appropriate index. See Example 2 for an exampleof a match-merge. For more information, see Reading, Combining,and Modifying SAS Data Sets in SAS Language Reference: Concepts.

Note: The MERGE statement does not produce a Cartesian product on amany-to-many match-merge. Instead it performs a one-to-one merge while thereare observations in the BY group in at least one data set. When all observationsin the BY group have been read from one data set and there are still moreobservations in another data set, SAS performs a one-to-many merge until allobservations have been read for the BY group.

Comparisons
  • MERGE combines observations from two or more SASdata sets. UPDATE combines observations from exactly two SAS data sets. UPDATEchanges or updates the values of selected observations in a master data setas well. UPDATE also might add observations.

  • Like UPDATE, MODIFY combines observations fromtwo SAS data sets by changing or updating values of selected observationsin a master data set.

  • The results that are obtained by reading observationsusing two or more SET statements are similar to the results that are obtainedby using the MERGE statement with no BY statement. However, with the SET statements,SAS stops processing before all observations are read from all data sets ifthe number of observations are not equal. In contrast, SAS continues processingall observations in all data sets named in the MERGE statement.

Examples

Example 1: One-to-One Merging

Merge Data In Sas Enterprise Guide

Thisexample shows how to combine observations from two data sets into a singleobservation in a new data set:


Example 2: Match-Merging

This exampleshows how to combine observations from two data sets into a single observationin a new data set according to the values of a variable that is specifiedin the BY statement:


Example 3: Merging with a Data Set List

This exampleuses a data list to define the data sets that are merged.

See Also

Statements:

Reading,Combining , Modifyingand SAS Data Sets in SAS Language Reference: Concepts

Proc Merge In Sas

Previous PageNext PageTop of Page