Introduction To Data Management and Programming in SAS
Introduction To Data Management and Programming in SAS
Introduction To Data Management and Programming in SAS
Text book
The Little SAS Book (Second Edition), Delwiche LD and Slaughter SJ, SAS Institute, Inc.,
1998. Available at the Harvard Coop. Identified as LSB.
Reference
SAS User's Guide: Basics and SAS User's Guide: Statistics.
SAS System Help (online)
Accessing external raw data sets and SAS data sets (LSB 2.8-2.10)
Create new variables and change existing variables (LSB 3.1-3.3, 3.7-3.8, 8.7-8.9)
Assignment statements............................................................................................... 37
Create new variables .................................................................................................. 37
Change existing variables ........................................................................................... 38
SAS functions .............................................................................................................39
Date functions ............................................................................................................. 40
Year 2K ....................................................................................................................... 41
Date difficulties............................................................................................................ 42
Useful SAS functions .................................................................................................. 43
Basics .........................................................................................................................71
Concatenate 2 SAS data sets ..................................................................................... 72
Interleave 2 SAS data sets: BY................................................................................... 72
Create a subset: IF ..................................................................................................... 73
Basics .........................................................................................................................74
One-to-one join ........................................................................................................... 75
One-to-many join ........................................................................................................ 76
Join with IN= option..................................................................................................... 77
"Many-to-one" join (doesn't use MERGE) ................................................................... 79
Data analysis - PROC step (LSB 7.1-7.8, SAS User's Guide: Statistics)
What is a SAS data set? a rectangular array of observations and variables with documentation.
Variables (columns)
Documentation
Observations (rows)
What is a SAS program? a series of SAS statements, each terminated by a semicolon (;),
executed in order on each observation (an implied loop with implicit output).
proc print;
run;
SAS LOG
8 ;
9 run;
10
11 proc print;
12 run;
SAS OUTPUT
1 roland 180
2 rhonda 172
BIO113 3
SAS names: You, the SAS programmer, get to name variables, arrays and data sets but there
are a few rules.
Examples:
regular list: name age sex wt bp1 bp2 bp3 bp4 bp5
'type' list: _numeric_, _character_
single dash list: bp1-bp5
double dash list: name--bp5
SAS statements: the building blocks of the SAS program. A few rules here, too.
Coding style: Develop a style that works for you. The idea is have a program that you and
people you work with can read and debug easily. Some suggestions...
When the statement extends beyond a single line, indent succeeding lines
1. Start SAS
Click Start, Statistics and SAS8 to open the SAS Windowing Environment. Three windows
(Editor, Log, Explorer) are visible when you start. Output and Results may be accessed by
clicking the appropriate tab at the bottom of the screen.
The title bar of the active window is blue while the others are gray. Click anywhere in a
window to make it the active window.
BIO113 6
The status bar is at the bottom of the screen. Error messages will appear in the area
on the left. The area on the right displays the current working folder (directory). You may
change the working folder by double-clicking this area and navigating through folders in the
usual Windows way (we will demonstrate this in class).
BIO113 7
arrow keys move you around, the End key moves you to the end of a line, Home moves you
to the beginning of a line, PageUp and PageDown move you up or down one page, Enter
moves you to the beginning of a new line. You can block areas of the program and cut and
paste in the usual way. Pay close attention to spelling and punctuation as you type the
program. If you misspell a SAS key word, it will be colored red until you correct it. Notice
that comments are green, key words are blue, data in the program are highlighted in yellow,
title text is purple, and everything else is black. In addition, rules appear to separate parts of
the program (data and proc steps). It's a very good idea to save your program before you
submit. It can be recalled from memory during this SAS session but, should you exit SAS
before saving it, it will be lost forever. Click File and Save and give the file the name
first.sas. Because your working folder is p:\bio113, that is where the file will be saved.
By the way, you could click on the save button and fill in the dialogue to save the file.
BIO113 8
visible and then on the little man in the tool bar. Text will whiz by in the LOG window
and, if the program runs successfully, output will appear in the OUTPUT window.
Your SAS program and messages from SAS are displayed in different colors (you'll see the colors
in class).
Black the program you submitted
Blue SAS notes--number of observations and processing time
Green warnings
Red errors
Green and red messages suggest that you should examine your program carefully for program
and/or logical errors. We'll review recalling, correcting and resubmitting problem programs in
class.
BIO113 9
Scroll through the output with PageUp and PageDown or by moving the bar on the right side
of the window.
Make the window you want to print active. Click File, Print and OK or click (the printer
button) on the tool bar.
Make the window you want to save active, click File and Save as or click on (the save
button) and fill in the dialogue. Name the LOG file first.log and the OUTPUT first.lst. Files
saved in this way may be imported into your favorite word processing program for formatting
and editing (notice that you cannot edit in the LOG or OUTPUT windows). You may also use
the usual copy and paste buttons or key strokes to copy LOG or OUTPUT material to a
word processing document.
BIO113 10
Example: data;
data whatever;
Where's the data? DATALINES and INFILE statements tell SAS where to find 'raw' data
Example:
Obs a b
1 11 35
2 95 7
BIO113 11
Describe the 'raw' data to SAS: The input statement begins with the SAS keyword INPUT
and ends with a semicolon (;).
1. Data values are separated by one or more spaces or by a delimiter such as a tab or
comma
2. Character data contain no imbedded spaces and are 8 or fewer characters
3. Missing data values are indicated by a dot (.)
4. There's nothing funny like a date (if so, use an INFORMAT statement, too)
5. You plan to read every variable on the data line
How? List the names you have chosen for the variables
Indicate character variables by following their names with $
Example program:
(♦ in data lines indicate blanks)
Result:
1 sam 42
Obs id a b c
1 21 1 123 4456
BIO113 12
1 90043 58 67 98 72 71
1 15 sam
2 17 william
1 2 5 41
2 11 2 54
BIO113 13
1. Data values for each variable are always in the same columns in all the data lines in
the raw data set
2. There's nothing funny like a date (INFORMAT may not be used with column input)
How? List the names you have chosen for the variables
Follow each name with the column location of the variable
Indicate character variables with $ before the column location
Example program:
(♦ in data lines indicate blanks)
1 sam 42
2 alexander 12
Obs id a b c
1 31 1 123 4456
BIO113 14
1 90043 58 67 98 72 71
1 pam 1 19 1960 18
BIO113 15
1. Data values for each variable are always in the same columns in all the data lines in the
raw data set
2. There are funny variables like dates
3. You want to reuse formatting instructions
How? List the names you have chosen for the variables
Follow each name with the format instruction for the variable (note the
'.' at the end of the format instruction)
Indicate character variables with $ before the format instruction
Use pointer instructions (@n and +n) to move to correct columns
Example program:
(♦ in data lines indicate blanks)
1 sam 42 15
Data:
data one;
infile 'p:\bio113\ex12.dat'; 311123♦4456
input id 2. a 1. b 3. @8 c 4.;
run;
proc print;
run; Result:
Obs id a b c
1 31 1 123 4456
BIO113 16
Result:
1 90043 58 67 98 72 71
Result:
Obs x1 x2 x3 y1 y2 y3 z1 z2 z3
1 a d f 3 1 2 42 65 9
2 m x p 1 8 5 11 93 85
data one;
infile 'p:\bio113\ex14.dat';
input x1 $1. y1 1. z1 2. x2 $1. y2 1. z2 2. x3 $1. y3 1. z3 2.;
run;
proc print;
run;
Result:
Obs x1 y1 z1 x2 y2 z2 x3 y3 z3
1 a 3 42 d 1 65 f 2 9
2 m 1 11 x 8 93 p 5 85
BIO113 17
MIXED input:
In most situations, you'll mix input styles, using what's convenient and requires the
least typing. Typically, you use list style or you use a mix of column and formatted
style. This example is a bit contrived to show that you might use all three styles.
Example program:
(♦ in data lines indicate blanks)
data one;
infile 'p:\bio113\ex15.dat';
input name $ age mrn 14-18 dob mmddyy8.
+1 sex $1. @33 (bp1-bp5)(2.);
run;
proc print;
run;
Data:
sam♦♦♦♦♦43♦♦♦9004301161960♦m♦♦♦♦5867987271
pam♦♦♦♦43♦♦♦♦9011301191960♦f♦♦♦♦6165807870
Result:
Obs name age mrn dob sex bp1 bp2 bp3 bp4 bp5
1 sam 43 90043 15 m 58 67 98 72 71
2 pam 43 90113 18 f 61 65 80 78 70
BIO113 18
How? Input the conditional variable and end the statement with @ (trailing at)
Test the value of the conditional variable
Execute additional input statements or delete the observation based on the
value found
run; Result:
Result:
Obs rec sex age employed
1 parent 1 33 1
2 parent 0 33 0
BIO113 19
Example program: Read different data lines with different input statements.
data one;
infile 'p:\bio113\ex17.dat';
input sex $6 @;
if sex='f'
then input name $1-4 @7 (softball fldhcky)(1.) age 9-10;
else if sex='m'
then input name $1-4 @7 (baseball soccer)(1.) age 10-11;
else delete;
Data:
run; 123456789012
proc print;
run; bob m11 10
jane f01 9
jill f1012
bill m01 12
lee 11
Result:
Obs sex name softball fldhcky age baseball soccer
1 m bob . . 10 1 1
2 f jane 0 1 9 . .
3 f jill 1 0 12 . .
4 m bill . . 12 0 1
Result:
Obs rec rooms num kids pets sex age employed kind
1 house 8 5 3 2 . . . .
2 parent . . . . 1 33 1 .
3 parent . . . . 0 33 0 .
4 child . . . . 1 10 . .
5 child . . . . 0 9 . .
6 child . . . . 1 6 . .
7 pet . . . . 0 . . 1
8 pet . . . . 0 . . 2
BIO113 20
data one;
infile 'p:\bio113\ex18.dat'; Data:
input id hr @@;
run; 1001 121 1005 93 1011
proc print; 107 1034 82
run;
OUTPUT:
Obs id hr
1 1001 121
2 1005 93
3 1011 107
4 1034 82
Read a single observation from multiple records – slash (/) or pound sign (#) pointer
instruction :
data one; Data:
infile 'p:\bio113\ex19.dat';
input id 4. hr1 2. / hr2 2-3 #3 hr3; 102195
run; f88
proc print; 101
run;
OUTPUT:
1 1021 95 88 101
BIO113 21
With no options on the infile statement, we get into trouble trying to read the following data
with a list input statement. Notice, there are only 3 data values on the first data record.
Example program:
* infile with no options;
data one;
infile 'p:\bio113\ex20.dat';
Data:
input bp1-bp4;
run; 121 83 100
proc print; 84 92 120 95
run;
Result:
OUTPUT:
Obs bp1 bp2 bp3 bp4
1 121 83 100 84
BIO113 22
MISSOVER If the list input statement describes variables beyond the end of the data
record, as above, SAS will go to the next line to complete the input statement.
MISSOVER tells SAS to stop reading variables at the end of the line; any variables in
the input statement not read before SAS finds the end of the line are assigned
missing (.). (Note: TRUNCOVER, page 23, will achieve the same effect but the
tradition is to use missover with list input and truncover with column and formatted
input)
Result:
OUTPUT:
1 121 83 100 .
2 84 92 120 95
BIO113 23
With no options on the infile statement, we get into trouble trying to read the following data
with a column or formatted input statement. You cannot ordinarily see that a record is
"short", but here I've marked the end of the data record with .
Result:
OUTPUT:
Obs x y
1 453 763
BIO113 24
TRUNCOVER If the column or formatted input statement describes variables beyond the
end of the data record, as above, SAS will go to the next line to complete
the input statement. TRUNCOVER tells SAS to read to the last column
specified in the input statement or to the end of the line, whichever
happens first. By the way, MISSOVER will not work here.
OUTPUT:
Obs x y
1 453 85
2 763 943
LRECL Short for Logical RECord Length. Somewhere deep inside SAS it is written that
data lines (records) are of maximum length 256 unless otherwise specified. If the
data lines are longer than 256 characters, then you must describe the longer
record length with LRECL in order to read the data beyond column 256.
PAD PAD, used with column or formatted input, will 'pad' the data line with spaces out
to the LRECL. This is another way to even out the 'ragged' right side of a data set
(you would not need to use truncover).
NOTE: You may end up using PAD on your infile statement but it is NOT a good idea to
use it unless it's needed. Ragged data lines in formatted files often indicate some
problem in the creation or copying of the data set. The 'SAS went to a new line'
message will warn you to look at the data file more carefully.
OUTPUT:
Obs x y
1 453 85
2 763 943
BIO113 26
ACCESSING EXTERNAL RAW DATA SETS AND SAS DATA SETS (LSB 2.8-2.10)
Data:
We've seen:
11 35
* what we've seen; 95 7
data one;
infile 'p:\bio113\ex01.dat'; * note: name of external raw data set in single quotes *;
input a b;
run;
You may also describe the external 'raw' data set with a FILENAME statement. A link is
created between the file named in the FILENAME statement and the INFILE statement with
the fileref. You choose the fileref but it can only be 8 characters long. The FILENAME
statement must precede the INFILE statement. Options (MISSOVER, TRUNCOVER,
LRECL, PAD) appear on the INFILE statement as before.
Example:
SAS LOG:
Test: Write FILENAME and INFILE statements to read the 'raw' data set, ivh.dat, from
p:\bio113.
Answer:
Temporary SAS file: Exists only for the duration of the SAS session
Is stored in the library, WORK (click on WORK to see what's
been stored there)
You gave it a 'single-level' name (though SAS gives it a 'two-
level' name
Permanent SAS file: Exists after you exit a SAS session (we'll make one shortly!)
Is stored in your own library
You gave it a 'two-level' name
Example:
* creating a temporary SAS dataset; Data
filename pups 'p:\bio113\ex22.dat';
data dogs; spot 2
infile pups; dot 4
input name $ age;
run;
Result: Examine the SAS log. SAS created DOGS with 2 observations and 2 variables.
While you gave the data set the single-level name, DOGS, SAS knows it by a
two-level name, WORK.DOGS. The first level of the name WORK) is the libref
and it points to the library (physically, a subfolder in c:\SASTemp\_Txxxx; xxxx is
a number generated for this session) where temporary SAS data files are stored.
The second level is DOGS, the name you gave the data set. Click Libraries in
Explorer and the library WORK to verify that DOGS is there.
BIO113 29
Test: Write a program to create a temporary data set with the single-level name,
INSECTS, and with variables, BUG and LEGS. Then see if you can find it in a
SAS library. Which library should you look in?
spider 8
cricket 6
Answer:
BIO113 30
How? Tell SAS the name of the library (folder) where you want to store the SAS data set.
Result: Examine the SAS Log. SAS created MUFFY.DOGS with 2 observations and 2
variables. The first level of the name (muffy) is the libref I chose and it points
to a library that's physically a folder (p:\bio113) I created before I ran the
program. This permanent SAS data set will be stored in p:\bio113. Click
Libraries and under Active Libraries, muffy, to verify; notice that SAS calls the
file DOGS (Not muffy.dogs). Just to confuse things, the file will have the name
dogs.sas7bdat (sas7bdat is the special suffix reserved for version 7 and 8
SAS data sets) when you look for it with Windows “My Computer” or “Search”.
Test: Write a program to read the insect data set (ex23.dat), as before, but, this time,
save it as a permanent SAS data set, insects.sas7bdat, on your P drive.
Answer:
BIO113 31
Example: Come back later to read the permanent data set, dogs.sas7bdat, into a
temporary data set (foo) and print the data. Note, this time the libref is alvin
but it points to the same physical location, p:\bio113.
Test: Write SAS statements to read the insect data set (ex23.dat) and save
insects.sas7bdat permanently in your own folder, p:\bio113. Write statements
that read insects.sas7bdat from your folder into a temporary data set (call it
ONE) and then print the data.
Answer:
BIO113 32
Another way to create and later use a permanent SAS data set:
Starting with SAS version 7, you can directly reference SAS data sets in programs. The
syntax is similar to that of the INFILE statement where the name of a raw data set is
included in quotation marks. On the DATA (or SET, MERGE, etc) statement, include the
path and member name of the SAS data set in quotes. Notice that SAS, in fact, creates a
libref for you (Wc000001, 2, etc) and lists it under Active Libraries. In this program the SAS
file created and later read is dogs.sas7bdat and it can be found in the folder, p:\bio113.
Test: Write SAS statements to read the insect data set (ex23.dat) and save
insects.sas7bdat permanently in your own folder, p:\bio113 without using a
libref statement.
Answer:
BIO113 33
Test: Read your permanent SAS data set, insects.sas7bdat, into a temporary data set, foo.
Answer:
BIO113 34
PROC CONTENTS: This PROC lists documentation stored with a temporary or permanent
SAS data set.
Result:
View the data: You may also view the data contained in the SAS data set. We've been
doing that using PROC PRINT but it can also be done from SAS's Explorer. Click Explorer,
Libraries and then click the active library WORK. Then double click Foo. (Note that if you
right click on the file and choose View Columns from the pull-down menu, you get an
abbreviated 'proc contents'.)
Click on the u to the right of Columns to choose to see the data set properties or engine
and host information.
BIO113 36
Stat/Transfer is a separate, easy to use, Windows program that transfers data among
several statistical, database and spreadsheet programs. Among the input and output files
Stat/Transfer works with are ASCII, Access, dBase Excel, Lotus 1-2-3, Paradox, SAS, S-
plus, SPSS, Stata and Systat.
Start Stat/Transfer by clicking Start, Miscellaneous and StatTransfer. You will see this
dialogue box.
Select an input file type—I've indicated a Stata data set—and click Browse to locate the file.
My Stata file is c:\bio113\endomet.dta. The suffix, .dta, indicates this is a Stata data set.
Next, select the output file type—I've indicated a SAS for Windows file. Stat/Transfer gave
the output file the same name as the input file except the suffix is .sas7bdat, which indicates
a Version 7/8 SAS data set is to be created. You may change any part of this name (use
Browse) but do not change the suffix. Finally, click Transfer and the SAS data set will be
created from the Stata data set. The tabs--Variables, Observations and Options--permit you
to transfer subsets of variables and observations and to select various transfer options.
When you create a SAS data set with Stat/Transfer you avoid the INPUT issue entirely! You
may immediately access the data set with SAS's SET statement. We will demonstrate this
handy program in class.
BIO113 37
CREATE NEW VARIABLES & CHANGE EXISTING VARIABLES (LSB 3.1-3.3, 3.7-3.8, 8.7-8.9,
8.12)
Assignment statements: New variables are created and existing ones modified with
assignment statements. Note that these SAS statements do not
begin with SAS keywords.
Create new variables: Assign a constant value or the result of an expression to a new
variable. The new variable appears to the left of the equals sign.
If any variable used in an expression is missing (.), the new
variable will also be missing.
Result:
Obs id temp a b c name age age_mons tempc x y
Change existing variables: Replace the value of a variable with a new value.
Result:
Obs id age temp a b c avg1 avg2
SAS functions: SAS has a library of functions that perform arithmetic, string and other
operations on constants and variables. Use these functions in expressions to
create new variables or replace existing ones. Unlike the assignment
statements above, missing values are not always generated by missing data
when you use SAS functions. We will find a listing of built-in SAS functions
using SAS on-line Help (Click in succession: Help, Getting Started with SAS
Software, Help on SAS software products, Base software, Using base SAS
software, Working with the SAS language, SAS Functions, Function
Categories).
name=trim(last)||', '||trim(first);
init=substr(first,1,1);
tot1=round(sum(2.2,1.3,6.8));
tot2=round((sum(2.2,1.3,6.8)),.5);
Result:
Date functions: SAS performs date/time arithmetic on dates expressed as number of days
before or since January 1, 1960 (1/1/1960 has the value 0) and time
expressed as seconds since midnight (midnight has the value 0). Date-time is
expressed as seconds before or since midnight, January 1, 1960. Earlier, we
used the date informat mmddyy8. to input a date, 01161960, which was then
represented in the SAS data set as 15. SAS date and time functions create
SAS dates from variables containing values for month, day, year, hours,
minutes, seconds, etc, or from constants. More date/time functions are listed
in SAS on-line help.
dob2=mdy(bm,bd,by);
tob2=hms(bh,bmn,0);
dob3=dhms(dob2,bh,bmn,0);
thisday=today();
now=time();
rightnow=24*60*60*thisday+now;
run;
Result:
Obs id bm bd by bh bmn dob1 tob1 dob2 tob2 dob3 thisday now rightnow
Year 2K?
yearcutoff=nnnn where nnnn is the first year of the 100-year span. The yearcutoff option
applies only to dates where year is described with 2 digits. Notice the effect of the two
output formats, mmddyy8. and mmddyy10.
Program: Data:
Result:
This is the output when you omit 'yearcutoff=1940' from the program:
Dates with yearcutoff=1920 15:15 Sunday, August 15, 2004
Date difficulties
Occasionally you'll be given a SAS data set with dates that were read in without using SAS
date functions. The variables may be stored as character or numeric but SAS will not have
interpreted them as dates. You will need to disassemble the variables into months, days and
years and then apply a SAS date function to get a proper SAS date. See if you can figure out
how to make SAS times out of time information that was read in incorrectly.
* extract month, day, and year so you can use the mdy function when a date
is read from raw data as a character or numeric variable instead of
as a date variable;
data one;
input cdate $10. +1 ndate 8.;
mon1=1*substr(cdate,1,2);
day1=1*substr(cdate,4,2);
yr1 =1*substr(cdate,7,4);
mon2=int(ndate/1000000);
day2=int(ndate/10000)-(mon2*100);
yr2 =ndate-(mon2*1000000+day2*10000);
date1=mdy(mon1,day1,yr1);
/* or date2=mdy(mon2,day2,yr2); */
datalines;
09-30-2003 09302003
11/06/1970 11061970
run;
proc print; run;
Obs cdate ndate mon1 day1 yr1 mon2 day2 yr2 date1
These SAS functions were used in a program executed on 09/01/2004 at 1:37:03 PM.
Data:
x1 x2 x3 x4 x5 f nm a1 a2 mon day yr hr min sec
10 5 23 . 37 21.655 -4 Muffy st♦♦♦ 9 01 2004 13 37 3
Result:
name age sex wt x1 x2 x3 x4 agecat1 agecat2 group complete miss
molly 9 f 30 1 . 3 3 1 1 . . 1
max 12 m 54 5 3 2 7 . . . 1 0
mike . m 32 . 5 . 7 1 . 1 . 1
matt . 62 3 4 1 6 1 1 . 1 0
* NOTE:
IF-THEN-DO-END: If you want to make more than one assignment, the syntax changes slightly.
libname in 'p:\bio113';
data one; set in.foo5;
run;
Result:
1 molly 9 f 30 1 . 3 3 1 . .
2 max 12 m 54 5 3 2 7 . 1 4.5
3 mandy 99 f . 1 1 1 1 . 1 .
4 mike . m 32 . 5 . 7 . . .
5 matt 0 f 62 3 4 1 6 1 1 .
BIO113 47
run;
proc print;
run;
Result:
Obs id name age bp agecat htn1 htn2 agehtn sex group
1 1 ham 52 100 3 1 1 1 .
2 2 pam 20 70 2 0 0 0 f 5
3 3 nan . 80 . 0 0 . .
4 4 sam 8 78 1 0 0 0 m 3
5 4 dan 18 . 2 . 0 1 .
6 6 ann 26 . 3 . 0 1 f 5
BIO113 48
1 1 ham 52 100 3 1 1 1 .
2 2 pam 20 70 2 0 0 0 f 5
3 3 nan . 80 . 0 0 . .
4 4 sam 8 78 1 0 0 0 m 3
5 4 dan 18 . 2 . 0 1 .
6 6 ann 26 . 3 . 0 1 f 5
Example program (make subsets of WORK.ONE, above, from previous page)
Result:
Define arrays: It is my habit to use caps for array names--it makes them stand out from the
variables! In any event, arrays are named using the rules for all SAS names.
In an explicit array, the number of variables in the array or '*' appears in
parentheses after the array name. In an implicit array, the number in
parentheses is omitted. Arrays may be multidimensional but, if so, must be
explicit. A $ indicates that the array contains character variables.
Process arrays: Arrays are usually processed in DO...END loops. In an explicit array, the
number of times the loop is processed (usually it's the number of variables in
the array) is specified by a subscript variable. You may process all variables
in an array in the same way or you may address a particular variable by
number (subscript). The first variable in the array has subscript, 1, the second,
2, and so forth. Several arrays may be processed in one DO...END if they
contain the same number of variables.
Example:
proc print;
run;
Result:
Obs id ga wt len i
1 1 29 1150 . 4
2 2 . 764 18 4
BIO113 50
proc print;
run;
The output is identical but look at the SAS log to see how values are reassigned by the
array.
Example:
proc print;
run;
Result:
1 1 n y y n 2 5
2 2 n n n n 0 5
3 3 y n y y 3 5
proc print;
run;
Result:
Obs id sbp1 sbp2 sbp3 dbp1 dbp2 dbp3 rat1 rat2 rat3 rep
Result:
139 libname in '.';
NOTE: Libref IN was successfully assigned as follows:
Engine: V8
Physical Name: P:BIO113
140 data one; set in.foo9;
141 * drop rep;
142 array BB(2,3) sbp1-sbp3 dbp1-dbp3;
143 array RR(3) rat1-rat3;
144 do rep=1 to 3;
145 RR(rep)=BB(1,rep)/BB(2,rep);
146 put _all_;
147 end;
148 run;
id=1 sbp1=120 sbp2=115 sbp3=132 dbp1=87 dbp2=70 dbp3=79 rat1=1.3793103448 rat2=. rat3=.
rep=1_ERROR_=0 _N_=1
id=1 sbp1=120 sbp2=115 sbp3=132 dbp1=87 dbp2=70 dbp3=79 rat1=1.3793103448
rat2=1.6428571429 rat3=. rep=2 _ERROR_=0 _N_=1
id=1 sbp1=120 sbp2=115 sbp3=132 dbp1=87 dbp2=70 dbp3=79 rat1=1.3793103448
rat2=1.6428571429 rat3=1.6708860759 rep=3 _ERROR_=0 _N_=1
id=2 sbp1=119 sbp2=123 sbp3=127 dbp1=78 dbp2=75 dbp3=86 rat1=1.5256410256 rat2=. rat3=.
rep=1 _ERROR_=0 _N_=2
id=2 sbp1=119 sbp2=123 sbp3=127 dbp1=78 dbp2=75 dbp3=86 rat1=1.5256410256 rat2=1.64
rat3=. rep=2 _ERROR_=0 _N_=2
id=2 sbp1=119 sbp2=123 sbp3=127 dbp1=78 dbp2=75 dbp3=86 rat1=1.5256410256 rat2=1.64
rat3=1.476744186 rep=3 _ERROR_=0 _N_=2
NOTE: There were 2 observations read from the dataset IN.FOO9.
NOTE: The data set WORK.ONE has 2 observations and 11 variables.
NOTE: DATA statement used:
real time 0.05 seconds
Obs id sbp1 sbp2 sbp3 dbp1 dbp2 dbp3 rat1 rat2 rat3 rep
libname in 'p:bio113'; id a b c d e
data one; set in.foo10;
* notice NO drop statement; 1 229 9999 871 381 173
array MISS a--e; 2 9999 637 377 9999 383
do over MISS;
if MISS=9999 then MISS=.;
end;
run;
Result:
Obs id a b c d e
proc print;
run;
Result:
Obs id sbp1 sbp2 sbp3 dbp1 dbp2 dbp3 rat1 rat2 rat3
Create observations: Up to now, we've read in a single observation and placed a single
observation in a SAS data set or we've deleted it (subsetting IF). An OUTPUT statement inside
the DO...END loop that processes an array, will cause an observation to be output each time the
array is processed so that multiple observations are placed in a new SAS data set. This program
changes the unit of observation in the data set from a subject to a subject-visit.
Example:
Result:
Obs id wt1 wt2 wt3 visit wt
1 1 88 90 97 1 88
2 1 88 90 97 2 90
3 1 88 90 97 3 97
4 2 91 85 87 1 91
5 2 91 85 87 2 85
6 2 91 85 87 3 87
Program:
libname in 'p:\bio113';
data one; set in.foo11;
drop wt1-wt3; * but don't drop visit;
array WW(3) wt1-wt3;
do visit=1 to 3;
wt=WW(visit);
output;
end;
run;
proc print;
run;
Result:
Obs id visit wt
1 1 1 88
2 1 2 90
3 1 3 97
4 2 1 91
5 2 2 85
6 2 3 87
BIO113 55
Program:
libname in 'p:\bio113';
data one; set in.foo11;
drop wt1-wt3;
array WW wt1-wt3;
do over WW;
wt=WW;
visit=_i_;
output;
end;
run;
proc print;
run;
Result:
Obs id visit wt
1 1 1 88
2 1 2 90
3 1 3 97
4 2 1 91
5 2 2 85
6 2 3 87
Program:
* SAS log with put _all_;
libname in 'p:\bio113';
data one; set in.foo11;
drop wt1-wt3;
array WW wt1-wt3;
do over WW;
wt=WW;
visit=_i_;
put _all_;
output;
end;
run;
Basics: Procedures process the most recently created SAS data set unless you explicitly
name another one
Most procedures produce some sort of a report in the OUTPUT window but SORT
doesn't so we must print to see what sort has done
SORT modifies the data set—it changes the order of observations in the data set
(verify this by looking at the data with VIEW between sorts).
Process a subset (without changing the SAS data set) with WHERE
Example: Create a SAS data set then PROC it (this is one SAS session)
sex=1
oreo 1 4
spot 4 5
---- -------
sex 9
=======
20
Number of
Dog's Dog's first place
name age sex ribbons
cleo 1 0 8
dot 2 0 3
oreo 1 1 4
spot 4 1 5
1 cleo f pup
2 dot f adu
3 oreo m pup
4 spot m adu
BIO113 60
If you find yourself using the same formats over and over again, you might want to put them
into your own private format library. The libname statement tells SAS where to put the
format library. The format library will always have the name formats.sas7bcat (the suffix is
sas7bcat, not sas7bdat, which reminds SAS that this special data set contains formats not
data).
Program:
Example: Return later and print variables using the format library. When you create the
format library you may choose the libref (I used mylib). However, when you
use a format library, its libref must be library.
Program:
proc print;
var name sex age;
format sex sex. age age.;
run;
Result:
Obs name sex age
Result:
1 m 31 1462 0 0
2 f 25 564 1 1
3 f 27 931 0 1
4 f 28 1002 1 0
5 m 28 1067 1 1
6 f 32 1327 0 0
7 m 26 880 0 1
8 f 33 1650 0 0
9 m 29 1250 1 0
10 m 26 736 0 1
11 f 30 1212 0 0
BIO113 62
Program continues:
proc format;
value yn 0='No' 1='Yes';
run;
Result:
Cesarian delivery
Cumulative Cumulative
cs Frequency Percent Frequency Percent
--------------------------------------------------------
No 7 63.64 7 63.64
Yes 4 36.36 11 100.00
Cumulative Cumulative
sex Frequency Percent Frequency Percent
--------------------------------------------------------
f 6 54.55 6 54.55
m 5 45.45 11 100.00
intraventricular hemorrhage
Cumulative Cumulative
ivh Frequency Percent Frequency Percent
--------------------------------------------------------
No 6 54.55 6 54.55
Yes 5 45.45 11 100.00
BIO113 63
Frequency|
Percent |
Row Pct |
Col Pct |No |Yes | Total
---------+--------+--------+
f | 4 | 2 | 6
| 36.36 | 18.18 | 54.55
| 66.67 | 33.33 |
| 66.67 | 40.00 |
---------+--------+--------+
m | 2 | 3 | 5
| 18.18 | 27.27 | 45.45
| 40.00 | 60.00 |
| 33.33 | 60.00 |
---------+--------+--------+
Total 6 5 11
54.55 45.45 100.00
Sample Size = 11
BIO113 64
Table 1 of cs by ivh
Controlling for sex=f
cs(Cesarian delivery)
ivh(intraventricular hemorrhage)
Frequency|
Expected |
Percent |
Row Pct |
Col Pct |No |Yes | Total
---------+--------+--------+
No | 3 | 1 | 4
| 2.6667 | 1.3333 |
| 50.00 | 16.67 | 66.67
| 75.00 | 25.00 |
| 75.00 | 50.00 |
---------+--------+--------+
Yes | 1 | 1 | 2
| 1.3333 | 0.6667 |
| 16.67 | 16.67 | 33.33
| 50.00 | 50.00 |
| 25.00 | 50.00 |
---------+--------+--------+
Total 4 2 6
66.67 33.33 100.00
Table 2 of cs by ivh
Controlling for sex=m
cs(Cesarian delivery)
ivh(intraventricular hemorrhage)
Frequency|
Expected |
Percent |
Row Pct |
Col Pct |No |Yes | Total
---------+--------+--------+
No | 1 | 2 | 3
| 1.2 | 1.8 |
| 20.00 | 40.00 | 60.00
| 33.33 | 66.67 |
| 50.00 | 66.67 |
---------+--------+--------+
Yes | 1 | 1 | 2
| 0.8 | 1.2 |
| 20.00 | 20.00 | 40.00
| 50.00 | 50.00 |
| 50.00 | 33.33 |
---------+--------+--------+
Total 2 3 5
40.00 60.00 100.00
BIO113 65
Calculates mean, standard deviation, variance, etc. for one or more variables
Use with BY-variable if data set is sorted by BY-variable
Perform paired t-test
Summaries may be output to a new SAS data set
Program:
Result:
Program:
proc sort; by ivh;
run;
proc means maxdec=1; by ivh; * with by variable *;
var ga bw;
format ivh yn.;
run;
Result:
intraventricular hemorrhage=No
Variable Label N Mean Std Dev Minimum
----------------------------------------------------------------------------
ga gestational age 6 30.5 1.9 28.0
bw birth weight 6 1317.2 222.1 1002.0
----------------------------------------------------------------------------
Variable Label Maximum
-----------------------------------------
ga gestational age 33.0
bw birth weight 1650.0
-----------------------------------------
intraventricular hemorrhage=Yes
Variable Label N Mean Std Dev Minimum
----------------------------------------------------------------------------
ga gestational age 5 26.4 1.1 25.0
bw birth weight 5 835.6 192.6 564.0
----------------------------------------------------------------------------
Variable Label Maximum
-----------------------------------------
ga gestational age 28.0
bw birth weight 1067.0
-----------------------------------------
BIO113 67
Program:
proc means maxdec=1; class ivh; * with class variable (no sort) *;
var ga bw;
format ivh yn.;
run;
Result:
intraventricular N
hemorrhage Obs Variable Label N Mean Std Dev
--------------------------------------------------------------------------------
No 6 ga gestational age 6 30.5 1.9
bw birth weight 6 1317.2 222.1
intraventricular N
hemorrhage Obs Variable Label Minimum Maximum
----------------------------------------------------------------------------
No 6 ga gestational age 28.0 33.0
bw birth weight 1002.0 1650.0
Program:
* with output of specific statistics to a new data set (two);
proc means noprint mean std; by ivh;
var ga bw;
output out=two
mean=ga bw std=ga_s bw_s;
run;
proc print data=two;
run;
Result:
Program:
Result:
Program:
GPLOT (graphic plot) makes plots in which points are not limited to locations determined by
dot matrix printers. The plots also can be in color--you can see color on the screen but,
obviously, you need a color printer to get a hard copy. What you get, in spite of being
'graphic', is not publication quality. Notice that SAS opens a fourth window, GRAPH, to
display the results from gplot.
Examples:
libname in 'p:\bio113';
* simple plot - gestational age vs birth weight - See top next page for output;
proc gplot;
plot ga*bw;
run;
* HIGHLIGHT SEX;
proc gplot;
plot ga*bw=sex;
run;
proc gplot;
plot ga*bw=sex;
symbol1 v=circle;
symbol2 v=plus;
run;
ga
33
32
31
30
29
28
27
26
25
24
500 600 700 800 900 1000 1100 1200 1300 1400 1500
bw
FREQUENCY
70
60
50
40
30
20
10
0
520 600 680 760 840 920 1000 1080 1160 1240 1320 1400 1480
bw MI DPOI NT
BIO113 73
Basics: SET reads existing SAS data sets not 'raw' data
Reads in observations, sequentially, one at a time
May read and modify a single data set
May read, combine (concatenate) and modify multiple data sets
Result:
one
Obs id ga bw cs
1 1 31 1462 1
2 8 . 1650 0
3 3 25 931 .
4 6 32 1327 0
two
Obs id ga bw cs acs
1 7 26 880 0 1
2 10 26 736 . 1
3 2 25 564 1 0
4 11 30 1212 1 0
BIO113 74
Example: Concatenate SAS data sets, work.one and work.two, with SET.
libname in 'p:\bio113';
data in.baby1; set one two; * creates baby1 *;
run;
proc print; title 'one and two concatenated'; run;
Result:
one and two concatenated
Obs id ga bw cs acs
1 1 31 1462 1 .
2 8 . 1650 0 .
3 3 25 931 . .
4 6 32 1327 0 .
5 7 26 880 0 1
6 10 26 736 . 1
7 2 25 564 1 0
8 11 30 1212 1 0
Example: Repeat, but this time interleave the 2 SAS data sets by ID and create the new
variable, gacat. Both data sets need to be sorted before interleaving.
proc sort data=one; by id; run;
proc sort data=two; by id; run;
data in.baby1; set one two; by id; * replaces old baby1 with new one *;
run;
proc print; title 'one and two interleaved'; run;
Result:
one and two interleaved
Obs id ga bw cs acs gacat
1 1 31 1462 1 . older
2 2 25 564 1 0 younger
3 3 25 931 . . younger
4 6 32 1327 0 . older
5 7 26 880 0 1 younger
6 8 . 1650 0 .
7 10 26 736 . 1 younger
8 11 30 1212 1 0 older
BIO113 75
Create a subset:
Example: Now let's create a new temporary data set, work.older, that's a subset of the data set
in.baby1 created in the previous step. The subsetting IF statement tells SET to bring
in a subset of observations, those with gacat='older'.
Result:
older babies
1 1 31 1462 1 . older
2 6 32 1327 0 . older
3 11 30 1212 1 0 older
Example: Now let's create a new temporary data set, work.cs, that's a different subset of
work.baby1. The subsetting IF statement tells SET to bring in the subset of
observations with non-missing cs.
Result:
1 1 31 1462 1 . older
2 2 25 564 1 0 younger
3 6 32 1327 0 . older
4 7 26 880 0 1 younger
5 8 . 1650 0 .
6 11 30 1212 1 0 older
BIO113 76
Basics: MERGE reads existing SAS data sets not 'raw' data
Reads in observations, sequentially, one at a time
Reads, combines (joins) and modifies multiple data sets
Meaningful joining requires one or more BY variables
Joining is one-to-one, one-to-many or many-to one but never many-to-many
First use SET to get a different subset of observations and variables from work.baby1 and then
create a second SAS data set. Notice KEEP= option and DROP statement.
Result:
old data
Obs id ga acs
1 2 25 0
2 3 25 .
3 7 26 1
4 10 26 1
new data
1 2 2 9 15 16 1
2 10 8 22 11 19 .
3 3 2 4 . 9 1
BIO113 77
One-to-one join:
Example: Perform a one-to-one merge, matching on ID. Data sets must be sorted prior to
merge. Note RENAME statement (rename old-varname=new-varname…;).
proc sort data=one; by id; run;
proc sort data=two; by id; run;
data in.baby2; merge one two; by id;
rename colloid1=col1 colloid2=col2 colloid3=col3;
run;
proc print; title 'baby2 - merged data set'; run;
Result:
baby2 - merged data set
1 2 25 1 2 9 15 16
2 3 25 1 2 4 . 9
3 7 26 1 . . . .
4 10 26 . 8 22 11 19
BIO113 78
One-to-many join:
Example: Create a SAS data set with information about mothers, sort it and in.baby2 by
MOMID, and join the two by MOMID.
data mom; Data:
infile 'p:\bio113\ex31.dat';
input momid age; 2 38
run; 5 34
8 29
proc sort data=in.baby2; by momid; run;
proc sort data=mom; by momid; run;
Result:
baby2
1 7 26 1 . . . .
2 2 25 1 2 9 15 16
3 3 25 1 2 4 . 9
4 10 26 . 8 22 11 19
mom
1 2 38
2 5 34
3 8 29
1 7 26 1 . . . . .
2 2 25 1 2 9 15 16 38
3 3 25 1 2 4 . 9 38
4 . . . 5 . . . 34
5 10 26 . 8 22 11 19 29
BIO113 79
Join with IN= option: Creates a special temporary SAS variable that indicates whether the
data set contributed data to the current observation. In the data step, the
value of the variable is 1 if the data set contributed data to the current
observation, and 0 otherwise. The variable is available during the data
step but is not placed in the output data set.
Example: Notice that obs 1 in in.baby3 has no data from work.mom and obs 4 has no data
from in.baby2. Now repeat the merge with the IN= option.
proc print; title 'babies have moms and moms have babies'; run;
Result:
babies have moms and moms have babies
1 2 25 1 2 9 15 16 38
2 3 25 1 2 4 . 9 38
3 10 26 . 8 22 11 19 29
BIO113 80
Look at what IN= is doing by creating variables that hold the values of the IN= variables:
1 7 26 1 . . . . . 1 0
2 2 25 1 2 9 15 16 38 1 1
3 3 25 1 2 4 . 9 38 1 1
4 . . . 5 . . . 34 0 1
5 10 26 . 8 22 11 19 29 1 1
1 2 25 1 2 9 15 16 38 1 1
2 3 25 1 2 4 . 9 38 1 1
3 10 26 . 8 22 11 19 29 1 1
1 7 26 1 . . . . . 1 0
2 2 25 1 2 9 15 16 38 1 1
3 3 25 1 2 4 . 9 38 1 1
4 10 26 . 8 22 11 19 29 1 1
BIO113 81
Example: Find how GA and BW of the CS babies differs from the average for the group. Use
SET, not MERGE, and a SAS automatic variable, _n_. Like the variable created with
IN=, _n_ is temporary and does not appear in your SAS data set.
baby1
1 1 31 1462 1 . older
2 2 25 564 1 0 younger
3 3 25 931 . . younger
4 6 32 1327 0 . older
5 7 26 880 0 1 younger
6 8 . 1650 0 .
7 10 26 736 . 1 younger
8 11 30 1212 1 0 older
Program:
gadiff=xga-ga;
bwdiff=xbw-bw;
run;
Result:
ga and bw differences
Creates one or more SAS data sets from a single SAS data set
Overrides SAS implicit output
Example: Return to baby1 and create separate SAS data sets for babies who were
delivered by C-section and those who were not.
Notice where we use the KEEP= (or DROP=) option to keep different
variables in the two datasets.
Program:
Result:
baby1
Obs id ga bw cs acs gacat
1 1 31 1462 1 . older
2 2 25 564 1 0 younger
3 3 25 931 . . younger
4 6 32 1327 0 . older
5 7 26 880 0 1 younger
6 8 . 1650 0 .
7 10 26 736 . 1 younger
8 11 30 1212 1 0 olde
cs no
Obs id ga acs
1 6 32 .
2 7 26 1
3 8 . .
cs yes
Obs id bw acs
1 1 1462 .
2 2 564 0
3 11 1212 0
BIO113 83
Example: Each baby has 3 colloid measures. Make an observation for each non-
missing colloid value for a baby by using OUTPUT inside the DO...END
processing of an array. This changes the unit of observation from a baby
to a baby-measurement time.
run;
Result:
1 2 9 15 16
2 3 4 . 9
1 2 1 9
2 2 2 15
3 2 3 16
4 3 1 4
5 3 3 9
BIO113 84
Example: Undo what was just done! Data set must be sorted by the id variable (ID) and
the indexing variable (TIME).
DD(time)=col;
run;
Result:
1 2 9 15 16
2 3 4 . 9
BIO113 85
DATA ANALYSIS - PROC STEP (LSB 7.1, 7.4-7.8, SAS User's Guide: Statistics)
Example: Are babies receiving more fluids/kg of birth weight on day 4 than on day 1. Is
the change normally distributed? Is it significantly different from 0?
Result:
The UNIVARIATE Procedure
Variable: delta
Moments
Location Variability
Quantiles (Definition 5)
Quantile Estimate
Extreme Observations
-----------Lowest---------- ----------Highest----------
Missing Values
-----Percent Of-----
Missing Missing
Value Count All Obs Obs
. 11 1.94 100.00
Example: Do babies with and without IVH have different pCO2 values on day 1 or day 2?
proc ttest;
class ivh; format ivh yn.;
var pco2_1 pco2_2;
run;
Result:
Statistics
T-Tests
Equality of Variances
Npar1way: Compare medians for two or more groups with the Wilcoxon rank sums test or
the Kruskal-Wallis test. Unfortunately, Npar1way does not display the
medians only the mean of the ranks for each category (mean score).
Example: Is MAP (mean blood pressure) on day 1 different for babies whose mothers
received no, partial or complete antenatal corticosteriod courses?
Result:
Kruskal-Wallis Test
Chi-Square 8.4380
DF 2
Pr > Chi-Square 0.0147
BIO113 90
Reg: Perform univariate or multivariate linear regression. This is one of several SAS procedures
for linear regression and is nice because diagnostic plots are built in.
Example: What's the relationship between birth weight and gestational age? Is the
relationship linear?
Result:
The REG Procedure
Model: MODEL1
Dependent Variable: bw
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
200
- 200
- 400
- 600
- 800
24 25 26 27 28 29 30 31 32 33
ga
BIO113 92
ANOVA: Perform one-way analysis of variance--compare means when there are more
than two groups. Make multiple comparisons with Bonferroni, Duncan, Tukey,
Scheffe, etc.
Example: Do the babies receive different amounts of colloid (in cc/kg birth weight) on days 1 to
4? Note we need to make a 'long skinny' data set.
proc anova;
class day;
model colloid=day;
means day / tukey bon;
run;
Result:
The ANOVA Procedure
day 4 1 2 3 4
NOTE: Due to missing values, only 657 observations can be used in this analysis.
Sum of
Source DF Squares Mean Square F Value Pr > F
Alpha 0.05
Error Degrees of Freedom 653
Error Mean Square 129.5161
Critical Value of Studentized Range 3.64260
Difference
day Between Simultaneous 95%
Comparison Means Confidence Limits
NOTE: This test controls the Type I experimentwise error rate, but it generally has a
higher
Type II error rate than Tukey's for all pairwise comparisons.
Alpha 0.05
Error Degrees of Freedom 653
Error Mean Square 129.5161
Critical Value of t 2.64632
Difference
day Between Simultaneous 95%
Comparison Means Confidence Limits
Example: Is the pCO2 on the first day of life associated with gestational age
(categorized), birth hospital, an interaction between the two?
Program:
* 2-way anova with interaction - use work.one & create ga categories;
data one; set one;
if . < ga < 26 then gacat=1;
else if 26 <= ga < 29 then gacat=2;
else if ga >= 29 then gacat=3;
proc glm;
class gacat hosp;
model pco2_1=gacat hosp gacat*hosp;
run;
Result:
gacat 3 1 2 3
hosp 2 1 2
NOTE: Due to missing values, only 533 observations can be used in this analysis.
Sum of
Source DF Squares Mean Square F Value Pr > F
Program:
* logistic regression example - use work.one and create dummy variables;
data one; set one;
if ga > . then ga1=(. < ga < 26);
if ga > . then ga2=(26 <= ga < 29);
if rom > . then romcat=(rom >= 1);
if acs > . then complete=(acs=3);
run;
proc logistic descending;
model ivh=ga1 ga2 pih romcat complete;
run;
Result:
The LOGISTIC Procedure
Model Information
Response Profile
Ordered Total
Value ivh Frequency
1 1 105
2 0 426
NOTE: 35 observations were deleted due to missing values for the response or
explanatory variables.
BIO113 99
Intercept
Intercept and
Criterion Only Covariates
Standard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSq
Example: Is hospital length of stay related to gestational age (categorical). Deaths are
censored
Program:
proc lifetest notable plots=(s) data=one;
time los*dead(1);
strata gacat;
run;
/* note: proc lifetest notable plots=(s) data=one lineprinter; for dot plot*/
Result:
The LIFETEST Procedure
Percent
Stratum gacat Total Failed Censored Censored
NOTE: There were 2 observations with missing values, negative time values or frequency
values less than 1.
Rank Statistics
1 -73.669 -26890
2 -19.678 -14009
3 93.346 40899
gacat 1 2 3
gacat 1 2 3
This graph would be in color on your screen. The line on the left is the babies in gacat=3 (29
weeks and above), the middle line is the babies in gacat=2 (26 to 28 weeks), and the line on the
right is the babies in gacat=1 (less than 26 weeks).
1. 00
0. 75
0. 50
0. 25
0. 00
l os