Functtion in Base Sas and Proc SQL
Functtion in Base Sas and Proc SQL
Coders' Corner
Tips to Use Character String Functions in Record Lookup Anjan Matlapudi Pharmacy Informatics, PerformRx, The Next Generation PBM, 200 Stevens Drive, Philadelphia, PA 19113
ABSTRACT
This paper gives you a better idea of how and where to use the record lookup functions to locate observations where a variable has some characteristic. Various related functions are illustrated to search numeric and character values in this process. Code is shown with time comparisons. I will discuss possible three ways to retrieve records using in SAS DATA step, PROC SQL and Perl regular expression. Real and CPU time processing issues will be highlighted when comparing to retrieve records using these methods. Program was written for the PC using SAS 9.2 on Windows XP 62 bit environment. All the tools discussed are in BASE SAS. The typical attendee or reader will have some experience in SAS, but not a lot of experience dealing with large number of data.
INTRODUCTION
A common task is to look for a particular field in databases ranging in size from less than a million to more than a billion. SAS Character functions are required to do this. I ask myself whether all of these functions would work in DATA step, PROC SQL and Perl Regular Expression. In this paper, I will introduce the code in several tables solving different problems for each function. In some cases PROC SQL is omitted because conditional use of the functions. This way many of us would be aware of how best we can use some of the commonly used functions in several ways. To make the tests I used the following code to randomly generate 10 million observations with character and numeric variables using RANUIN function. DO LOOP =1 to 10e7 will generate 10 million records, but this code can generate any number of records by changing log exponential. I tested most of the functions to find one or many records out of the 10 million and noted processing time it takes to read, modify and write character variables.
NESUG 2011
Coders' Corner
end; keep ssn Name; run; proc print data= TenMillonRecs (obs=10); title "First 10 Records out of 10 Million Test Records"; run;
FUNCTION
RETURNS
REALTIME/CPU
LIKE Purpose: Search option for specific string of characters. data Like_out; set TenMillonRecs; where name like 'Anjam%'; run; proc sql; create table Like_tbl as select * from TenMillonRecs where name like 'Anjam%'; quit; Anjamvfk Zmqzneup 440019853 A 1.56/1.42
A 1.93/2.67
SUBSTR Purpose: Extracts part of the string specified by the start and length parameters. data SubStr_out; set TenMillonRecs; where substr(name,1,5)='Anjam'; run; proc sql; create table SubStr_tbl as select * from TenMillonRecs where substr(name,1,5)= 'Anjam'; quit; Anjamvfk Zmqzneup 440019853 A 3.43/2.79
A 3.59/2.68
NESUG 2011
Coders' Corner
FIND Purpose: Locate substring with in a string. data Find_out; set TenMillonRecs; where find(name,'Anjamvfk Zmqzneup')=1; run;
A 3.93/2.29
proc sql; create table find_tbl as select * from TenMillonRecs where find(name,'Anjamvfk A Zmqzneup')=1; quit;
A 4.87/2.45
INDEX Purpose: To locate starting portion of substring of a string. data Index_out; set TenMillonRecs; where index(name,'Anjam')=1; run; proc sql; create table Index_tbl as select * from TenMillonRecs where index(name,'Anjam')=1; quit; Anjamvfk Zmqzneup 440019853 A 4.75/2.53
A 5.01/2.64
SCAN Purpose: Extracts specified word from a character string. data Scan_out; length FirstName LastName $10. MiddleName $1.; set TenMillonRecs; FirstName =scan(name,1,''); MiddleName =scan(name,2,''); LastName =scan(name,3,''); /*where name ='Anjamvfk A Zmqzneup';1*/ run; proc sql; create table Scan_tbl as select name,ssn, scan(name,1,' ') as firstName, scan(name,3,' ') as MiddleName, scan(name,2,' ') as LastName from TenMillonRecs /*1where name ='Anjamvfk A Zmqzneup';*/; quit;
1
32.12/6.71
4.06.03/30.92
We can test all statements to read, modify and write one record using where clause option or we can retrieve 10 million records by removing where clause
NESUG 2011
Coders' Corner
TRANSLATE Purpose: To exchange on character value to another. data translate_out (keep=result); set TenMillonRecs; result= translate(name,'Anjan','Anjamvfk'); where name ='Anjamvfk A Zmqzneup'; run; proc sql; create table translate_tbl as select translate(name,'Anjan','Anjamvfk') as result from TenMillonRecs where name ='Anjamvfk A Zmqzneup'; quit; CAT Purpose: Combine two strings. data Cat_out (keep=result); set TenMillonRecs; result= cat(name,'is' ,' funny name'); where name ='Anjamvfk A Zmqzneup'; run; proc sql; create table Cat_tbl as select cat(name,'is' ,' funny name') as result from TenMillonRecs where name ='Anjamvfk A Zmqzneup'; quit;
Anjan ZnqzneupA
0.87/0.85
Anjan Znqzneup
0.88/0.90
0.87/0.87
0.87/0.89
VERIFY Purpose: Returns the position of the first character in a string that is not in any of several other strings. data Verify_out (keep=result); set TenMillonRecs; result= verify(name,'Anjam'); where name ='Anjamvfk A Zmqzneup'; run; proc sql; create table Verify_tbl as select verify(name,'Anjam') as result from TenMillonRecs where name ='Anjamvfk A Zmqzneup'; quit;
1.30/0.95
1.49/0.92
NESUG 2011
Coders' Corner
TRIM Purpose: Removes trailing blanks from a character string. data trim_out; set TenMillonRecs; name= trim(name); run; proc sql; create table trim_tbl as select trim(name) as name,ssn from TenMillonRecs; quit; 10000000 observations and 2 variables
39.61/4.15
15.96/5.48
STRIP Purpose: To strip leading or trailing blanks from character string. data strip_out(keep=result); set TenMillonRecs; if name= 'Anjamvfk A Zmqzneup' then name1 = ' Anjamvfk'; result = strip(name1) ; where name ='Anjamvfk A Zmqzneup'; run; Proc SQL is omitted. RIGHT Purpose: Align right side of character string. data Right_out(keep= name1 result); set TenMillonRecs; if name= 'Anjamvfk A Zmqzneup' then name1 = 'Anjamvf '; result = right(name1) ; where name ='Anjamvfk A Zmqzneup'; run; Proc SQL is omitted. LEFT Purpose: Align left side of character string. data Left_out(keep= name1 result); set TenMillonRecs; if name= 'Anjamvfk A Zmqzneup' then name1 = ' Anjamvf'; result = left(name1) ; where name ='Anjamvfk A Zmqzneup'; run; proc print data =left_out; run; Proc SQL is omitted.
Anjamvfk
0.87/0.87
Anjamvfk
Anjamvfk
0.89/0.90
Anjamvfk Anjamvfk
0.87/0.87
NESUG 2011
Coders' Corner
COMPRESS Purpose: Remove specified character value (including blanks) from a string. data compress_out; set TenMillonRecs; name= compress(name); run; proc sql; create table compress_tbl as select compress(name) as name,ssn from TenMillonRecs; quit; 10000000 observations and 2 variables
9.59/4.82
47.37/6.12
COMPBL Purpose: Replace two or more blanks with single blank. data Compbl_out; length result $100.; set TenMillonRecs; result= compbl(name); where name ='Anjamvfk run;
0.90/0.89
proc sql; create table Compbl_tbl as select compbl(name) as result from TenMillonRecs where name ='Anjamvfk A Zmqzneup'; quit; UPCASE Purpose: Convert all letters to upper case. data Upcause_out(keep=result); set TenMillonRecs; result= upcase(name); where name ='Anjamvfk A Zmqzneup'; run; proc sql; create table Upcause_tbl as select upcase(name) as result from TenMillonRecs where name ='Anjamvfk A Zmqzneup'; quit; LOWCASE Purpose: Converts all the letters to lower case. data Lowcause_out(keep=result); set TenMillonRecs; result= lowcase(name); where name ='Anjamvfk A Zmqzneup'; run;
Anjamvfk A Zmqzneup
1.29/0.89
ANJAMVFK ZMQZNEUP
0.87/0.87
ANJAMVFK ZMQZNEUP
0.87/0.87
anjamvfk zmqzneup
0.86/0.86
NESUG 2011
Coders' Corner
proc sql; create table Lowcause_tbl as select lowcase(name) as result from TenMillonRecs where name ='Anjamvfk A Zmqzneup'; quit;
anjamvfk zmqzneup
0.89/0.89
PROPCASE Purpose: Capitalize the first letter of each word in a string. data Propcase_out(keep=result); set TenMillonRecs; result= propcase(name); where name ='Anjamvfk A Zmqzneup'; run; proc sql; create table Propcase_tbl as select propcase(name) as result from TenMillonRecs where name ='Anjamvfk A Zmqzneup'; quit; ANYSPACE Purpose: To locate first occurrence of white space. data Anyspace_out (keep=result); set TenMillonRecs; result = anyspace(name) ; where name ='Anjamvfk A Zmqzneup'; run; proc sql; create table Anyspace_tbl as select anyspace(name) as result from TenMillonRecs where name ='Anjamvfk A Zmqzneup'; quit; FIRST Purpose: Extracts the first character from a string. data First_out(keep=result); set TenMillonRecs; result= first(name); where name ='Anjamvfk A Zmqzneup'; run; proc sql; create table First_tbl as select first(name) as name,ssn from TenMillonRecs where name ='Anjamvfk A Zmqzneup'; quit;
Anjamvfk Zmqzneup
0.92/0.89
Anjamvfk Zmqzneup
0.89/0.87
0.87/0.87
1.34/0.92
0.90/0.89
0.89/0.89
NESUG 2011
Coders' Corner
TRANWRD Purpose: Substitute one or more words in a string. data Tranwrd_out (keep=result); set TenMillonRecs; result = tranwrd(name,'Anjamvfk A Zmqzneup','Anjan Matlapudi') ; where name ='Anjamvfk A Zmqzneup'; run; proc sql; create table Tranwrd_tbl as select tranwrd(name,'Anjamvfk A Zmqzneup','Anjan Matlapudi') as result from TenMillonRecs where name ='Anjamvfk A Zmqzneup'; quit;
Anjan Matlapudi
2.32/0.96
Anjan Matlapud
2.42/0.96
LENGTH Purpose: Determine length of a character value (not counting trailing blanks). data Length_out; set TenMillonRecs; result = length(name) ; where name ='Anjamvfk A Zmqzneup'; run; proc sql; create table Length_tbl as select name,ssn,length(name) as result from TenMillonRecs where name ='Anjamvfk A Zmqzneup'; quit; REVERSE Purpose: Reverse the order of character string. data Reverse_out(keep=result); set TenMillonRecs ; result = reverse(name) ; where name ='Anjamvfk A Zmqzneup'; run; proc sql; create table Reverse_tbl as select reverse(name) as result from TenMillonRecs where name ='Anjamvfk A Zmqzneup'; quit;
A 2.33/0.89 21
A 21 2.29/0.89
puenzqmZ A kfvmajnA
0.83/0.84
puenzqmZ A kfvmajnA
0.89/0.87
NESUG 2011
Coders' Corner
REPEATE Purpose: Make several copies of a string. Anjamvfk Zmqzneup Anjamvfk Zmqzneup Anjamvfk Zmqzneup Anjamvfk Zmqzneup Anjamvfk Zmqzneup Anjamvfk Zmqzneup Anjamvfk Zmqzneup Anjamvfk Zmqzneup A A A A A A A A 0.87/0.87 0.95/0.85
data Repeat_out(keep=result); set TenMillonRecs ; result = repeat(name,3) ; where name ='Anjamvfk A Zmqzneup'; run
proc sql; create table Repeat_tbl as select repeat(name,3) as result from TenMillonRecs where name ='Anjamvfk A Zmqzneup'; quit;
SPEDIS Purpose: Computes spelling distance between words. data Spedis_out; set TenMillonRecs; if name = 'Anjamvfk A Zmqzneup' then Name = 'Knowledge'; if spedis(name,'nowledge' )le 27 then output; run; Proc SQL is omitted LAG Purpose: To obtain previous value from the current character variable. data Lag_out(keep=result); set TenMillonRecs; result = lag(name) ; run; Note: LAG does not work in Proc SQL. COUNT Purpose: Counts number of times in a given substring in a string. data Count_out(keep=result); length name $100.; set TenMillonRecs; if name = 'Anjamvfk A Zmqzneup' then name = 'Random generated Random Name in Random data'; result = count(Name,'Random') ; where name ='Anjamvfk A Zmqzneup';
Knowledge 440019853
35.27/30.79
13.73/2.87
087/0.86 3
NESUG 2011
Coders' Corner
run; Proc SQL is omitted. CHOOSEC Purpose: Returns a character value that represents the results of choosing from a list of arguments. data Choosec_out(keep=result); set TenMillonRecs; if name= 'Anjamvfk A Zmqzneup' then name1 = 'Anjan'; else if name = 'Kxlgoqhma A Pcotljsnz' then name2 = 'Matlapudi'; else if name = 'Taosezswy A Gzlhvahdb' then name3 = 'Anjan Matlapudi'; Anjan Matlapudi 7.01/1.23 result = Choosec(3,name1,name2,name3) ; where name in('Anjamvfk A Zmqzneup', 'Kxlgoqhma A Pcotljsnz','Taosezswy A Gzlhvahdb'); run; Proc SQL is omitted.
FUNCTION
RETURNS
REALTIME/CPU
PRXMATCH Purpose: Searches for a pattern match and returns the position at which the pattern is found. data PrxMatch_out; set TenMillonRecs ; if prxmatch("m/Anjamvfk/oi",name)> 0 then found =1; else found =0; run; Proc sql; create table PrxMatch_tbl as select name,ssn, prxmatch("m/Anjamvfk/oi",name) as found from TenMillonRecs; quit; 10000000 observations with found =1 record and remaining observation found = 0
36.37/0.79
37.02/6.26
10
NESUG 2011
Coders' Corner
PRXPARSE AND PRXSUBSTR Purpose: Perl regular expression (PRX) can be used for substring of character string matching using PRXPARSE. data PrxSubstr_out; set TenMillonRecs; if _n_=1 then do; retain re; re = prxparse('/\w \w.+/'); if missing(re) then do; stop; end; end; call prxsubstr(re,name,start,length); FirstName = substrn(name,start -11, length -4); LastName = substrn(name,start+2, length +1); if start > 0 then output; run;
58.61/11.90
VARIABLE
CHARACTER proc print data = TenMillonRecs; where name ='Anjamvfk A Zmqzneup'; run;
RETURNS
REALTIME/CPU
Anjamvfk 440019853
A Zmqzneup 0.95/0.90
proc sql; select * from TenMillonRecs where name ='Anjamvfk A Zmqzneup'; quit; NUMERIC proc print data = TenMillonRecs; where ssn =440019853; run; proc sql; select * from TenMillonRecs where ssn =440019853; quit;
Anjamvfk 440019853
A Zmqzneup
0.96/0.89
Anjamvfk 440019853
A Zmqzneup
0.92/0.84
Anjamvfk 440019853
A Zmqzneup
0.93/0.84
11
NESUG 2011
Coders' Corner
CONCLUSION
So far I have demonstrated possible ways for a record look up using DATA step, PROC SQL and Perl regular expression. I further generalize some tips using these functions for record look If you already know SQL, you will be pleased to know that you can use most of the functions in PORC SQL to create, read and modify variables in SAS data sets. Flexibility of character string functions available while we are working with small to a large scale data. We can use all these functions in DATA step and PROC SQL. All these functions would also work in Oracle, SQL server and Access database with slight syntax modification in code. Real time processing will give best suitable option to choose some of the functions. Based on the above queries, data step processing has been taken less time when compare PROC step.
By now you may have some clue that all most all the functions work in DATA step and PROC SQL and if you are dealing with large scale data, you can easily pick up best suitable function in terms of time taken to run each function. Also, I have included most commonly used function at one place; some of you may take advantage instead of spending more time finding them.
REFERENCES
Ronald Cody, An Introduction to SAS Character Functions. Coders Corner Paper 217-2007. Paul A. Choate Quick Record Lookup without an index. Tutorials, Paper 031-2007. David L. Cassell. The Basics of the PRX Functions SAS, Coders Corner Paper. Paper 223-2007.
ACKNOWLEDGMENTS
I like to acknowledge Ian Whitlock for his valuable suggestions. I would also like to acknowledge Mr. Shimels Afework, Senior Director, PerformRx. PerformRx, LLC provides pharmacy benefit management (PBM) services through proactively managing escalating pharmacy costs while focusing on clinical improvement and financial results.
CONTACT INFORMATION:
Your comments and questions are valued and encouraged. Contact the author at Name Address Anjan Matlapudi Senior Pharmacy Analyst PerformRx, The Next Generation PBM 200 Stevens Drive Philadelphia, PA 19113 (215)937-7252 (215)863-5100 anjan.matlapudi@performrx.com anjanmat@gmail.com
SAS is a registered trademark or trademark of SAS Institute, Inc. in the USA and other countries. IBM, OS/390, and MVS are registered trademarks of International Business Machines Inc.
12