[go: up one dir, main page]

0% found this document useful (0 votes)
415 views12 pages

Functtion in Base Sas and Proc SQL

This document discusses using character string functions in SAS to lookup records. It shows code examples of functions like LIKE, SUBSTR, FIND etc in DATA step and PROC SQL to retrieve records from a dataset of 10 million records. Processing time for each method is also compared.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
415 views12 pages

Functtion in Base Sas and Proc SQL

This document discusses using character string functions in SAS to lookup records. It shows code examples of functions like LIKE, SUBSTR, FIND etc in DATA step and PROC SQL to retrieve records from a dataset of 10 million records. Processing time for each method is also compared.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

NESUG 2011

Coders' Corner

Tips to Use Character String Functions in Record Lookup Anjan Matlapudi Pharmacy Informatics, PerformRx, The Next Generation PBM, 200 Stevens Drive, Philadelphia, PA 19113
ABSTRACT
This paper gives you a better idea of how and where to use the record lookup functions to locate observations where a variable has some characteristic. Various related functions are illustrated to search numeric and character values in this process. Code is shown with time comparisons. I will discuss possible three ways to retrieve records using in SAS DATA step, PROC SQL and Perl regular expression. Real and CPU time processing issues will be highlighted when comparing to retrieve records using these methods. Program was written for the PC using SAS 9.2 on Windows XP 62 bit environment. All the tools discussed are in BASE SAS. The typical attendee or reader will have some experience in SAS, but not a lot of experience dealing with large number of data.

INTRODUCTION
A common task is to look for a particular field in databases ranging in size from less than a million to more than a billion. SAS Character functions are required to do this. I ask myself whether all of these functions would work in DATA step, PROC SQL and Perl Regular Expression. In this paper, I will introduce the code in several tables solving different problems for each function. In some cases PROC SQL is omitted because conditional use of the functions. This way many of us would be aware of how best we can use some of the commonly used functions in several ways. To make the tests I used the following code to randomly generate 10 million observations with character and numeric variables using RANUIN function. DO LOOP =1 to 10e7 will generate 10 million records, but this code can generate any number of records by changing log exponential. I tested most of the functions to find one or many records out of the 10 million and noted processing time it takes to read, modify and write character variables.

CREATE TEST RECORDS USING RANUNI FUNCTION:


*----Test data-----*; data TenMillonRecs; format FirstName LastName $10. MiddleName $1. Name $25. SSN 9.; string1="abcdefghijklmnopqrstuvwxyz"; string2="ABCDEFGHIJKLMNOPQRSTUVWXYZ"; string3=reverse("abcdefghijklmnopqrstuvwxyz"); strnglen=length(string1); do i=1 to 1e7; random=ranuni(95959); FirstName =''; LastName = '' ; MiddleName = ''; length=int(ranuni(0)*3)+8; *int truncate decimal point ; do j=1 to length; pick=int(ranuni(0)*strnglen)+1; FirstName=substr(string1,pick,1)||FirstName; MiddleName = substr(string2,1,1); LastName=substr(string3,pick,1)||LastName; Name = propcase(FirstName||" "||MiddleName||" "||LastName); SSN = input(compress(put(1e9*random,z9.)),9.); end; output;

NESUG 2011

Coders' Corner

end; keep ssn Name; run; proc print data= TenMillonRecs (obs=10); title "First 10 Records out of 10 Million Test Records"; run;

DATA STEP AND PROC SQL:


SAS is an excellent tool to accommodate many functions in many ways; it has flexibility to use these functions in SAS for programmers with different sets of skills. SAS implemented SQL (PROC SQL) in version 6.0 and Perl Script in version 9.0 for better flexibility to retrieve information. In this paper, I will touch base some of the functions for records look up and also highlight the real and CPU (Central Processing Unit) time taken to run each functions in the same environment using DATA step and PROC SQL. Most of the function names itself describe the role of each function; however I highlighted purpose of each function in the tables. If you want to know more details, you can always approach SAS help or online documentation. The code bellow shows some of the highlighted functions in DATA step and PROC SQL and next column for each function represent value return while running each function real and CPU time in seconds is noted for each processing. All these functions are tested on same environment, each statement run several times for processing the real time comparison.

FUNCTION

RETURNS

REALTIME/CPU

LIKE Purpose: Search option for specific string of characters. data Like_out; set TenMillonRecs; where name like 'Anjam%'; run; proc sql; create table Like_tbl as select * from TenMillonRecs where name like 'Anjam%'; quit; Anjamvfk Zmqzneup 440019853 A 1.56/1.42

Anjamvfk Zmqzneup 440019853

A 1.93/2.67

SUBSTR Purpose: Extracts part of the string specified by the start and length parameters. data SubStr_out; set TenMillonRecs; where substr(name,1,5)='Anjam'; run; proc sql; create table SubStr_tbl as select * from TenMillonRecs where substr(name,1,5)= 'Anjam'; quit; Anjamvfk Zmqzneup 440019853 A 3.43/2.79

Anjamvfk Zmqzneup 440019853

A 3.59/2.68

NESUG 2011

Coders' Corner

FIND Purpose: Locate substring with in a string. data Find_out; set TenMillonRecs; where find(name,'Anjamvfk Zmqzneup')=1; run;

Anjamvfk Zmqzneup 440019853

A 3.93/2.29

proc sql; create table find_tbl as select * from TenMillonRecs where find(name,'Anjamvfk A Zmqzneup')=1; quit;

Anjamvfk Zmqzneup 440019853

A 4.87/2.45

INDEX Purpose: To locate starting portion of substring of a string. data Index_out; set TenMillonRecs; where index(name,'Anjam')=1; run; proc sql; create table Index_tbl as select * from TenMillonRecs where index(name,'Anjam')=1; quit; Anjamvfk Zmqzneup 440019853 A 4.75/2.53

Anjamvfk Zmqzneup 440019853

A 5.01/2.64

SCAN Purpose: Extracts specified word from a character string. data Scan_out; length FirstName LastName $10. MiddleName $1.; set TenMillonRecs; FirstName =scan(name,1,''); MiddleName =scan(name,2,''); LastName =scan(name,3,''); /*where name ='Anjamvfk A Zmqzneup';1*/ run; proc sql; create table Scan_tbl as select name,ssn, scan(name,1,' ') as firstName, scan(name,3,' ') as MiddleName, scan(name,2,' ') as LastName from TenMillonRecs /*1where name ='Anjamvfk A Zmqzneup';*/; quit;
1

10000000 observations and 5 variables

32.12/6.71

10000000 rows and 5 columns.

4.06.03/30.92

We can test all statements to read, modify and write one record using where clause option or we can retrieve 10 million records by removing where clause

NESUG 2011

Coders' Corner

TRANSLATE Purpose: To exchange on character value to another. data translate_out (keep=result); set TenMillonRecs; result= translate(name,'Anjan','Anjamvfk'); where name ='Anjamvfk A Zmqzneup'; run; proc sql; create table translate_tbl as select translate(name,'Anjan','Anjamvfk') as result from TenMillonRecs where name ='Anjamvfk A Zmqzneup'; quit; CAT Purpose: Combine two strings. data Cat_out (keep=result); set TenMillonRecs; result= cat(name,'is' ,' funny name'); where name ='Anjamvfk A Zmqzneup'; run; proc sql; create table Cat_tbl as select cat(name,'is' ,' funny name') as result from TenMillonRecs where name ='Anjamvfk A Zmqzneup'; quit;

Anjan ZnqzneupA

0.87/0.85

Anjan Znqzneup

0.88/0.90

Anjamvfk A Zmqzneup is funny name

0.87/0.87

Anjamvfk A Zmqzneup is funny name

0.87/0.89

VERIFY Purpose: Returns the position of the first character in a string that is not in any of several other strings. data Verify_out (keep=result); set TenMillonRecs; result= verify(name,'Anjam'); where name ='Anjamvfk A Zmqzneup'; run; proc sql; create table Verify_tbl as select verify(name,'Anjam') as result from TenMillonRecs where name ='Anjamvfk A Zmqzneup'; quit;

1.30/0.95

1.49/0.92

NESUG 2011

Coders' Corner

TRIM Purpose: Removes trailing blanks from a character string. data trim_out; set TenMillonRecs; name= trim(name); run; proc sql; create table trim_tbl as select trim(name) as name,ssn from TenMillonRecs; quit; 10000000 observations and 2 variables

39.61/4.15

10000000 rows and 2 columns

15.96/5.48

STRIP Purpose: To strip leading or trailing blanks from character string. data strip_out(keep=result); set TenMillonRecs; if name= 'Anjamvfk A Zmqzneup' then name1 = ' Anjamvfk'; result = strip(name1) ; where name ='Anjamvfk A Zmqzneup'; run; Proc SQL is omitted. RIGHT Purpose: Align right side of character string. data Right_out(keep= name1 result); set TenMillonRecs; if name= 'Anjamvfk A Zmqzneup' then name1 = 'Anjamvf '; result = right(name1) ; where name ='Anjamvfk A Zmqzneup'; run; Proc SQL is omitted. LEFT Purpose: Align left side of character string. data Left_out(keep= name1 result); set TenMillonRecs; if name= 'Anjamvfk A Zmqzneup' then name1 = ' Anjamvf'; result = left(name1) ; where name ='Anjamvfk A Zmqzneup'; run; proc print data =left_out; run; Proc SQL is omitted.

Anjamvfk

0.87/0.87

Anjamvfk

Anjamvfk

0.89/0.90

Anjamvfk Anjamvfk

0.87/0.87

NESUG 2011

Coders' Corner

COMPRESS Purpose: Remove specified character value (including blanks) from a string. data compress_out; set TenMillonRecs; name= compress(name); run; proc sql; create table compress_tbl as select compress(name) as name,ssn from TenMillonRecs; quit; 10000000 observations and 2 variables

9.59/4.82

10000000 rows and 2 columns

47.37/6.12

COMPBL Purpose: Replace two or more blanks with single blank. data Compbl_out; length result $100.; set TenMillonRecs; result= compbl(name); where name ='Anjamvfk run;

Anjamvfk A Zmqzneup A Zmqzneup';

0.90/0.89

proc sql; create table Compbl_tbl as select compbl(name) as result from TenMillonRecs where name ='Anjamvfk A Zmqzneup'; quit; UPCASE Purpose: Convert all letters to upper case. data Upcause_out(keep=result); set TenMillonRecs; result= upcase(name); where name ='Anjamvfk A Zmqzneup'; run; proc sql; create table Upcause_tbl as select upcase(name) as result from TenMillonRecs where name ='Anjamvfk A Zmqzneup'; quit; LOWCASE Purpose: Converts all the letters to lower case. data Lowcause_out(keep=result); set TenMillonRecs; result= lowcase(name); where name ='Anjamvfk A Zmqzneup'; run;

Anjamvfk A Zmqzneup

1.29/0.89

ANJAMVFK ZMQZNEUP

0.87/0.87

ANJAMVFK ZMQZNEUP

0.87/0.87

anjamvfk zmqzneup

0.86/0.86

NESUG 2011

Coders' Corner

proc sql; create table Lowcause_tbl as select lowcase(name) as result from TenMillonRecs where name ='Anjamvfk A Zmqzneup'; quit;

anjamvfk zmqzneup

0.89/0.89

PROPCASE Purpose: Capitalize the first letter of each word in a string. data Propcase_out(keep=result); set TenMillonRecs; result= propcase(name); where name ='Anjamvfk A Zmqzneup'; run; proc sql; create table Propcase_tbl as select propcase(name) as result from TenMillonRecs where name ='Anjamvfk A Zmqzneup'; quit; ANYSPACE Purpose: To locate first occurrence of white space. data Anyspace_out (keep=result); set TenMillonRecs; result = anyspace(name) ; where name ='Anjamvfk A Zmqzneup'; run; proc sql; create table Anyspace_tbl as select anyspace(name) as result from TenMillonRecs where name ='Anjamvfk A Zmqzneup'; quit; FIRST Purpose: Extracts the first character from a string. data First_out(keep=result); set TenMillonRecs; result= first(name); where name ='Anjamvfk A Zmqzneup'; run; proc sql; create table First_tbl as select first(name) as name,ssn from TenMillonRecs where name ='Anjamvfk A Zmqzneup'; quit;

Anjamvfk Zmqzneup

0.92/0.89

Anjamvfk Zmqzneup

0.89/0.87

0.87/0.87

1.34/0.92

0.90/0.89

0.89/0.89

NESUG 2011

Coders' Corner

TRANWRD Purpose: Substitute one or more words in a string. data Tranwrd_out (keep=result); set TenMillonRecs; result = tranwrd(name,'Anjamvfk A Zmqzneup','Anjan Matlapudi') ; where name ='Anjamvfk A Zmqzneup'; run; proc sql; create table Tranwrd_tbl as select tranwrd(name,'Anjamvfk A Zmqzneup','Anjan Matlapudi') as result from TenMillonRecs where name ='Anjamvfk A Zmqzneup'; quit;

Anjan Matlapudi

2.32/0.96

Anjan Matlapud

2.42/0.96

LENGTH Purpose: Determine length of a character value (not counting trailing blanks). data Length_out; set TenMillonRecs; result = length(name) ; where name ='Anjamvfk A Zmqzneup'; run; proc sql; create table Length_tbl as select name,ssn,length(name) as result from TenMillonRecs where name ='Anjamvfk A Zmqzneup'; quit; REVERSE Purpose: Reverse the order of character string. data Reverse_out(keep=result); set TenMillonRecs ; result = reverse(name) ; where name ='Anjamvfk A Zmqzneup'; run; proc sql; create table Reverse_tbl as select reverse(name) as result from TenMillonRecs where name ='Anjamvfk A Zmqzneup'; quit;

Anjamvfk Zmqzneup 440019853

A 2.33/0.89 21

Anjamvfk Zmqzneup 440019853

A 21 2.29/0.89

puenzqmZ A kfvmajnA

0.83/0.84

puenzqmZ A kfvmajnA

0.89/0.87

NESUG 2011

Coders' Corner

REPEATE Purpose: Make several copies of a string. Anjamvfk Zmqzneup Anjamvfk Zmqzneup Anjamvfk Zmqzneup Anjamvfk Zmqzneup Anjamvfk Zmqzneup Anjamvfk Zmqzneup Anjamvfk Zmqzneup Anjamvfk Zmqzneup A A A A A A A A 0.87/0.87 0.95/0.85

data Repeat_out(keep=result); set TenMillonRecs ; result = repeat(name,3) ; where name ='Anjamvfk A Zmqzneup'; run

proc sql; create table Repeat_tbl as select repeat(name,3) as result from TenMillonRecs where name ='Anjamvfk A Zmqzneup'; quit;

SPEDIS Purpose: Computes spelling distance between words. data Spedis_out; set TenMillonRecs; if name = 'Anjamvfk A Zmqzneup' then Name = 'Knowledge'; if spedis(name,'nowledge' )le 27 then output; run; Proc SQL is omitted LAG Purpose: To obtain previous value from the current character variable. data Lag_out(keep=result); set TenMillonRecs; result = lag(name) ; run; Note: LAG does not work in Proc SQL. COUNT Purpose: Counts number of times in a given substring in a string. data Count_out(keep=result); length name $100.; set TenMillonRecs; if name = 'Anjamvfk A Zmqzneup' then name = 'Random generated Random Name in Random data'; result = count(Name,'Random') ; where name ='Anjamvfk A Zmqzneup';

Knowledge 440019853

35.27/30.79

10000000 observations and 1 variables

13.73/2.87

087/0.86 3

NESUG 2011

Coders' Corner

run; Proc SQL is omitted. CHOOSEC Purpose: Returns a character value that represents the results of choosing from a list of arguments. data Choosec_out(keep=result); set TenMillonRecs; if name= 'Anjamvfk A Zmqzneup' then name1 = 'Anjan'; else if name = 'Kxlgoqhma A Pcotljsnz' then name2 = 'Matlapudi'; else if name = 'Taosezswy A Gzlhvahdb' then name3 = 'Anjan Matlapudi'; Anjan Matlapudi 7.01/1.23 result = Choosec(3,name1,name2,name3) ; where name in('Anjamvfk A Zmqzneup', 'Kxlgoqhma A Pcotljsnz','Taosezswy A Gzlhvahdb'); run; Proc SQL is omitted.

PERL REGULAR EXPERSION:


Perl Regular Expression in SAS has wide variety of functionality while working with matching patterns, text manipulation including validation and text replacement. Each variable holds 1-32676 bytes long as a charter string. I have tested some of the character string functions using 32676 bytes long character variable and I can able to successfully returned values of each function (data not included). PRXPARSE and other functions have great deal while we are working with text manipulation. Bellow mentioned prx functions show how to retrieve records in DATA step and PROC step.

FUNCTION

RETURNS

REALTIME/CPU

PRXMATCH Purpose: Searches for a pattern match and returns the position at which the pattern is found. data PrxMatch_out; set TenMillonRecs ; if prxmatch("m/Anjamvfk/oi",name)> 0 then found =1; else found =0; run; Proc sql; create table PrxMatch_tbl as select name,ssn, prxmatch("m/Anjamvfk/oi",name) as found from TenMillonRecs; quit; 10000000 observations with found =1 record and remaining observation found = 0

36.37/0.79

10000000 rows with found =1row and remaining found = 0

37.02/6.26

10

NESUG 2011

Coders' Corner

PRXPARSE AND PRXSUBSTR Purpose: Perl regular expression (PRX) can be used for substring of character string matching using PRXPARSE. data PrxSubstr_out; set TenMillonRecs; if _n_=1 then do; retain re; re = prxparse('/\w \w.+/'); if missing(re) then do; stop; end; end; call prxsubstr(re,name,start,length); FirstName = substrn(name,start -11, length -4); LastName = substrn(name,start+2, length +1); if start > 0 then output; run;

Anjamvfk A Zmqzneup 440019853 1 12 14 Anjamvfk Zmqzneup With 10000000 observations

58.61/11.90

RECORD LOOKUP BY CHARACTER STRING vs. NUNBER FIELD:


I further demonstrated records retrieval using character and number fields as mentioned bellow. Real time and CPU time is noted to read one record out of 10 million.

VARIABLE
CHARACTER proc print data = TenMillonRecs; where name ='Anjamvfk A Zmqzneup'; run;

RETURNS

REALTIME/CPU

Anjamvfk 440019853

A Zmqzneup 0.95/0.90

proc sql; select * from TenMillonRecs where name ='Anjamvfk A Zmqzneup'; quit; NUMERIC proc print data = TenMillonRecs; where ssn =440019853; run; proc sql; select * from TenMillonRecs where ssn =440019853; quit;

Anjamvfk 440019853

A Zmqzneup

0.96/0.89

Anjamvfk 440019853

A Zmqzneup

0.92/0.84

Anjamvfk 440019853

A Zmqzneup

0.93/0.84

11

NESUG 2011

Coders' Corner

CONCLUSION
So far I have demonstrated possible ways for a record look up using DATA step, PROC SQL and Perl regular expression. I further generalize some tips using these functions for record look If you already know SQL, you will be pleased to know that you can use most of the functions in PORC SQL to create, read and modify variables in SAS data sets. Flexibility of character string functions available while we are working with small to a large scale data. We can use all these functions in DATA step and PROC SQL. All these functions would also work in Oracle, SQL server and Access database with slight syntax modification in code. Real time processing will give best suitable option to choose some of the functions. Based on the above queries, data step processing has been taken less time when compare PROC step.

By now you may have some clue that all most all the functions work in DATA step and PROC SQL and if you are dealing with large scale data, you can easily pick up best suitable function in terms of time taken to run each function. Also, I have included most commonly used function at one place; some of you may take advantage instead of spending more time finding them.

REFERENCES
Ronald Cody, An Introduction to SAS Character Functions. Coders Corner Paper 217-2007. Paul A. Choate Quick Record Lookup without an index. Tutorials, Paper 031-2007. David L. Cassell. The Basics of the PRX Functions SAS, Coders Corner Paper. Paper 223-2007.

ACKNOWLEDGMENTS
I like to acknowledge Ian Whitlock for his valuable suggestions. I would also like to acknowledge Mr. Shimels Afework, Senior Director, PerformRx. PerformRx, LLC provides pharmacy benefit management (PBM) services through proactively managing escalating pharmacy costs while focusing on clinical improvement and financial results.

CONTACT INFORMATION:
Your comments and questions are valued and encouraged. Contact the author at Name Address Anjan Matlapudi Senior Pharmacy Analyst PerformRx, The Next Generation PBM 200 Stevens Drive Philadelphia, PA 19113 (215)937-7252 (215)863-5100 anjan.matlapudi@performrx.com anjanmat@gmail.com

Work Phone: Fax: E-mail:

SAS is a registered trademark or trademark of SAS Institute, Inc. in the USA and other countries. IBM, OS/390, and MVS are registered trademarks of International Business Machines Inc.

12

You might also like