Chapter 2 Reading Data into SAS

 

Topics

 

2.1 Methods for Reading Your Data

2.2 Reading Data with the Viewtable Window

2.3 The Import Wizard

2.4 Telling SAS Where to Find Your Raw Data

2.5 Reading Space- or Column-Formatted Raw Data

2.6 Reading Raw Data Not in Standard Format

2.7 Mixing Input Styles

2.8 Reading Messy Raw Data

2.9 Reading Multiple Lines of Raw Data per Observation

2.10 Reading Multiple Observations per Line of Raw Data

2.11 Reading Part of Raw Data

2.12 Options in the INFILE statement

2.13 Reading Delimited Files with the DATA Step

2.14 Reading Delimited Files with PROC IMPORT

2.15 Reading PC Files with PROC IMPORT

2.16 Reading PC Files with DDE

2.17 The LIBNAME Statement

2.18 Using Permanent SAS Data Sets by Direct Referencing

 

2.1 Methods for Reading Your Data

 

In general , there are 4 ways to read in data and create SAS data sets:

Reading data directly into SAS data sets

 

You can use the Viewtable window to enter your data in a tabular format. You can define variables and give them attributes such as name, length, and type (character or numeric). SAS Enterprise Guide software (for Windows only) also has a data entry window that is similar to the Viewtable window.

 

Reading data in from raw data files

 

There are two methods for reading raw data files:

Converting other software's data files into SAS data sets

 

You can use the IMPORT procedure and the Import Wizard to convert MS Excel and Access, Lotus and dBase files into SAS data sets if you have SAS/ACCESS for PC File Formats software installed. Otherwise, you can simply create your raw data files (particularly CSV files) in your application and read them with the DATA step or the IMPORT procedure. If you are working in the Windows operating environment, then you can use Dynamic Data Exchange (DDE) and the DATA step to convert data into SAS data sets.

 

Back to Top

 

2.2 Reading Data with the Viewtable Window

 

To open the Viewtable window, select Table Editor from the Tools menu. You can open the Column Attributes window by right-clicking on the header of a column to set up column attributes. When you finished entering or editing your new or existing data set, you can use it in programs. For example, if you saved your table in the SASUSER library and named it HELLO, you could print it with the following program:

 

PROC PRINT DATA=sasuser.hello;

RUN;

 

Back to Top

 

2.3 The Import Wizard

 

You can read a variety of data file types into SAS using the Import Wizard, which can read all types of delimited files including CSV files. To start the Import Wizard, select Import Data from the File menu.

 

Back to Top

 

2.4 Telling SAS Where to Find Your Raw Data

 

You will have the most flexibility to read the data using the DATA step if they are in raw data files. By the way, Spreadsheet files are examples of data files that are not raw data. You raw data may be either internal to your SAS program (namely, you type raw data directly in your SAS program), or in a separate file.

 

Internal raw data

 

Use the DATALINES (or CARDS) statement to indicate internal data. The DATALINES statement must be the last statement in the DATA step and all lines following the DATALINES statement are considered data until SAS encounters a semicolon. The semicolon can be on a line by itself or at the end of a SAS statement which follows the data lines. See section 1.4 for an example.

 

External raw data

 

Use the INFILE statement to tell SAS the filename and path of the external file containing the data. The INFILE statement follows the DATA statement and must precede the INPUT statement. For example,

 

Windows: INFILE 'c:\Temp\plants.dat';

UNIX: INFILE '/home/temp/plants.dat';

 

Back to Top

 

2.5 Reading Space- or Column-Formatted Raw Data

 

Space-Formatted

 

If the values in your raw data file are all separated by at least on space, then using list input to read the data may be appropriate. This method has some limitations. You must read all the data in a record---no skipping over unwanted values. Any missing data must be indicated with a period. Character data must be simple: no embedded spaces, and no values greater than 8 characters in length. (You can override this constraint using the LENGTH statement, which can change the length of character variables from the default of 8 to anything between 1 and 32,767.) If the data file contains dates or other values which need special treatment, then list input may not be appropriate.

 

To write an INPUT statement using list input, simply list the variable names after the INPUT keyword in the order they appear in the data file. Place a dollar sign ($) after a character variable name. Leave at least one space between names. For example,

 

INPUT Name $ Age Height;

 

Column-Formatted

 

Some raw data files do not have spaces (or other delimiters) between all the values or periods for missing data, so you can't use the list input to read them. But if each of the variable's values is always found in the same place in the data line then you can use column input as long as all the values are character or standard numeric (i.e., numerals, decimal points, plus and minus signs, and E for scientific notation). Note that numbers with embedded commas or dates are not standard.

 

With column input, the INPUT statement takes the following form: after the INPUT keyword, list the first variable's name. If it is character, place a $. After the $, or variable name if it is numeric, leave a space. Then list the column or range of columns for that variable. Repeat this for all the variables you want to read. For example,

 

INPUT Name $ 1-10 Age 11-13 Height 14-18;

 

Example:

 

Columbia Peaches      35    67   1   10    2   1

Plain Peanuts            210           2     5    0   2

Gilroy Garlics            151035   12 11    7   6

 

Back to Top

 

2.6 Reading Raw Data Not in Standard Format

 

Informats are useful when you have non-standard data. Numbers with embedded commas or dollar signs are examples of non-standard data. Dates are also non-standard. There are 3 general types of informats: character, numeric and date, which have the following general forms: $informatw. (character), informatw.d (numeric) and informatw. (date). w indicates the total width and d is the number of decimal places. Two informats do not have names: $w., which reads standard character data, and w.d, which reads standard numeric data.

 

For a formatted input, place informat after the variable name in the INPUT statement. For example,

 

INPUT Name $10. Age 3. Height 5.1 Birthday MMDDYY10. ;

 

The columns read for each variable are determined by the starting point and the width of the informat. SAS always starts with the first column; so the data values for the first variable, Name, are in columns 1 through 10. Now the starting point for the second variable is column 11, and SAS reads values for Age in columns 11 through 13, and so on. Note that the five columns for Height include the decimal place and the decimal point itself (e.g. 366.8).

 

Example:

 

Alicia Lee          13 c 10-28-2003 7.8 6.5

Elizabeth Garcia  9 d 10-30-2003 8.817.6

...

 

Use the following INPUT statement to read the data:

 

INPUT Name $16. Age 3. +1 Type $1. +1 Date MMDDYY10. (Score1 Score2) (4.1);

 

Note the +1 skips over one column. Score1 and Score2 both require the same informat, 4.1. You only have to list the informat once by putting the variables and the informat in separate sets of parentheses.

 

For definitions of more informats, see the SAS Help and Documentation.

 

Back to Top

 

2.7 Mixing Input Styles

 

We've mentioned three major input styles: list style, column style and formatted style. You can also mix these input styles for your convenience.

 

Example:

 

Yellowstone           ID/MT/WY 1872 *  4,065,493

Everglades            FL 1934 *        1,398,800

Great Smoky Mountains NC/TN 1926 *       760,917

 

Note: Asterisks are not part of data and used to mark the place where SAS starts reading the values of Acreage if the column pointer is removed from the INPUT statement below.

 

One way to read the above data is:

 

INPUT ParkName $ 1-22 State $ Year @40 Acreage COMMA9. ;

 

Note: SAS uses a column pointer to mark its place when it read a line for raw data. With list input, SAS scans to the next non-blank field and starts reading. With column input, SAS starts reading in the specified column. With formatted input, SAS starts reading right where the column pointer is. You can use the column pointer, @n, to move the pointer explicitly. You can use the pointer anytime you want SAS to skip backwards or forwards within a data line. For example, you could use it to skip over unwanted data, or to read a variable twice using different informats.

 

Back to Top

 

2.8 Reading Messy Raw Data

 

Sometimes the data don't line up in nice columns or have predictable lengths. Here are two more tools: @'character' column pointer and colon modifier.

 

@'character' column pointer

 

At times you don't know the starting column of the data so that you can't use @n column pointer, but you do know that the starting column always comes after a particular character or word. In this case, you can use the @'character' column pointer. For example, suppose the breed of the dog always follows the word Breed:. The statement

 

INPUT @'Breed:' DogBreed $;

 

reads the dog's breed.

 

Colon modifier

 

Use a colon modifier on the informat to read the data until it encounters a space. Simply put a colon (:) before the informat (e.g. :$20.). For example, in a Web log,

 

130.192.70.235 - - [08/Jun/2001:23:51:32 -0700] "GET /rover.jpg HTTP/1.1" 200 66820

128.32.236.8 - - [09/Jun/2001:00:26:66 -0700] "GET /grooming.html HTTP/1.0" 200 89

...

 

You want to read the data the files were accessed and the filename. Use the following statement:

 

INPUT @'[' AccessDate DATE11. @'GET' File :$20. ;

 

SAS uses @'[' to position the column pointer t read the date, then uses @'GET' to position the column pointer to read the filename. Because the filename is more than 8 characters, but not always the same number of characters, an informat with a colon modifier :$20. is used to read the filename.

 

Note: It's possible to define a variable's length in a LENGTH or INFORMAT statement (covered later) instead of in an INPUT statement. When a variable's length is defined before the INPUT statement, then SAS will read until it encounters a space or reaches the length of the variable---the same behavior as using the colon modifier.

 

Back to Top

 

2.9 Reading Multiple Lines of Raw Data per Observation

 

Sometimes each observation has more than one data lines. Since SAS will automatically go to the next line if it runs out of data before it has read all the variables in an INPUT statement, you could just let SAS figure out when to go to a new line. But SAS will generate notes in the log correspondingly:

 

NOTE: SAS went to a new line when INPUT statement reached past the end of a line.

 

To avoid these "bothersome" notes, you may wish to explicitly tell SAS when to go the next line by using the line pointers, slash (/) and pound-n (#n). A slash tells SAS to skip to the next line before reading the next variable and the #n tells SAS to move the the line you specify.

 

Example:

 

NOME AK

55 66

88 29

Miami FL

90 75

96 63

...

 

To read the above data, you could use:

 

INPUT City $ State $ / NormalHigh NormalLow #3 RecordHigh RecordLow;

 

The slash tells SAS to move the column 1 of the next line before reading NormalHigh and NormalLow. The #3 tells SAS to move to column 1 of the 3rd line of data FOR THAT OBSERVATION before reading RecordHigh and RecordLow. Of course, you could replace the slash with #2 or replace #3 with a slash.

 

Back to Top

 

2.10 Reading Multiple Observations per Line of Raw Data

 

You can use double trailing at signs (@@) at the end of the INPUT statement to tell SAS to read multiple observations per line. SAS will continue to read observations until it runs out of data.

 

For example:

 

Nome AK 3.7 18 Miami FL 8.63

16 Raleigh NC . 16

 

You can use the following statment:

 

INPUT City $ State $ NormalRain MeanDaysRain @@;

 

Back to Top

 

2.11 Reading Part of Raw Data

 

At times you only want to read a fraction of the records in a large data file. Instead of reading all the records and then throwing out the unneeded ones, which would waste time, you could read just enough variables to determine whether to keep the current observation, then end the INPUT statement with a trailing at (@). This tells SAS to hold the current line of raw data, while you can test the observation with an IF statement to see if it's one you want to keep. If it is, you can read data for the remaining variables with a second INPUT statement. With the trailing @, SAS will hold the line of data until it reaches the end of the DATA step, or an INPUT statement that does not end with a trailing @. Without the trailing @ SAS would automatically start reading the next line of raw data with each INPUT statement.

 

Example:

 

freeway 408                          3866 2938

surface Martin Luther King Jr. Blvd. 2863 1978

surface Broadway                     1256 1383

...

 

You could use the following statements:

 

INPUT Type $ @;

IF Type ='surface' THEN DELETE;

INPUT Name $ 9-38 AMTraffic PMTraffic;

 

The first INPUT statement reads the variable Type and ends with an @, which holds each line of data while the IF statement tests it. The second INPUT statement reads Name, AMTraffic, PMTraffic. If an observation does not pass the test (Type='freeway'), then the second INPUT statement won't execute. Instead, SAS returns to the beginning of the DATA step to process the next observation so that the unwanted observation won't be included.

 

@ vs. @@

 

Both are line-hold specifiers; the difference is how long they hold a line of data for input. @ holds a line of data for subsequent INPUT statements, but releases that line when SAS returns to the tope of the DATA step to begin building the next observation. @@ holds a line of data for subsequent INPUT statements even when SAS starts building a new observation. In both cases, the line of data is released if SAS reaches a subsequent INPUT statement that does not contain a line-hold specifier.

 

Back to Top

 

2.12 Options in the INFILE statement

 

FIRSTOBS=

 

This option tells SAS at what line to begin reading data. It's useful when the data file contains descriptive text or header information at the beginning.

 

OBS=

 

This option can be used anytime you want to read only a part of your data file. It can be used with the FIRSTOBS= option to read lines from the middle of the file. For example, with FIRSTOBS=3 and OBS=5, SAS will start reading the data on the third line and stop reading after the fifth line.

 

MISSOVER

 

By default, SAS will go to the next data line to read more data if SAS runs out of data and there are still more variables in the INPUT statement that have not been assigned values. The MISSOVER option tells SAS that if it reaches the end of the data line, don't go to the next data line. Instead, assign missing values to any remaining variables. For example,

 

Linda Hamilton    89 67 29

Jennifer Aniston  66 76 38 88 80

Meg Ryan          77 73 35 91

...

 

Use the following statements in a DATA step to read the data for the five test scores, assigning missing values to tests not completed:

 

INFILE 'c:\MyData\scores.dat' MISSOVER;

INPUT Name $ 1-17 test1 test2 test3 test4 test5;

 

TRUNCOVER

 

The TRUNCOVER option is necessary when you are reading data using column or formatted input and some data lines are shorter than others. If a variable's field extends past the end of the data line,  SAS will automatically go to the next line to start reading the variable's value.  This option tells SAS to read data for the variable until it reaches the end of the data line, or the last column specified in the informat or column range, whichever comes first. For example,

 

Jennifer Lopez   113 Sunset Ave.

Alicia Keys     1333 Pennsylvania Ave.

Jessica Simpson   63 76th St.

 

Use the following statements:

 

INFILE 'c:\MyData\Address.dat' TRUNCOVER;

INPUT Name $ 1-16 Number 17-20 Street $ 22-39;

 

Because some of the addresses stop before the end of the variable Street's field (columns 22-39), the TRUNCOVER option is required. Otherwise, SAS would try to go to the next line to read the data for Street on the first and third records.

 

TRUNCOVER vs. MISSOVER

 

Both assign missing values to variables if the data line ends before the variable's field starts. But when the data line ends in the middle of a variable's field, TRUNOVER will take as much as is there, whereas MISSOVER will assign the variable a missing value.

 

Back to Top

 

2.13 Reading Delimited Files with the DATA Step

 

Delimited files are raw data files that have a special character separating data values. Commas or tab characters are typical delimiters. There are two options for the INFILE statement.

 

DLM=

 

By default, the DATA step expects spaces between your data values. The DELIMITER=, or DLM=, option in the INFILE statement allows you read data files with other delimiters. Youd could read data files with any delimiter character (e.g. comma, tab and &) by enclosing the delimiter character in quotation marks (i.e., DLM=',' or DLM='&'). In particular, in ASCII, 09 is the hexadecimal equivalent of a tab character, and the notation '09'X means a hexadecimal 09. So you use DLM='09'X to specify the tab character as the delimiter.

 

By default, SAS interprets two or more delimiters in a row as a single delimiter. You will need the DSD option in the INFILE statement if you want two delimiters in a row to indicate a missing value.

 

DSD

 

The DSD (Delimiter-Sensitive Data) option for the INFILE statement does three things. First, it ignores delimiters in data values enclosed in quotation marks. Second, it does not read quotation marks as part of the data value. Third, it interprets two delilmiters in a row as a missing value. The DSD option assumes that the delimiter is a comma. If your delimiter is not a comma then you can use the DLM= option with the DSD option to specify the delimiter. For example, use the following statement to read a tab-delimited ASCII file with missing values indicated by two consecutive tab characters:

 

INFILE 'c:\MyData\flowers.dat' DLM='09'X DSD;

 

CSV files

 

Many programs (e.g. MS Excel) can save data in CSV format. These files have commas for delimiters and consecutive comas for missing values; if there are commas in any of the data values, then those values are enclosed in quotation marks. the DSD option can be used to read CSV files.

 

Example:

 

Liv Tyler, 12/3/2003, 45, 63, 70,

Nicole Kidman, 12/15/2003, 17, 28, 56, 12

"Beckinsale, Kate", 1/5/2004, 25, 62, 77, 91

Isabelle Adjani, 1/8/2004, 56, , 63, 98

 

Use the following statements:

 

INFILE 'c:\MyData\Stars.csv' DLM=',' DSD MISSOVER;

INPUT StarName :$20. Date :MMDDYY10. No1 No2 No3 No4;

 

When using the DSD option, it is prudent to add the MISSOVER option if there is any chance that you have missing data at the end of your data lines (as in the first line of this data file). Recall that the colon modifier tells SAS to read for the length of the informat, or until it reaches a delimiter, whichever comes first.

 

Back to Top

 

2.14 Reading Delimited Files with PROC IMPORT

 

The IMPORT procedure can also read delimited files. PROC IMPORT will scan your data file and automatically determine the variable types (character or numeric), will assign proper lengths to the character variables, and can recognize some date formats. PROC IMPORT will treat two consecutive delimiters as a missing value, will read values enclosed by quotation marks, and assign missing values to variables when it runs out of data on a line. Also, you can use the first line in your data file for the variable names. In fact, PROC IMPORT writes a DATA step for you, and you can find the DATA step it produced in the Log window after submitting your program.

 

The basic form of PROC IMPORT is

 

PROC IMPORT DATAFILE='filename' OUT=data-set;

 

where the DATAFILE= option specifies the file you want to read, and the OUT= option specifies the SAS data set you want to create. SAS will determine the file type by the extension of the file. If your file does not have the proper extension, or your file is of type DLM, then you must use the DBMS= option. The following table shows the DBMS identifiers corresponding to different file extensions:

 

Type of File                                              Extension                        DBMS Identifier

Comma-delimited                                     .csv                                 CSV

Tab-delimited                                           .txt                                  TAB

Delimiters other than commas or tabs                                              DLM

 

Use the REPLACE option if you overwrite an existing data set with the same name as you specified in the OUT= option. By default, PROC IMPORT will get variable names from the first line in your data file. Use the GETNAMES=NO statement to suppress this. PROC IMPORT will then assign the variables the names VAR1, VAR2, and so on. If your data file is of type DLM, PROC IMPORT assumes the delimiter is a space. If you have a different delimiter, then specify it in the DELIMITER= statement. The following shows an example:

 

PROC IMPORT DATAFILE='filename' OUT=data-set DBMS=DLM REPLACE;

  GETNAMES=NO;

  DELIMITER='delimiter-character';

RUN;

 

Back to Top

 

2.15 Reading PC Files with PROC IMPORT

 

If you have SAS/ACCESS for PC File Formats software, then you can use PROC IMPORT to import several types of PC files, such as MS Excel, Lotus, dBase and MS Access files. If you don't have this software, you can read PC files using Dynamic Data Exchange (DDE).

 

MS Excel, Lotus and dBase files

 

The general form of PROC IMPORT for reading PC files:

 

PROC IMPORT DATAFILE='filename' OUT=data-set DBMS=identifier REPLACE;

 

If your data file has the proper extension, as shown in the following table, then you may not need the DBMS= option. Of course, it doesn't hurt to specify it.

 

Type of File           Extension           DBMS identifier

MS Excel               .xls                     EXCEL

                                                         EXCEL4

                                                         EXCEL5                                                            

Lotus                      .wk4                   WK4

                               .wk3                  WK3

                               .wk1                  WK1

dBase                     .dbf                    DBF

 

By default, PROC IMPORT uses the first row of data for the variable names. If you don't want this, then you can use the GETNAMES=NO statement as in the previous section and SAS will name the variables F1, F2, and so on.

 

MS Access files

 

The general form:

 

PROC IMPORT DATABASE='database-path' DATATABLE='table-name' OUT=data-set DBMS=identifier REPLACE;

 

The DBMS identifiers for Access are the following:

 

Type of File            Extension             DBMS Identifier

MS Access             .mdb                     ACCESS

                                                            ACCESS97

 

Back to Top

 

2.16 Reading PC Files with DDE

 

Dynamic Data Exchange (DDE) can only be used in Windows, and the application (e.g. MS Excel) must be running while SAS is reading the file. DDE does not require any additional SAS products to be licensed. Look at the following three ways to access data through DDE:

Copying data to the clipboard

 

Copy the rows and columns that you want to read into SAS onto the clipboard, then use the similar statements in the following example:

 

* Read an Excel spreadsheet using DDE;

FILENAME baseball DDE 'CLIPBOARD';

DATA sales;

    INFILE baseball NOTAB DLM='09'X DSD MISSOVER;

    LENGTH VisitingTeam $ 20;

    INPUT VisitingTeam CSales BSales;

RUN;

 

The INFILE statement defines a fileref as type DDE and specifies that you want to read the contents of the clipboard. By default, DDE assumes there are spaces between your data values. So, if you have embedded spaces in your data (e.g. spaces in character data such as "Full Name"), then you will need the NOTAB and the DLM='09'x options in the INFILE statement. They tell SAS to put a tab character (NOTAB) between values and define the tab character as the delimiter (DLM='09'x). The DSD option interprets two consecutive delimiters as missing data and MISSOVER tells SAS not to go to the next line to continue reading data if it runs out of data on the current line.

 

Specify the DDE triplet

 

You can specify the triplet in the FILENAME statement if you don't want to cop the data to the clipboard beforehand and know the DDE triplet for the data you want to read. Each application has its own way of specifying a DDE triplet. In general, the DDE triplet takes on the following form:

 

application | topic ! item

 

Refer to the documentation for the application for specific information about DDE triplets. However, you can find out the DDE triplet from within SAS. First, copy the data you want onto the clipboard, then toggle to your SAS session. From the Solutions menu, select Accessories. The select DDE Triplet. A window will appear that will give the DDE triplet for the data you  copied to the clipboard. An example of the DDE triplet of an Excel spreadsheet is:

 

Excel | C:\MyFiles\[BaseBall.xls]sheet1!R2C1:R5C7

 

Note that "R2C1:R5C7" refers to the rows and columns on the spreadsheet. Then you could use the following FILENAME statement and the rest of the program is the same as the previous example:

 

FILENAME baseball DDE 'Excel | C:\MyFiles\[BaseBall.xls]sheet1!R2C1:R5C7';

 

Starting the PC application from SAS

 

Example code:

 

* Read an Excel spreadsheet using DDE;

OPTIONS NOXSYNC NOXWAIT;

X ' "C:\MyFiles\Base Ball.xls" ';

FILENAME baseball DDE 'Excel | C:\MyFiles\[BaseBall.xls]sheet1!R2C1:R5C7';

DATA sales;

    INFILE baseball NOTAB DLM='09'X DSD MISSOVER;

    LENGTH VisitingTeam $ 20;

    INPUT VisitingTeam CSales BSales;

RUN;

The NOXSYNC and NOXWAIT options tell SAS to return control back to SAS after executing the command and not to wait for input from the user. The X statement tells Windows to execute the program or open the file that follows in quotation marks. If you have embedded spaces in the path for your filename, then you need to enclose the filename in two sets of quotation marks, as shown in the example. With this method, you must specify the DDE triplet.

 

Back to Top

 

2.17 The LIBNAME Statement

 

You can use the LIBNAME statement to define a libref, which could also be done using the New Library window in SAS Explorer. The basic form of the LIBNAME statement is

 

LIBNAME libref 'your-SAS-data-library';

 

Windows: LIBNAME libref 'drive:\directory';

UNIX: LIBNAME libref '/home/path';

 

Back to Top

 

2.18 Using Permanent SAS Data Sets by Direct Referencing

 

You can use permanent SAS data sets by direct referencing without first setting up librefs. Simply take the operating environment's name for a file, enclose it in quotation marks and put it in the DATA statement. SAS actually defines a library and makes up a libref for you.

 

Windows: DATA 'drive:\directory\filename';

UNIX: DATA '/home/path/filename';

 

Back to Top

Back to the Listing of Lecture Notes