There are plenty of ways to remove duplicates using SAS. Here I show how I’ve done it when I was working with the Unemployment Insurance administrative registers of the Ministry of Economy in Brazil.
First things first, I used SAS version 9.3 to this project.
SAS version: 9.3
Create a token variable
The first step is to create a token variable equals to 1.
data your_data;
set your_data;
token = 1;
run;
Count duplicates
The second step is to count the number of duplicate records in your data. The way I do this is using PROC MEANS
to sum the token variable when a duplicate occurs.
PROC MEANS data = your_data noprint;
class column1 column2 column3 ... columnN /missing; /* use all the columns you want to check for duplicate records */
types column1*column2*column3*...*columnN; /* repeat the columns above here, but separated with an asterisk instead of a single space */
output out= new_data (drop=_type_ _freq_)
sum(token) = repetition; /* Sum the number of records with the same key */
run;
Then, you could either create a table with only duplicate records or a table without duplicate records.
Delete non-duplicates
If you want to analyse the duplicate records, you could work on a new table with duplicate records only. Here I will show one way to achieve this.
DATA duplicate_records;
set new_data;
if repetition = 1 then delete;
;
Delete duplicates
Maybe the fastest way to delete your duplicate records from your data is using PROC SQL
.
PROC SQL;
delete from work.new_data /* your new_data will be generated in the work folder if not told otherwise */
where repetition ge 2;
quit;