Before you use your data set you must be sure that each record in your data has a unique identifier (unique ID).
Why? The best reason to assign unique ID’s is for tracking purposes. Trust me, you will run into a situation where a need to backtrack is necessary. There are many reasons, but here are two:
A unique identifier (ID) can be anything, as long as each record in your data set has a unique one of this thing. Typically a unique identifier is numeric, but it doesn’t have to be. If your data consists of survey respondents, you could use each respondents email (assuming respondents haven’t shared an email address!). Or maybe you’d like to use alpha-numeric ID’s. A way to determine what would work for your data is to ask, “What is the best unique information I have about the records in my data set?” You can then think about ways to fashion a nice unique ID for each record.
Ok, you aren’t sure how to fashion your unique ID numbers for your records, or maybe you don’t really care about specifics of the unique ID, just that each record has one. Here are some tips on creating unique ID’s using SPSS software. I hope you find them useful!
Let’s assume that your data is in a typical structure of a typical data set and you want to make a variable (a column) that gives each record (row) a unique identification number (ID).
Here is some SPSS syntax to use to give a unique ID to each record in the dataset. The unique ID number will be the same as the row number of the record. SPSS refers to the row number as the “case number”.
COMPUTE ID = $casenum.
FORMAT ID (F8.0).
EXECUTE.
The FORMAT command says to give the variable you are creating (ID) for your unique ID number 8 digits, and 0 decimal places. ID’s don’t really need decimal places, but if you want to, just change the 0 to whatever number you want.
Ok, easy peasy! But, what if you have a dataset in long format, with repeated measurements for some folks?
I recently worked on a data set that included 3 years of data, but some participants were only in 1 year, some were in 2 years, and some others were in all 3 years.
Every person had their own unique ID number, but the numbers were very long and it was hard to see the matches. And, I wanted to also see how many participants were in only 1 year, 2 years, or 3 years respectively.
Another time you may want to do this is when you want to protect confidentiality of the participants. If the ID’s are traceable to participant records, such as patient numbers in hospital records or students in school records, or hey, the participant names are in the dataset! Then a new number that can’t be used to match to those records to outside sources should be created.
So I needed to have unique ID numbers to replace the current ID numbers, and also needed to include a coding sequence for the possibility of a person having more than one measurement time. And, I wanted to know how many people were represented once, twice, or three times.
I used these steps:
1. First sort cases by the name or ID of the individuals.
SORT CASES BY current_id(A).
2. Under the “Data” pull down menu, choose “Identify Duplicate Cases”.
— Define matching cases by: Move over the variable that is currently the ID into the box.
— In the “Variables to Create” box, check the “Indicator of primary cases” box and specify “First Case in Each Group is primary” You will see a variable name such as “PrimaryFirst” in the box to the right. You can change the name to whatever you want.
But I usually leave it as is.
— Also check the box by “Sequential count of matching” this will count the number of instances each participant is represented.
— Uncheck the “Move Matching cases to the top of the file” box. I don’t want to move anything around just yet.
— But do check the “Display frequencies for created variables” box.
3. Click “OK”.
4. You should now have two variables at the right of the data set (a) PrimaryFirst and (b) MatchSequence. And you will have some output with frequencies of the matches etc.
5. Now, we will use the two variables we created and some syntax to give the unique ID’s and they will be based on the case number variable. This syntax will give unique ID’s.
BUT you will have some gaps in the numbering sequence due to the use of the case numbers to define the ID’s. For instance, you most likely won’t have ID that run 1, 2, 3, 4, 5, …. But they will be 1, 3, 4, 6, etc. Still UNIQUE though, and that is what we want!
Here is the syntax to run:
DO IF (PrimaryFirst EQ 1).
COMPUTE ID = $casenum.
else.
COMPUTE ID = lag(ID).
end if.
EXECUTE.
There should now be a variable called “ID” with a unique number given to each participant. You can now make a back-up of the file and don’t touch this one any more. Save the file also as a working file and then delete the “traceable” identifiers you don’t want to use from you working file.
This working file will be used for all analyses, and sent to whomever needs to see it, but now you have, hopefully, protected some confidentiality of records.
Also, you can use your new ID variable, and the MatchSequence variable as an index variable, if you need to transpose your data from long to wide.
Now you’ll know who’s who and what’s what! 🙂