De-dupe your data files

In this article I will show you one way that you could use to de-dupe your data collection datafiles. I am not sure how you would get duplicates in it in the first place , but let’s just assume you do have them. This example will use the Museum.ddf file in the DDL. This file & the ddf can be found here “.DDLDataData Collection File”

Ok , so what we need is a DMS script and some data. As mentioned we are going to use the museum.ddf file , and as an example we will check it against itself , so basically all records will be marked up as duplicates. The first thing in our DMS file we will add is some metadata.

Metadata(en-AU, Question, label)

sysDuplicate "" categorical [1] {
  Yes,No
};

End Metadata

The above code will add a question to our input file called sysDuplicate and this new question will either have a value of yes or no. depending on if the record is a duplicate or not.

Next we will have an input section.

InputDatasource(Input, "")
    ConnectionString = "Provider=mrOleDB.Provider.2;Data Source=mrDataFileDsc;Location=Museum.ddf;Initial Catalog=C:Program FilesIBMSPSSDataCollection6DDLDataData Collection FileMuseum.mdd" 
    SelectQuery = "SELECT * FROM VDATA"
    UseInputAsOutput = True 
End InputDatasource

As you can see the input section reads the museum.ddf file found in the DDL and we use the keyword “UseInputAsOutput” also. What this means is that any changes to the records will be made to the input file not a new output file. You could of course not use this option and have an OutputDataSource.

Now that we have our input , we need to have the onNextCase event to do the duplicate checking.

Event(OnNextCase, "")

Dim sSQL,sSQLTest
Dim oConnection , oRecordset,iCount

sSQL = "SELECT RESPONDENT.SERIAL FROM VDATA WHERE age = [age] and gender = [gender]"

Set oConnection = CreateObject("ADODB.Connection")

oConnection.ConnectionString = "Provider=mrOleDB.Provider.2;Data Source=mrDataFileDsc;Location=C:Program FilesIBMSPSSDataCollection6DDLDataData Collection FileMuseum1.ddf;Initial Catalog=C:Program FilesIBMSPSSDataCollection6DDLDataData Collection FileMuseum1.mdd"
oConnection.Open()
If oConnection.State = 1 Then
	Set oRecordset = CreateObject("ADODB.Recordset")
		sSQLTest = replace(sSQL,"[age]",age)
		sSQLTest = replace(sSQLTest,"[gender]",gender)
		
		oRecordset.Open(sSQLTest , oConnection,3,1)

		If oRecordset.EOF and oRecordset.BOF Then
			sysDuplicate = {No}
		Else
			sysDuplicate = {Yes}
		End If
		
	Set oRecordset = null
End If
oConnection.Close()
Set oConnection = null

End Event

if you have been a longtime follower of this site you will have seen this type of code before. Basicaly it is an ODBC connection to the file that we are comparing against. This file is then scanned to see if any records are returned that match our duplicate criteria and if they do , we mark the current record up as a duplicate.

Once we have made the connection , the key code is really these two lines.

sSQLTest = replace(sSQL,"[age]",age)
sSQLTest = replace(sSQLTest,"[gender]",gender)

These two lines take the sqltemplate string that we defined , take the values of the current record and replace them. So for example if the current record that the dms script was on have values of 10 and {male} then we would end up with a sql string that looked like this.

sSQL = "SELECT RESPONDENT.SERIAL FROM VDATA WHERE age = 10 and gender = {male}"

So , hopefully this script will be a start for someone to de-duplicate their data files if they need to. If you have a similar requirement and you need some help, just drop me a comment on this post and I will try and help you.

Leave a Comment