Unsupervised Duplicate Detection (UDD) of Query Results from Multiple Web Databases
There has been an exponential growth of data in the last decade both in public and private domain. This thesis presents an unsupervised mechanism to identify duplicates that refer to the same real-world entity. With an unsupervised algorithm, there is no need for manual labeling of training data. This thesis builds on this idea by introducing an additional classifier, known as the blocking classifier. Various experiments are conducted on a dataset to verify the effectiveness of the unsupervised algorithm in general and the additional blocking classifier in particular.