In certain systems we have to deal with lots of low-quality data, containing some typos, duplicate entries, malformatted or missing fields, erroneous or inaccurate bits of information.
The reason for it may lay in the specifics of the data source:
careless humans using touch-screens,
keyboard cats and internet trolls,
cheap and/or faulty sensors,
underpaid or lazy staff,
primitive web crawlers, etc.
In some cases the data comes from multiple external providers, each using slightly different schema, and ends up dumped in a common database. The chance is that these providers share a common subset of entries, but for some reasons it’s difficult to uniquely identify an entry, so that we know which of them are likely to be duplicates.
This kind of datasets often contain vast numbers of similar entries. If not filtered out properly, these affect the quality of service delivered by the system. Better to get rid of those soon!
The most basic concept is as follows.
Take a new entry.
Ask elasticsearch if it can recognize something like the given entry within the index (e.g. fuzzy-search)
No: it’s a unique entry – add to the elasticsearch index and go to #1
Yes, but the similarity is way too low1 – it’s a unique entry – add to the elasticsearch index under a new group id and go to #1
Yes and the similarity is rather fair – it’s likely to be a duplicate. Take the closest one,combine2 the new entry with our finding, put it back to elasticsearch index under the same group id. Go to #1.
Ad. 1 How do I tell if similarity is sufficient? In the simplest solution, it’s a matter of putting a fixed score threshold, based on a series of test results. More advanced use cases might require more sublime techniques.
Ad. 2 Combine entries the easiest way – join their properties as they are. Here we benefit from the fact that groups which contain multiple occurrences of keywords that we search for, are ranked higher in the overall search results.
What kind of query types may come in handy in finding the most effective dedupe engine? All of them are thoroughly described in the reference, however most likely you’ll need to try these out:
If possible, it’s recommended to normalize the entries up-front before the processing (phone numbers, zip codes, etc.).
If applicable, make use of geocoding services. External services like Google Maps might prove helpful.
Deduplicating large datasets with this kind of engine may appear inefficient. To address this issue, apart from scaling elasticsearch and fine-tuning queries, one can think of parallelizing the process.
Although it looks pretty naive in this form, it definitely can do the job. With very little effort, one can achieve really encouraging results. And what do exactly encouraging results mean? Well, it’s all relative, all depends on your specific requirements, e.g. whether some level of chance of getting a false negative is acceptable. Just to give some reference point: 95+% accuracy (with barely any false negatives) is feasible.
For the sake of this PoC, I’ve prepared a set of 200 entries: 100 unique & 100 duplicates, along with an extra information about the duplicate groups to be able to measure the quality of experiment results. It’s obviously a modest and artificial dataset, but it suits as a model for predicting deduplication quality of much larger target datasets. But it’s pretty much the only way to work with it – try it out on a small test dataset, and project the results for large production data.
Facing elasticsearch problems? Drop us a line!
We just sent you an email. Please click the link in the email to confirm your subscription!