We get invited to solve some very complex data problems and what we found over the years is, when you take into consideration the exact nature of the data and context it is used in, then you can generate much better results.
Generic matching engines are good for 80% of duplicates or matches you’re seeking, the remaining 20% need a lot more effort.
(I’m using the convention that duplicates refer to records matched with a single data set and matches refer to records matched across multiple data sets.)
For many people finding 80% is good enough and they can live with the remaining 20% duplicates, for others the duplicate or match is critical, in this case you need an in-depth technique to reach 95% or above. If you want to reach 98% or more then you’ll need more techniques.
Depending on how many duplicates you can tolerate, the techniques vary according to the type of data, volume of data and the quality of data.
Here are some examples of the different types of matching you can perform:
The above assumes all the information is within one field, but the information can be distributed amongst multiple fields. For example, here we are matching a company using its name, address and phone number. But the address is different as the company has moved location, but the phone number helps to validate the match.
On many occasions we see data in incorrect fields. In these cases, you must pre-process and cleanse the data. Here is an example
Here is a more difficult matching problem. We want to classify these product names as the same entity for reporting purposes
There are many techniques for matching different types of data with different variations in quality. Our philosophy has been, if you can eye-ball a match then you should be able to program it. We are using information that we are aware of to decide; our aim is to provide the matching software with all the information to make the same decision. If this can be done successfully, then you can have a very accurate and exhaustive matching algorithm.
If you have a difficult matching problem, then contact us and we’ll see how we can help. Remember, if you can eye-ball it then there’s a good chance it can be automated.