Robust Entity Deduplication Process Developed to Reduce Manual Effort for the Virginia Public Access Project
The Virginia Public Access Project (VPAP) provides political contribution information to the public. They receive contribution information from the Virginia Board of Elections. They screen incoming records to identify matches against a database of past campaign contributors. The process was largely manual, relying upon individuals with subject matter expertise to compare incoming records with those in the database. The reliance on manual effort was time consuming and sometimes imprecise. VPAP required a system that could run daily during peak production periods to perform the matching work. UDig developed a deduplication system utilizing machine learning that is capable of evaluating each new record and calculating the probability that this record matches any other record in the dataset. More than 70 percent of the records are now matched without the need for human intervention, freeing up the analyst’s time to pursue tasks that require more critical thinking and add more value.
STRATEGIC SNAPSHOT
Challenge
Streamline and automate the process of reviewing data for duplicate entries to yield time savings and to improve accuracy.
Strategy
Apply text analytics to automatically detect potential duplicates with a high degree of precision and reduce the manual effort.
Outcome
Deduplication engine utilizing Natural Language Processing techniques to identify potential duplicate entities using a variety of data points and route those entities appropriately.
Most of the records are processed without the need for human intervention, freeing up the analyst’s time to pursue tasks that require more critical thinking and add more value.
Challenge
Clean data is a necessity for companies to get the most out of robust analytics. Bad data can lead to bad conclusions. When an organization seeks to become data-driven, it’s important that the data is trustworthy. Machine learning can enable organizations to increase the integrity of their data. A common issue for organizations is entity matching – being able to match the same company or person within internal data. For instance, when your analysts report that you received loan application from 1,000 distinct individuals over the past quarter, how are they measuring distinct individuals? Do you have the capability to account for the one customer who submits 20 separate loan applications, or is there no indication that these applications are connected?
Most organizations attempt to address this problem with simple rules checking to see if the name and address match, but this can be messy and imprecise. What if they abbreviate parts of the address, use a nickname, or get married and change their last name? These are not edge cases. The methodology used to account for such cases can materially change the data. As a result, the handling of these cases can alter analysts’ conclusions, reports, and business strategy. Determining how to handle cases like this is an important question with far-reaching consequences.
Outcome
The table below illustrates the kind of issue that UDig was tasked with solving with the VPAP. VPAP is an organization that “connects Virginians to nonpartisan information about Virginia politics in easily understood ways.” One way they do this is by tracking political contribution and expenditure data for Virginia politics. When Bob Frank from Norfolk, VA donates $100 to his favorite politician, they need to be able to say whether this is the same Robert Frank of Norfolk who donated $100 to the same politician in the prior year.
VPAP’s existing system was like many organization’s — a rule-based system that checked for matching fields and accounted for a handful of commonly encountered nicknames and abbreviations. Looking at the example below, it’s clear that the process can quickly become messy. Should you create rules that explicitly check for every possible nickname, misspelling, and abbreviation? Clearly, this is not practical.
First Name | Last Name | Company Name | Industry | Address | Match |
---|---|---|---|---|---|
Jacob | Ferraiolo | ABC Consulting | Consulting | 1234 Sesame Street Henrico Va | 1 |
Jake | Ferraiolo | UDig | Consulting | 1234 Sesame St Va | 1 |
Jacob | Ferraiolo | UDig | Consulting | 1234 Sasame St Glen Allen, Virginia | 1 |
Jacob | Ferraiolo | UDig | Consulting | 3241 North St Norfolk, Va | 0 |
Luke | Ferraiolo | Joe’s Bar | Bartender | 1234 Sesame Street Henrico Va | 0 |
Sample data that shows the slight differences between records that can make entity matching challenging.
Looking at the above examples, a human analyst could tell you that “Jacob” and “Jake” are similar enough that they are most likely to be the same person. Conversely, “3241 North St Norfolk, VA” is nothing like “1234 Sesame Street Henrico Va“, so these are much less likely to be a match. A human analyst also knows that the importance of each field is not the same. It’s more important that a last name matches than a company name. These are all rules that humans intuitively know but become cumbersome to manually program into a computer. This is where machine learning comes into play; instead of explicitly programming all these rules, we can let the machine find these rules on its own from years of historical data.
Hundreds of decision trees checked the similarity of various fields to determine an overall confidence level; they checked the similarity of each field to be able to account for the abbreviations, typos, and nicknames that are common in such data without having to explicitly program millions of exceptions. If the last name, first name, and address meet similarity thresholds, then we could safely assume the records are the same person and reach a determination.
VPAP now has a system that runs daily to perform the work previously manually handled by an analyst. It evaluates each new record and calculates the probability that this record matches any other record in the dataset. Records that the process is unsure about get flagged for human review, but most of the records are processed without the need for human intervention which frees up the analyst’s time to pursue tasks that require more critical thinking and add more value.
How We Did It
Tech Stack
- Python
- SQL Server