Pitching EntiPy
2024-03-02
Summary
What is it?
EntiPy is a Python library for linking data records that point to the same real-world thing. This is a surprisingly common issue in data.
EntiPy is licensed under AGPLv3, which means that anything that uses EntiPy to serve users must become open-source itself. However, companies can purchase commercial licenses for non-open-source use cases.
Why me?
- I wrote a paper on this exact subject.
- I am my own customer. I had this exact problem in my consulting project. Twice, actually, in the same project. I wrote this library to solve my problem.
Why now?
- This problem is as old as data.
- Regulations will make this even more of a problem in the near future.
Details
Problem
Each row in a database points to one thing. The opposite of this is not true. In real-world data, a single thing can be represented by many rows. This is called data duplication.
On the one hand, simple cases where column values match exactly are trivial: just run SELECT DISTINCT
, FROM x JOIN y
, or .drop_duplicates()
. On the other hand, complex cases where column values don't match exactly are so difficult that papers need to be written about them.
This affects how companies make decisions. In tracking systems, duplication can make data useless. In regulated environments, it can lead to violations. In healthcare, it can lead to deaths.
Idea
The field that tackles data duplication is called entity resolution. You might also know it as record linkage, merge-purge, or data deduplication.
I encountered this problem when I collected OCR receipt data for a company's loyalty program in 2022. Interestingly, it arose in two areas. First, the scans of product names often had typos or smudges that made them useless in a groupby. Second, users would often fraudulently upload a receipt that had already been scanned by another user. The first problem made my data useless, and the second problem made it too expensive to collect data. These problems ended my business.
I only solved the problems in 2023, four months after the program ended. I published two things as a result of my research: a conference paper for CENTERIS 2023 and a Python library called EntiPy. Now, I want to turn EntiPy into a startup.
Progress
EntiPy is already available for download and use. You can find it here. Install it like so:
pip install entipy
EntiPy understands that your data is domain-specific. It exposes a simple data modeling interface to its users. I can now solve my product name problem like so:
from entipy import Field, Reference, SerialResolver
from rapidfuzz import fuzz
class ObservedNameField(Field):
true_match_probability = 0.85
false_match_probability = 0.15
def compare(self, other):
return fuzz.ratio(self.value, other.value) >= 70
class ProductNameReference(Reference):
observed_name = ObservedNameField
r1 = ProductNameReference(observed_name='PrimeHarvestCheese10Qg')
r2 = ProductNameReference(observed_name='PureGourCetYogurt2.4kg')
r3 = ProductNameReference(observed_name='PrimeHarvLstCheese1F0g')
r4 = ProductNameReference(observed_name='NutSaFusionBakingSoda200g')
r5 = ProductNameReference(observed_name='PrimeIarvestCh~ose100g')
r6 = ProductNameReference(observed_name='PureGotrmetYogurt2_4kg')
sr = SerialResolver([r1, r2, r3, r4, r5, r6])
sr.resolve()
clusters = sr.get_cluster_data()
Internally, EntiPy implements a streaming algorithm that can accept new data as it comes. This makes it suitable not only for overnight batch processing but also for live services that need to resolve data as it arrives.
I have two major goals for EntiPy's technical roadmap.
- First, I want to implement blocking. This is a technique where obviously dissimilar references are disqualified from comparison. Blocking is critical for scaling ER.
- Second, I want to implement my unpublished research on a mergesort-inspired improvement to the algorithm. My initial tests show that this improvement will make serial performance 12 times faster. It will also enable parallel processing, which is also critical for scaling ER.
Why now?
Two main reasons.
- Entity resolution has been a problem since forever. It has been poisoning data since forever. Based on my research, the only reason it isn't more widely known is because it (funnily enough) goes by several names itself.
- The new regulations around data mean you have to know who people are. On one front, Segment argues that companies will need to start collecting their marketing data themselves. This means that they will have to clean their marketing data themselves, or at least that they can't rely on a data broker to do so anymore. On another front, regulations like GDPR require companies to delete personal data when asked. Missing a record in a duplicated database could mean noncompliance.
More and more companies may need to implement entity resolution themselves. EntiPy will enable them to do that.
Competitors
Other resolution tools exist. Here are some of them that I've surveyed.
- Dedupe is also a Python library. It uses machine learning instead of data modeling. I don't understand their interface, and their documentation was not very helpful. Interestingly, the Dedupe team used to run a cloud service, but as of 2023-01-31, they shuttered it in favor of their consultancy.
- Senzing is a company that offers enterprise-grade entity resolution. They are closed-source, so it isn't obvious to me how to use their software. Also, as a freelance consultant, I couldn't afford their prices.
- Zingg is a Java entity resolution tool. It also uses ML to do resolution.
- OpenRefine is a web app that cleans data. I actually tried this on my messy product data. I found the results satisfactory, and I wish I'd known about it during the project. I don't think it could have helped me with the user fraud problem, though.
- AWS Entity Resolution is a machine learning entity resolution service by AWS. It's a relatively new entrant. It's much cheaper than Senzing, but still pretty expensive.
- Splink is a Python package for probablistic record linkage. It promises that it's very fast. It explicitly doesn't support the use cases I was interested in, though.
With the exception of Senzing, which I couldn't try, I felt that these tools were developed from the point of view of the developer. The implementation of each tool likely came before their interface. I intend to do the opposite for EntiPy.
I'm also not actually a fan of the machine learning based solutions. Services that define similarity with machine learning ask me to trust them a little too much. What if I want to tweak the weights? That would be difficult with ML, but really easy with EntiPy.
Money
I am EntiPy's sole copyright holder. I have licensed it to the general public under AGPLv3. I have not accepted contributions from anyone else yet. I also offer commercial licenses for projects that can't open-source themselves, but I haven't tried to sell these commercial licenses yet.
If Senzing's prices are any indication, entity resolution is expensive. Their lowest price tier is roughly $40,000 for 10,000,000 records processed. AWS Entity Resolution is cheaper, at roughly $2,500 for 10,000,000 records, but it's still expensive.
I don't think I can price by record processed with this business model, but I think I can copy something like Akka's per-core license.
Ambition
I understand that VCs look for potential hundred-baggers or thousand-baggers. I think EntiPy has a sensible way to become such a home run.
Good hackers can bang out a prototype of an app in a week or even a weekend, but only if all the tech is already solved by libraries. If they find a really hard problem, they'll get stuck.
ER is one of those really hard problems. Maybe it's not as hard as self-driving cars, but it's still hard, and a lot of companies encounter it. It's key to these use cases:
- CDPs
- ETL on highly siloed data for business intelligence
- Watchlist applications for regulated industries
- Fraud detection
- Data cleanup of digitized (i.e., previously analog) data
- First-party master data management
My original intent was to have other companies implement these use cases. They would just pay us to use our technology.
However, if we solve the hardest problem of ER first, there will be nothing stopping our hackers from just making these apps ourselves. To merge these two models, I'd also be happy to grant perpetual licenses to startups in exchange for some equity.
Having ER changes the economics of how difficult these data-intensive apps will be to make. EntiPy won't merely help our customers build these apps. It will help us build these apps, too.
Moat
I've considered some future scenarios.
- LLM capabilities advance fast. There's a chance they'll eventually be able to do entity resolution. This isn't something we can control. I still think that LLMs are opaque, so they'll have the same magic problem as the competitor ER services. I also doubt that they'll be able to resolve late-arriving data. I could be wrong, though. LLMs are world-changing. If they invade this space, I'll just have to pivot.
- LLMs have high promise as feature engineering aids to prepare data for ER. Someone mentioned this idea at CENTERIS '23.
PS
- I've often been asked about the frequency/intensity of the data duplication problem. Let me put it this way. Data duplication is insidious. It's the result of a complex domain, or bad architecture, or both, and once you have it, you're screwed.