Pitching Hyperjoin

2024-04-21

Update: It didn't work. Not EntiPy and Hyperjoin, anyway. Entity resolution in other forms, though, seems to be in demand. We'll see.

I'm Joe, the CEO of Hyperjoin. This is the updated version of this article.

Summary

What is it?

Hyperjoin is a data quality startup that links fragmented data that point to the same real-world thing. We are the authors of the EntiPy Python library, which is our flagship product.

Humans can tell whether "John Doe" and "Jonathan Doe" are the same person. EntiPy helps computers do the same at scale.

Why us?

Why now?

Details

Problem

Each row in a database points to one thing. The opposite of this is not true. In real-world data, a single thing can be represented by many rows. This is called data duplication.

On the one hand, simple cases where column values match exactly are trivial: just run SELECT DISTINCT, FROM x JOIN y, or .drop_duplicates(). On the other hand, complex cases where column values don't match exactly are so difficult that papers need to be written about them.

This affects how companies make decisions. In tracking systems, duplication can make data useless. In regulated environments, it can lead to violations. In healthcare, it can lead to deaths.

Idea

The field that tackles data duplication is called entity resolution. You might also know it as record linkage, merge-purge, or data deduplication.

I encountered this problem when I collected OCR receipt data for a company's loyalty program in 2022. Interestingly, it arose in two areas. First, the scans of product names often had typos or smudges that made them useless in a groupby. Second, users would often fraudulently upload a receipt that had already been scanned by another user. The first problem made my data useless, and the second problem made it too expensive to collect data. These problems ended my business.

I only solved the problems in 2023, four months after the program ended. I published two things as a result of my research: a conference paper for CENTERIS 2023 and a Python library called EntiPy. Now, I want to turn EntiPy into a startup.

Progress

EntiPy is already available for download and use. I am using it to solve one of my consulting clients' problems.

You can find the library here. Install it like so:

pip install entipy

EntiPy understands that your data is domain-specific. It exposes a simple data modeling interface to its users. I can now solve my product name problem like so:

from entipy import Field, Reference, SerialResolver
from rapidfuzz import fuzz


class ObservedNameField(Field):
    true_match_probability = 0.85
    false_match_probability = 0.15
    def compare(self, other):
        return fuzz.ratio(self.value, other.value) >= 70


class ProductNameReference(Reference):
    observed_name = ObservedNameField


r1 = ProductNameReference(observed_name='PrimeHarvestCheese10Qg')
r2 = ProductNameReference(observed_name='PureGourCetYogurt2.4kg')
r3 = ProductNameReference(observed_name='PrimeHarvLstCheese1F0g')
r4 = ProductNameReference(observed_name='NutSaFusionBakingSoda200g')
r5 = ProductNameReference(observed_name='PrimeIarvestCh~ose100g')
r6 = ProductNameReference(observed_name='PureGotrmetYogurt2_4kg')

sr = SerialResolver([r1, r2, r3, r4, r5, r6])

sr.resolve()

clusters = sr.get_cluster_data()

Internally, EntiPy implements a streaming algorithm that can accept new data as it comes. This makes it suitable not only for overnight batch processing but also for live services that need to resolve data as it arrives.

Previously, I had two critical features on my roadmap: blocking and parallelization. I have now written both.

Why now?

Two main reasons.

More and more companies may need to implement entity resolution themselves. EntiPy will enable them to do that.

Competitors

Other resolution tools exist. Here are some of them that I've surveyed.

With the exception of Senzing, which I couldn't try, I felt that these tools were developed from the point of view of the developer. The implementation of each tool likely came before their interface. I intend to do the opposite for EntiPy.

I'm also not actually a fan of the machine learning based solutions. Services that define similarity with machine learning ask me to trust them a little too much. What if I want to tweak the weights? That would be difficult with ML, but really easy with EntiPy.

Money

I am EntiPy's sole copyright holder. I have licensed it to the general public under AGPLv3. I have not accepted contributions from anyone else yet. I also offer commercial licenses for projects that can't open-source themselves, but I haven't tried to sell these commercial licenses yet.

If Senzing's prices are any indication, entity resolution is expensive. Their lowest price tier is roughly $40,000 for 10,000,000 records processed. AWS Entity Resolution is cheaper, at roughly $2,500 for 10,000,000 records, but it's still expensive.

Ambition

I understand that VCs look for potential hundred-baggers or thousand-baggers. I think EntiPy has a sensible way to become such a home run.

Good hackers can bang out a prototype of an app in a week or even a weekend, but only if all the tech is already solved by libraries. If they find a really hard problem, they'll get stuck.

ER is one of those really hard problems. Maybe it's not as hard as self-driving cars, but it's still hard, and a lot of companies encounter it. It's key to these use cases:

My original intent was to have other companies implement these use cases. They would just pay us to use our technology.

However, if we solve the hardest problem of ER first, there will be nothing stopping our hackers from just making these apps ourselves. To merge these two models, I'd also be happy to grant perpetual licenses to startups in exchange for some equity.

Having ER changes the economics of how difficult these data-intensive apps will be to make. EntiPy won't merely help our customers build these apps. It will help us build these apps, too.

Moat

I've considered some future scenarios.

PS