Home page Portfolio

Hidden complexity

2023-11-25

One of my early projects was a receipt-scanning chatbot. Consumers could upload their receipts to the chatbot, and if the receipt contained an SKU participating in a promo, the chatbot would dispense a voucher in near real-time. Its purpose was to collect market basket data.

I foolishly thought that this would be straightforward. It was not. I encountered two massive problems only after running the bot in production for a few weeks.

Fraud

First, since the bot could dispense vouchers, it was a huge target for fraud. We found that certain users would pass a single receipt around their circles to receive multiple vouchers for the same receipt.

I accept that I made the mistake of not being completely skeptical of user input. I had considered that a user might try to submit a receipt they had previously submitted, but I hadn't considered that they would pass a single receipt through multiple accounts. I was guarding against mistakes but I had overlooked active deceit.

After I discovered this behavior while monitoring the system, I patched our similarity detection algorithm to consider all receipts submitted by all users instead of just the receipts submitted by a single user. This worked fine at the beginning, but our similarity algorithm (a highly flawed combination of date validation, Levenshtein edit distance, and receipt ID validation) was expensive. The system slowed so much after three months that it was almost unusable. (Luckily for me, the promo period ended after three months as well.)

Dirty data

Second, since the bot read receipt data through OCR, the data it collected was imperfect. Imagine an analyst trying to find how many "XBrandVngr" (X Brand Vinegar) units were sold from a dataset which knows about "XBrandUngr", with a U. Now imagine that these data imperfections exist in almost every row and that each row has its own flavor of imperfection. The data was as good as useless.

The data people among you will appreciate this. It describes the issue perfectly.

Unlike the fraud problem, I was not able to come up with a solution during the project itself. It strained my relationship with my client, and I believe that it was a bad enough problem to consider the project a failure from a business standpoint.

I did solve this issue four months after the end of the project. It was so difficult that I wrote a conference paper about it.

End

There are lessons to take from this. Clearly, a) don't trust user input, and b) in software, you don't know what you're up against until you've done it at least once.

What I think will stick with me, though, is the sense of powerlessness I had towards the end of the project (when the fraud detection algorithm became slow enough to break the bot) and while trying to solve the dirty data issue. Though I eventually solved both the fraud issue and the data issue (through entity resolution), I solved them too late.

This was the first time I engaged a software problem and I was defeated. I doubt it will be the last, but I have a respect for the unknowns of software development now, which I hope will make it more difficult for hidden complexity to get me as badly as it did this time.