I've played around a bit using AI to form SQL queries, but I'm trying to find out if it's possible to implement it for a different use. 

We have a payee table that contains the fields name/address/city/state/zip/tax ID, as well as a secondary name field that's only used to house whatever might be printed on the 2nd line of a check. These are all free text fields, so of course it's a clusterberkeley, and we end up with sometimes dozens of variants of "John Smith MD" at "123 Main St,", etc. 

I've developed queries that utilize the Levenshtein Distance formula, as well as other normalization logic, which helps, but there still ends up often being thousands of rows needing manual review to ensure there are no duplicates.

This seems to me like something AI could possibly do better than humans, but I've not had any luck finding anything relevant online. 

Any suggestions?

GameboyRMH
GameboyRMH GRM+ Memberand MegaDork
1/2/24 7:20 p.m.

I think you'd have to train something custom for that...or roll the dice with the accuracy of an existing option (maybe one of the llama-based systems available as a local executable, example).

Another option may be to use an address lookup system (either something online like Google Maps or something local from an OpenStreetMaps client) and compare the returned coordinates. If you get two matching valid coordinates (errors may give you a spot in the middle of the Atlantic so be careful), that's a duplicate for sure.

In reply to GameboyRMH :

Thanks! I'll look into llama a bit. 

An address lookup could help, but the name portion is just as important. As an example there may be a clinic or hospital with dozens of providers, but all with the same address. So I need something smart enough to know that  Jon/John/Jonathan/etc. are likely all the same person if the last name is a close match & the address details are close enough.

Keith Tanner
Keith Tanner GRM+ Memberand MegaDork
1/2/24 7:36 p.m.

Mailing companies offer deduping services. This software already exists.

GameboyRMH
GameboyRMH GRM+ Memberand MegaDork
1/2/24 7:43 p.m.
Pete Gossett (Forum Supporter) said:

In reply to GameboyRMH :

Thanks! I'll look into llama a bit. 

An address lookup could help, but the name portion is just as important. As an example there may be a clinic or hospital with dozens of providers, but all with the same address. So I need something smart enough to know that  Jon/John/Jonathan/etc. are likely all the same person if the last name is a close match & the address details are close enough.

Careful with that, we dealt with that sort of thing a lot at my last job, often had a situation where two similar patients got accidentally merged or were otherwise confused leading to a situation requiring us to pull data from a backup (often father & son or mother & daughter with similar or even identical names), be sure to compare birthdates and any kind of health ID numbers as well.

Also check out some of the nickname databases listed here: https://opendata.stackexchange.com/questions/9777/nicknames-database

I think it could be done without training an AI, you could use some kind of "closeness score" based on matching addresses and names, birth dates etc, and if the score is close enough to perfect consider it a match. This would cut the development cost down a few orders of magnitude and eliminate the risk of hallucinations.

You'll need to log in to post.

Our Preferred Partners
bbGKUtibldbBVOwWduN3n2jJGWvv9RdYyxqMjXc5shQ9H4I0ONWeWEGRKgYR38K7