Skip to content
Technology

How do you spell that? Why accurate KYC/KYB risk systems need to look beyond spelling.

Discover how advanced AI-driven name-matching algorithms enhance KYC/KYB processes by accurately identifying individuals despite spelling variations and errors.

Many firms, particularly in financial services, need to know who their customers and business partners are. Is the Katherine Smith trying to open a brokerage account at your firm, rent an apartment on your real estate platform, or help your customers with tasks around their home, the same Katherine Smith who is also a felon, a terrorist, or on a government watch list? As a firm wanting to protect themselves and their customers, it is imperative to know who you are dealing with—yet it is more complicated than you might think. 

How AI can assist with KYC/KYB programs

The typical historical approach by firms was to manually check names against sanctions, wanted, and watch lists. Lists that are known as structured data that represent just 20% of internet data and can be slow to be updated.  If you had 10, 20, or 50K customers, manually checking them all wasn’t realistic.  Firms would often triage large customer populations and review those deemed risky on a more regular basis.  Even with lots of human reviewers assigned to the task, customers might only be reviewed at the start of the relationship and then once a year or once every few years thereafter. Manual processes exposed the firm to risks. 

Now, AI is being used to monitor much more than humans alone ever could.  AI can now help firms monitor structured data as well as unstructured data (80% of internet data). For example, Saifr uses multiple layers of sophisticated AI—including large language models (LLMs), natural language processing (NLP), and machine learning (ML)—to search 230K internet sources from 190 countries in 160 languages 24/7, allowing clients to more effectively monitor their full customer populations. AI is uniquely able to help monitor extremely large populations against publicly available information to help identify indications of financial or reputational risk, distinguishing behaviors that indicate fraud versus murder. AI models, if trained correctly, can accurately resolve adverse media to a person—Katherine Smith. Name-matching algorithms are at the heart of effective applications. 

Why is matching to the correct person a technical challenge?

Let me describe why matching to the correct person is a hard technical challenge. As an example, let’s just use the first name Katherine, a classic English female given name, derived from the Greek name "Aikaterine." The etymology is debated, but it is often associated with the Greek word "Katharos" meaning "pure." The name has been borne by numerous saints, queens, and prominent figures throughout history, contributing to its enduring popularity across Western cultures. Its variations are widespread due to its long history and adoption into many languages. 

When trying to find Katherine across the internet, there are frequent errors in spelling that can occur due to a myriad of reasons: 

  • Common Misspellings- phonetic similarities, common letter transpositions, or simple typos, such as: "Catherine", "Kathryn", "Katharine", "Katerine", "Kathrine", "Kathryne", "Catherin", "Kathrin", "Katherin", and "Kateryn" 
  • Insertions- variations created by adding one or more characters, such as: "Katheriine", "Katherinne", "Katherinee", "Katheerine", "Kathereine", "Katheerine", "Katherinee", "Katheirne", "Katheriane", "Kathrinee" 
  • Deletions- variations created by removing one or more characters, such as: Katherin", "Katheine", "Katherne", "Katrine", "Katerine", "Katheine", "Kathrin", "Katheirne", "Katheine", "Kathrin" 
  • Transpositions- variations created by swapping the order of adjacent characters, such as: Katheirne", "Katherien", "Kahteirne", "Katherien", "Katheirne", "Kathereni", "Kathreine", "Katheirne", "Katerhine", "Kahterine" 
  • Keyboard Adjacent Character Mistakes- variations generated by replacing a character with an adjacent key on a standard QWERTY keyboard, such as: Jatherine", "Kathfrine", "Kathetine", "Ratherine", "Kathetrine", "Katjerine", "Katherine", "Kqtherine", "Kathwrine", "Katherinr" 

 

You get the idea. There are many, many more types of errors—phonetic, orthographic, transliterations, nicknames, etc.—that can contribute to potentially hundreds of variations of the name Katherine.  And that is just the first name.  This exercise can be done on the last name and any middle names to create thousands of variations for each individual being searched. In our work, we have seen over 100 thousand variations being created for a single, three-part (first, middle, and last) name. 

How to ensure your name-matching algorithm is accurate

You want to know if the Kathryn Smyth who was in the paper for financial crimes in Europe, for example, is the Katherine Smith you are searching for in NYC. Identifying when two different name strings refer to the same real-world entity requires a robust hybrid algorithmic approach. This task is inherently complex due not only to the factors mentioned above, but also because of the algorithmic choices that must be made. Such a hybrid approach needs to be guided by the need to balance between recall and precision depending on the use case.  

First some definitions: 

  • Recall can be thought of as a measure of completeness. It measures the proportion of actual positives that were identified correctly. How often did the model say X when it was  X? This is significant when overlooked cases are important.  
  • Precision can be thought of as a measure of exactness. It measures the quality of the positive predictions. When the model returned X, how often was it correct? This is an important measure for avoiding false positives.  

If you don’t want to misidentify someone who can cause serious harm, then you want the recall to be high. But this will come at the cost of precision as there’s an inverse relationship between recall and precision. By widening the criteria to catch as many of the true positive matches, you will also increase the chance of a false positive (misidentifying someone as a threat who is not a threat).  

A hybrid implementation can balance the need for an exact match vs a phonetic match (two names that sound similar but are spelled differently) vs the degree of similarity between the two names (two names with slight misspellings that may refer to the same person.) A lot of the science then involves how you set up a scoring system to determine what part of the algorithm to put more emphasis or weight on (e.g., phonetic vs similarity) and how to design a system that can do this millions of times in seconds or efficiently if the number of named entities you are interested is in the millions. It’s also important to note that similarity measures can be simple (e.g., how much the characters between two names overlap) or can involve more complex measures of how two names are “approximately” in the same neighborhood because they are in the same vector space.  

In a future blog, I plan to discuss how the introduction of non-Latin alphabets (e.g. Arabic, Cyrillic, Chinese, etc.) further complicates the science of name matching. 

 Reach out if you are interested in learning more about this topic. Let’s engage: contact@saifr.ai 

The opinions provided are those of the author and not necessarily those of Saifr or its affiliates. Saifr and any other third parties are independent entities and not affiliated. Mentioning them does not suggest a recommendation or endorsement by Saifr. 

1216629.1.0 

 

Vall Herard

CEO
Vall specializes in the intersection of financial markets and technology and has a mastery of emerging methods like AI, machine learning, blockchain, and micro-services. He has a proven track record of taking companies from ideation to scale on a global basis within FinTech and financial services.

Check out our latest blogs

Q2 Familiar compliance themes heat up for the Summer

Stay updated on recent regulatory actions in Q2, focusing on compliance themes like off-channel communications, misleading disclosures, and...

How do you spell that? Why accurate KYC/KYB risk systems need to look beyond spelling.

Discover how advanced AI-driven name-matching algorithms enhance KYC/KYB processes by accurately identifying individuals despite spelling v...

Compliance requirements for life insurance and annuity products

Compliance requirements for life insurance and annuity products

Understand how NAIC Model 570 and corresponding state regulations can be divided roughly into three separate types of compliance requiremen...