Background
Named Entity Recognition (NER) is a fundamental natural language processing (NLP) task that involves identifying and categorizing named entities within the text. What are named entities? Well, things normally appear in any email conversation, like Individual names, Company names, Currencies, Dates, Address fields, and so on. Essentially NER refers to a whole set of attributes present in any common human-to-human conversation. In a business context where emails, pdfs, and other documents are commonly shared, there is much value for businesses to
- Identify such attributes present in the text
- Extract them from the unstructured text and finally
- Present such NERs to be stored in databases and other downstream systems.
The problem
Identifying and extracting NERs in unstructured text is an important business problem that cuts across various Industries as it allows the job done by humans to be done with high accuracy and efficiency using ML Models.
Several pre-trained models exist today to extract NERs, such as
- BERT-based NER
- Stanford NER
- Spacy and others
While such pre-trained NER models can perform such extractions, they may only sometimes satisfy the specific needs of certain applications. For example, suppose you were to identify all the dates, names, and several custom information present in a broker-to-insurer email. In that case, some of the common challenges that you run into are:
- Unable to understand the specific context in the domain
- Challenging to recognize the language constructs and the abbreviations that are used in the email
- Extraction of different combined entities - in-network and out-of-network copay is identified as (80/120).
The solution
Developing custom models can be useful in instances where a lot of domain knowledge needs to be used in addition to the classical identification abilities of the base model. We discuss the usage of custom NER models and show how we have used them to solve the problem of accurately identifying Company/Group names in emails sent from the broker to the Insurance provider.
What is a Group, and why is it important?
A Group Name is nothing but a company name & it plays a very important role in the RFP journey. It helps in providing accurate quotes and benefits. The rates provided by carriers usually vary based on multiple factors, such as:
- Industry: The Group Name can give much information on the company’s industry (SIC Code). This will help categorize and filter out those which are not suitable. Companies in high-risk industries are provided lesser benefits.
- A well-known company/brand carries a large reputation associated with it. Providing richer benefits to such companies could increase the carrier’s credibility.
Our approach
Ushur has approached this problem of extraction of Group names from unstructured RFP emails by utilizing two distinct models:
- one designed to capture long tokens and ( > 6 tokens)
- another trained on smaller or average-sized tokens.
Obtaining Long Group Names
Some group name span many words (Examples: Houston City Arts and Cultural Center Inc, The Hartford Financial Services Group of Indiana etc). We noticed that the base pretrained NER fails to pick the entire group name due to various challenges like
- The existing pre-trained models have not been trained on long group names. In the existing models, “Cultural Center, Inc” would get identified as the group.
- Some of the long group names get identified as 2 different organizations, leading to partial extractions.
We quickly realized that we need a unique NER model dedicated to capturing long tokens to enable correct detection of these extended entities. A good sized sample dataset that contained annotated examples of such names was created and . We also used standard techniques of augmenting the dataset with artificially created long names - very soon, with this sample data set, we were able to correctly detect and categorize these expanded entities.
Handling Smaller or Average-Sized Tokens
While the first model focuses on long tokens, a second custom model was constructed to handle smaller or average-sized group names. You may wonder - why have another model for smaller names? Well, we found that if we used the previous model that was trained on longer group names, it caused spurious text around the actual name to also get picked up!
The second model was trained on a different dataset that included annotations for shorter entities. As this went through a few iterations, we realized that this model complements the first model very well and enhances the overall accuracy of the system.
Conclusion
A quick script was written to compare the performance of such a combined approach - we did a fuzzy match of the predicted outcome versus the actual ground truths. We found that this approach was able to give us a baseline accuracy of about 90%. This is the one that is currently deployed into Production today.
While pre-trained NER models from huggingface and other sources offer a good baseline accuracy, our experience tells us that training a custom NER model helps us improve the accuracy as well as predict the correct group name in most occasions.
The interesting thing about this technique is that it can be applied to various other domains and use-cases as well