System and method of combining statistical models, data models, and human-in-the-loop for text normalization

US11599723B2 · US · B2

Patent metadata
FieldValue
Publication numberUS-11599723-B2
Application numberUS-201916656955-A
CountryUS
Kind codeB2
Filing dateOct 18, 2019
Priority dateOct 18, 2018
Publication dateMar 7, 2023
Grant dateMar 7, 2023

How to read this patent

A practical reading order for non-experts. Skip the full description unless you need deep technical detail.

  1. Title

    What the patent document calls the invention.

  2. Abstract

    A short plain-language summary of the technical disclosure.

  3. Assignees and inventors

    Who owns or filed the patent and who is credited as inventor.

  4. Key dates

    Filing, priority, publication, and grant dates set the timeline.

  5. First independent claim

    The legal scope of protection — read this for what is actually claimed.

  6. CPC / IPC classifications

    Technology tags used to group this patent with similar filings.

  7. Citations and related patents

    Prior art links and similar publications in this corpus.

Abstract

Official abstract text for this publication.

According to principles described herein, unsupervised statistical models, semi-supervised data models, and HITL methods are combined to create a text normalization system that is both robust and trainable with a minimum of human intervention. This system can be applied to data from multiple sources to standardize text for insertion into knowledge bases, machine learning model training and evaluation corpora, and analysis tools and databases

First claim

Opening claim text (preview).

What is claimed is: 1. A non-transitory computer readable medium comprising instructions that, when executed by a processor of a processing system, cause the processing system to perform a method of updating a language model for a language domain, the method comprising: receiving text from at least two source platforms having different source text forms, the source platforms comprising interactive voice response log, social media platform, web page, or chat; extracting terms from the text in a source text form associated with a respective source platform to create a set of ingested terms; searching for the extracted terms in the source text form within a digitized data model; removing extracted terms that are found in the source text form in the digitized data model from the set of ingested terms; removing extracted terms that have a related form in the digitized data model from the set of ingested terms; identifying as a “new term” any term in the set of ingest terms that has not been discarded; assigning a priority to the new term based on context and probability of occurrence; and automatically adding the new term in the source text form to the digitized data model based on the priority above a predetermined threshold; and recompiling the language model after the new term is added to the digitized data model for a specific domain to expand vocabulary of the language model. 2. The non-transitory computer readable medium of claim 1 , wherein the priority is assigned based on probability of occurrence of the new term based on one of a language model and a word embedding model. 3. The non-transitory computer readable medium of claim 2 , wherein a low priority is assigned if the probability of occurrence is below a predetermined value. 4. The non-transitory computer readable medium of claim 1 , the method further comprising passing the new terms to a human for determination of whether the term should be added to the data model if the new term crosses a predetermined threshold. 5. The non-transitory computer readable medium of claim 4 , further comprising the human adding information to the training model about usage of the new term. 6. The non-transitory computer readable medium of claim 4 , wherein the predetermined threshold is based on frequency of occurrence of the new term. 7. The non-transitory computer readable medium of claim 1 , wherein adding the new term to the digitized data model comprises adding the new term to a training model of the digitized data model and recompiling the digitized data model based on the training model. 8. The non-transitory computer readable medium of claim 7 , the method further comprising passing the new terms to a human for determination of whether the term should be added to the training model of the digitized data model if the new term crosses a predetermined threshold. 9. The non-transitory computer readable medium of claim 8 , the method further comprising the human adding information to the training model about usage of the new term. 10. The computer program product of claim 1 , the method further comprising adding the new term in a normal text form different from the source text based on the priority below the predetermined threshold. 11. A method of updating a language model for a language domain, comprising: receiving text from at least two source platforms having different source text forms, the source platforms comprising interactive voice response log, social media platform, web page, or chat; extracting terms from the text in a source text form associated with a respective source platform to create a set of ingested terms; searching for the extracted terms in the source text form within a digitized data model; removing extracted terms that are found in the source text form in the digitized data model from the set of ingested terms; removing extracted terms that have a related form in the digitized data model from the set of ingested terms; identifying as a “new term” any term in the set of ingest terms that has not been discarded; assigning a priority to the new term based on context and probability of occurrence; and automatically adding the new term in the source text form to the digitized data model based on the priority above a predetermined threshold; and recompiling the language model after the new term is added to the digitized data model for a specific domain to expand vocabulary of the language model. 12. The method of claim 11 , wherein the priority is assigned based on probability of occurrence of the new term based on one of a language model and a word embedding model. 13. The method of claim 12 , wherein a low priority is assigned if the probability of occurrence is below a predetermined value. 14. The method of claim 11 , further comprising passing the new terms to a human for determination of whether the term should be added to the data model if the new term crosses a predetermined threshold. 15. The method of claim 14 , further comprising the human adding information to the training model about usage of the new term. 16. The method of claim 14 , wherein the predetermined threshold is based on frequency of occurrence of the new term. 17. The method of claim 11 , wherein adding the new term to the digitized data model comprises adding the new term to a training model of the digitized data model and recompiling the digitized data model based on the training model. 18. The method of claim 17 , further comprising passing the new terms to a human for determination of whether the term should be added to the training model of the digitized data model if the new term crosses a predetermined threshold. 19. The method of claim 18 , further comprising the human adding information to the training model about usage of the new term. 20. The method of claim 11 , further comprising adding the new term in a normal text form different from the source text based on the priority below the predetermined threshold.

Assignees

Inventors

Classifications

Patent family

Related publications grouped by family.

External sources

Frequently asked questions

Answers are generated from the same data shown on this page.

What does patent US11599723B2 cover?
According to principles described herein, unsupervised statistical models, semi-supervised data models, and HITL methods are combined to create a text normalization system that is both robust and trainable with a minimum of human intervention. This system can be applied to data from multiple sources to standardize text for insertion into knowledge bases, machine learning model training and eval…
Who is the assignee on this patent?
Verint Americas Inc
What technology area does this patent fall under?
Primary CPC classification G06F40/30. Mapped technology areas include Physics.
When was this patent published?
Publication date Tue Mar 07 2023 00:00:00 GMT+0000 (Coordinated Universal Time) (B2). Legal status and post-grant events are not shown on this page.
What related patents are in patentsdb?
We list 2 related publications on this page (citations in our corpus or others sharing the same primary CPC).