Device and method for natural language processing
US-2018157643-A1 · Jun 7, 2018 · US
US11599723B2 · US · B2
| Field | Value |
|---|---|
| Publication number | US-11599723-B2 |
| Application number | US-201916656955-A |
| Country | US |
| Kind code | B2 |
| Filing date | Oct 18, 2019 |
| Priority date | Oct 18, 2018 |
| Publication date | Mar 7, 2023 |
| Grant date | Mar 7, 2023 |
A practical reading order for non-experts. Skip the full description unless you need deep technical detail.
What the patent document calls the invention.
A short plain-language summary of the technical disclosure.
Who owns or filed the patent and who is credited as inventor.
Filing, priority, publication, and grant dates set the timeline.
The legal scope of protection — read this for what is actually claimed.
Technology tags used to group this patent with similar filings.
Prior art links and similar publications in this corpus.
Official abstract text for this publication.
According to principles described herein, unsupervised statistical models, semi-supervised data models, and HITL methods are combined to create a text normalization system that is both robust and trainable with a minimum of human intervention. This system can be applied to data from multiple sources to standardize text for insertion into knowledge bases, machine learning model training and evaluation corpora, and analysis tools and databases
Opening claim text (preview).
What is claimed is: 1. A non-transitory computer readable medium comprising instructions that, when executed by a processor of a processing system, cause the processing system to perform a method of updating a language model for a language domain, the method comprising: receiving text from at least two source platforms having different source text forms, the source platforms comprising interactive voice response log, social media platform, web page, or chat; extracting terms from the text in a source text form associated with a respective source platform to create a set of ingested terms; searching for the extracted terms in the source text form within a digitized data model; removing extracted terms that are found in the source text form in the digitized data model from the set of ingested terms; removing extracted terms that have a related form in the digitized data model from the set of ingested terms; identifying as a “new term” any term in the set of ingest terms that has not been discarded; assigning a priority to the new term based on context and probability of occurrence; and automatically adding the new term in the source text form to the digitized data model based on the priority above a predetermined threshold; and recompiling the language model after the new term is added to the digitized data model for a specific domain to expand vocabulary of the language model. 2. The non-transitory computer readable medium of claim 1 , wherein the priority is assigned based on probability of occurrence of the new term based on one of a language model and a word embedding model. 3. The non-transitory computer readable medium of claim 2 , wherein a low priority is assigned if the probability of occurrence is below a predetermined value. 4. The non-transitory computer readable medium of claim 1 , the method further comprising passing the new terms to a human for determination of whether the term should be added to the data model if the new term crosses a predetermined threshold. 5. The non-transitory computer readable medium of claim 4 , further comprising the human adding information to the training model about usage of the new term. 6. The non-transitory computer readable medium of claim 4 , wherein the predetermined threshold is based on frequency of occurrence of the new term. 7. The non-transitory computer readable medium of claim 1 , wherein adding the new term to the digitized data model comprises adding the new term to a training model of the digitized data model and recompiling the digitized data model based on the training model. 8. The non-transitory computer readable medium of claim 7 , the method further comprising passing the new terms to a human for determination of whether the term should be added to the training model of the digitized data model if the new term crosses a predetermined threshold. 9. The non-transitory computer readable medium of claim 8 , the method further comprising the human adding information to the training model about usage of the new term. 10. The computer program product of claim 1 , the method further comprising adding the new term in a normal text form different from the source text based on the priority below the predetermined threshold. 11. A method of updating a language model for a language domain, comprising: receiving text from at least two source platforms having different source text forms, the source platforms comprising interactive voice response log, social media platform, web page, or chat; extracting terms from the text in a source text form associated with a respective source platform to create a set of ingested terms; searching for the extracted terms in the source text form within a digitized data model; removing extracted terms that are found in the source text form in the digitized data model from the set of ingested terms; removing extracted terms that have a related form in the digitized data model from the set of ingested terms; identifying as a “new term” any term in the set of ingest terms that has not been discarded; assigning a priority to the new term based on context and probability of occurrence; and automatically adding the new term in the source text form to the digitized data model based on the priority above a predetermined threshold; and recompiling the language model after the new term is added to the digitized data model for a specific domain to expand vocabulary of the language model. 12. The method of claim 11 , wherein the priority is assigned based on probability of occurrence of the new term based on one of a language model and a word embedding model. 13. The method of claim 12 , wherein a low priority is assigned if the probability of occurrence is below a predetermined value. 14. The method of claim 11 , further comprising passing the new terms to a human for determination of whether the term should be added to the data model if the new term crosses a predetermined threshold. 15. The method of claim 14 , further comprising the human adding information to the training model about usage of the new term. 16. The method of claim 14 , wherein the predetermined threshold is based on frequency of occurrence of the new term. 17. The method of claim 11 , wherein adding the new term to the digitized data model comprises adding the new term to a training model of the digitized data model and recompiling the digitized data model based on the training model. 18. The method of claim 17 , further comprising passing the new terms to a human for determination of whether the term should be added to the training model of the digitized data model if the new term crosses a predetermined threshold. 19. The method of claim 18 , further comprising the human adding information to the training model about usage of the new term. 20. The method of claim 11 , further comprising adding the new term in a normal text form different from the source text based on the priority below the predetermined threshold.
Semantic analysis · CPC title
Machine learning · CPC title
Recognition of textual entities · CPC title
using probabilistic model · CPC title
Transformation · CPC title
Related publications grouped by family.
Answers are generated from the same data shown on this page.